SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Apache Kafka at
trivago
2017-01-25, Munich, Germany
Clemens Valiente
Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Senior Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente
3
As a hotel price comparison engine, our most
valuable information are hotel prices.
They are not only shown to our visitors to
support their hotel booking decision, but also
stored and later analyzed by Business
Intelligence.
With over one million hotels and all major
booking websites connected to our system, we
have one of the most complete sources of
information on hotel price development and
trends
Collecting price information for BI
4
The past: Data pipeline 2010 – 2015
5
The past: Data pipeline 2010 – 2015
Java Software
Engineering
6
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
7
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
8
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
9
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
10
The past: Data pipeline 2010 – 2015
Facts & Figures
Price dimensions
- Around one million hotels
- 250 booking websites
- Travellers search for up to
180 days in advance
- Data collected over five
years
Restrictions
- Only single night stays
- Only prices from
European visitors
- Prices cached up to 30
minutes
- One price per hotel,
website and arrival date
per day
- “Insert ignore”: The first
price per key wins
Size of data
- We collected a total of 56
billion prices in those five
years
- Towards the end of this
pipeline in early 2015 on
average around 100 million
prices per day were written
to BI
11
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
12
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
13
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
14
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
15
The past: Data pipeline 2010 – 2015
Java Software
Engineering
BI
Warehouse
16
Refactoring the pipeline: Requirements
• Scales with an arbitrary amount of data (future proof)
• reliable and resilient
• low performance impact on Java backend
• long term storage of raw input data
• fast processing of filtered and aggregated data
• Open source
• we want to log everything:
• more prices
• Length of stay, room type, breakfast info, room category, domain
• with more information
• Net & gross price, city tax, resort fee, affiliate fee, VAT
17
Present data pipeline 2017 – ingestion
Düsseldorf
18
Present data pipeline 2017 – ingestion
Düsseldorf
19
Present data pipeline 2017 – ingestion
San Francisco
Düsseldorf
Hongkong
20
Present data pipeline 2017 – processing
Camus
21
Present data pipeline 2017 – results after two
years in production
• Very reliable, barely any downtime or service interruptions of the system
• Java team is very happy – less load on their system
• BI team is very happy – more data, more resources to process it
• stakeholders very happy
• Faster results
• Better quality of results due to more data
• More detailed results
• => Shorter research phase, more and better stories
• => Less requests & workload for BI
22
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
23
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics
24
Present data pipeline 2017 – facts & figures
Kafka Cluster specifications
- Cluster of 5 machines in
each data centre for logs
- An additional cluster of two
machines in Düsseldorf for
aggregation/stream
processing
Data Size (price log)
- Over 4 trillion messages
collected so far
- 10 billion messages/day
- Over a hundred topics
Camus
- Mapreduce application that
writes prices to hdfs
- 15 Mappers running in
parallel
- Pretty much continuously
in 10 minute intervals
- To be replaced by
Gobblin/Kafka Connect
25
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
26
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka
27
Present data pipeline 2017 – use cases &
status quo
Uses for price information
- Monitoring price parity in
hotel market
- Anomaly and fraud
detection
- Price feed for online
marketing
- Display of price
development and
delivering price alerts to
website visitors
Other data sources and
usage
- Clicklog information from
our website and mobile
app
- Used for marketing
performance analysis,
product tests, invoice
generation etc
- Every Euro of revenue at
some point was a
message in Kafka
Status quo
- Our entire BI business
logic runs on and through
the kafka – hadoop
pipeline
- Almost all departments rely
on data, insights and
metrics delivered by
hadoop
- Most of the company could
not do their job without
hadoop data
28
Düsseldorf
Leipzig Palma
Ongoing Projects: Breaking up the Monolith
29
Düsseldorf
PalmaLeipzig
30
Key challenges and learnings
●
Settle on a common message format (Avro/Protobuf, not csv or json)
●
A common message envelope is helpful (e.g. header with timestamp and
sender)
●
For stream processing repeat your key in your message value
●
Monitor your consumer offsets with an audit log, especially across data
centres
●
Turn off auto creation of topics, but have a process in place for topic creation
Email: clemens.valiente@trivago.com
de.linkedin.com/in/clemensvaliente
Senior Data Engineer
trivago Düsseldorf
Originally a mathematician
Studied at Uni Erlangen
At trivago for 5 years
Clemens Valiente
Thank you!
Questions
and
comments?
●
Thanks to Jan Filipiak for his brainpower behind most
projects
●
Additional resources:
●
https://github.com/trivago/gollum A n:m message
multiplexer written in Go
●
https://github.com/trivago/triava TriavaCache, JSR107
compliant cache

Mais conteúdo relacionado

Mais procurados

Machine Learning with Apache Kafka in Pharma and Life Sciences
Machine Learning with Apache Kafka in Pharma and Life SciencesMachine Learning with Apache Kafka in Pharma and Life Sciences
Machine Learning with Apache Kafka in Pharma and Life Sciences
Kai Wähner
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 

Mais procurados (20)

Machine Learning with Apache Kafka in Pharma and Life Sciences
Machine Learning with Apache Kafka in Pharma and Life SciencesMachine Learning with Apache Kafka in Pharma and Life Sciences
Machine Learning with Apache Kafka in Pharma and Life Sciences
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files Syndrome
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Business Drivers Behind Data Governance
Business Drivers Behind Data GovernanceBusiness Drivers Behind Data Governance
Business Drivers Behind Data Governance
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata Strategies
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache Spark
 
Kentik Network@Scale (Dan Ellis)
Kentik Network@Scale (Dan Ellis)Kentik Network@Scale (Dan Ellis)
Kentik Network@Scale (Dan Ellis)
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Semelhante a Kafka at trivago

UX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der HaarUX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der Haar
UX STRAT
 
Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016
Partha Sarathi Pattnaik
 

Semelhante a Kafka at trivago (20)

Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
 
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
 
How open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open OyHow open source empowers startups to start big, with case Double Open Oy
How open source empowers startups to start big, with case Double Open Oy
 
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...The analytics journey at Viewbix - how they came to use Snowplow and the setu...
The analytics journey at Viewbix - how they came to use Snowplow and the setu...
 
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmPresentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Customer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° viewCustomer Event Hub - the modern Customer 360° view
Customer Event Hub - the modern Customer 360° view
 
Big Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data ManagementBig Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data Management
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs
 
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
06. DIGGIT MARIUS IVANOVAS (Httpool Baltics): Personalizirane marketinške akt...
 
Data-informed Experience Design
Data-informed Experience DesignData-informed Experience Design
Data-informed Experience Design
 
UX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der HaarUX STRAT Europe 2019: Rob van der Haar
UX STRAT Europe 2019: Rob van der Haar
 
OVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceOVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a Service
 
Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016Resume_Partha_Data Consultant_23_July_2016
Resume_Partha_Data Consultant_23_July_2016
 
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
 

Último

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 

Último (20)

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Kafka at trivago

  • 1. Apache Kafka at trivago 2017-01-25, Munich, Germany Clemens Valiente
  • 2. Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Clemens Valiente
  • 3. 3 As a hotel price comparison engine, our most valuable information are hotel prices. They are not only shown to our visitors to support their hotel booking decision, but also stored and later analyzed by Business Intelligence. With over one million hotels and all major booking websites connected to our system, we have one of the most complete sources of information on hotel price development and trends Collecting price information for BI
  • 4. 4 The past: Data pipeline 2010 – 2015
  • 5. 5 The past: Data pipeline 2010 – 2015 Java Software Engineering
  • 6. 6 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 7. 7 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 8. 8 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years
  • 9. 9 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years Restrictions - Only single night stays - Only prices from European visitors - Prices cached up to 30 minutes - One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins
  • 10. 10 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years Restrictions - Only single night stays - Only prices from European visitors - Prices cached up to 30 minutes - One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins Size of data - We collected a total of 56 billion prices in those five years - Towards the end of this pipeline in early 2015 on average around 100 million prices per day were written to BI
  • 11. 11 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 12. 12 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 13. 13 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 14. 14 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 15. 15 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  • 16. 16 Refactoring the pipeline: Requirements • Scales with an arbitrary amount of data (future proof) • reliable and resilient • low performance impact on Java backend • long term storage of raw input data • fast processing of filtered and aggregated data • Open source • we want to log everything: • more prices • Length of stay, room type, breakfast info, room category, domain • with more information • Net & gross price, city tax, resort fee, affiliate fee, VAT
  • 17. 17 Present data pipeline 2017 – ingestion Düsseldorf
  • 18. 18 Present data pipeline 2017 – ingestion Düsseldorf
  • 19. 19 Present data pipeline 2017 – ingestion San Francisco Düsseldorf Hongkong
  • 20. 20 Present data pipeline 2017 – processing Camus
  • 21. 21 Present data pipeline 2017 – results after two years in production • Very reliable, barely any downtime or service interruptions of the system • Java team is very happy – less load on their system • BI team is very happy – more data, more resources to process it • stakeholders very happy • Faster results • Better quality of results due to more data • More detailed results • => Shorter research phase, more and better stories • => Less requests & workload for BI
  • 22. 22 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing
  • 23. 23 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing Data Size (price log) - Over 4 trillion messages collected so far - 10 billion messages/day - Over a hundred topics
  • 24. 24 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing Data Size (price log) - Over 4 trillion messages collected so far - 10 billion messages/day - Over a hundred topics Camus - Mapreduce application that writes prices to hdfs - 15 Mappers running in parallel - Pretty much continuously in 10 minute intervals - To be replaced by Gobblin/Kafka Connect
  • 25. 25 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors
  • 26. 26 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors Other data sources and usage - Clicklog information from our website and mobile app - Used for marketing performance analysis, product tests, invoice generation etc - Every Euro of revenue at some point was a message in Kafka
  • 27. 27 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors Other data sources and usage - Clicklog information from our website and mobile app - Used for marketing performance analysis, product tests, invoice generation etc - Every Euro of revenue at some point was a message in Kafka Status quo - Our entire BI business logic runs on and through the kafka – hadoop pipeline - Almost all departments rely on data, insights and metrics delivered by hadoop - Most of the company could not do their job without hadoop data
  • 30. 30 Key challenges and learnings ● Settle on a common message format (Avro/Protobuf, not csv or json) ● A common message envelope is helpful (e.g. header with timestamp and sender) ● For stream processing repeat your key in your message value ● Monitor your consumer offsets with an audit log, especially across data centres ● Turn off auto creation of topics, but have a process in place for topic creation
  • 31. Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Clemens Valiente Thank you! Questions and comments?
  • 32. ● Thanks to Jan Filipiak for his brainpower behind most projects ● Additional resources: ● https://github.com/trivago/gollum A n:m message multiplexer written in Go ● https://github.com/trivago/triava TriavaCache, JSR107 compliant cache