Koshy june27 140pm_room210_c_v4

•Download as PPTX, PDF•

3 likes•1,157 views

The document discusses LinkedIn's use of Apache Kafka as a central data pipeline to integrate a variety of data sources and make that data available to many systems in real-time. Some key points include: LinkedIn uses Kafka to ingest over 28 billion messages per day from various services and applications; Kafka provides high throughput for both data writes and reads at LinkedIn; and Kafka allows for standardized data formats and automated data loading to Hadoop for long-term storage and analytics.

Technology Business

Building a Real-Time Data Pipeline:
Apache Kafka at Linkedin
Hadoop Summit 2013
Joel Koshy
June 2013
LinkedIn Corporation ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Network update stream

LinkedIn Corporation ©2013 All Rights Reserved
We have a lot of data.
We want to leverage this data to build products.
Data pipeline

HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5

How do we integrate this variety of data
and make it available to all these systems?
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Point-to-point pipelines

HADOOP SUMMIT 2013
LinkedIn’s user activity data pipeline (circa 2010)

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 10

HADOOP SUMMIT 2013
Central data pipeline

First attempt: don’t re-invent the wheel
LinkedIn Confidential ©2013 All Rights Reserved

Second attempt: re-invent the wheel!
LinkedIn Confidential ©2013 All Rights Reserved

Use a central commit log
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
What is a commit log?

HADOOP SUMMIT 2013
The log as a messaging system
LinkedIn Corporation ©2013 All Rights Reserved 17

HADOOP SUMMIT 2013
Apache Kafka
LinkedIn Corporation ©2013 All Rights Reserved 18

HADOOP SUMMIT 2013
Usage at LinkedIn
 16 brokers in each cluster
 28 billion messages/day
 Peak rates
– Writes: 460,000 messages/second
– Reads: 2,300,000 messages/second
 ~ 700 topics
 40-50 live services consuming user-activity data
 Many ad hoc consumers
 Every production service is a producer (for metrics)
 10k connections/colo
LinkedIn Corporation ©2013 All Rights Reserved 19

HADOOP SUMMIT 2013
Usage at LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 20

HADOOP SUMMIT 2013
Standardize on Avro in data pipeline
LinkedIn Corporation ©2013 All Rights Reserved 22
{
"type": "record",
"name": "URIValidationRequestEvent",
"namespace": "com.linkedin.event.usv",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": ”TrackingEventHeader",
"namespace": "com.linkedin.event",
"fields": [
{
"name": "memberId",
"type": "int",
"doc": "The member id of the user initiating the action"
},
{
"name": ”timeMs",
"type": "long",
"doc": "The time of the event"
},
{
"name": ”host",
"type": "string",
...
...

HADOOP SUMMIT 2013
Hadoop data load (Camus)
 Open sourced:
– https://github.com/linkedin/camus
 One job loads all events
 ~10 minute ETA on average from producer to HDFS
 Hive registration done automatically
 Schema evolution handled transparently

Does it work?
“All published messages must be delivered to all consumers (quickly)”
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Kafka replication (0.8)
 Intra-cluster replication feature
– Facilitates high availability and durability
 Beta release available
https://dist.apache.org/repos/dist/release/kafka/
 Rolled out in production at LinkedIn last week
LinkedIn Corporation ©2013 All Rights Reserved 28

HADOOP SUMMIT 2013
Join us at our user-group meeting tonight @ LinkedIn!
– Thursday, June 27, 7.30pm to 9.30pm
– 2025 Stierlin Ct., Mountain View, CA
– http://www.meetup.com/http-kafka-apache-org/events/125887332/
– Presentations (replication overview and use-case studies) from:
 RichRelevance
 Netflix
 Square
 LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 29

HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30

What's hot

SomeSlidesguestd60742

Cenitpede: Analyzing WebcrawlPrimal Pappachan

MongoDB + SpringNorberto Leite

Reproducible Research and the CloudMicrosoft Azure for Research

Visualizing Austin's data with Elasticsearch and KibanaObjectRocket

20090701 Climate Data StagingHenning Bergmeyer

ORCID in RD-Switchboardamiraryani

MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB

Vital.AI Creating Intelligent AppsVital.AI

Mining a Large Web CorpusRobert Meusel

Data Symposium Data Privacy EthicsChristian Bartens

Your data layer - Choosing the right database solutions for the futureObjectRocket

Open source for customer analyticsMatthias Funke

MongoDB and Hadoop: Driving Business InsightsMongoDB

Using the whole web as your datasetTuri, Inc.

How Banks Manage Risk with MongoDBMongoDB

Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok

Papyri.info's Linked Data StoryHugh Cayless

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters

London HUGBoudicca

What's hot (20)

SomeSlides

Cenitpede: Analyzing Webcrawl

MongoDB + Spring

Reproducible Research and the Cloud

Visualizing Austin's data with Elasticsearch and Kibana

20090701 Climate Data Staging

ORCID in RD-Switchboard

MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...

Vital.AI Creating Intelligent Apps

Mining a Large Web Corpus

Data Symposium Data Privacy Ethics

Your data layer - Choosing the right database solutions for the future

Open source for customer analytics

MongoDB and Hadoop: Driving Business Insights

Using the whole web as your dataset

How Banks Manage Risk with MongoDB

Complex realtime event analytics using BigQuery @Crunch Warmup

Papyri.info's Linked Data Story

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...

London HUG

Viewers also liked

SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs Lisong Guo

Floods of Twitter Data - StampedeCon 2016StampedeCon

The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon

Streaming Processing with a Distributed Commit LogJoe Stein

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Spark Summit

Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...confluent

The One Page ProposalUniversity of Victoria - Distance Education Services

Data Pipelines Made Simple with Apache Kafkaconfluent

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Lean Agile Metrics And KPIsYuval Yeret

Viewers also liked (11)

SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs

Floods of Twitter Data - StampedeCon 2016

The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

Streaming Processing with a Distributed Commit Log

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...

Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...

The One Page Proposal

Data Pipelines Made Simple with Apache Kafka

Developing Real-Time Data Pipelines with Apache Kafka

Lean Agile Metrics And KPIs

Similar to Koshy june27 140pm_room210_c_v4

All data accessible to all my organization - Presentation at OW2con'19, June...OW2

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten

The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4

Sparkling Water Webinar October 29th, 2014Sri Ambati

Advanced Analytics and Machine Learning with Data VirtualizationDenodo

Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bhaskar Ghosh

Breaking down data silos with ODataWoodruff Solutions LLC

The LOD Gateway: Open Source Infrastructure for Linked DataDavid Newbury

The oecd delta project – providing easier access to data through api'sJonathan Challener

Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community

Big Data, Bigger BrainsDenny Lee

Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit

Advanced Analytics and Machine Learning with Data VirtualizationDenodo

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j

Opensocial Haifa Seminar - 2008.04.08Ari Leichtberg

Better integrations through open interfacesSteve Speicher

Test trend analysis: Towards robust reliable and timely testsHugh McCamphill

Similar to Koshy june27 140pm_room210_c_v4 (20)

All data accessible to all my organization - Presentation at OW2con'19, June...

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

Sparkling Water Webinar October 29th, 2014

Advanced Analytics and Machine Learning with Data Virtualization

Interactive Analytics at Scale in Apache Hive Using Druid

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Breaking down data silos with OData

The LOD Gateway: Open Source Infrastructure for Linked Data

The oecd delta project – providing easier access to data through api's

Microsoft Graph: Connect to essential data every app needs

Big Data, Bigger Brains

Interactive Analytics at Scale in Apache Hive Using Druid

Advanced Analytics and Machine Learning with Data Virtualization

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...

Opensocial Haifa Seminar - 2008.04.08

Better integrations through open interfaces

Test trend analysis: Towards robust reliable and timely tests

Recently uploaded

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Artificial Intelligence: Facts and MythsJoaquim Jorge

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

MINDCTI Revenue Release Quarter One 2024MIND CTI

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Why Teams call analytics are critical to your entire businesspanagenda

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform

Scaling API-first – The story of a global engineering organization

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Artificial Intelligence: Facts and Myths

presentation ICT roal in 21st century education

Data Cloud, More than a CDP by Matt Robison

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Artificial Intelligence Chap.5 : Uncertainty

HTML Injection Attacks: Impact and Mitigation Strategies

Powerful Google developer tools for immediate impact! (2023-24 C)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Tata AIG General Insurance Company - Insurer Innovation Award 2024

MINDCTI Revenue Release Quarter One 2024

Automating Google Workspace (GWS) & more with Apps Script

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Partners Life - Insurer Innovation Award 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Why Teams call analytics are critical to your entire business

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Koshy june27 140pm_room210_c_v4

2. HADOOP SUMMIT 2013 Network update stream

4. HADOOP SUMMIT 2013 People you may know

7. HADOOP SUMMIT 2013 Point-to-point pipelines

8. HADOOP SUMMIT 2013 LinkedIn’s user activity data pipeline (circa 2010)

9. HADOOP SUMMIT 2013 Point-to-point pipelines

11. HADOOP SUMMIT 2013 Central data pipeline

13. HADOOP SUMMIT 2013

16. HADOOP SUMMIT 2013 What is a commit log?

19. HADOOP SUMMIT 2013 Usage at LinkedIn  16 brokers in each cluster  28 billion messages/day  Peak rates – Writes: 460,000 messages/second – Reads: 2,300,000 messages/second  ~ 700 topics  40-50 live services consuming user-activity data  Many ad hoc consumers  Every production service is a producer (for metrics)  10k connections/colo LinkedIn Corporation ©2013 All Rights Reserved 19

22. HADOOP SUMMIT 2013 Standardize on Avro in data pipeline LinkedIn Corporation ©2013 All Rights Reserved 22 { "type": "record", "name": "URIValidationRequestEvent", "namespace": "com.linkedin.event.usv", "fields": [ { "name": "header", "type": { "type": "record", "name": ”TrackingEventHeader", "namespace": "com.linkedin.event", "fields": [ { "name": "memberId", "type": "int", "doc": "The member id of the user initiating the action" }, { "name": ”timeMs", "type": "long", "doc": "The time of the event" }, { "name": ”host", "type": "string", ... ...

24. HADOOP SUMMIT 2013 Hadoop data load (Camus)  Open sourced: – https://github.com/linkedin/camus  One job loads all events  ~10 minute ETA on average from producer to HDFS  Hive registration done automatically  Schema evolution handled transparently

27. HADOOP SUMMIT 2013 Audit Trail

28. HADOOP SUMMIT 2013 Kafka replication (0.8)  Intra-cluster replication feature – Facilitates high availability and durability  Beta release available https://dist.apache.org/repos/dist/release/kafka/  Rolled out in production at LinkedIn last week LinkedIn Corporation ©2013 All Rights Reserved 28

29. HADOOP SUMMIT 2013 Join us at our user-group meeting tonight @ LinkedIn! – Thursday, June 27, 7.30pm to 9.30pm – 2025 Stierlin Ct., Mountain View, CA – http://www.meetup.com/http-kafka-apache-org/events/125887332/ – Presentations (replication overview and use-case studies) from:  RichRelevance  Netflix  Square  LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 29

Editor's Notes

Talk about our data pipeline, the motivations behind it and how we built it out using Apache Kafka.
LI like most web companies derives a lot of value from tracking user activity – page views, clicks, ad impressions and so on. In fact some of this activity data is visible directly in one form or another on your own network update stream. People in your network may add a new connection or share a URL – and you ideally want to see these updates as soon as possible, and ideally in real-timeThis user activity feed is a useful user-facing product in and of itself, but it really is much more important than that. LinkedIn is a data-centric company and this activity data is also a hugely valuable ingredient in other data-driven products.
Use data to provide a richer, more relevant experience to members, and that engages our users more in turn generating more activity data and we get into this perpetual self-feeding cycle of successive refinement.
And that principle manifests in products such as PYMK which is only as engaging or as useful as it is relevant. If you only see people who are not related to you or you don’t care about, then it is pointless. Relevant suggestions here obviously lead to connecting with that person, triggering a connect event, you may click on their profile or company page triggering more activity events. So each page view can directly or indirectly follow up in additional front-end reads and considerable downstream activity – i.e., calls into backend services. So a simple page view (which is intuitively a read operation) when tracked as user-activity, results in a bunch of writes within your activity data pipeline. The data pipeline is not solely for activity tracking.
It is important to also have a metrics data feed for tracking your system and application metrics, logs. This is critical for monitoring the health trends of your production services, both low-level and application-level. Log and service call data for tools such as service call graphs. User activity and system metrics are just two kinds of data that you might want to have in your data pipeline. There are a few others as well. And there are a whole bunch of data-driven systems that need to feed off these data streams.
So that is really the key problem we want to solve – integrating these different data pipelines and making it easily available preferrably in real-time to each data-driven system.and what happens at most companies including linkedin is that we end up building specialized data systems to handle these types of data and very soon you end up with an architecture that looks something like:
… where there is a different solution or pipeline for each type of data. In this picture, the data sources are above and the data-driven systems (the consumers are below). (For e.g., for operational metrics we used JMX feeding into Zenoss, we had a different user activity tracking system (which I’ll talk more about a little later), we used splunk for scraping and searching logs) And there are a number of data-driven systems that are directly user-facing, some mid-tier and some that are more backend. And for many of these systems it is important to have access to the data in near real-time. Take security systems for e.g., which would need to consume user activity events from the user tracking system and detect anomalous or malicious patterns and react quickly. Likewise, for search systems to provide more relevant search results it is important to make content that is indexed as fresh as possible. Recommendation systems can do a better job of providing more relevant results if signals from activity data are incorporated early on. So in order to fulfill these use cases we needed tight coupling between the sources of the various types of data and the various specialized data-driven systems that feed off that data. Cons: In order to have universal access to that data need O(n2) p2p pipelines. end up having to have this tight coupling between sources and systems and end points need to know how to talk to each other.Concrete example to drive home these points hopefully a little more clearly: linkedin’s old activity data pipeline
To drive home these points a little more clearly I’ll provide some details on our previous (specialized) user activity data pipeline. Front-end applications would post XML blobs containing activity data to a HTTP-based logging service – activity logs were scraped from this service and periodically rsync’d over to staging servers in the offline data centers where the ETL process takes place. Number of limitations with this pipeline.First, The logging service did not really provide real-time access to the data that was sent to it. It just served as a point of aggregation. So other data systems in the live data center couldn’t directly feed off this activity data – in fact as this diagram shows, ultimately the only consumer of the user activity were the offline systems.Second, the data flowing through this pipeline was raw xml data – producers used whatever structure they wanted, and there’s this one single DWH team whose job is to suck in all this raw activity data, clean it and make it beautiful and represents everything about your business. Cleaning the data produced by tons of producers. Not well-versed with the producer data. Producer doesn’t know what constitutes clean data that is amenable for treatment at the ETL stage.Data flow is fragile wrt schema changes, labor intensive,unscalable at the human layer: people down at the batch layer build data-rich products and depend on a number of data sources. Say, 50+ - so highly likel to break if data format changes in any one of these data flows. New application – file ticket and talk to consumers.Hard to verify correctness. Does it work? Is all the activity data getting collected?Inherently batch oriented process: rsyncs were periodic, ETL jobs are periodic – multi-hour delays in getting the cleaned data.Furthermore, the DWH is the only source of clean data. Irony is that this data is very important to a data-centric company. i.e., we don’t want to stop at reports, but we want this clean data made available to production services as soon as possible in order to power insight-driven products. Part of the reason why we built out our hadoop cluster. Even though it is unscalable to move this data around making the data just available to these systems is hugely useful. E.g., after setting up hadoop – unlocked a lot of possibilities. New computation was possible on data that would have been hard to do before. Many new products and analysis just came from putting together multiple pieces of data that had previously been locked up in specialized systems. People really wanted the data.Forces setting up solutions that are relatively heavy-weight/clumsy and not particularly effective.E.g., recommendation systems match jobs, match people you may know groups you may want to join, etc. The way it works is user events end up in hadoop, on which some offline processing and enrichment takes place and we generate pre-built RO stores. For various reasons it is not possible to update a RW store in the live DC from the offline DC. these pre-built RO stores are shipped to the production data center intermittently. If those jobs aren’t running frequently enough your recommendation system is stuck with signals from potentially activity data from the last run.
So that was just the user-activity pipeline. Similar issues plague other specialized pipelines.
Simple recipe: take all the organization’s data and put it into a central pipeline for real-time consumption.Multiple benefits:Data is integrated and made available and if it supports persistence it is available for a period of time.Decouples producers and consumers - only need to know how to talk to the central pipelineAdding a new data source or sink is simple and organizationally scalable.Should point out that we have had and still have a pipeline for database update streams – that’s databus. I won’t be going into that in this talk.
Since we were already using activemq at the time for ad hoc messaging purposes, we wondered if we could use it for the central data pipeline as well – i.e., the approach was to hook it up to the activity feed and just see what happens.So we tried activemq and rabbitmq.
Problems With JMS Messaging SystemsNot designed for high volume data especially with large backlogs of unconsumed data.IOW, persistence is not an ingrained conceptDifficult to scale out – no inherent support for distribution.Featuritis: xactions (exactly once semantics)
This borrows from the traditional database log concept – which is take all changes, update a bunch of stuff – tables, indexes, materialized views and so on, and do this In a way that is correct in the presence of failures – write a log of all that happens. Someone else can read this log, take and apply these updates. So that’s a log – it’s an append-only, totally ordered sequence of records indexed by time. Not too different from file – but purpose of a log is specific. It captures what happened and when and provides a persistent re-playable record of history.
Apart from decoupling which is a benefit of the central log, subscribers can consume at their own pace. Important – e.g., for hadoop which may be on an hourly schedule or down for maintenance. Because the log is persistent, it can resume from where it left off when it comes back up and even if there is a large backlog that won’t impact consumer (or broker) performance. Linear access pattern.
Brokers producers consumers, topics.Engineering for high throughput:Batching at producer and consumerCompressionReliance on pagecacheHorizontally scalable:Can add more producersCan add more brokers -> more partitions. Don’t need that many. (16 brokers per DC).Can add more consumer instances. E.g., can have one consumer reading three partitions, or three consumers sharing that load.Guarantees:Successfully published messages must not be lost and must be available for delivery even in the presence of broker failures.Consumer: At least once, most of the time exactly once. E2E latency is generally under a second.
Cluster size – some are big some are small. Tracking cluster is 16Writes/reads are batched – rps will be less.Currently at 700 topics – ever growing number.
This is an approximation of our topology – each large box is a data center. The main point I want to show is that we can mirror clusters to other DCs efficiently (with minutes delay) and in doing so the data pipeline can almost seamlessly cut across data center boundaries.So the hadoop clusters in the offline DC has near real-time access to the activity feed and can push back enriched data for consumption by production services. This is a powerful use case. It allows live services to take large amounts of data that has been cleaned and enriched in one way or the other by a batch-oriented system (hadoop in this case) and consume it in a stream oriented manner. i.e., before we had this data pipeline, we were left with few options but to push the batch on to production services that needed it but that may not always be feasible because it’s a push model – services may not be able to deal with sudden bursts of data being pushed from hadoop. Instead, the hadoop jobs can now shove data into the data pipeline which is mirrored back to the live DC and services can consume this data in a stream-oriented fashion at its own pace.E.g., this is used by a system called Faust (under development by the Voldemort team also part of DI @ LI) to improve the turn around time of incorporating recent activity data for use by rec systems. I mentioned earlier we had these jobs that shipped out entire pre-built RO stores – now the jobs can just write out updates to the data pipeline that are read and applied by Faust in the live DC.
Second idea is to pre-emptively prevent data fragility due to schema changes and ensure that only clean well-structured data gets into the pipeline in the first place.
Picture: show sample schemaHighlight that compat check is automaticSchema review to ensure best practices: well-named fields, has required header information and so on, amenable to future evolution.Compile time check to ensure (if update) it is compat with previous version – we have central repository of all schemas to aid in that verification.May seem restrictive to have compatibility model, but if you have 50+ services consuming a given data source then it makes sense especially if you intend to evolve schemas over time.Ensure that the producer cannot send an event with a schema that is invalid or incompatible with a previous version.A reference to the schema (hash of canonicalized version) is embedded in each message so a reader always uses same schema as writer.
The first two ideas facilitate a much more streamlined O(1) approach to ETL – by O(1) I mean with regard to human effort. Previously new event types would probably need some custom parsing work to be ETL’d. However, since we now have a central pipeline, and because all data in that pipeline uses backwards compatible schemas with a standardized encoding – avro in our case, a new event type is just that – a new event type. You just send it and it gets ETL’d. The ETL process knows how to read avro records and the effort to get your new data into Hadoop is literally zero.
Bunch of mappers that read from Kafka and write to HDFS.
Does it work? In order to measure we need a metric to measure.
Every message should be received by every consumer quickly.So we want to measure event loss, and we want to measure lag from producer to various consumers.
Producers keep track of how many messages it sent for various topics in every 10 minute time window. The time is taken from the event header of each message.It sends an audit event every 10 min. saying…Likewise Kafka cluster and ETL report counts.We have an app that reconciles these counts every few minutes.LEAD into 0.8 – discrepancy due to producer failures; typically when we upgrade the cluster. Unavailability to producers and consumers alike. No ack so don’t know if data made it or not.

Koshy june27 140pm_room210_c_v4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Koshy june27 140pm_room210_c_v4

Similar to Koshy june27 140pm_room210_c_v4 (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Koshy june27 140pm_room210_c_v4

Editor's Notes