Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

•

5 gostaram•2,175 visualizações

Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.

Engenharia

BUILDING REALTIME DATA
PIPELINES WITH KAFKA CONNECT
AND SPARK STREAMING
Guozhang Wang
Confluent

About Me: Guozhang Wang
• Engineer @ Confluent.
• Apache Kafka Committer, PMC Member.
• Before: Engineer @ LinkedIn, Kafka and Samza.

What do you REALLY need
for Stream Processing?

Real-time Data Integration:
getting data to all the right places

Option #1: One-off Tools
• Tools for each specific data systems
• Examples:
• jdbcRDD, Cassandra-Spark connector, etc..
• Sqoop, logstash to Kafka, etc..

Option #2: Kitchen Sink Tools
• Generic point-to-point data copy / ETL tools
• Examples:
• Enterprise application integration tools

Option #3: Streaming as Copying
• Use stream processing frameworks to copy data
• Examples:
• Spark Streaming: MyRDDWriter (forEachPartition)
• Storm, Samza, Flink, etc..

Example: LinkedIn with Kafka
Apache Kafka

Large-scale streaming data import/export for Kafka
Kafka Connect

Delivery Guarantees
• Offsets automatically committed and restored
• On restart: task checks offsets & rewinds
• At least once delivery – flush data, then commit
• Exactly once for connectors that support it (e.g. HDFS)

Format Converters
• Abstract serialization agnostic to connectors
• Convert between Kafka Connect Data API (Connectors)
and serialized bytes
• JSON and Avro currently supported

Connector Developer APIs
class Connector {
abstract void start(props);
abstract void stop();
abstract Class<? extends Task> taskClass();
abstract List<Map<…>> taskConfigs(maxTasks);
…
}
class Source/SinkTask {
abstract void start(props);
abstract void stop();
abstract List<SourceRecord> poll();
abstract void put(records);
abstract void commit();
…
}

Kafka Connect Today
• Confluent open source: HDFS, JDBC
• Connector Hub: connectors.confluent.io
• Examples: MySQL, MongoDB, Twitter, Solr, S3, MQTT,  
Counchbase, Vertica, Cassandra, Elastic Search,  
HBase, Kudu, Attunity, JustOne, Striim, Bloomberg ..
• Improved connector control (0.10.0)

THANK YOU!
Guozhang Wang | guozhang@confluent.io | @guozhangwang 
Confluent – Afternoon Break Sponsor for Spark Summit 
• Jay Kreps – I Heart Logs book signing and giveaway 
• 3:45pm – 4:15pm in Golden Gate
Kafka Training with Confluent University
• Kafka Developer and Operations Courses
• Visit www.confluent.io/training
Want more Kafka? 
• Download Confluent Platform Enterprise (incl. Kafka Connect) at 
http://www.confluent.io/product
• Apache Kafka 0.10 upgrade documentation at 
http://docs.confluent.io/3.0.0/upgrade.html

Mais conteúdo relacionado

Mais procurados

Confluent building a real-time streaming platform using kafka streams and k...

Thomas Alex

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example

confluent

Kafka, Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform (Kafka Core + Kafka Connect + Kafka Streams) for building streaming data pipelines and streaming data applications. This talk, that I gave at the Chicago Java Users Group (CJUG) on June 8th 2017, is mainly focusing on Kafka Streams, a lightweight open source Java library for building stream processing applications on top of Kafka using Kafka topics as input/output. You will learn more about the following: 1. Apache Kafka: a Streaming Data Platform 2. Overview of Kafka Streams: Before Kafka Streams? What is Kafka Streams? Why Kafka Streams? What are Kafka Streams key concepts? Kafka Streams APIs and code examples? 3. Writing, deploying and running your first Kafka Streams application 4. Code and Demo of an end-to-end Kafka-based Streaming Data Application 5. Where to go from here?

Kafka Streams for Java enthusiasts

Slim Baltagi

Introduction to Apache Kafka and why it matters - Madrid

Paolo Castagna

Apache Kafka sits at the center of a technology ecosystem that can be a bit overwhelming to someone just getting started. Fortunately, Apache Kafka is also at the heart of an amazing community that is able and eager to help! So, if you are new, or relatively new, to Apache Kafka, welcome! I’d like to introduce you to the Kafka ecosystem, and present you with a plan for how to learn and be productive with it. I’d also like to introduce you to one of the most helpful and welcoming software communities I’ve ever encountered. I’ll take you through the basics of Kafka—the brokers, the partitions, the topics—and then on and up into the different APIs and tools that are available to work with it. Consider it a Kafka 101, if you will. We’ll stay at a high level, but we’ll cover a lot of ground, with an emphasis on where and how you can dig in deeper. I am still learning myself, so I will share with you what and who have helped me in my journey, and then I’ll invite you to continue that journey with me. It’s going to be a great adventure!

Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020

confluent

Jack Gudenkauf sparkug_20151207_7

Jack Gudenkauf

Jay Kreps, Open Source Visionary and Co Founder of Confluent and several open source projects will be visiting LA. I have asked him to come present at our group. He will present his vision and will answer questions regarding Kafka and other projects Bio:- Jay is the co-founder and CEO at Confluent a company built around realtime data streams and the open source messaging system Apache Kafka. He is the original author of several of open source projects including Apache Kafka, Apache Samza, Voldemort, and Azkaban.

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

Data Con LA

Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka’s ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems. However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes. We describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector. eventbrite_kafka_summit_event_logo_v3-035858-edited.png

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...

confluent

Introduction to Spark Streaming

datamantra

RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...

confluent

Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR

confluent

Kafka is a scalable, distributed publish subscribe messaging system that's used as a data transmission backbone in many data intensive digital businesses. Couchbase Server is a scalable, flexible document database that's fast, agile, and elastic. Because they both appeal to the same type of customers, Couchbase and Kafka are often used together. This presentation from a meetup in Mountain View describes Kafka's design and why people use it, Couchbase Server and its uses, and the use cases for both together. Also covered is a description and demo of Couchbase Server writing documents to a Kafka topic and consuming messages from a Kafka topic. using the Couchbase Kafka Connector.

Real time Messages at Scale with Apache Kafka and Couchbase

Will Gardella

I Heart Log: Real-time Data and Apache Kafka

Jay Kreps

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average. An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.

Bullet: A Real Time Data Query Engine

DataWorks Summit

PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together. In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.

PostgreSQL + Kafka: The Delight of Change Data Capture

Jeff Klukas

kafka for db as postgres

PivotalOpenSourceHub

Apache Kafka at LinkedIn

Guozhang Wang

The initial deployment of Espresso relies on MySQL’s built-in mechanism for Master-Slave replication. Storage hosts running MySQL masters service HTTP requests to store and retrieve documents, while hosts running slave replicas remain mostly idle. Since replication is at the MySQL instance level, masters and slaves must contain the exact same partitions – precluding flexible and dynamic partition placement and migration within the cluster. Espresso is migrating to a new deployment topology where each Storage Node may host a combination of master and slave partitions; thus distributing the application requests equally across all available hardware resources. This topology requires per-partition replication between master and slave nodes. Kafka will be used as the transport for replication between partitions. For use as the replication stream for the source-of-truth data store for LinkedIn’s most valuable data, Kafka must be as reliable as MySQL replication. The session will cover Kafka configuration options to ensure highly reliable, in-order message delivery. Additionally, the application logic maintains state both within the Kafka event stream and externally to detect message re-delivery, out of order delivery, and messages inserted out-of-band. These application protocols to guarantee high fidelity will be discussed.

Espresso Database Replication with Kafka, Tom Quiggle

confluent

BIGO currently has two major video products and services, Live and Likee. At present, BIGO Live's live broadcast business has covered more than 150 countries and regions and Likee short video also has more than 100 million users. Being well received and approved among young people, our products are popular all around the world. In the past technical architecture, BIGO adopted open-source Kafka as a basic service to support real-time data processing and analyzing. And based on this, BIGO has built a complete recommendation infrastructure to provide users with high-quality recommendation services. However, as the business continues to develop rapidly, the scale of our processing messages has entered a trillion scale and the past architecture has encountered huge challenges. Regarding these problems, we got in touch with Apache Pulsar. With the gradual deepening of Pulsar, we have benefited a lot from these excellent features brought by Apache Pulsar such as hierarchical persistent storage, low E2E latency and horizontal scalability. These advantages have also helped us overcome many difficulties in the production system. This talk will introduce the development and construction of BIGO's messaging platform based on Pulsar. We will start with message processing, storage, etc., and deeply analyze the performance bottlenecks that may be encountered in the production environment, and share our production processing practice experience.

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...

StreamNative

Design Patterns for working with Fast Data

MapR Technologies

Mais procurados (20)

Confluent building a real-time streaming platform using kafka streams and k...

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example

Kafka Streams for Java enthusiasts

Introduction to Apache Kafka and why it matters - Madrid

Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020

Jack Gudenkauf sparkug_20151207_7

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...

Introduction to Spark Streaming

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...

Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR

Real time Messages at Scale with Apache Kafka and Couchbase

I Heart Log: Real-time Data and Apache Kafka

Bullet: A Real Time Data Query Engine

PostgreSQL + Kafka: The Delight of Change Data Capture

kafka for db as postgres

Apache Kafka at LinkedIn

Espresso Database Replication with Kafka, Tom Quiggle

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...

Design Patterns for working with Fast Data

Semelhante a Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Jug - ecosystem

Florent Ramiere

Chti jug - 2018-06-26

Florent Ramiere

Apache Spark Streaming

Bartosz Jankiewicz

BBL KAPPA Lesfurets.com

Cedric Vidal

Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed: • What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them? • When to use batch and when stream processing? • What is a Lambda-Architecture and a Kappa Architecture? • What are the best practices for your project?

20170126 big data processing

Vienna Data Science Group

Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Guido Schmutz

Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.

Self-Service Data Ingestion Using NiFi, StreamSets & Kafka

Guido Schmutz

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Helena Edelson

Apache Spark on HDinsight Training

Synergetics Learning and Cloud Consulting

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Guido Schmutz

Streaming with Spring Cloud Stream and Apache Kafka - Soby Chacko

VMware Tanzu

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

Chris Fregly

ApacheCon North America 2018: Creating Spark Data Sources

Jayesh Thakrar

Spark (Structured) Streaming vs. Kafka Streams

Guido Schmutz

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Evan Chan

The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc). Speaker: Matei Zaharia

Composable Parallel Processing in Apache Spark and Weld

Databricks

A Unified Platform for Real-time Storage and Processing

StreamNative

Kafka connect 101

Whiteklay

Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka. Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice. In this session we talk about how Apache Kafka helps you to radically simplify your data processing architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. Notably, we introduce Kafka’s Streams API, its abstractions for streams and tables, and its recently introduced Interactive Queries functionality. As we will see, Kafka makes such architectures equally viable for small, medium, and large scale use cases.

Introducing Kafka's Streams API

confluent

Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...

Lviv Startup Club

Semelhante a Building Realtim Data Pipelines with Kafka Connect and Spark Streaming (20)

Jug - ecosystem

Chti jug - 2018-06-26

Apache Spark Streaming

BBL KAPPA Lesfurets.com

20170126 big data processing

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Self-Service Data Ingestion Using NiFi, StreamSets & Kafka

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Apache Spark on HDinsight Training

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Streaming with Spring Cloud Stream and Apache Kafka - Soby Chacko

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

ApacheCon North America 2018: Creating Spark Data Sources

Spark (Structured) Streaming vs. Kafka Streams

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Composable Parallel Processing in Apache Spark and Weld

A Unified Platform for Real-time Storage and Processing

Kafka connect 101

Introducing Kafka's Streams API

Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...

Mais de Guozhang Wang

Consensus in Apache Kafka: From Theory to Production.pdf

Guozhang Wang

We present Apache Kafka’s core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. We also demonstrate how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.

Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...

Guozhang Wang

Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems. To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library.

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...

Guozhang Wang

Anyone who has used Kafka consumer groups or operated a Kafka Streams application is likely familiar with the rebalancing protocol, which is used to (re)distribute partitions among the consumers of a group whenever there is a change in membership or in the topics subscribed to. The current protocol takes the safest possible approach of pausing all work and revoking ownership of all partitions so that a new assignment can be made. This “stop-the-world” approach can be frustrating especially when the mapping of partitions to the consumer that owns them barely changes. In KIP-429 we introduce incremental cooperative rebalancing for the consumer client, a new rebalancing protocol that allows consumers to retain ownership and continue fetching for their owned partitions while a rebalance is in progress. This proposal trades extra rebalances for the ability to revoke only those partitions which are to be migrated to another consumer for overall workload balance.

Introduction to the Incremental Cooperative Protocol of Kafka

Guozhang Wang

High-speed and low footprint data stream processing is high in demand for Kafka Streams applications. However, how to write an efficient streaming application using the Streams DSL has been asked by many users in the past since it requires some deep knowledge about Kafka Streams internals. In this talk, I will talk about how to analyze your Kafka Streams applications, target performance bottlenecks and unnecessary storage costs, and optimize your application code accordingly using the Streams DSL. In addition, I will talk about the new optimization framework that we have been developed inside Kafka Streams since the 2.1 release which replaced the in-place translation of the Streams DSL into a comprehensive process composed of streams topology compilation and rewriting phases, with a focus on reducing various storage footprints of Streams applications, such as state stores, internal topics etc.

Performance Analysis and Optimizations for Kafka Streams Applications

Guozhang Wang

Apache Kafka from 0.7 to 1.0, History and Lesson Learned

Guozhang Wang

Exactly-once Stream Processing with Kafka Streams

Guozhang Wang

Apache Kafka, and the Rise of Stream Processing

Guozhang Wang

To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.

Building Stream Infrastructure across Multiple Data Centers with Apache Kafka

Guozhang Wang

Apache Kafka is a scalable publish-subscribe messaging system with its core architecture as a distributed commit log. It was originally built as its centralized event pipelining platform for online data integration tasks. Over the past years developing and operating Kafka, we extend its log-structured architecture as a replicated logging backbone for much wider application scopes in the distributed environment. I am going to talk about our design and engineering experience to replicate Kafka logs for various distributed data-driven systems, including source-of-truth data storage and stream processing.

Building a Replicated Logging System with Apache Kafka

Guozhang Wang

Behavioral Simulations in MapReduce

Guozhang Wang

Mais de Guozhang Wang (11)

Consensus in Apache Kafka: From Theory to Production.pdf

Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...

Introduction to the Incremental Cooperative Protocol of Kafka

Performance Analysis and Optimizations for Kafka Streams Applications

Apache Kafka from 0.7 to 1.0, History and Lesson Learned

Exactly-once Stream Processing with Kafka Streams

Apache Kafka, and the Rise of Stream Processing

Building Stream Infrastructure across Multiple Data Centers with Apache Kafka

Building a Replicated Logging System with Apache Kafka

Behavioral Simulations in MapReduce

Último

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand Booking Contact Details :- WhatsApp Chat :- +91-7737669865 Call Girls In Model Towh +91-7737669865 !! Best Woman Seeking Man Call Girls Service, Escorts Service in Home Hotel in NCR 24 Hours Available Service Call Girls, Contact Us +91-7737669865 (Any Time. Any Where) Call Girls in , Noida, Gurgaon, Ghaziabad,Sexy Indian Female Escorts Service NCRWelcome To Escorts Service – An All Over New Very Sexy Hot Call Girls Agency Service Escorts In South NCR’s No. 1 High Profile Independent Female Escorts Service. We Provide Good Quality Educated Profile At #K09 Very Regnebal Price 100% Safe And Original.We Are Provide Escorts Service All OYO Hotels ,3*,4*,5* Star Hotel And Home Flat, Apartment. Guest-House. Services In -Call And Out – Call Both Are Services Available. 24Hrs. Any Time Any Where. In All Over Noida Gurgaon Ghaziabad Faridabad.More Information And Contact Profile Real Pic Visit Our Website City Wise Escorts Service Agency.Good Looking Cheap And Best Models Girls U Can Get Best Click On Link……Night Call Girls Now In Hotel Le Meridien Gurgaon Near Female Escort One Shot — 5000/in call (time 1 hour), 6000/out call Two shot with one girl — 8000/in call (time 2 hour), 10000/out call Body to body massage with sex- 8000/in call (time 1 hour) Full night Service for one person– 12000/in call, 13000/out call (shot limit 3-4 shots) Full night Service for more than 1 person — please contact Us —7737669865 We are available 24*7 all days of the year. Call us — 7737669865 Thank you for Visiting.

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand

amitlee9823

Increased aeration of the soil; Stabilized soil structure; Higher and more diversified crop production; Better workability of the land; Earlier planting dates; Reduction of peak discharges by an increased temporary storage of water in the soil decomposition of organic matter; soil subsidence; reduced irrigation efficiency; increased risk of drought. excessive leaching of valuable nutrients from the soil; downstream environmental damage by salty or otherwise polluted drainage water; the presence of ditches, canals, and structures impending accessibility and interfering with other infrastructural elements of the land.

chapter 5.pptx: drainage and irrigation engineering

mulugeta48

ONLINE FOOD ORDER SYSTEM is a website designed primarily for use in the food delivery industry. This system will allow hotels and restaurants to increase scope of business by reducing the labor cost involved. The system also allows to quickly and easily manage an online menu which customers can browse and use to place orders with just few clicks. Restaurant employees then use these orders through an easy to navigate graphical interface for efficient processing.

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf

Kamal Acharya

LIST OF EXPERIMENTS: 1. Implement simple vector addition in Tensor Flow. 2. Implement a regression model in Keras. 3. Implement a perception in TensorFlow/Keras Environment. 4. Implement a Feed Forward Network in TensorFlow/Keras. 5. Implement an image classifier using CNN in TensorFlow/Keras. 6. Improve the deep Learning model by fine tuning hyper parameters. 7. Implement a Transfer Learning concept in image classification. 8. Using a pre trained model on Keras for transfer learning. 9. Perform Sentimental Analysis using RNN. 10. Implement an LSTM based Auto encoding inTensorflow/Keras. 11. Image generation using GAN. ADDITIONAL EXPERIMENTS 12. Train a deep Learning model to classify a given image using pre trained model. 13. Recommendation system from sales data using Deep Learning. 14. Implement Object detection using CNN. 15. Implement any simple Reinforcement Algorithm for an NLP problem.

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

Asst.prof M.Gokilavani

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil

Cara Menggugurkan Kandungan 087776558899

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking Escorts Service Available Whatsapp SABANA ☎️ : [+91-7001035870] Escorts Service are always ready to make their clients happy. Their exotic looks and sexy personalities are sure to turn heads. You can enjoy with them, including massages and erotic encounters. Our area Escorts are young and sexy, so you can expect to have an exotic time with them. They are trained to satiate your naughty nerves and they can handle anything that you want. They are also intelligent, so they know how to make you feel comfortable and relaxed Independent Escorts Service They know all the sex positions and can satisfy you in any way that you desire. They can even give you erotic massages to help you relax before your session. This is essential, because a man who is stressed won’t be receptive to the pleasures of sex. They also know how to play with your sexy organs, so you’ll have plenty of foreplay and cuddling. P252024SS SERVICE ✅ ❣️ ⭐➡️HOT & SEXY MODELS // COLLEGE GIRLS HOUSE WIFE RUSSIAN , AIR HOSTES ,VIP MODELS . AVAILABLE FOR COMPLETE ENJOYMENT WITH HIGH PROFILE INDIAN MODEL AVAILABLE HOTEL & HOME ★ SAFE AND SECURE HIGH CLASS SERVICE AFFORDABLE RATE ★ SATISFACTION,UNLIMITED ENJOYMENT. ★ All Meetings are confidential and no information is provided to any one at any cost. ★ EXCLUSIVE PROFILes Are Safe and Consensual with Most Limits Respected ★ Service Available In: - HOME & HOTEL Star Hotel Service .In Call & Out call SeRvIcEs : ★ A-Level (star escort) ★ Strip-tease ★ BBBJ (Bareback Blowjob)Receive advanced sexual techniques in different mode make their life more pleasurable. ★ Spending time in hotel rooms ★ BJ (Blowjob Without a Condom) ★ Completion (Oral to completion) ★ Covered (Covered blowjob Without condom ★ANAL SERVICES.

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

dharasingh5698

Generative AI or GenAI technology based PPT

bhaskargani46

NFPA 5000 2024 standard .

DerechoLaboralIndivi

Double rodded leveling 1 pdf activity 01

KreezheaRecto

GAS POWER CYCLES Cycles: Otto, Diesel, Dual, Brayton - Calculation of mean effective pressure - Air standard efficiency - Comparison of cycles INTERNAL COMBUSTION ENGINES Classification - Components and their function - valve timing diagram and port timing diagram - actual and theoretical p-v diagram of two stroke and four stroke engines – carburettor - diesel pump and injector system - battery and magneto ignition system - principles of combustion and detonation in CI engines - lubrication and cooling systems - performance parameters and calculations.

Thermal Engineering Unit - I & II . ppt

DineshKumar4165

KubeKraft presentation @CloudNativeHooghly

sanyuktamishra911

data_management_and _data_science_cheat_sheet.pdf

JiananWang21

Intro To Electric Vehicles PDF Notes.pdf

rs7054576148

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to 25K High Profile Escorts In Pune Booking Now open +91- 8005736733 Why you Choose Us- +91- 8005736733 HOT⇄ 8005736733 Mr ashu ji Call Mr ashu Ji +91- 8005736733 (V030524]N) 𝐇𝐨𝐭𝐞𝐥 𝐑𝐨𝐨𝐦𝐬 𝐈𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠 𝐑𝐚𝐭𝐞 𝐒𝐡𝐨𝐭𝐬/𝐇𝐨𝐮𝐫𝐲🆓 .█▬█⓿▀█▀ 𝐈𝐍𝐃𝐄𝐏𝐄𝐍𝐃𝐄𝐍𝐓 𝐆𝐈𝐑𝐋 𝐕𝐈𝐏 𝐄𝐒𝐂𝐎𝐑𝐓 Hello Guys ! High Profiles young Beauties and Good Looking standard Profiles Available , Enquire Now if you are interested in Hifi Service and want to get connect with someone who can understand your needs. Service offers you the most beautiful High Profile sexy independent female Escorts in genuine ✔✔✔ To enjoy with hot and sexy girls ✔✔✔ ★providing:- • Models • vip Models • Russian Models • Foreigner Models • TV Actress and Celebrities • Receptionist • Air Hostess • Call Center Working Girls/Women • Hi-Tech Co. Girls/Women • Housewife

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...

SUHANI PANDEY

Call Girls In Mahipalpur (Delhi) Call or Whataap {+91-8377877756} Escorts Provide 24×7 Available With Room ☆☆TIMINGS 24 HOURS OPENS☆☆☆ Booking Now Gentleman Only:-Call Now Best High Class Normal Call Girls Escorts Service In Delhi NCR 24-7 Hours Available Service I provide In Delhi NCR Female Escorts Sex Service 100% Customers Satisfaction Guarantee VIP Profiles Top Grade Service 100% Cooperative All round Service ꧁❤8377877756❤꧂ InCall: – You Can Reach At Our Place in Delhi Our place Which Is Very Clean Hygienic 100% safe Accommodation OutCall: – Service For Out Call You Have To Come Pick The Girl From My Place. We Also Provide Door Step Services. Note: – Pic Collectors Time Passers And Bargainers Stay Away As We Respect The Value Of Your Money And Time And Expect The Same From You. 8377877756. Hygienic: – Full Ac Neat And Clean Rooms Available In Hotel 24 * 7 Hrs In Delhi NCR 8377877756. Place: – South Extension Nehru Place Saket Malviya Nagar Munirka Vasant Kunj Safdarjung Katwaria Sarai Lajpat Nagar Kalkaji Hauz Khas Mahipalpur Dwarka Karol Bagh Noida Gurgaon Faridabad All Outcall Only Hotel Service In Delhi Ncr 8377877756. We are providing. : – House Wife’s : – Private Independent House Wife’s : – Private Independent Collage Going Girls : – Corporate MNC Working Profiles : – Call Center Girls : – Live Band Girls : – Foreigners and Many More: – Independent Models Service type : – Incall Short 2500 : – Outcall Short 3500 : – Incall Night 8500 : – Outcall Night 95,00 For Pictures And Other Details Pls Whatsapp Me Otherwise Call Me Any Time Incall Outcall Both Are Services Available Door Step, House, Apartment, Guest House, Flate, All Star Hotel Available ꧁❤8377877756❤꧂ THANKS FOR VISITING. Booking 24×7 HRS

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756

dollysharma2066

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf

Suman Jyoti

Unleashing the Power of the SORA AI lastest leap

RishantSharmaFr

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Girls Waiting For You To Fuck Booking Contact Details WhatsApp Chat: +91-6297143586 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 01-may-2024(v.n)

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

Call Girls in Nagpur High Profile

Online banking management system project.pdf

Kamal Acharya

Unit 1 - Soil Classification and Compaction.pdf

RagavanV2

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

1. BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING Guozhang Wang Confluent

2. About Me: Guozhang Wang • Engineer @ Confluent. • Apache Kafka Committer, PMC Member. • Before: Engineer @ LinkedIn, Kafka and Samza.

3. What do you REALLY need for Stream Processing?

4. Spark Streaming! Is that All?

5. Spark Streaming! Is that All?

6. Spark Streaming! Is that All?

7. Data can Comes from / Goes to..

9. Real-time Data Integration: getting data to all the right places

10. Option #1: One-off Tools • Tools for each specific data systems • Examples: • jdbcRDD, Cassandra-Spark connector, etc.. • Sqoop, logstash to Kafka, etc..

11. Option #2: Kitchen Sink Tools • Generic point-to-point data copy / ETL tools • Examples: • Enterprise application integration tools

12. Option #3: Streaming as Copying • Use stream processing frameworks to copy data • Examples: • Spark Streaming: MyRDDWriter (forEachPartition) • Storm, Samza, Flink, etc..

13. Real-time Integration: E, T & L

14.

15. Example: LinkedIn back in 2010

16. Example: LinkedIn with Kafka Apache Kafka

17.

18.

19. Large-scale streaming data import/export for Kafka Kafka Connect

20.

21.

22. Data Model

23. Data Model

24. Parallelism Model

25. Standalone Execution

26. Distributed Execution

27. Distributed Execution

28. Distributed Execution

29. Delivery Guarantees • Offsets automatically committed and restored • On restart: task checks offsets & rewinds • At least once delivery – flush data, then commit • Exactly once for connectors that support it (e.g. HDFS)

30. Format Converters • Abstract serialization agnostic to connectors • Convert between Kafka Connect Data API (Connectors) and serialized bytes • JSON and Avro currently supported

31. Connector Developer APIs class Connector { abstract void start(props); abstract void stop(); abstract Class<? extends Task> taskClass(); abstract List<Map<…>> taskConfigs(maxTasks); … } class Source/SinkTask { abstract void start(props); abstract void stop(); abstract List<SourceRecord> poll(); abstract void put(records); abstract void commit(); … }

32. Kafka Connect & Spark Streaming

33. Kafka Connect Today • Confluent open source: HDFS, JDBC • Connector Hub: connectors.confluent.io • Examples: MySQL, MongoDB, Twitter, Solr, S3, MQTT,   Counchbase, Vertica, Cassandra, Elastic Search,   HBase, Kudu, Attunity, JustOne, Striim, Bloomberg .. • Improved connector control (0.10.0)

34. THANK YOU! Guozhang Wang | guozhang@confluent.io | @guozhangwang  Confluent – Afternoon Break Sponsor for Spark Summit  • Jay Kreps – I Heart Logs book signing and giveaway  • 3:45pm – 4:15pm in Golden Gate Kafka Training with Confluent University • Kafka Developer and Operations Courses • Visit www.confluent.io/training Want more Kafka?  • Download Confluent Platform Enterprise (incl. Kafka Connect) at  http://www.confluent.io/product • Apache Kafka 0.10 upgrade documentation at  http://docs.confluent.io/3.0.0/upgrade.html

35. Separation of Concerns

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Semelhante a Building Realtim Data Pipelines with Kafka Connect and Spark Streaming (20)

Mais de Guozhang Wang

Mais de Guozhang Wang (11)

Último

Último (20)

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming