Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

•

0 likes•374 views

This document discusses applying chaos engineering principles to complex data systems. It recommends defining a steady state, acknowledging real-world events, running manual experiments in production, and automating production experiments. This helps discover weaknesses and manage the data lifecycle through stages like development, experimentation, deployment, and production. It also presents the idea of using a "Git for Data" system like lakeFS to version data and enable features like branching, rolling back, and atomic updates to more easily develop, test, and recover from errors in data pipelines.

Technology

“Don’t make the assumption that things in
Spark just work, there is a good chance
that Spark underneath the hood is going to
do something unexpected”
2
Holden Karau
Apache Spark PMC
2017, BeeScala
@adipolak

What can go wrong with Data flows?
EVERYTHING
3
● new component logic
● new data source
● introducing incompatible schema
change
● Kafka broker failed to validate record
● Spark job runs twice, in parallel
● changing tables’ relationship keys
● accidentally delete yesterday’s
`events/` partition
● data duplication
@adipolak

Testing is hard!
Distributed data systems with many moving parts
4
Unit testing
Integration testing
System testing
E2E …
@adipolak

Chaos Engineering in the
World of Large-Scale Complex
Data Flow
Adi Polak
Treeverse

4 Principles:
6
• Defining a steady-state
• Acknowledging a variety of real-world events
• Running manual experiments in production
• Automating production experiments
@adipolak
Chaos Engineering

#1 - Steady state
• system’s throughput
• error rates
• latency percentiles
• etc
DevOps / SRE
7
• Data products requirements
Data Engineer
@adipolak

Data Product Requirements
8
• Data Quality
• Accuracy
• No duplications
• SLA
@adipolak

In 3 lines of code..
9
# Read data from file
df = spark.read.parquet('s3a://bank_transactions/ts=123123123/’)
# Some transformation over the data related to finance
updated_df = df… do some magic!
updated_df.write.mode('append').parquet('s3a://bank_transactions/ts=123123123/’)
1
2
3
4
5
6
7
8
9
10
@adipolak
Data Duplication

#2 - Vary Real-world Events
hardware failures like :
• servers dying
• spike in traffic
DevOps / SRE
10
• schema change
• corrupted data record
• data variance
• accidentally delete yesterday’s `events/`
partition
Data Engineer
@adipolak

#3 – Run Experiments in Production
• Production machines and network
DevOps / SRE
12
• Production data
Data Engineer
@adipolak
🙀

Experimental environment
On production data in production ..
13
s3://data-repo/collections/foo
s3://data-repo/exp1/collections/foo
@adipolak
★
★

#4 – Automate Experiments to Run
Continuously in Prod
• Discover weaknesses
DevOps / SRE
14
• Manage Data lifecycle in stages
Data Engineer
@adipolak
& Minimize Blast Radius

Roll Back
Troubleshoot
Production
Version control
Best Practices & Data
Quality
Deployment
Data product stages & requirements
Experimentation
Debug
Collaborate
.
Development
15
@adipolak

16
What's the best
way to automate
data stages
propagation in
production?
@adipolak

Branching Strategy
Like source control -
17
@adipolak

Challenge Solution
$ spark.read.parquet
(‘s3://my-repo/<commit_id>’)
$ lakectl revert main^1
$ lakectl branch create my-branch
$ lakectl merge my-branch
main
$ lakectl branch create new-logic
$ lakectl merge new-logic main
Quickly recover from an error
Develop in Isolation
Troubleshoot
Atomic Update
Reprocess
Git for Data

22
What Does A Typical Environment Look
Like?
@adipolak

23
Recap
• Adopting Principles of Chaos Engineering to data systems
• Data Lifecycle Management stages
• Git for Data
• lakeFS

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Apache Spark Fundamentals & Concepts What’s new in Spark 2.x SparkSessions vs SparkContexts Datasets/Dataframes and Spark SQL Introduction to Structured Streaming concepts and APIs

Jump Start with Apache Spark 2.0 on Databricks

Anyscale

Connecting Hadoop and Oracle

Tanel Poder

Watch the companion webinar at: http://embt.co/1ueUSsP Making mistakes is human nature, avoiding them comes from experience and shared wisdom from others. Tim Radney, Production DBA and Manager, and Embarcadero’s Scott Walz would like to share some SQL server wisdom on “gut check” strategies that can help you avoid making SQL server mistakes. Making mistakes is human nature, avoiding them comes from experience and shared wisdom from others. Tim Radney, Production DBA and Manager, and Embarcadero’s Scott Walz would like to share some SQL server wisdom on “gut check” strategies that can help you avoid making SQL server mistakes. Watch the webinar to learn: - Backups – are yours good enough? - Maintenance – learn about regular a maintenance that can keep your SQL Server running its best. - Instance Configurations - Which settings should you be aware of that can make a huge difference? - Baselines and why they are important.

Common SQL Server Mistakes and How to Avoid Them with Tim Radney

Embarcadero Technologies

Data as a Service

Kyle Hailey

Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience. In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses. We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.

SQL Analytics Powering Telemetry Analysis at Comcast

Databricks

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Big Data Spain

Python performance profiling

Jon Haddad

Oracle ACE Director Dan Morgan presented those slides about migrating to database 12c and how to get it right. For more information, visit www.perftuning.com Between 2015 and 2017 a large percentage of Oracle's existing customer base will be upgrading their existing databases to the new version 12cR1. Most of the time when upgrades happen the only benefits organizations receive are the satisfaction of having survived the upgrade unscathed. In general, the new database, other than having a new version number, provides little in the way of tangible benefits. With the re-architecture that can come with a 12cR1 upgrade it is, for the first time, possible to plan for and receive substantial measurable benefits, and possible to make costly mistakes that could create substantial liabilities that are both business and financial. Oracle ACE Director and industry veteran Dan Morgan, in a presentation targeted to IT/IS management explores both the benefits and the risks and provide a guideline for "getting it right." This Performance Tuning's Lunch & Learn event focuses on management, planning, and budgeting, not features and technology, and provides you and your management teams the information they need to perform the next database upgrade or migration cycle.

Migrating to Database 12c Multitenant - New Opportunities To Get It Right!

Performance Tuning Corporation

Development teams are facing an increased demand for building faster, smarter, more complex systems in record time. The DataStax development tools have been architected from the ground up to enable teams to deliver intelligent, large-scale distributed, high-performance applications while remaining familiar and easy to use. This session will provide an overview of the DataStax Developer Tools, where and how they fit into the development cycle. About the Speaker Alex Popescu Senior Product Manager, DataStax I'm a developer turned product manager building developer tools for Apache Cassandra and DSE. With an eye for simplicity, I focus on creating friendly developer solutions that enable building high-performance, scalable, and fault tolerant applications. I'm passionate about open source and over years I made numerous contributions to major projects like TestNG and Groovy.

DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016

DataStax

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

Bolke de Bruin

Introduction SQL Analytics on Lakehouse Architecture

Databricks

[262] netflix 빅데이터 플랫폼

NAVER D2

Pig on Spark

mortardata

This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML. Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...

Databricks

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015

Iulia Emanuela Iancuta

As a Tester you need to level up. You can do more than functional verification or reporting Response Time In my Performance Clinic Workshops I show you real life exampls on why Applications fail and what you can do to find these problems when you are testing these applications. I am using Free Tools for all of these excercises - especially Dynatrace which gives full End-to-End Visibility (Browser to Database). You can test and download Dynatrace for Free @ http://bit.ly/atd2014challenge

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam

Andreas Grabner

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Kristofferson A

Deploying Data Science Engines to Production

Mostafa Majidpour

Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...

Andreas Grabner

Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Agenda: • Overview of Spark Fundamentals & Architecture • What’s new in Spark 2.x • Unified APIs: SparkSessions, SQL, DataFrames, Datasets • Introduction to DataFrames, Datasets and Spark SQL • Introduction to Structured Streaming Concepts • Four Hands On Labs You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL. Level: Beginner to intermediate, not for advanced Spark users. Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional. Bio: Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.

Jump Start on Apache® Spark™ 2.x with Databricks

Databricks

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022 (20)

Jump Start with Apache Spark 2.0 on Databricks

Connecting Hadoop and Oracle

Common SQL Server Mistakes and How to Avoid Them with Tim Radney

Data as a Service

SQL Analytics Powering Telemetry Analysis at Comcast

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Python performance profiling

Migrating to Database 12c Multitenant - New Opportunities To Get It Right!

DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

Introduction SQL Analytics on Lakehouse Architecture

[262] netflix 빅데이터 플랫폼

Pig on Spark

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Deploying Data Science Engines to Production

Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...

Jump Start on Apache® Spark™ 2.x with Databricks

More from HostedbyConfluent

"In this talk, attendees will be provided with an introduction to Kafka Connect and the basics of Single Message Transforms (SMTs) and how they can be used to transform data streams in a simple and efficient way. SMTs are a powerful feature of Kafka Connect that allow custom logic to be applied to individual messages as they pass through the data pipeline. The session will explain how SMTs work, the types of transformations they can be used for, and how they can be applied in a modular and composable way. Further, the session will discuss where SMTs fit in with Kafka Connect and when they should be used. Examples will be provided of how SMTs can be used to solve common data integration challenges, such as data enrichment, filtering, and restructuring. Attendees will also learn about the limitations of SMTs and when it might be more appropriate to use other tools or frameworks. Additionally, an overview of the alternatives to SMTs, such as Kafka Streams and KSQL, will be provided. This will help attendees make an informed decision about which approach is best for their specific use case. Whether attendees are developers, data engineers, or data scientists, this talk will provide valuable insights into how Kafka Connect and SMTs can help streamline data processing workflows. Attendees will come away with a better understanding of how these tools work and how they can be used to solve common data integration challenges."

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

HostedbyConfluent

"While Apache Kafka lacks native support for topic renaming, there are scenarios where renaming topics becomes necessary. This presentation will delve into the utilization of MirrorMaker 2.0 as a solution for renaming Kafka topics. It will illustrate how MirrorMaker 2.0 can efficiently facilitate the migration of messages from the old topic to the new one and how Kafka Connect Metrics can be employed to monitor the mirroring progress. The discussion will encompass the complexity of renaming Kafka topics, addressing certain limitations, and exploring potential workarounds when using MirrorMaker 2.0 for this purpose. Despite not being originally designed for topic renaming, MirrorMaker 2.0 has a suitable solution for renaming Kafka topics. Blog Post : https://engineering.hellofresh.com/renaming-a-kafka-topic-d6ff3aaf3f03"

Renaming a Kafka Topic | Kafka Summit London

HostedbyConfluent

"Trendyol, Turkey's leading e-commerce company, is committed to positively impacting the lives of millions of customers. Our decision-making processes are entirely driven by data. As a data warehouse team, our primary goal is to provide accurate and up-to-date data, enabling the extraction of valuable business insights. We utilize the benefits provided by Kafka and Kafka Connect to facilitate the transfer of data from the source to our analytical environment. We recently transitioned our Kafka Connect clusters from on-premise VMs to Kubernetes. This shift was driven by our desire to effectively manage rapid growth(marked by a growing number of producers, consumers, and daily messages), ensuring proper monitoring and consistency. Consistency is crucial, especially in instances where we employ Single Message Transforms to manipulate records like filtering based on their keys or converting a JSON Object into a JSON string. Monitoring our cluster's health is key and we achieve this through Grafana dashboards and alerts generated through kube-state-metrics. Additionally, Kafka Connect's JMX metrics, coupled with NewRelic, are employed for comprehensive monitoring. The session will aim to explain our approach to NRT data ingestion, outlining the role of Kafka and Kafka Connect, our transition journey to K8s, and methods employed to monitor the health of our clusters."

Evolution of NRT Data Ingestion Pipeline at Trendyol

HostedbyConfluent

"Join our lightning talk to delve into the strategies vital for maintaining a resilient Kafka service. While proactive monitoring is key for issue prevention, failures will still occur. Rapid detection tools will enable you to identify and resolve problems before they impact end-users. This session explores the techniques employed by Kafka cloud providers for this detection, many of which are also applicable if you are managing independent Kafka clusters or applications. The talk focuses on health-checking, a powerful tool that encompasses an application and its monitoring to validate Kafka environment availability. The session navigates through Kafka health-check methods, sharing best practices, identifying common pitfalls, and highlighting the monitoring of critical performance metrics like throughput and latency for early issue detection. Attendees will gain valuable insights into the art of health-checking their Kafka environment, equipping them with the tools to identify and address issues before they escalate into critical problems. We invite all Kafka enthusiasts to join us in this talk to foster a deeper understanding of Kafka health-checking and ensure the continued smooth operation of your Kafka environment."

Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques

HostedbyConfluent

"Stream processing systems traditionally gave their users the choice between at least once processing and at most once processing: accepting duplicate data or missing data. But ideally we would provide exactly-once processing, where every event in the input data is represented exactly once in the output. Kafka provides a transaction API that enables exactly-once when using Kafka as your source and sink. But this API has turned out to not be well suited for use by high level streaming systems, requiring various work arounds to still provide transactional processing. In this talk, I’ll cover how the transaction API works, and how systems like Arroyo and Flink have used it to build exactly-once support, and how improvements to the transactional API will enable better end-to-end support for consistent stream processing."

Exactly-once Stream Processing with Arroyo and Kafka

HostedbyConfluent

"In this talk, we will explore the exciting world of IoT and computer vision by presenting a unique project: Fish Plays Pokemon. Using an ESP Eye camera connected to an ESP32 and other IoT devices, to monitor fish's movements in an aquarium. This project showcases the power of IoT and computer vision, demonstrating how even a fish can play a popular video game. We will discuss the challenges we faced during development, including real-time processing, IoT device integration, and Kafka message consumption. By the end of the talk, attendees will have a better understanding of how to combine IoT, computer vision, and the usage of a serverless cloud to create innovative projects. They will also learn how to integrate IoT devices with Kafka to simulate keyboard behavior, opening up endless possibilities for real-time interactions between the physical and digital worlds."

Fish Plays Pokemon | Kafka Summit London

HostedbyConfluent

Tiered Storage 101 | Kafla Summit London

HostedbyConfluent

"Real-time 24/7 monitoring and verification of massive data is challenging – even more so for the world’s second largest manufacturer of memory chips and semiconductors. Tolerance levels are incredibly small, any small defect needs to be identified and dealt with immediately. The goal of semiconductor manufacturing is to improve yield and minimize unnecessary work. However, even with real-time data collection, the data was not easy to manipulate by users and it took many days to enable stream processing requests – limiting its usefulness and value to the business. You’ll hear why SK hynix switched to Confluent and how we developed a self-service stream process portal on top of it. Now users have an easy-to-use service to manipulate the data they want. Results have been impressive, stream processing requests are available the same day – previously taking 5 days! We were also able to drive down costs by 10% as stream processing requests no longer require additional hardware. What you’ll take away from our talk: - What were the pain points in the previous environment - How we transitioned to Confluent without service downtime - Creating a self-service stream processing portal built on top of Connect and ksqlDB - Use case of stream process portal"

Building a Self-Service Stream Processing Portal: How And Why

HostedbyConfluent

"Discover how default configurations might impact ingestion times, especially when dealing with large files. We'll explore a real-world scenario with a 20,000,000+ line file, assessing metrics and exploring the bottleneck in the default setup. Understand the intricacies of batch size calculations and how to optimize them based on your unique data characteristics. Walk away with actionable insights as we showcase a practical example, turning a 7-hour ingestion process into a mere 30 minutes for over 30,000,000 records in a Kafka topic. Uncover metrics, configurations, and best practices to elevate the performance of your Kafka Connect CSV source connectors. Don't miss this opportunity to optimize your data pipeline and ensure smooth, efficient data flow."

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...

HostedbyConfluent

"In order to meet the current and ever-increasing demand for near-zero RPO/RTO systems, a focus on resiliency is critical. While Kafka offers built-in resiliency features, a perfect blend of client and cluster resiliency is necessary in order to achieve a highly resilient Kafka client application. At Fidelity Investments, Kafka is used for a variety of event streaming needs such as core brokerage trading platforms, log aggregation, communication platforms, and data migrations. In this lightening talk, we will discuss the governance framework that has enabled producers and consumers to achieve their SLAs during unprecedented failure scenarios. We will highlight how we automated resiliency tests through chaos engineering and tightly integrated observability dashboards for Kafka clients to analyze and optimize client configurations. And finally, we will summarize the chaos test suite and the ""test, test and test"" mantra that are helping Fidelity Investments reach its goal of a future with zero down-time."

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...

HostedbyConfluent

"There are various strategies for securely connecting to Kafka clusters between different networks or over the public internet. Many cloud providers even offer endpoints that privately route traffic between networks and are not exposed to the internet. But, depending on your network setup and how you are running Kafka, these options ... might not be an option! In this session, we’ll discuss how you can use SSH bastions or a self managed PrivateLink endpoint to establish connectivity to your Kafka clusters without exposing brokers directly to the internet. We explain the required network configuration, and show how we at Materialize have contributed to librdkafka to simplify these scenarios and avoid fragile workarounds."

Navigating Private Network Connectivity Options for Kafka Clusters

HostedbyConfluent

"In my talk, we will examine all the stages of building our self-service Streaming Data Platform based on Apache Flink and Kafka Connect, from the selection of a solution for stateful streaming data processing, right up to the successful design of a robust self-service platform, covering the challenges that we’ve met. I will share our experience in providing non-Java developers with a company-wide self-service solution, which allows them to quickly and easily develop their streaming data pipelines. Additionally, I will highlight specific business use cases that would not have been implemented without our platform.0 characters0 characters"

Apache Flink: Building a Company-wide Self-service Streaming Data Platform

HostedbyConfluent

"Almost everyone has heard about large language models, and tens of millions of people have tried out OpenAI ChatGPT and Google Bard. However, the intricate architecture and underlying mathematics driving these remarkable systems remain elusive to many. LLM's are fascinating - so let's grab a drink and find out how these systems are built and dive deep into their inner workings. In the length of time it to enjoy a round of drinks, you'll understand the inner workings of these models. We'll take our first sip of word vectors, enjoy the refreshing taste of the transformer, and drain a glass understanding how these models are trained on phenomenally large quantities of data. Large language models for your streaming application - explained with a little maths and a lot of pub stories"

Explaining How Real-Time GenAI Works in a Noisy Pub

HostedbyConfluent

"Monitoring is a fundamental operation when running Kafka and Kafka applications in production. There are numerous metrics available when using Kafka, however the sheer number is overwhelming, making it challenging to know where to start and how to properly utilise them. This session will introduce you to some of the key metrics that should be monitored and best practices in fine tuning your monitoring. We will delve into which metrics are the indicators for cluster’s availability and performance and are the most helpful when debugging client applications."

TL;DR Kafka Metrics | Kafka Summit London

HostedbyConfluent

Kafka Streams relies on state restoration for maintaining standby tasks as failure recovery mechanism as well as for restoring the state after rebalance scenarios. When you are scaling up or down your application instances, it is necessary to know the current state of the restoration process for each active and standby task in order to prevent a long restoration process as much as possible. During this presentation, you will get an understanding of how KIP-869 provides valuable information about the current active task restoration after a rebalance and KIP-988 opens a window to the continuous process of standby restoration. When you encounter a situation in which you need to choose whether or not to scale up or down your application instances, both KIPs will be an invaluable ally for you.

A Window Into Your Kafka Streams Tasks | KSL

HostedbyConfluent

"In this talk, we will dive into the world of Kafka producer configs and explore how to understand and optimize them for better performance. We will cover the different types of configs, their impact on performance, and how to tune them to achieve the best results. Whether you're new to Kafka or a seasoned pro, this session will provide valuable insights and practical tips for improving your Kafka producer performance. - Introduction to Kafka producer internal and workflow - Understanding the producer configs like linger.ms, batch.size, buffer.memory and their impact on performance - Learning about producer configs like max.block.ms, delivery.timeout.ms, request.timeout.ms and retries to make producer more resilient. - Discuss configs like enable.idempotence, max.in.flight.requests.per.connection and transaction related configs to achieve delivery guarantees. - Q&A session with attendees to address specific questions and concerns."

Mastering Kafka Producer Configs: A Guide to Optimizing Performance

HostedbyConfluent

"Data contracts are one of the hottest topics in the data management community. A data contract is a formal agreement between a data producer and its consumers, aimed at reducing data downtime and improving data quality. Schemas are an important part of data contracts, but they are not the only relevant element. In this talk, we’ll: 1. see why data contracts are so important but also difficult to implement; 2. identify the characteristics of a well-designed data contract: discuss the anatomy of a data contract, its main elements and, how to formally describe them; 3. show how to manage the lifecycle of a data contract leveraging Confluent Platform's services."

Data Contracts Management: Schema Registry and Beyond

HostedbyConfluent

"In the realm of stateful stream processing, Apache Flink has emerged as a powerful and versatile platform. However, the conventional SQL-based approach often limits the full potential of Flink applications. We will delve into the benefits of adopting a code-first approach, which provides developers with greater control over application logic, facilitates complex transformations, and enables more efficient handling of state and time. We will also discuss how the code-first approach can lead to more maintainable and testable code, ultimately improving the overall quality of your Flink applications. Whether you're a seasoned Flink developer or just starting your journey, this talk will provide valuable insights into how a code-first approach can revolutionize your stream processing applications."

Code-First Approach: Crafting Efficient Flink Apps

HostedbyConfluent

"Change Data Capture (CDC) has become a commodity in data engineering, much in part due to the ever-rising success of Debezium [1]. But is that all there is? In this lightning talk, we’ll outline the current state of the CDC ecosystem, and understand why adopting a Debezium alternative is still a hard sell. If you’ve ever wondered what else is out there, but can’t keep up with the sprawling of new tools in the ecosystem; we’ll wrap it up for you! [1] https://debezium.io/"

Debezium vs. the World: An Overview of the CDC Ecosystem

HostedbyConfluent

"Separation of compute and storage has become the de-facto standard in the data industry for batch processing. The addition of tiered storage to open source Apache Kafka is the first step in bringing true separation of compute and storage to the streaming world. In this talk, we'll discuss in technical detail how to take the concept of tiered storage to its logical extreme by building an Apache Kafka protocol compatible system that has zero local disks. Eliminating all local disks in the system requires not only separating storage from compute, but also separating data from metadata. This is a monumental task that requires reimagining Kafka's architecture from the ground up, but the benefits are worth it. This approach enables a stateless, elastic, and serverless deployment model that minimizes operational overhead and also drives inter-zone networking costs to almost zero."

Beyond Tiered Storage: Serverless Kafka with No Local Disks

HostedbyConfluent

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Renaming a Kafka Topic | Kafka Summit London

Evolution of NRT Data Ingestion Pipeline at Trendyol

Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques

Exactly-once Stream Processing with Arroyo and Kafka

Fish Plays Pokemon | Kafka Summit London

Tiered Storage 101 | Kafla Summit London

Building a Self-Service Stream Processing Portal: How And Why

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...

Navigating Private Network Connectivity Options for Kafka Clusters

Apache Flink: Building a Company-wide Self-service Streaming Data Platform

Explaining How Real-Time GenAI Works in a Noisy Pub

TL;DR Kafka Metrics | Kafka Summit London

A Window Into Your Kafka Streams Tasks | KSL

Mastering Kafka Producer Configs: A Guide to Optimizing Performance

Data Contracts Management: Schema Registry and Beyond

Code-First Approach: Crafting Efficient Flink Apps

Debezium vs. the World: An Overview of the CDC Ecosystem

Beyond Tiered Storage: Serverless Kafka with No Local Disks

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

MINDCTI Revenue Release Quarter One 2024

MIND CTI

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Increase engagement and revenue with Muvi Live Paywall! In this presentation, we will explore the five key benefits of using Muvi Live Paywall to monetize your live streams. You'll learn how Muvi Live Paywall can help you: Monetize your live content easily: Set up pay-per-view access to your live streams and start generating revenue from your content. Increase audience engagement: Provide exclusive, premium content behind the paywall to keep your viewers engaged. Gain valuable viewer insights: Track viewer data and analytics to better understand your audience and tailor your content accordingly. Reduce content piracy: Muvi Live Paywall's security features help protect your content from unauthorized distribution. Streamline your workflow: The all-in-one platform simplifies the process of managing and monetizing your live streams. With Muvi Live Paywall, you can take control of your live stream monetization and create a sustainable business model for your content. Learn more about Muvi Live Paywall and start generating revenue from your live streams today!

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Roshan Dwivedi

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Scaling API-first – The story of a global engineering organization

Radu Cotescu

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Top 10 Most Downloaded Games on Play Store in 2024

SynarionITSolutions

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker

MINDCTI Revenue Release Quarter One 2024

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Exploring the Future Potential of AI-Enabled Smartphone Processors

A Year of the Servo Reboot: Where Are We Now?

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Partners Life - Insurer Innovation Award 2024

Boost PC performance: How more available memory can improve productivity

Scaling API-first – The story of a global engineering organization

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

AWS Community Day CPH - Three problems of Terraform

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Top 10 Most Downloaded Games on Play Store in 2024

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

1. 1

2. “Don’t make the assumption that things in Spark just work, there is a good chance that Spark underneath the hood is going to do something unexpected” 2 Holden Karau Apache Spark PMC 2017, BeeScala @adipolak

3. What can go wrong with Data flows? EVERYTHING 3 ● new component logic ● new data source ● introducing incompatible schema change ● Kafka broker failed to validate record ● Spark job runs twice, in parallel ● changing tables’ relationship keys ● accidentally delete yesterday’s `events/` partition ● data duplication @adipolak

4. Testing is hard! Distributed data systems with many moving parts 4 Unit testing Integration testing System testing E2E … @adipolak

5. Chaos Engineering in the World of Large-Scale Complex Data Flow Adi Polak Treeverse

6. 4 Principles: 6 • Defining a steady-state • Acknowledging a variety of real-world events • Running manual experiments in production • Automating production experiments @adipolak Chaos Engineering

7. #1 - Steady state • system’s throughput • error rates • latency percentiles • etc DevOps / SRE 7 • Data products requirements Data Engineer @adipolak

8. Data Product Requirements 8 • Data Quality • Accuracy • No duplications • SLA @adipolak

9. In 3 lines of code.. 9 # Read data from file df = spark.read.parquet('s3a://bank_transactions/ts=123123123/’) # Some transformation over the data related to finance updated_df = df… do some magic! updated_df.write.mode('append').parquet('s3a://bank_transactions/ts=123123123/’) 1 2 3 4 5 6 7 8 9 10 @adipolak Data Duplication

10. #2 - Vary Real-world Events hardware failures like : • servers dying • spike in traffic DevOps / SRE 10 • schema change • corrupted data record • data variance • accidentally delete yesterday’s `events/` partition Data Engineer @adipolak

11. 11 @adipolak

12. #3 – Run Experiments in Production • Production machines and network DevOps / SRE 12 • Production data Data Engineer @adipolak 🙀

13. Experimental environment On production data in production .. 13 s3://data-repo/collections/foo s3://data-repo/exp1/collections/foo @adipolak ★ ★

14. #4 – Automate Experiments to Run Continuously in Prod • Discover weaknesses DevOps / SRE 14 • Manage Data lifecycle in stages Data Engineer @adipolak & Minimize Blast Radius

15. Roll Back Troubleshoot Production Version control Best Practices & Data Quality Deployment Data product stages & requirements Experimentation Debug Collaborate . Development 15 @adipolak

16. 16 What's the best way to automate data stages propagation in production? @adipolak

17. Branching Strategy Like source control - 17 @adipolak

18. 18 @adipolak Git for Data

19. Challenge Solution $ spark.read.parquet (‘s3://my-repo/<commit_id>’) $ lakectl revert main^1 $ lakectl branch create my-branch $ lakectl merge my-branch main $ lakectl branch create new-logic $ lakectl merge new-logic main Quickly recover from an error Develop in Isolation Troubleshoot Atomic Update Reprocess Git for Data

20.

21. 21 How Does lakeFS Work? @adipolak

22. 22 What Does A Typical Environment Look Like? @adipolak

23. 23 Recap • Adopting Principles of Chaos Engineering to data systems • Data Lifecycle Management stages • Git for Data • lakeFS

Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

Recommended

Recommended

More Related Content

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022