Data Pipeline Challenges and Architectures

•Transferir como PPTX, PDF•

0 gostou•239 visualizações

The document discusses challenges in building a data pipeline including making it highly scalable, available with low latency and zero data loss while supporting multiple data sources. It covers expectations for real-time vs batch processing and explores stream and batch architectures using tools like Apache Storm, Spark and Kafka. Challenges of data replication, schema detection and transformations with NoSQL are also examined. Effective implementations should include monitoring, security and replay mechanisms. Finally, lambda and kappa architectures for combining stream and batch processing are presented.

Software

Manish Singh
Engineer at Hevo
https://linkedin.com/in/manishsingh123/
Challenges in Building a
Data Pipeline

● Data Pipeline
● Possible Implementations
● Challenges
● Data Processing Architectures
Agenda

● Highly scalable
● Highly available
● Low latency
● Zero data loss
● Support for multiple data sources (e.g. MySQL, NoSQL,
Mixpanel, Analytics)
● Instrumentation, monitoring, and alerting
● Real-time vs Batch
Expectations

Stream
● Usages: Live dashboards
(count, average), rate
limiting, triggers
● Processing: Apache Storm,
Apache Spark, Apache
Samza
● Store: Elastic Search, Druid,
Spark SQL, Kafka SQL
Stream vs Batch
Batch
● Batch Processing
and
pre-computation
● Immutable Store: HDFS,
Cassandra, Event Stream to
S3
● Data Warehouse: HBase,
Hive, Redshift, Postgres

● ETL (Extract -> Transform -> Load)
● ELT (Extract -> Load -> Transform)
ETL vs ELT

● Complexity of transformation logic compromises latency
● Hardware systems today are better equipped
● Efficient, reduces load time
● Cost effective in the cloud, less components required
Moving from traditional ETL
to ELT

● Query Source DB and keep offset (ID, Updated timestamp)
● Database change logs (e.g. Mysql Binlogs, MongoDB Oplogs)
Replication Modes

● New fields can be added to a source at any point in time
● Character lengths of String columns in source can increase
● Data Type incompatibility between Source and Destination
● Varying type casting
● Data loss during loads - Power failure, Server failure, Code
bugs, etc
Challenges

● Schema detection cannot be done upfront
● Different documents in a single collection can have a different
set of fields
● Different documents in a single collection can have
incompatible field data types
● Nested objects and arrays with a dynamic structure
Additional Challenges with
NoSQL

● Transformations
● Security (Filter, Hashing)
● Replay Mechanism
● Integrity and Anomaly Detection
● Monitoring and Alerts for failures
● Activity Log
Effective Implementations

● How to beat the CAP theorem by Nathan Marz
● Different layers for stream and batch processing
● Need to manage two different layers of the system
Lambda Architecture

● Questioning the Lambda Architecture by Jay Kreps
● Only stream processing with parallelism
● Set Kafka retention policy
● Reprocess into separate table
● Switch table when done and delete the old one
Kappa Architecture

Thank You
Manish Singh, Hevo
https://linkedin.com/in/manishsingh123/

Mais conteúdo relacionado

Mais procurados

Moving to Databricks & DeltaDatabricks

Time to Talk about Data MeshLibbySchulze

Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano

Spark with Delta LakeKnoldus Inc.

Data Staging StrategyMilind Zodge

Crossing the low-code and pro-code chasm: a platform approachAsanka Abeysinghe

Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks

Data Lakehouse Symposium | Day 4Databricks

Amazon Redshift: Performance Tuning and OptimizationAmazon Web Services

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

Snowflake SnowPro Certification Exam Cheat SheetJeno Yamma

Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks

Databricks on AWS.pptxWasm1953

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia

Free Training: How to Build a LakehouseDatabricks

SQL Analytics Powering Telemetry Analysis at ComcastDatabricks

Databricks Platform.pptxAlex Ivy

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Data Modeling & Data IntegrationDATAVERSITY

Mais procurados (20)

Moving to Databricks & Delta

Time to Talk about Data Mesh

Demystifying Data Warehousing as a Service (GLOC 2019)

Spark with Delta Lake

Data Staging Strategy

Crossing the low-code and pro-code chasm: a platform approach

Data Privacy with Apache Spark: Defensive and Offensive Approaches

Data Lakehouse Symposium | Day 4

Amazon Redshift: Performance Tuning and Optimization

Building Lakehouses on Delta Lake with SQL Analytics Primer

Snowflake SnowPro Certification Exam Cheat Sheet

Adaptive Query Execution: Speeding Up Spark SQL at Runtime

Databricks on AWS.pptx

Apache Iceberg: An Architectural Look Under the Covers

Making Data Timelier and More Reliable with Lakehouse Technology

Free Training: How to Build a Lakehouse

SQL Analytics Powering Telemetry Analysis at Comcast

Databricks Platform.pptx

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Data Modeling & Data Integration

Semelhante a Data Pipeline Challenges and Architectures

Cloud Lambda Architecture PatternsAsis Mohanty

JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal

Big Data_Architecture.pptxbetalab

Introduction to Apache ApexApache Apex

PostgreSQL as an Alternative to MSSQLAlexei Krasner

Kylin and Druid Presentationargonauts007

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis

Data streaming fundamentalsMohammed Fazuluddin

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Drill architecture 20120913jasonfrantz

Cassandra trainingAndrás Fehér

Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson

Azure DocumentDB OverviewAndrew Liu

Glint with Apache SparkVenkata Naga Ravi

Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann

NoSQL.pptxRithikRaj25

HBase introduction talkHayden Marchant

Semelhante a Data Pipeline Challenges and Architectures (20)

Cloud Lambda Architecture Patterns

JPoint'15 Mom, I so wish Hibernate for my NoSQL database...

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Spark Concepts - Spark SQL, Graphx, Streaming

Big Data_Architecture.pptx

Introduction to Apache Apex

PostgreSQL as an Alternative to MSSQL

Kylin and Druid Presentation

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016

Data streaming fundamentals

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Drill architecture 20120913

Cassandra training

Streaming Analytics with Spark, Kafka, Cassandra and Akka

Azure DocumentDB Overview

Glint with Apache Spark

Introduction to Apache NiFi dws19 DWS - DC 2019

NoSQL.pptx

HBase introduction talk

Último

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

DNT_Corporate presentation know about usDynamic Netsoft

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Right Money Management App For Your Financial GoalsJhone kinadey

Data Pipeline Challenges and Architectures

1. Manish Singh Engineer at Hevo https://linkedin.com/in/manishsingh123/ Challenges in Building a Data Pipeline

2. ● Data Pipeline ● Possible Implementations ● Challenges ● Data Processing Architectures Agenda

3. ● Highly scalable ● Highly available ● Low latency ● Zero data loss ● Support for multiple data sources (e.g. MySQL, NoSQL, Mixpanel, Analytics) ● Instrumentation, monitoring, and alerting ● Real-time vs Batch Expectations

4. Stream ● Usages: Live dashboards (count, average), rate limiting, triggers ● Processing: Apache Storm, Apache Spark, Apache Samza ● Store: Elastic Search, Druid, Spark SQL, Kafka SQL Stream vs Batch Batch ● Batch Processing and pre-computation ● Immutable Store: HDFS, Cassandra, Event Stream to S3 ● Data Warehouse: HBase, Hive, Redshift, Postgres

5. ● ETL (Extract -> Transform -> Load) ● ELT (Extract -> Load -> Transform) ETL vs ELT

7. ● Complexity of transformation logic compromises latency ● Hardware systems today are better equipped ● Efficient, reduces load time ● Cost effective in the cloud, less components required Moving from traditional ETL to ELT

8. ● Query Source DB and keep offset (ID, Updated timestamp) ● Database change logs (e.g. Mysql Binlogs, MongoDB Oplogs) Replication Modes

9. ● New fields can be added to a source at any point in time ● Character lengths of String columns in source can increase ● Data Type incompatibility between Source and Destination ● Varying type casting ● Data loss during loads - Power failure, Server failure, Code bugs, etc Challenges

10. ● Schema detection cannot be done upfront ● Different documents in a single collection can have a different set of fields ● Different documents in a single collection can have incompatible field data types ● Nested objects and arrays with a dynamic structure Additional Challenges with NoSQL

11. ● Transformations ● Security (Filter, Hashing) ● Replay Mechanism ● Integrity and Anomaly Detection ● Monitoring and Alerts for failures ● Activity Log Effective Implementations

12.

13.

14. ● How to beat the CAP theorem by Nathan Marz ● Different layers for stream and batch processing ● Need to manage two different layers of the system Lambda Architecture

15. Lambda Architecture

16. ● Questioning the Lambda Architecture by Jay Kreps ● Only stream processing with parallelism ● Set Kafka retention policy ● Reprocess into separate table ● Switch table when done and delete the old one Kappa Architecture

17. Kappa Architecture

18. Questions?

19. Thank You Manish Singh, Hevo https://linkedin.com/in/manishsingh123/

Notas do Editor

https://youtu.be/YzAIjEQ75_c?t=6892 Explain Kafka SQL
Yahoo’s Hadoop clusters sorted 1 TB of data in 209 seconds Petabyte sort using Spark in 4 hours
Petabyte sort using Spark in 4 hours
Petabyte sort using Spark in 4 hours
Petabyte sort using Spark in 4 hours
Petabyte sort using Spark in 4 hours
Lambda - 11th Greek letter
Kappa - 10th Greek letter

Data Pipeline Challenges and Architectures

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data Pipeline Challenges and Architectures

Semelhante a Data Pipeline Challenges and Architectures (20)

Último

Último (20)

Data Pipeline Challenges and Architectures

Notas do Editor