The document discusses challenges in building a data pipeline including making it highly scalable, available with low latency and zero data loss while supporting multiple data sources. It covers expectations for real-time vs batch processing and explores stream and batch architectures using tools like Apache Storm, Spark and Kafka. Challenges of data replication, schema detection and transformations with NoSQL are also examined. Effective implementations should include monitoring, security and replay mechanisms. Finally, lambda and kappa architectures for combining stream and batch processing are presented.
Right Money Management App For Your Financial Goals
Data Pipeline Challenges and Architectures
1. Manish Singh
Engineer at Hevo
https://linkedin.com/in/manishsingh123/
Challenges in Building a
Data Pipeline
2. ● Data Pipeline
● Possible Implementations
● Challenges
● Data Processing Architectures
Agenda
3. ● Highly scalable
● Highly available
● Low latency
● Zero data loss
● Support for multiple data sources (e.g. MySQL, NoSQL,
Mixpanel, Analytics)
● Instrumentation, monitoring, and alerting
● Real-time vs Batch
Expectations
7. ● Complexity of transformation logic compromises latency
● Hardware systems today are better equipped
● Efficient, reduces load time
● Cost effective in the cloud, less components required
Moving from traditional ETL
to ELT
8. ● Query Source DB and keep offset (ID, Updated timestamp)
● Database change logs (e.g. Mysql Binlogs, MongoDB Oplogs)
Replication Modes
9. ● New fields can be added to a source at any point in time
● Character lengths of String columns in source can increase
● Data Type incompatibility between Source and Destination
● Varying type casting
● Data loss during loads - Power failure, Server failure, Code
bugs, etc
Challenges
10. ● Schema detection cannot be done upfront
● Different documents in a single collection can have a different
set of fields
● Different documents in a single collection can have
incompatible field data types
● Nested objects and arrays with a dynamic structure
Additional Challenges with
NoSQL
11. ● Transformations
● Security (Filter, Hashing)
● Replay Mechanism
● Integrity and Anomaly Detection
● Monitoring and Alerts for failures
● Activity Log
Effective Implementations
12.
13.
14. ● How to beat the CAP theorem by Nathan Marz
● Different layers for stream and batch processing
● Need to manage two different layers of the system
Lambda Architecture
16. ● Questioning the Lambda Architecture by Jay Kreps
● Only stream processing with parallelism
● Set Kafka retention policy
● Reprocess into separate table
● Switch table when done and delete the old one
Kappa Architecture