O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Data Integration

2.703 visualizações

Publicada em

An overview of data integration, from ingestion to processing and architecture.

Publicada em: Engenharia
  • Login to see the comments

Data Integration

  1. 1. Data Integration
  2. 2. Contents Introduction1 2 Data Ingestion 3 Data Processing 4 Data Architectures 5 Workshop
  3. 3. 1. Introduction
  4. 4. vision products data science Data access data infrastructure Data Needs
  5. 5. Relational DBs Log filesSearch indexes NoSQL DBs Message queueMonitoring Data Sources
  6. 6. Data Warehouse ETL ETL ETL ETL Data Warehouse Ingestion
  7. 7. Sink Source . . . .Transform Load Extract
  8. 8. 1990 Data Warehousing - Drop relational assumption - Programmability - Open Source 2008 Hadoop + MapReduce - Batch → Real-time - Daily → Continous 2015 Kafka + Streaming data
  9. 9. 2. Data Ingestion From ETL to ELT: Flume, sqoop, kafka
  10. 10. sqoopflume Data Lake Kafka Producer Kafka Producer Kafka Consumer Data Lake Ingestion Kafka
  11. 11. Channel Channel Processor Interceptor #1 Interceptor #N SinkSource Flume Agent Apache Flume Avro Thrift Kafka Exec JMS Spool dir Twitter Netcat Syslog HTTP HDFS Kafka Hive Logger Avro Thrift IRC HBase Elastic
  12. 12. RDBMS Apache Sqoop Sqoop Tool Import Export
  13. 13. Data Pipeline Problem Inter-process communication channel
  14. 14. Data Pipeline Problem Metrics Pub/Sub A publish/subscribe System
  15. 15. Data Pipeline Problem Metrics Pub/Sub Logging Pub/Sub Multiple publish/subscribe Systems
  16. 16. Apache Kafka Broker 1 Broker 2 Broker 3 Kafka Cluster ● ● ● ●
  17. 17. Consumer Kafka as reliable Flume channel Flume + Kafka Source Sink Channel Producer Flume as kafka producer/consumer
  18. 18. 3. Data Processing
  19. 19. Batch Processing Data Lake Batch Processing Pageviews [url, timestamp] [url, timestamp] [url, timestamp] [url, timestamp] DBRollups [url, hour, count] [url, hour, count] [url, hour, count] {url+hour : count} {url+hour : count} {url+hour : count} mapreduce mapreduce Data Analysis
  20. 20. Stream Processing Real Time Technologies Data Source flume Kafka producer Events / DB writes Process Stream Event Stream Output Stream
  21. 21. 4. Data Architectures
  22. 22. Data Lake Batch Processing Data Processing Architecture Data Source flume Kafka producer Data Analysis
  23. 23. Data Lake Batch Processing Stream Processing Data Processing Architecture Data Source flume Kafka producer Data Analysis
  24. 24. Lambda Architecture Serving Layer New Data Stream Batch Views Real-Time Views Partial Aggregate Partial Aggregate Partial Aggregate Real-Time Data Bath LayerPrecompute Views (MapReduce)Batch Processing Real-Time Layer Increment Views Stream Processing Process Stream Merged View query merge
  25. 25. Data Lake Batch Processing Stream Processing Data Processing Architecture Data Source flume Kafka producer Serving Layer Data Analysis
  26. 26. Kappa Architecture Serving Layer query Serving DB Output Table n Output Table n+1 Stream Processing System Job Version n Job Version n+1 Data Storage 1 New Data Stream 2 3 .. Where everything is a stream Real-Time Layer query
  27. 27. 4. Workshop
  28. 28. THANKS! Any questions? @datiobd flasheras@datiobd.com rbravo@datiobd.com datio-big-data

×