O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Construindo Data Lakes - Visão Prática com Hadoop e BigData

1.623 visualizações

Publicada em

Minha apresentação sobre construção de data lakes para bigdata usando hadoop como plataforma de dados. Conheça mais sobre nossos trabalhos de consultoria e treinamento em Hadoop Hortonworks, BigData, Data Warehousing e Business Intelligence

Publicada em: Tecnologia
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Construindo Data Lakes - Visão Prática com Hadoop e BigData

  1. 1. Data Lakes visão prática Marco Garcia CTO, Founder – Cetax, TutorPro mgarcia@cetax.com.br https://www.linkedin.com/in/mgarciacetax/
  2. 2. Com mais de 20 anos de experiência em TI, sendo 18 exclusivamente com Business Intelligence , Data Warehouse e Big Data, Marco Garcia é certificado pelo Kimball University, nos EUA, onde obteve aula pessoalmente com Ralph Kimball – um dos principais gurus do Data Warehouse. 1º Instrutor Certificado Hortonworks LATAM Arquiteto de Dados e Instrutor na Cetax Consultoria. 02 Apresentação
  3. 3. Data Lake ?
  4. 4. Data Lake ?
  5. 5. The ability to learn or understand or to deal with new or trying situations :reason; also:the skilled use of reason the ability to apply knowledge to manipulate one's environment or to think abstractly as measured by objective criteria (as tests). What is intelligence? 04 Data Lake ?
  6. 6. 1ª Citação Data Lake Outubro-2010
  7. 7. Data Warehouse x Data Lake https://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html Garrafas de água: - Limpas - Tratadas - Empacotadas - Prontas para o Consumo Lago de Dados : - Bruto - Sem tratamento - Precisa ser trabalhada para ser consumida
  8. 8. “Dados são o novo Petróleo” No ano de 2012 a Como petróleo, precisam ser refinados ! DATA IS THE NEW OIL!
  9. 9. DADOS PARA BIG DATA
  10. 10. DADOS POR VALIDADE PARA BIG DATA
  11. 11. FERRAMENTAS PARA BIG DATA
  12. 12. ARQUITETURA COMPLETA PARA BIG DATA ? Hadoop ! Hadoop
  13. 13. WhatisApacheHadoop?  Allows for the distributed processing of large data sets across clusters of computers using simple programming models  Is designed to scale up from single servers to thousands of machines, each offering local computation and storage  Does not rely on hardware to deliver high-availability, but rather the library itself is designed to detect and handle failures at the application layer  Delivers a highly-available service on top of a cluster of computers, each of which may be prone to failures The Apache Hadoop project describes the technology as a software framework that: Source: http://hadoop.apache.org
  14. 14. HadoopCore=Storage+Compute storage storage storage storage CPU RAM Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS)
  15. 15. HadoopDistribution
  16. 16. DistinctMastersandScale-OutWorkers worker node NodeManager DataNode master node 2 ZooKeeper Resource Manager master node 1 ZooKeeper NameNode master node 3 ZooKeeper HiveServer2 utility node 1 Client Gateway Knox utility node 2 Client Gateway Ambari Server worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode
  17. 17. Como seria o DataLake no Hadoop ?
  18. 18. compute & storage . . . . . . . . compute & storage . . YARN KNOX AMBARI HCATALOG (table metadata) Step 2: Model/Apply Metadata (data processing) HIVE PIG Step 3: Transform, Aggregate & Materialize LOAD SQOOP/Hive Web HDFS Data Sources RDBMS, No/New SQL Store (Oracle, Hana) EDW (SAP BW) Step 4a: Publish/Exchange Step 4c: Analyze Analytical Tools SAS, Python, R, Matlib ANALYTICAL NN AppMaster Streaming INTERACTIVE HIVE Server Query/Visualization/Re porting Tools SAP BO Tableau/Excel Any JDBC Compliant ToolStep 4b: Explore/Visualize FALCON (data lifecycle) Manage Steps 1-3: Data Lifecycle with Falcon LOAD SQOOP FLUME NIFI KAFKA SOURCE DATA App/System Logs Customer/Invent ory Data Transaction/Sale s Data Flat Files Twitter/Facebook Streams DB File JMS REST HTTP Streaming Step 1:Extract & Load
  19. 19. PassosparaoDataLake Passo 1 - Extrair e Carregar Passo 2 - Modelar e Aplicar os metadados Passo 3 - Transformar, Agregar e Materializar os dados Passo 4a - Publicar ou Enviar Dados Passo 4b - Explorar e Visualizar Passo 4c - Analisar, fazer Ciência de Dados
  20. 20. Como Estruturar e Criar o Data Lake
  21. 21. PontosFundamentais  Alinhe o Data Lake com a Estrutura Organizacional  Crie áreas (Zones) no Data Lake (ingest zone, transformation zone, presentation zone)  Processos de Ingestão de Dados  Segurança  Linhagem de Dados  Entender as necessidades  Integrações serão necessárias !
  22. 22. EstruturaLógicadaOrganização  Alinhe a estrutura por funções e não por departamentos ou equipes, as organizações mudam, mas as funções quase sempre são semelhantes.  Pense em um investimento de longo prazo  Esteja sempre atendo a regulamentações e controles internos ou mesmo externos.  Pense no Data Lake em Camadas
  23. 23. OqueArmazenar?TUDO!
  24. 24. HDFSlayer  Data is written into landing zone SQOOP HDF Flume … RAW format  Security Contains PII information Landing zone is using HDFS TDE for data protection Only ETL tools are accessing this layer Access by data wrangler only Data retention is limited ( < 1 month ) Landing zone RDBMS Landing SQOOP Nifi
  25. 25. HDFSlayer  Data is compressed in large files Hadoop archive (har) Solve small file problem  Data is automatically removed Retention policy managed via Falcon  Security Archive zone is using HDFS TDE for data protection Limited set of users can access it  HDFS tiering Archival layer Landing Archive
  26. 26. HDFSlayer  Data is moving from Landing to Speed Data is cleaned as part of ETL Optimized file format Orc, parquet, avro, …  Multiple copy of same dataset depending on use cases RAW data store in optimized file format Tokenised, normalisation, datamarts, ...  Security Sensitive data are tokenised Business users access this layer Presentation layer Landing Archive Presentation
  27. 27. Multi-tenantenvironment  Third party tools move data from landing into dev & test zone PII information are encrypted using 3rd party solution One way tokenisation Data is consistently tokenised Enable join in between different datasets  Benefit Development is done against realistic dataset (volume & format) Give access to data scientist team Development and test layer Landing Dev / Test / …
  28. 28. Multi-tenantenvironment  Data Accessed from presentation layer  Benefit Give access to version of production data to data scientist teams Allow data science team to acquire ad-hoc external datasets Data exploration layer Landing Dev / Test / … Data exploration
  29. 29. Multi-tenantenvironment  Third party tools move data from landing into dev & test zone PII information are encrypted using 3rd party solution Reversible tokenisation Data is consistently tokenised Enable join in between different datasets Production layer Landing Dev / Test / … Prod Data exploration
  30. 30. Bestpractices  Create a catalogue of datasets in Atlas Data owner Source system Project using it  Keep multiple copy of the same data Raw Optimized Tokenized  Disaster Recovery Dev / Test / Data Exploration run on DR cluster Define prioritize workload  Create dataset structures based upon projects Datasets will be reused across projects  No write access to business users Do’s Don’ts
  31. 31. Obrigado ! Visite nos : www.cetax.com.br Estamos contratando !

×