O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Construindo Data Lakes - Visão Prática com Hadoop e BigData

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 31 Anúncio

Construindo Data Lakes - Visão Prática com Hadoop e BigData

Minha apresentação sobre construção de data lakes para bigdata usando hadoop como plataforma de dados. Conheça mais sobre nossos trabalhos de consultoria e treinamento em Hadoop Hortonworks, BigData, Data Warehousing e Business Intelligence

Minha apresentação sobre construção de data lakes para bigdata usando hadoop como plataforma de dados. Conheça mais sobre nossos trabalhos de consultoria e treinamento em Hadoop Hortonworks, BigData, Data Warehousing e Business Intelligence

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (19)

Semelhante a Construindo Data Lakes - Visão Prática com Hadoop e BigData (20)

Anúncio

Mais de Marco Garcia (17)

Mais recentes (20)

Anúncio

Construindo Data Lakes - Visão Prática com Hadoop e BigData

  1. 1. Data Lakes visão prática Marco Garcia CTO, Founder – Cetax, TutorPro mgarcia@cetax.com.br https://www.linkedin.com/in/mgarciacetax/
  2. 2. Com mais de 20 anos de experiência em TI, sendo 18 exclusivamente com Business Intelligence , Data Warehouse e Big Data, Marco Garcia é certificado pelo Kimball University, nos EUA, onde obteve aula pessoalmente com Ralph Kimball – um dos principais gurus do Data Warehouse. 1º Instrutor Certificado Hortonworks LATAM Arquiteto de Dados e Instrutor na Cetax Consultoria. 02 Apresentação
  3. 3. Data Lake ?
  4. 4. Data Lake ?
  5. 5. The ability to learn or understand or to deal with new or trying situations :reason; also:the skilled use of reason the ability to apply knowledge to manipulate one's environment or to think abstractly as measured by objective criteria (as tests). What is intelligence? 04 Data Lake ?
  6. 6. 1ª Citação Data Lake Outubro-2010
  7. 7. Data Warehouse x Data Lake https://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html Garrafas de água: - Limpas - Tratadas - Empacotadas - Prontas para o Consumo Lago de Dados : - Bruto - Sem tratamento - Precisa ser trabalhada para ser consumida
  8. 8. “Dados são o novo Petróleo” No ano de 2012 a Como petróleo, precisam ser refinados ! DATA IS THE NEW OIL!
  9. 9. DADOS PARA BIG DATA
  10. 10. DADOS POR VALIDADE PARA BIG DATA
  11. 11. FERRAMENTAS PARA BIG DATA
  12. 12. ARQUITETURA COMPLETA PARA BIG DATA ? Hadoop ! Hadoop
  13. 13. WhatisApacheHadoop?  Allows for the distributed processing of large data sets across clusters of computers using simple programming models  Is designed to scale up from single servers to thousands of machines, each offering local computation and storage  Does not rely on hardware to deliver high-availability, but rather the library itself is designed to detect and handle failures at the application layer  Delivers a highly-available service on top of a cluster of computers, each of which may be prone to failures The Apache Hadoop project describes the technology as a software framework that: Source: http://hadoop.apache.org
  14. 14. HadoopCore=Storage+Compute storage storage storage storage CPU RAM Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS)
  15. 15. HadoopDistribution
  16. 16. DistinctMastersandScale-OutWorkers worker node NodeManager DataNode master node 2 ZooKeeper Resource Manager master node 1 ZooKeeper NameNode master node 3 ZooKeeper HiveServer2 utility node 1 Client Gateway Knox utility node 2 Client Gateway Ambari Server worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode worker node NodeManager DataNode
  17. 17. Como seria o DataLake no Hadoop ?
  18. 18. compute & storage . . . . . . . . compute & storage . . YARN KNOX AMBARI HCATALOG (table metadata) Step 2: Model/Apply Metadata (data processing) HIVE PIG Step 3: Transform, Aggregate & Materialize LOAD SQOOP/Hive Web HDFS Data Sources RDBMS, No/New SQL Store (Oracle, Hana) EDW (SAP BW) Step 4a: Publish/Exchange Step 4c: Analyze Analytical Tools SAS, Python, R, Matlib ANALYTICAL NN AppMaster Streaming INTERACTIVE HIVE Server Query/Visualization/Re porting Tools SAP BO Tableau/Excel Any JDBC Compliant ToolStep 4b: Explore/Visualize FALCON (data lifecycle) Manage Steps 1-3: Data Lifecycle with Falcon LOAD SQOOP FLUME NIFI KAFKA SOURCE DATA App/System Logs Customer/Invent ory Data Transaction/Sale s Data Flat Files Twitter/Facebook Streams DB File JMS REST HTTP Streaming Step 1:Extract & Load
  19. 19. PassosparaoDataLake Passo 1 - Extrair e Carregar Passo 2 - Modelar e Aplicar os metadados Passo 3 - Transformar, Agregar e Materializar os dados Passo 4a - Publicar ou Enviar Dados Passo 4b - Explorar e Visualizar Passo 4c - Analisar, fazer Ciência de Dados
  20. 20. Como Estruturar e Criar o Data Lake
  21. 21. PontosFundamentais  Alinhe o Data Lake com a Estrutura Organizacional  Crie áreas (Zones) no Data Lake (ingest zone, transformation zone, presentation zone)  Processos de Ingestão de Dados  Segurança  Linhagem de Dados  Entender as necessidades  Integrações serão necessárias !
  22. 22. EstruturaLógicadaOrganização  Alinhe a estrutura por funções e não por departamentos ou equipes, as organizações mudam, mas as funções quase sempre são semelhantes.  Pense em um investimento de longo prazo  Esteja sempre atendo a regulamentações e controles internos ou mesmo externos.  Pense no Data Lake em Camadas
  23. 23. OqueArmazenar?TUDO!
  24. 24. HDFSlayer  Data is written into landing zone SQOOP HDF Flume … RAW format  Security Contains PII information Landing zone is using HDFS TDE for data protection Only ETL tools are accessing this layer Access by data wrangler only Data retention is limited ( < 1 month ) Landing zone RDBMS Landing SQOOP Nifi
  25. 25. HDFSlayer  Data is compressed in large files Hadoop archive (har) Solve small file problem  Data is automatically removed Retention policy managed via Falcon  Security Archive zone is using HDFS TDE for data protection Limited set of users can access it  HDFS tiering Archival layer Landing Archive
  26. 26. HDFSlayer  Data is moving from Landing to Speed Data is cleaned as part of ETL Optimized file format Orc, parquet, avro, …  Multiple copy of same dataset depending on use cases RAW data store in optimized file format Tokenised, normalisation, datamarts, ...  Security Sensitive data are tokenised Business users access this layer Presentation layer Landing Archive Presentation
  27. 27. Multi-tenantenvironment  Third party tools move data from landing into dev & test zone PII information are encrypted using 3rd party solution One way tokenisation Data is consistently tokenised Enable join in between different datasets  Benefit Development is done against realistic dataset (volume & format) Give access to data scientist team Development and test layer Landing Dev / Test / …
  28. 28. Multi-tenantenvironment  Data Accessed from presentation layer  Benefit Give access to version of production data to data scientist teams Allow data science team to acquire ad-hoc external datasets Data exploration layer Landing Dev / Test / … Data exploration
  29. 29. Multi-tenantenvironment  Third party tools move data from landing into dev & test zone PII information are encrypted using 3rd party solution Reversible tokenisation Data is consistently tokenised Enable join in between different datasets Production layer Landing Dev / Test / … Prod Data exploration
  30. 30. Bestpractices  Create a catalogue of datasets in Atlas Data owner Source system Project using it  Keep multiple copy of the same data Raw Optimized Tokenized  Disaster Recovery Dev / Test / Data Exploration run on DR cluster Define prioritize workload  Create dataset structures based upon projects Datasets will be reused across projects  No write access to business users Do’s Don’ts
  31. 31. Obrigado ! Visite nos : www.cetax.com.br Estamos contratando !

Notas do Editor

  • Os dados podem ser o novo petróleo, a nova corrida que as empresas vão enfrentar para multiplicar seus lucros!

    A correta coleta, processamento e análise dos dados podem ser um diferencial competitivo a todos os negócios.

    Claro, como petróleo, os dados também precisam ser refinados para um melhor resultado.
  • Essa lista é um exemplo de possíveis fontes, mas deveremos ter muito mais fontes.

    As novas ferramentas permitem conexão e captura de dados em diversas categorias de softwares ou mesmo equipamentos eletrônicos que permita captura de dados.

    Claro que além dos dados tradicionais que hoje buscamos em outros sistemas, bancos de dados e arquivos de texto.
  • Referencia - http://voltdb.com/blog/big-data/big-data-value-continuum/
  • Muitos softwares ?

    Por favor, se acalme, vamos falar disso um pouco mais para frente.
  • Muitos softwares ?

    Por favor, se acalme, vamos falar disso um pouco mais para frente.
  • This “wordy” slide is straight from the project’s self-description and warrants a splash before we go much further…

    So what is Apache Hadoop? It is a scalable, fault tolerant, open source framework for the distributed storing and processing of large sets of data on commodity hardware. But what does all that mean?

    Well first of all it is scalable. Hadoop clusters can range from as few as one machine to literally thousands of machines. That is scalability!

    It is also fault tolerant. Hadoop services become fault tolerant through redundancy. For example, the Hadoop Distributed File System, called HDFS, automatically replicates data blocks to three separate machines, assuming that your cluster has at least three machines in it. Many other Hadoop services are replicated, too, in order to avoid any single points of failure.

    Hadoop is also open source. Hadoop development is a community effort governed under the licensing of the Apache Software Foundation. Anyone can help to improve Hadoop by adding features, fixing software bugs, or improving performance and scalability.

    Hadoop also uses distributed storage and processing. Large datasets are automatically split into smaller chunks, called blocks, and distributed across the cluster machines. Not only that, but each machine processes its local block of data. This means that processing is distributed too, potentially across hundreds of CPUs and hundreds of gigabytes of memory.

    All of this occurs on commodity hardware which reduces not only the original purchase price, but also potentially reduces support costs as well.
  • At the most granular level, Hadoop is an engine who provides storage via HDFS and compute via YARN capabilities.


    The “ecosystem” tools wrap around core.
  • Hadoop is not a monolithic piece of software. It is a collection of architectural pillars that contain software frameworks. Most of the frameworks are part of the Apache software ecosystem. The picture illustrates the Apache frameworks that are part of the Hortonworks Hadoop distribution.

    So why does Hadoop have so many frameworks and tools? The reason is that each tool is designed for a specific purpose. The functionality of some tools overlap but typically one tool is going to be better than others when performing certain tasks.

    For example, both Apache Storm and Apache Flume ingest data and perform real-time analysis. But Storm has more functionality and is more powerful for real-time data analysis.

  • Here is an example cluster with three master nodes, 12 worker nodes, and two utility nodes. The cluster is running various services, like YARN and HDFS. Services can be implemented by one or more service components.

    The three master nodes are running service master components. The 12 worker nodes are running service worker components, sometimes called slave components. The two utility nodes are running service components that provide access, security, and management services for the cluster.

    This page does not illustrate all services, service master, or service worker components. More detail is provided in other lessons.

  • Break Glass?
  • If need to be reprocess – Copy form Archive into Landing
    Har tracking by atlas
  • ISO27001 – Data & Processing should be separated – Doesn’t mean separated env
    Separated dev & test are used for upgrade / patch testing - can be smaller / virtualised / ..
  • ISO24001 – Data & Processing should be separated – Doesn’t mean separated env

×