O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Architecting a Serverless Data Lake (ARC302) - AWS re:Invent 2018

2.523 visualizações

Publicada em

In this workshop, learn how to create a serverless data lake architecture. Understand how to ingest data at scale from multiple data sources, how to transform the data, and how to catalog it to make it available for querying using a variety of tools. Also, learn how to set up governance and data quality controls. All attendees must bring their own laptop (Windows, macOS, and Linux are all supported). Tablets are not appropriate. We recommend having the current version of Chrome or Firefox installed. Also, participants must have their own AWS account and administrator permission for AWS services within their accounts.

  • Seja o primeiro a comentar

Architecting a Serverless Data Lake (ARC302) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless Data Lake Workshop Amardeep Chudda Solutions Architect Amazon Web Services A R C 3 0 2 Mike Gillespie Solutions Architect Amazon Web Services
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda Development Environment Setup Review Data Lake Architecture Why Serverless? Glue Extract Transform Load (ETL) Data Governance Bonus Content
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Tuesday, Nov 27 ANT354-R - [REPEAT] Build a Query to Analyze Data in Your Amazon Redshift Warehouse & S3 Data Lake Together Time – 8:30 AM to 9:30 PM | Mirage Friday, Nov 30 AIM405-R1 - [REPEAT 1] Better Analytics Through Natural Language Processing Time – 11:30 PM to 12:30 PM | Venetian Thursday, Nov 29 ADT301 - Create a Serverless Web Event Pipeline Time – 4:00 PM to 5:00 PM | Mirage
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scenario You support a successful online ecommerce website with millions of users. The website is tracking your end user activity and their buying habits online. Your analytics team would like the ability to query data in both ad-hoc queries and using Business Intelligence tools with a end goal of helping business teams derive efficiencies in their marketing campaigns. You want to enable your analytics team but at the same time you don’t want to loose the focus on data quality and governance controls. Data Sources include weblogs, NoSQL databases and other datasources Your task is to build a cost effective solution to have a unified analytics environment.
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. re:Invent workshop summary • Ingest data from various data sources and join them together • Enrich raw data • Convert data to parquet for efficient querying • Grant access to roles based on the data classification • SQL Access for Data Scientists • Data Visualization with charts and graphs
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1. Your own device for console access 2. An AWS account that you are able to use for testing. (Should not be used for production or other purposes.) 3. Workshop on GitHub at https://bit.ly/2RX54o3 Requirements
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Development environment Your Cloud Engineering team has deployed a development environment for you Ingestion / Data Generation Kinesis / Log Data Data Generation Lambda Functions Amazon Simple Storage Service (Amazon S3) Buckets Amazon DynamoDB AWS Glue Management Console / Development Endpoint Amazon Athena Amazon QuickSight
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1. Deploy the Lab CloudFormation template from here https://bit.ly/2RX54o3 2. Examine the environment in AWS CloudFormation Designer 3. Deploy your stack Deploy the lab environment Template Stack
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. High-level architecture
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis Data Firehose • Serverless, easy to use • Seamless integration with AWS data stores • Support for serverless transformation • Near real-time ingestion • Pay only for what you use
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Simple Storage Service (Amazon S3) • Object store • Highly durable • Limitless scalability • Pay for what you use • Comprehensive security & compliance capabilities • Support for query in place
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue • Serverless ETL • Universal Data Catalog • Open source Apache Spark environment • DynamicFrame – Built in functions • Seamless integration with AWS services • Support for on-premises data stores
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Athena • Serverless interactive query service • Integrated with AWS Glue Data Catalog • Open source, built on Presto, query with standard SQL • Pay per query • Support for standard formats like CSV, JSON, ORC, Avro and Parquet • Fast parallel query execution
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon QuickSight • Serverless, end to end BI solution • Built-in SPICE engine • Smart visualizations • Seamless integration with AWS services • On-premises database support • Pay only for what you use • Multiple device support • Share and collaborate
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data classification and security • Grant S3 access by role to bucket / prefix • Approaches to segment data • Multiple copies of the data in different buckets • Tokenization, join to tokenized tables, and views to resolve them Bucket with objects Role Permissions
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. UserProfile Duplication ID First Last 1 Sam Smith 2 Jane Jones UserProfileSecure ID First Last SSN 1 Sam Smith 111-11-1111 2 Jane Jones 222-22-2222
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Duplication UserProfile ID First Last 1 Sam Smith 2 Jane Jones UserProfileSecure ID First Last SSN 1 Sam Smith 111-11-1111 2 Jane Jones 222-22-2222
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tokenization UserProfile ID First Last SSN_Token 1 Sam Smith 8c9d409dcc43 2 Jane Jones 06a38ea94e69 SSN_Tokens Token SSN 8c9d409dcc43 111-11-1111 06a38ea94e69 222-22-2222
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tokenization ProfileView ID First Last 1 Sam Smith 2 Jane Jones ProfileSecureView ID First Last SSN 1 Sam Smith 111-11-1111 2 Jane Jones 222-22-2222
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Spectrum UserProfileSecure ID First Last SSN 1 Sam Smith 111-11-1111 2 Jane Jones 222-22-2222
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bonus Content • AWS Glue Development Endpoints – Apache Zeppelin notebook • Amazon Redshift/Spectrum Integration • AWS Database Migration Service (DMS) - Importing files from S3 to DynamoDB
  26. 26. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amar, Mike
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×