O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Building a Server-less Data Lake on AWS - Technical 301

8.207 visualizações

Publicada em

We will introduce key concepts for a data lake and present aspects related to its implementation. Also discussing critical success factors, pitfalls to avoid operational aspects, and insights on how AWS enables a server-less data lake architecture.

Speaker: Sebastien Menant, Solutions Architect, Amazon Web Services

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Building a Server-less Data Lake on AWS - Technical 301

  1. 1. ©  2016,  Amazon  Web  Services,  Inc.  or  its  Affiliates.  All  rights  reserved. Sebastien  Menant &  Nam  Je  Cho,  Enterprise  Solutions  Architects   Amazon  Web  Services Building  a  Server-­less  Data  Lake  on  AWS Technical  301
  2. 2. Agenda • What  is  a  Data  Lake? • Why  You  Need  a  Data  Lake • Building  the  Data  Lake • Demo • Next  Steps
  3. 3. What  is  a  Data  Lake?
  4. 4. Definition “A data lake provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs” -­ Wikipedia
  5. 5. Characteristics  of  a  Data  Lake Collect Everything Dive  in Anywhere Flexible Access
  6. 6. Why  You  Need  a  Data  Lake
  7. 7. What  About  Modern  Business  Needs?
  8. 8. Big  Data… and  The  Hadoop  Ecosystem
  9. 9. But  Both  are  Complementary Amazon   EMR Amazon   Redshift But  Both  are  Complementary
  10. 10. STORAGE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE Amazon   EMR Amazon  S3
  11. 11. New  Business  Outcomes  and  Capabilities • Enable  New  Insights  in  Your  Data • Cost  Savings  of  Compute  and  Storage • Use  the  Right  Tool  for  the  Job • Increase  Durability  of  Data • Charge  Storage  Costs  to  Owner • Streaming  and  Real-­time  Analysis Retain  all  your  data,  for  years!
  12. 12. Building  the  Data  Lake
  13. 13. Beware
  14. 14. Building  Blocks  of  the  Data  Lake Storage  and  Ingestion Catalogue  and  Search Security API  and  UI
  15. 15. Storage  and  Ingestion Storage  and   Ingestion Catalogue  and   Search Security API  and  UI
  16. 16. Requirements  for  Storage • Multi-­year  Scalable  Storage  Capability • High  Durability • Store  Raw  Data  from  Any  Input  Sources • Support  for  Any  Data  Type • Low  Cost
  17. 17. Amazon  S3 1. Highly  Scalable  and  Durable 2. Security  and  Encryption 3. Lifecycle  Management 4. Event  Notifications 5. Versioning Key  Services  for  Storage Amazon  Glacier 1. Long-­term  Archival  Storage 2. Lifecycle  Integration  with  S3 3. Extremely  Low-­cost 4. Vault  Lock Amazon S3 Amazon   Glacier
  18. 18. Amazon   S3 Amazon   Glacier Storage   and Ingestion
  19. 19. Recommendations  #1 • S3  Buckets • Close  to  Users  and  Compute • Select  Region  for  Regulatory  Compliance • Naming • Human-­readable  Path • Random  Hash  Prefix  for  Optimal  Partitioning • Format • Structured  vs  Unstructured  +  Compression • CSV,  Parquet,  ORC,  JSON,  XML,  logs,  etc • GZIP  for  small  files,  Avro,  LZO,  Snappy
  20. 20. Recommendations  #2 • Optimise • Store  Everything • Use  Large  Files  with  Split-­able  Format • Lifecycle  Policies  for  Cost-­savings • Tagging  for  Cost  Allocation • Security • Encryption • Bucket  Policies,  ACL, Tagging,  CloudTrail
  21. 21. Requirements  for  Ingestion • Batch  File  Support • Traditional  ETL • Streaming  Data • Consumption  of  any  Dataset  as  a  Stream • Low  Latency  Analytics • Replay-­ability  from  the  Data  Lake • Server-­less  ETL  Capabilities
  22. 22. Amazon  Kinesis  Firehose 1. Easy  to  use  with  Agent 2. Automatic  Elasticity 3. Near  Real-­time 4. Simultaneous  Destinations Key  Services  for  Ingestion Amazon  Kinesis  Streams 1. Enables  Custom  Processing 2. Continuous  Data  Collection 3. Real-­time 4. API  Driven  for  Custom  Apps Amazon   Kinesis   Streams Amazon   Kinesis   Firehose
  23. 23. Data   Sources Data   Sources Data   Sources Data   Sources Data   Sources S3 DynamoDB Redshift Amazon Kinesis Availability   Zone Availability   Zone Availability   Zone Stream AWS  Lambda KCL  App EMR Elasticsearch
  24. 24. Amazon   Glacier Amazon   Kinesis Storage   and   Ingestion Amazon   S3
  25. 25. Recommendations • Reminder • Added  Complexity  needs  Business  Justification • Select  the  Right  Tools • Real-­time  Analysis:  Apache  Spark  Streaming,  Storm,  Flink • Firehose  to  Redshift  for  BI  and  Dashboards • Tips • AWS  Lambda  for  ETL  Transformation • Persist  Streams  into  S3
  26. 26. http://amzn.to/23DWr5O
  27. 27. http://amzn.to/1SRk8wG
  28. 28. Catalogue  and  Search Storage  and   Ingestion Catalogue  and   Search Security API  and  UI
  29. 29. Requirements  for  Catalogue  and  Search • Metadata  Index • Automated  Metadata  Processing • Discovery  and  Search • Data  Classification • Server-­less  and  Event-­driven
  30. 30. Key  Services  for  Catalogue  and  Search 1. Server-­less 2. Event  Driven 3. Auto  Scaling 4. Real-­time 1. NoSQL 2. Streams 3. Logstash Plugin 1. Deploy  Simply 2. Easy  Admin 3. Kibana Amazon   Elasticsearch Service Amazon DynamoDB AWS Lambda Lambda DynamoDB Elasticsearch
  31. 31. Catalogue  and  Search AWS   Lambda Amazon DynamoDB Amazon Elasticsearch
  32. 32. Recommendations • Tips • Start  Small  and  Simple… add  Capabilities • File  names,  size,  state,  dates,  tags,  owner • Region,  versions,  lineage,  relationships • Search  Metadata  and  Object  Content • Events • S3  Triggers  Lambda • DynamoDB Streams • Logstash Plugin  to  Elasticsearch
  33. 33. http://amzn.to/23E9LUp
  34. 34. http://amzn.to/1TQVBwp
  35. 35. Security Storage  and   Ingestion Catalogue  and   Search Security API  and  UI
  36. 36. Requirements  for  Security • Data  Encryption  at  Rest • Authentication • Authorisation
  37. 37. AWS  IAM 1. Users  and  Roles 2. Identity  Federation 3. Multi  Factor  Authentication 4. Granular  Permissions Key  Services  for  Security AWS  KMS 1. Seamless  Service  Integration 2. Extensive  Compliance AWS   IAM AWS   KMS AWS CloudHSM SSE-­S3
  38. 38. Security AWS   KMS AWS   IAM
  39. 39. Recommendations • Start  Early • Security  Needs  Practice! • Federate  with  your  Corporate  Directory • Best  Practice • Use  CloudTrail and  CloudWatch • Encrypt  Where  Possible • Select  Bucket  Region  for  Regulatory  Compliance • Tips • IAM  Policies,  S3  Versioning  and  MFA  Delete • Lambda  for  Data  Masking
  40. 40. API  and  UI Storage  and   Ingestion Catalogue  and   Search Security API  and  UI
  41. 41. Requirements  for  API  and  UI • Serve  Data  and  Capabilities  to  Customers • Programmatically • Search  Catalogue • Run  Compute • Extend  Access  Control  Management • And…  Use  of  Familiar  Visualisation  Tools
  42. 42. Amazon  API  Gateway 1. Performance  at  Any  Scale 2. Create  RESTful  Frontend 3. Managed  API  Lifecycle Key  Services  for  API  and  UI AWS  Lambda 1. Enables  Server-­less  API 2. Custom  Logic  for  Services   3. Automatic  Scaling AWS Lambda Amazon  API   Gateway
  43. 43. API   and   UI Amazon   API  Gateway AWS   Lambda
  44. 44. Recommendations • Tips • Go  Server-­less! • Extend  Existing  AWS  Services  and  Build  Custom  Logic • Data  Management,  Processing  and  Transformations • API  Gateway  for  Data  Access • Serve  the  Data,  Search  and  Compute  via  RESTful  APIs • Distribute  a  Custom  SDK • Extend  the  Solution • Build  Advanced  Security  Controls  using  Metadata  Index
  45. 45. The  Whole  Picture… Storage  and   Ingestion Catalogue  and   Search Security API  and  UI Storage  and   Ingestion Catalogue  and   Search Security API  and  UI
  46. 46. Amazon   EMR Amazon   RDS Amazon   S3 Amazon   Glacier Amazon   Kinesis Storage   and Ingestion Security AWS   KMS AWS   IAM API And UI Amazon API  Gateway AWS   Lambda USERS Amazon   Redshift Catalogue  and  Search AWS   Lambda Amazon DynamoDB Amazon Elasticsearch
  47. 47. A  Data  Lake  is… • Foundation  of  Data  Storage  and  Streaming  Data • Metadata  index  to  help  Categorise  and  Govern   • Search  Index  to  Enable  Data  Discovery • Robust  Set  of  Security  Controls • Governance  Through  Technology  Not  Policy • Interface  to  Expose  Data  and  Capabilities  to  Users
  48. 48. ©  2016,  Amazon  Web  Services,  Inc.  or  its  Affiliates.  All  rights  reserved. 2016-­04-­28 Demo
  49. 49. Demo
  50. 50. Building  Catalogue  and  Search ElasticSearch Metadata Index LambdaS3  Bucket Logstash Data  Flow Data Source DynamoDB
  51. 51. Next  Steps
  52. 52. Proof  of  Concept
  53. 53. Next  Steps • How  to  Get  Started • AWS  Documentation • Getting  Started  Guide • AWS  Training  &  Certification • Big  Data  on  AWS • AWS  Partner  Network • AWS  Professional  Services • Big  Data  Specialists
  54. 54. AWS  Training  &  Certification Intro  Videos  &  Labs   Free  videos  and  labs  to   help  you  learn  to  work   with  30+  AWS  services   – in  minutes! Training  Classes In-­person  and  online   courses  to  build   technical  skills  – taught  by  accredited   AWS  instructors Online  Labs   Practice  working  with   AWS  services  in  live   environment  – Learn  how  related   services  work   together AWS  Certification Validate  technical   skills  and  expertise  – identify  qualified  IT   talent  or  show  you   are  AWS  cloud  ready Learn  more:  aws.amazon.com/training
  55. 55. Your  Training  Next  Steps: ü Visit  the  AWS  Training  &  Certification  pod  to  discuss  your   training  plan  &  AWS  Summit  training  offer ü Register  &  attend  AWS  instructor  led  training ü Get  Certified AWS  Certified?  Visit  the  AWS  Summit  Certification  Lounge  to  pick  up  your  swag Learn  more:  aws.amazon.com/training
  56. 56. Thank  You!

×