O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
March 16, 2017
Fast Track to Your Data Lake on A...
Data has gravity
…easier to move processing to the data
4k/8k
Genomics
Seismic
Financial
Logs
IoT
Data has Business Value
Challenges with Legacy Data Architectures
• Can’t move data across silos
• Can’t afford to keep all of the data
• Can’t sc...
Enter Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volume...
1&2: Consolidate (Data) & Separate (Storage & Compute)
•S3 as the data lake storage tier; not a single analytics
tool like...
Designed for 11 9s
of durability
• Multiple Encryption Options
• Robust/Highly Flexible Access Controls
Durable Secure Hig...
“For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
busine...
Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (AC...
AWS Snowball & Snowmobile
• Accelerate PBs with AWS-provided
appliances
• 50, 80, 100 TB models
• 100PB Snowmobile
AWS Sto...
5: Catalog Your Data
S3
Put data in S3
Amazon
DynamoDB
Amazon
Elasticsearch Service
Metadata
What is in the data lake?
Doc...
Glue automates the undifferentiated heavy-lifting of ETL
 Cataloging data sources
 Identifying data formats and data typ...
S3 Standard S3 Standard - Infrequent
Access
Amazon Glacier
Active data Archive dataInfrequently accessed data
Milliseconds...
7: Use Athena for Ad Hoc Data Exploration
Amazon Athena is an interactive query service
that makes it easy to analyze data...
Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades
Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data for...
8: Use the Right Data Formats
• Pay by the amount of data scanned per query
• Use Compressed Columnar Formats
• Parquet
• ...
9: Choose the Right Tools
Amazon Redshift
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
...
A Sample Data Lake Pipeline
Ad-hoc access to data using Athena
Athena can query
aggregated datasets as well
Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Wareho...
Use S3 as the storage repository for your data lake, instead
of a Hadoop cluster or data warehouse
Decoupled storage and c...
Próximos SlideShares
Carregando em…5
×

Fast Track to Your Data Lake on AWS

2.274 visualizações

Publicada em

by John Mallory, Business Development, AWS

  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Fast Track to Your Data Lake on AWS

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. March 16, 2017 Fast Track to Your Data Lake on AWS John Mallory, Business Development
  2. 2. Data has gravity …easier to move processing to the data 4k/8k Genomics Seismic Financial Logs IoT
  3. 3. Data has Business Value
  4. 4. Challenges with Legacy Data Architectures • Can’t move data across silos • Can’t afford to keep all of the data • Can’t scale with dynamic data and real-time processing • Can’t scale management of data • Can’t find the people who know how to configure and manage complex infrastructure • Can’t afford the investments to keep refreshing infrastructure and data centers
  5. 5. Enter Data Lake Architectures Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data. Benefits of a Data Lake • All Data in One Place • Quick Ingest • Storage vs Compute • Schema on Read
  6. 6. 1&2: Consolidate (Data) & Separate (Storage & Compute) •S3 as the data lake storage tier; not a single analytics tool like Hadoop or a data warehouse •Decoupled storage and compute is cheaper and more efficient to operate •Decoupled storage and compute allow us to evolve to clusterless architectures (i.e. Lambda, Athena & Glue) •Do not build data silos in Hadoop or the EDW •Gain flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture
  7. 7. Designed for 11 9s of durability • Multiple Encryption Options • Robust/Highly Flexible Access Controls Durable Secure High performance  Multiple upload  Range GET  Scalable Throughput  Store as much as you need  Scale storage and compute independently  Scale without limits  Affordable Scalable  Amazon EMR  Amazon Redshift  Amazon DynamoDB  Amazon Athena  Amazon Rekognition  Amazon Glue Integrated  Simple REST API  AWS SDKs  Read-after-create consistency  Event notification  Lifecycle policies  Simple Management Tools  Hadoop compatibility Easy to use Why Choose Amazon S3 for data lake?
  8. 8. “For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.” - Steve Randich, CIO Case Study: Re-architecting Compliance What FINRA needed • Infrastructure for its market surveillance platform • Support of analysis and storage of approximately 75 billion market events every day • Store 5PB of historical data for analysis & training Why they chose AWS • Fulfillment of FINRA’s security requirements • Ability to create a flexible platform using dynamic clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3 Benefits realized • Increased agility, speed, and cost savings • Estimated savings of $10-20m annually by using AWS
  9. 9. Encryption ComplianceSecurity  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Amazon S3  SSL endpoints  Server Side Encryption (SSE-S3)  S3 Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption  Buckets access logs  Lifecycle Management Policies  Access Control Lists (ACLs)  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. 3: Implement the Right Security Controls
  10. 10. AWS Snowball & Snowmobile • Accelerate PBs with AWS-provided appliances • 50, 80, 100 TB models • 100PB Snowmobile AWS Storage Gateway • Instant hybrid cloud • Up to 120 MB/s cloud upload rate (4x improvement), and 4: Choose the Right Ingestion Methods Amazon Kinesis Firehose • Ingest device streams directly into AWS data stores AWS Direct Connect • COLO to AWS • Use native copy tools Native/ISV Connectors • Sqoop, Flume, DistCp • Commvault, Veritas, etc Amazon S3 Transfer Acceleration • Move data up to 300% faster using AWS’s private network
  11. 11. 5: Catalog Your Data S3 Put data in S3 Amazon DynamoDB Amazon Elasticsearch Service Metadata What is in the data lake? Documents the data lake Summary statistics Classification Data Sources Search capabilities Glue Coming Mid-year https://aws.amazon.com/answers/big-data/data-lake-solution/
  12. 12. Glue automates the undifferentiated heavy-lifting of ETL  Cataloging data sources  Identifying data formats and data types  Generating Extract, Transform, Load code  Executing ETL jobs; managing dependencies  Handling errors  Managing and scaling resources Amazon Glue – in Preview
  13. 13. S3 Standard S3 Standard - Infrequent Access Amazon Glacier Active data Archive dataInfrequently accessed data Milliseconds Minutes to HoursMilliseconds $0.021/GB/mo $0.004/GB/mo$0.0125/GB/mo 6: Keep More Data
  14. 14. 7: Use Athena for Ad Hoc Data Exploration Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  15. 15. Athena is Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgrades
  16. 16. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Athena supports multiple data formats • Text, CSV, TSV, JSON, weblogs, AWS service logs • Or convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data directly from Amazon S3
  17. 17. 8: Use the Right Data Formats • Pay by the amount of data scanned per query • Use Compressed Columnar Formats • Parquet • ORC • Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  18. 18. 9: Choose the Right Tools Amazon Redshift Enterprise Data Warehouse Amazon EMR Hadoop/Spark Amazon Athena Clusterless SQL Amazon Glue Clusterless ETL Amazon Aurora Managed Relational Database Amazon Machine Learning Predictive Analytics Amazon Quicksight Business Intelligence/Visualization Amazon ElasticSearch Service ElasticSearch Amazon ElastiCache Redis In-memory Datastore Amazon DynamoDB Managed NoSQL Database Amazon Rekognition & Amazon Polly Image Recognition & Text-to-Speech AI APIs Amazon Lex Voice or Text Chatbots
  19. 19. A Sample Data Lake Pipeline Ad-hoc access to data using Athena Athena can query aggregated datasets as well
  20. 20. Amazon S3 Data Lake Amazon Kinesis Streams & Firehose Hadoop / Spark Streaming Analytics Tools Amazon Redshift Data Warehouse Amazon DynamoDB NoSQL Database AWS Lambda Spark Streaming on EMR Amazon Elasticsearch Service Relational Database Amazon EMR Amazon Aurora Amazon Machine Learning Predictive Analytics Any Open Source Tool of Choice on EC2 AWS Data Lake Analytic Capabilities Data Science Sandbox Visualization / Reporting Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Serving Tier Clusterless SQL Query Amazon Athena DataSourcesTransactionalData Amazon Glue Clusterless ETL Amazon ElastiCache Redis
  21. 21. Use S3 as the storage repository for your data lake, instead of a Hadoop cluster or data warehouse Decoupled storage and compute is cheaper and more efficient to operate Decoupled storage and compute allow us to evolve to clusterless architectures like Athena Do not build data silos in Hadoop or the Enterprise DW Gain flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture 10: Evolve as Needed

×