More Related Content More from Amazon Web Services (20) ABD313-Building an End-to-End Serverless Data Analytics Solution on AWS1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Building an End-to-End Serverless
Data Analytics Solution on AWS
G o w r i B a l a s u b r a m a n i a n , A W S S o l u t i o n A r c h i t e c t
K a r t h i k K u m a r O d a p a l l y , A W S S o l u t i o n A r c h i t e c t
R a j e e v S r i n i v a s a n , A W S S o l u t i o n A r c h i t e c t
R u d y C h e t t y , A W S S o l u t i o n A r c h i t e c t
A B D 3 1 3
N o v e m b e r 2 7 , 2 0 1 7
2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Presentation
Service Introduction
Reference Architecture
Query Performance Best Practices
Workshop Overview
• Hands on Workshop
Lab1: Serverless Analysis of Data in Amazon Simple Storage (Amazon S3) using Amazon Athena
Lab2: Visualization Using Amazon QuickSight
Lab3: Serverless ETL and Data Discovery Using Amazon Glue [Optional]
Lab4: Analysis of Data in Amazon S3 Using Amazon Redshift Spectrum [Take Home]
3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SERVICE INTRODUCTION
A M A Z O N A T H E N A A M A Z O N Q U I C K S I G H T A W S G L U E
4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena
Start Querying Instantly
Serverless. No ETL.
Pay per Query
Only pay for data scanned.
Open. Powerful. Standard.
Built on Presto. Runs standard SQL.
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight—Data Sources
S P I C E
S P I C E
Amazon
Athena
Amazon
S3
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight—Data Sources
Amazon
Redshift
Redshift
Spectrum
S P I C E
S P I C E
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Integrated
Data Catalog
Automated
Data Discovery
Code
Generation
Developer
Endpoints
Flexible
Job Scheduler
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Components
Hive Metastore compatible with enhanced functionality
Crawlers automatically extracts metadata and create tables
Integrated with Amazon Athena, Amazon Redshift Spectrum
Run jobs on a serverless Spark platform
Provides flexible scheduling
Handles dependency resolution, monitoring, and alerting
Auto-generates ETL code
Build on open frameworks—Python and Spark
Developer-centric—editing, debugging, sharing
Data Catalog
Job Authoring
Job Execution
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Customer Reference—Amazon QuickSight
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Customer Reference—Amazon Athena
One of the big attractions of Amazon
Athena is that it’s serverless and purely
consumption based. We only pay when
we’re actually querying the data, and we
don’t have to keep a cluster running all the
time.
–Matt Chesler,
Director of DevOps, Movable Ink
”
“
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reference Architecture
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On-Premises Analytics Pipeline
Always OnStatic : Not Scalable Outages Impact Storage Compute
Database
User
Reporting
IOT Devices
Application
Server Logs
Data Source
On-Premise Hadoop
Cluster
14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reference Architecture
Landing
Amazon S3 Bucket
Amazon Athena
Query data
Amazon QuickSight
Visualization
Kinesis Firehose
Real-Time Data
Collection
Data Export
AWS DMS
Log Aggregation
AWS Service Logs
Web Application Logs
Server Logs
15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reference Architecture—ETL
Landing
Amazon S3 Bucket
Amazon Athena
Glue Data
Catalog
Query data
Amazon S3
Glue ETL Glue Crawler
Amazon QuickSight
Visualization
Kinesis Firehose
Real-Time Data
Collection
Data Export
AWS DMS
Log Aggregation
AWS Service Logs
Web Application Logs
Server Logs
16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reference Architecture
Amazon
QuickSight
4. Visualize all your data
17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query Performance
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices—Storage
Partition your data
Optimize columnar data store generation
Compress and split files
Optimize file size
SELECT count(*) as count FROM taxi_rides_csv
(Run time: 20.06 seconds, Data scanned: 207.54GB, Row Count: 1,310,911,060)
SELECT count(*) as count FROM taxi_rides_parquet
(Run time: 5.76 seconds, Data scanned: 0KB, Row Count: 2,870,781,820)
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices—Query
Optimize ORDER BY
Optimize joins
Optimize GROUP BY
Optimize the LIKE operator
SELECT * FROM nytaxirides WHERE year = 2011 AND month = 5 AND type = 'yellow’ ORDER BY ratecode
(Run time: 3 minutes 6 seconds)
SELECT * FROM nytaxirides WHERE year = 2011 AND month = 5 AND type = 'yellow’ ORDER BY ratecode LIMIT 1000
(Run time: 3.01 seconds)
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices—Query
Use approximate functions
Only include the columns that you need
SELECT count(distinct tpep_pickup_datetime) FROM nytaxidata
(Run time: 30.82 seconds)
SELECT approx_distinct(tpep_pickup_datetime) FROM nytaxidata
(Run time: 25.21 seconds)
SELECT * FROM nytaxirides WHERE year = 2011 AND type = 'yellow' AND month = 5
(Run time: 2 minutes 59 seconds, Data scanned: 382.88MB)
SELECT vendorid, ratecode, passenger_count FROM nytaxirides WHERE year = 2011 AND type = 'yellow' AND month = 5
(Run time: 38.79 seconds, Data scanned: 10.06MB)
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hands-On Workshop
22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab 1: Serverless Analysis Using Amazon Athena
Amazon S3 Bucket
(CSV Format)
Amazon Athena
Amazon S3 Bucket
(Parquet Format)
Athena Data
Catalog
Metadata
23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab 2: Visualization Using Amazon QuickSight
Amazon S3 Bucket
(CSV Format)
Amazon Athena
Amazon S3 Bucket
(Parquet Format)
Visualization
Amazon QuickSight
Athena Data
Catalog
Metadata
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab 3: ELT and Data Discovery Using AWS Glue
Amazon S3 Bucket
(CSV Format)
Amazon Athena
Your S3 Bucket
(Parquet Format)
AWS Glue ETL
CSV to Parquet
AWS Glue Crawler
AWS Glue Crawler
Metadata
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab 4: Analysis Using Redshift Spectrum
Amazon S3 Bucket
(CSV Format)
Amazon Redshift Spectrum
Amazon S3 Bucket
(Parquet Format)
AWS Glue Crawler
AWS Glue Crawler
Metadata
26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analysis & Visualization Pipeline on AWS
Generate Collect Store
Extract
Transform
Load
Analyze Visualize/Report
IOT Devices
Application
Server Logs
Kinesis Stream
Kinesis Firehose
Polling Application AWS Glue
Amazon Redshift &
Redshift Spectrum
Amazon Athena
Kinesis Analytics
Amazon QuickSight
Amazon S3
Amazon RDS
Database on EC2 Amazon EMR
Lab2
Lab1
Lab3
Lab4
AWS Lambda
Kinesis Enabled
27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Workshop
• Please collect the credit coupon. You can apply this coupon towards completing the labs in this workshop.
• Create an AWS Account, if you don’t have one. Please do not use your production account for the labs.
• Provide your AWS Account ID for whitelisting to any of the AWS personnel who are staffing the workshop.
Choose Support on the navigation bar on the upper right, and then choose Support Center. Your currently
signed-in account ID appears in the upper-right corner below the Support menu.
• Navigate to the following web link for workshop lab instruction
http://bit.ly/2jgx6vd
• Choose Oregon region for the labs.
28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
Please complete your survey