Query your data in S3 with SQL and optimize for cost and performance

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steffen Grunwald, AWS Solutions Architect, @steffeng
AWS Pop-up Loft Berlin, 17. October 2018
Query your data in S3 with SQL
and optimize for cost and
performance

What you will learn from this Session
• Benefits of raw Data in Amazon Simple Storage Service
• Query on S3 with Amazon Athena
• Optimize your Data Structure
• Compression
• Partitioning
• Columnar Formats
• Derive Views from raw Data for frequent Queries

Example Application: New York Taxi Data Ingestion
Amazon Kinesis
Streams
Amazon Kinesis
Analytics
Amazon Kinesis
Streams
AWS
Lambda
Amazon
CloudWatch
Amazon Kinesis
Firehose
Amazon
QuickSight
AWS Glue
Amazon
S3
Amazon
Athena
Instance

Benefits of raw Data in
Amazon Simple Storage Service (S3)
• Highly durable and cost-effective object store
• Limitlessly scalable
• Pay for what you use - in GB per month
• Decouple storage from compute
• Widely supported API by many consumers
• Well integrated into other AWS services
Use S3 as long term storage to answer yet unknown
questions of tomorrow.

Ingest Data with Amazon Kinesis Firehose
• Stores stream of records as files in a bucket
• Path: <Optional Prefix> + "YYYY/MM/DD/HH“
(Ingestion Time, UTC)
• Optionally compress (GZIP, ZIP, Snappy)
• Optionally store as columnar format (ORC, Parquet)
• Optionally transform records with AWS Lambda
Amazon Kinesis Firehose Amazon S3 Bucket

Amazon Athena is an interactive query service that
makes it easy to analyze data directly from Amazon
S3 using Standard SQL

Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• No Extract, Transform, and Load (ETL) required
• Stream data directly from Amazon S3

Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries &
window functions
• Complex data types (arrays,
structs, maps)
• Partitioning of data by any key
• date, time, custom keys
• Presto built-in functions

Amazon Athena Supports Multiple Data Formats
• Text files, e.g., CSV, raw logs
• Apache Web Logs, TSV files
• JSON (simple, nested)
• Compressed files
• Columnar formats such as Parquet & ORC
• AVRO support

Amazon Athena is Cost Effective
• Pay per query
• $5 per TB scanned from S3
• DDL Queries and failed queries are free

Demo: Query files from Amazon Kinesis Firehose
with Amazon Athena and AWS Glue

The Example Data
• NYC Taxi & Limousine Commission rides
• Data is generated by kinesis-taxi-stream-
producer available at [1]:
java -jar kinesis-taxi-stream-producer.jar
-speedup 400 -statisticsFrequency 10000
-stream nyctlc-ingestion –noWatermark
-region eu-central-1 -adaptTime ingestion
• ~2GB/h of raw data, 11 days, 487 GB total
[1] https://github.com/aws-samples/flink-stream-
processing-refarch

Test Setup: Ingesting Data with different Settings
Amazon
Kinesis
Streams
Amazon S3
Instance
Firehose (gzip)
Firehose (raw)
Firehose (orc)
Firehose (parquet)
(max Amazon Kinesis Firehose
buffering hints: 128MB & 900s)

Photo by Glen Noble on Unsplash

Example Query I
Show some rides on 2nd September 10-11h:
SELECT *
FROM "128mb"
WHERE pickup_datetime
BETWEEN '2018-09-02T10' AND '2018-09-02T11'
LIMIT 10
Run time: 3.53 seconds, Data scanned: 4.62GB

Example Query II (gzip)
Show some rides on 2nd September 10-11h:
SELECT *
FROM "128mbgz"
BETWEEN '2018-09-02T10' AND '2018-09-02T11'
LIMIT 10
Run time: 2.45 seconds, Data scanned: 303.04KB
gzip reduces 487GB to 76GB.

Example Query III (without LIMIT 10)
What was the distribution of passenger load
on 2nd September 10-11h?
SELECT passenger_count, count(*) count
FROM "128mbgz"
BETWEEN '2018-09-02T10' AND '2018-09-02T11'
GROUP BY passenger_count

Photo by Tang Junwen on Unsplash

Partitions to the Rescue
AWS Glue crawler adds partitions based on file prefixes/ dirs

Example Query IV
FROM "128mbgz"
BETWEEN '2018-09-02T10' AND '2018-09-02T11'
AND partition_0 || partition_1 || partition_2 ||
partition_3
BETWEEN '2018090210' AND '2018090215'

Log
S3 Athena
Data Catalog
Schema
Lookup
Create table partitions
Glue
Crawl Partitions with AWS Glue
Query data
Why? Just schedule the crawler, no need to code!
Deals with schema evolution.
Crawl data

Use Hive-style File Format in S3
Move/ copy:
YYYY/MM/DD/HH/file
year=YYYY/month=MM/day=DD/hours=HH/file
Make Athena reload partitions by: msck repair table
Why? Format easy to create on write, easy to move.

Log
S3 Athena
Data Catalog
Schema
Lookup
Add table partition
Lambda
Creating Partitions with AWS Lambda
Query data
New File
Trigger
Why? Add partitions instantly, just AWS Lambda cost.

Populate Partitions if paths are known
Issue Statements with Amazon Athena:
ALTER TABLE mytable
ADD PARTITION
(year='2015',month='01',day='01')
LOCATION 's3://[...]/2015/01/01/'
Why? Easy for predictable paths. Can be prepopulated.

Columnar Formats

Last_Name
Label
Le Fleming
Lisciandro
Minghi
Jime
Age
34
25
45
63
22
Gender
Fem
Fem
Fem
Mal
Mal
Flat File Sample Layout
First_Name
Tootsie
Miriam
Blakeley
Ernst
Brew

Last_Name
Label
Le Fleming
Lisciandro
Minghi
Jime
MIN: Jime
MAX: Minghi
Age
34
25
45
63
22
MIN: 22
MAX: 63
Gender
Fem
Fem
Fem
Mal
Mal
MIN: Fem
MAX: Mal
First_Name
Tootsie
Miriam
Blakeley
Ernst
Brew
MIN: Blakeley
MAX: Tootsie
Columnar Formats Layout (Parquet & ORC)

Last_Name
Label
Le Fleming
Lisciandro
Minghi
Jime
MIN: Jime
MAX: Minghi
Age
34
25
45
63
22
MIN: 22
MAX: 63
Gender
Fem
Fem
Fem
Mal
Mal
MIN: Fem
MAX: Mal
First_Name
Tootsie
Miriam
Blakeley
Ernst
Brew
MIN: Blakeley
MAX: Tootsie
Benefit 1: Predicate Pushdown
SELECT * FROM ... WHERE Age > 30

Last_Name
Label
Le Fleming
Lisciandro
Minghi
Jime
MIN: Jime
MAX: Minghi
Age
34
25
45
63
22
MIN: 22
MAX: 63
Gender
Fem
Fem
Fem
Mal
Mal
MIN: Fem
MAX: Mal
First_Name
Tootsie
Miriam
Blakeley
Ernst
Brew
MIN: Blakeley
MAX: Tootsie
Benefit 2: Projection Pushdown/ Column Pruning
SELECT First_Name FROM ... WHERE Age > 30

Benefit 3: Compression & Encoding
• RLE (& Bit Packing) for numbers
• Dictionary for string repetitions (+RLE)
• Delta encoding for increasing numbers
• Delta Strings (for string with a identical prefix)
• Plain encoding for varied strings
https://github.com/apache/parquet-format/blob/master/Encodings.md

More on Dictionary Encoding
• Builds list of unique strings, assigns numeric ID to each
• If the dictionary size over 1MB (configurable) or
number of distinct values too high, will fall back to
Plain encoding.
• The data itself is later represented as numbers and is
further encoded using RLE
https://github.com/apache/parquet-format/blob/master/Encodings.md

Demo: Parquet/ ORC with Amazon
Kinesis Firehose (new!)

Example Query V (parquet)
FROM "128mbparquet"
BETWEEN '2018-09-02T10' AND '2018-09-02T11'
partition_3
BETWEEN '2018090210' AND '2018090215'
Run time: 3.21 seconds, Data scanned: 300.7MB

Analyzing Parquet File
• parquet-tools
• head – view data in file
• meta – get metadata summary
• dump -d -n – get detailed metadata down to page
level stats included

Schema Information
Row Count Total Byte Size Size in Bytes Value Count Encoding
Download and build [1].
$ java -jar parquet-tools.jar meta <parquetfile>
[1] https://github.com/apache/parquet-mr/

parquet-tools dump: Encoding & Statistics
total_amount:
- DOUBLE SNAPPY DO:0 FPO:4155231 SZ:329324/338501/1.03
[more]... ST:[min: -76.8, max: 1121.3, num_nulls: 0]
dropoff_datetime:
- BINARY SNAPPY DO:0 FPO:3315979 SZ:839131/5540639/6.60
[more]... ST:[no stats for this column]
Use (unix epoch) or partition by timestamp for time series
data.

Example Query VI (ORC)
FROM "128mborc"
BETWEEN '2018-09-02T10' AND '2018-09-02T11'
partition_3
BETWEEN '2018090210' AND '2018090215'

Analyzing ORC: orcdumpfile
Spin up a single node/ master EMR Cluster and use the
hive command:
hive --orcfiledump file://<absolutepath>/file.orc
[…]
Column 7: count: 210141 hasNull: false min: -
76.96324157714844 max: 0.0 sum: -
1.5329986951126099E7
Column 8: count: 210141 hasNull: false min:
2018-08-30T00:13:48.573Z max: 2018-08-
30T00:28:49.564Z sum: 5043384
[…]

Log
S3 Athena
Data Catalog
Schema
Lookup
Write table partitions
Glue
ETL with AWS Glue For Frequent Queries
Query data
Read/
Write

Demo: ETL with AWS Glue

Example Zeppelin/ AWS Glue Notebook
https://gist.github.com/steffeng/
5b841a99230ba8377f161f5545
3d49d0

Example Query VII (repartitioned)
FROM "partitioned_by_hour"
WHERE year = 2018
AND month = 9
AND day = 2
AND hour = 10

Example Query VIII (aggregated)
SELECT passenger_count, trip_count
FROM "aggregates_by_hour"
WHERE year = 2018
AND month = 9
AND day = 2
AND hour = 10
Run time: 1.85 seconds, Data scanned: 0.37KB

Recently announced and relevant...

Photo by Benjamin Davies on Unsplash
I applied these simple
tricks when storing data
for Amazon Athena and
you won‘t believe what
happened next...

Measure. Then optimize.
There‘s no silver bullet.
Photo by Cesar Carlevarino Aragon on Unsplash

Optimize for Cost and Performance 1/2
• Use Athena in the region of your buckets.
• Compress your data for less storage & query cost.
• Use LIMIT in queries for faster results.
• Partition your data based on data access patterns.
• Use partitions in your queries.
• Add partitions by crawling or S3 triggers.

Optimize for Cost and Performance 2/2
• Columnar formats as ORC & parquet reduce scanned
data: faster, less cost
• Pick format depending on data, access patterns, clients
• Inspect/ verify the resulting files
• Create aggregates for frequent queries
• Shorten turnaround times for Glue job development:
• Use a provisioned development endpoint
• Use small subset of your data (think KB!)

The AWS Free Tier allows you to
get hands on experience with AWS
Glue and S3. Try it today!

Questions?
Ask the Architect
downstairs!

Query your data in S3 with SQL and optimize for cost and performance

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Query your data in S3 with SQL and optimize for cost and performance

Semelhante a Query your data in S3 with SQL and optimize for cost and performance (20)

Mais de AWS Germany

Mais de AWS Germany (20)

Último

Último (20)

Query your data in S3 with SQL and optimize for cost and performance