O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Serverlesss Big Data Analytics with Amazon Athena and Quicksight

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 31 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Serverlesss Big Data Analytics with Amazon Athena and Quicksight (20)

Anúncio

Mais de Amazon Web Services (20)

Serverlesss Big Data Analytics with Amazon Athena and Quicksight

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Robinson, Specialist SA, Data and Analytics, EMEA 20 September 2017 Serverless Big Data Analytics with Amazon Athena and QuickSight
  2. 2. data answers COLLECT STORE PROCESS/ ANALYZE CONSUME time to first answer
  3. 3. Agile Analytics • Experiment • Invest in promising experiments • Fail fast • React quickly
  4. 4. Serverless Analytics Amazon S3 Highly durable object storage AWS Glue Data catalog and managed ETL Amazon Athena Serverless interactive SQL queries Amazon QuickSight Business analytics service
  5. 5. Raw S3 Data Canonical Data Amazon Athena Amazon Quicksight ETL Job Data Catalog describes describes uses
  6. 6. AWS Glue
  7. 7. AWS Glue: Components Data Catalog  Hive Metastore compatible with enhanced functionality  Crawlers automatically extracts metadata and creates tables  Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution  Run jobs on a serverless Spark platform  Provides flexible scheduling  Handles dependency resolution, monitoring and alerting Job Authoring  Auto-generates ETL code  Build on open frameworks – Python and Spark  Developer-centric – editing, debugging, sharing
  8. 8. Taxi csv Limo csv Taxi ETL Job 1.6 GB 94.8 MB Limo ETL Job 220.3 MB 18 MB Canonical Data parquet Amazon Athena Amazon Quicksight
  9. 9. Review Database and Table Definitions
  10. 10. Amazon Athena
  11. 11. Introducing Amazon Athena Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using standard SQL
  12. 12. Athena is Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgrades
  13. 13. Amazon Athena is Easy To Use • Log into the Console • Create a table • Type in a Hive DDL Statement • Use the console Add Table wizard • Use tables from Glue’s Data Catalog • Start querying
  14. 14. Amazon Athena is Highly Available • You connect to a service endpoint or log into the console • Athena uses warm compute pools across multiple Availability Zones • Your data is in Amazon S3, which is also highly available and designed for 99.999999999% durability
  15. 15. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Avro, Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data directly from Amazon S3 • Take advantage of Amazon S3 durability and availability
  16. 16. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning EXTERNAL tables – no impact on underlying data
  17. 17. Use ANSI SQL • Start writing ANSI SQL • Support for complex joins, nested queries & window functions • Support for complex data types (arrays, structs) • Support for partitioning of data by any key • (date, time, custom keys) • e.g., Year, Month, Day, Hour or Customer Key, Date
  18. 18. Amazon Athena Supports Multiple Data Formats • Text files, e.g. CSV, TSV, custom delimiter • Apache Web Logs, CloudTrail logs • JSON (simple, nested), AVRO • Columnar formats, e.g. Apache Parquet & Apache ORC • Logstash Grok for unstructured text files • Compressed files (Snappy, Zlib, GZIP, and LZO) • Encrypted data (SSE-S3, SSE-KMS, CSE-KMS) Use large (128MB – 1GB) compressed files New!
  19. 19. Amazon Athena is Fast • Tuned for performance • Automatically parallelizes queries • Results are streamed to console • Results also stored in S3 • Improve query performance: • Compress your data • Use columnar formats • Partition your data
  20. 20. Amazon Athena is Cost Effective • Pay per query • $5 per TB scanned from S3 • DDL Queries and failed queries are free • Reduce costs: • Compress your data • Use columnar formats • Partition your data
  21. 21. PARQUET • Columnar format • Schema segregated into footer • Column major format • All data is pushed to the leaf • Integrated compression and indexes • Support for predicate pushdown Apache Parquet and Apache ORC – Columnar Formats ORC • Apache top level project • Schema segregated into footer • Column major with stripes • Integrated compression, indexes, and stats • Support for predicate pushdown
  22. 22. Pay By the Query - $5/TB Scanned • Pay by the amount of data scanned per query • Ways to save costs • Compress • Convert to Columnar format • Use partitioning • Free: DDL Queries, Failed Queries Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  23. 23. Converting to ORC and PARQUET • Use Glue! • You can use Hive CTAS to convert data • CREATE TABLE new_key_value_store • STORED AS PARQUET • AS • SELECT col_1, col2, col3 FROM noncolumartable • SORT BY new_key, key_value_pair; • You can also use Spark to convert the file into PARQUET / ORC • 20 lines of Pyspark code, running on EMR • Converts 1TB of text data into 130 GB of Parquet with Snappy compression • Total cost $5 https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  24. 24. Start Querying with Amazon Athena • Review console • Run Glue crawler to create canonical table definition • Change datatypes • Run some simple queries Raw S3 Data Canonical Data Amazon Athena Amazon QuicksightData Catalog describes describes uses ETL Job
  25. 25. ETL Uber Data into Parquet glueContext.write_dynamic_frame.from_options( frame = datasource1, connection_type = "s3", connection_options = {"path": "s3://ianrob-nyc-transportation"}, format = "parquet", transformation_ctx = "datasink4")
  26. 26. Amazon QuickSight
  27. 27. QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on- premises sources including Amazon Athena Amazon RDS Amazon S3 Amazon Redshift Amazon Athena Using Amazon Athena with Amazon QuickSight
  28. 28. AOB
  29. 29. AWS Glue regional availability plan Planned schedule Regions At launch US East (N. Virginia) Q3 2017 US East (Ohio), US West (Oregon) Q4 2017 EU (Ireland), Asia Pacific (Tokyo), Asia Pacific (Sydney) 2018 Rest of the public regions
  30. 30. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

×