2. What is Amazon Athena ?
Athena is an ANSI-standard query tool, or interactive query service, that
works with “big data” stored in Amazon Simple Storage Service (S3).
Typical use cases supported by Amazon Athena are data science,
machine learning, visualizations, ETL, and reporting.
Since AWS Athena is serverless, this means no infrastructure to manage, and you can
tap into scalable storage on S3. This also means you only pay for the queries you run,
which benefits someone like a data analyst who wants to minimize Amazon Athena
costs.
Amazon Athena is a serverless, interactive analytics service built on open-source
frameworks, supporting open-table and file formats. Athena provides a simplified,
flexible way to analyze petabytes of data where it lives. Analyze data or build
applications from an Amazon Simple Storage Service (S3) data lake and 25+ data
sources, including on-premises data sources or other cloud systems using SQL or
3. AWS Athena is a serverless interactive analytics service offered by
Amazon that can be readily used to gain insights on data residing in S3.
Under to hood, Athena used a distributed SQL engine called Presto,
which is used to run the SQL queries. Presto is based on the popular
open-source technology Hive, to store structured, semi-structured and
unstructured data.
4. Amazon Athena is a serverless data query tool which means it is scalable
and cost-effective at the same time. Usually, customers are charged on a
pay per query basis which translates to the number of queries that are
executed on a given time period. The normal charge for scanning 1TB of
data from S3 is 5 USD.
5.
6. Working with Athena
It can quickly analyze the data with the help of Amazon S3 using standard SQL. It even
does not need to load the data in Athena.
All we require to do is to point to the data in Amazon S3, define the particular schema
and start querying using the standard SQL. With the help of Amazon Athena, we can
process any of data, whether it is structured, semi-structured or unstructured data, i.e., it
can handle the data in CSV ,arrays and objects
Amazon Athena provides a simple UI.Getting started with Athena is much more
comfortable, all need to do is create a database, select the table name and specify the
location of the data on Amazon S3.
7. Working of AWS Athena
Amazon Athena works in direct association with the S3 data. It is used as a
distributed SQL engine for running the queries and it also uses Apache Hive
for creating and altering tables and partitions. Some of the important
standpoints needed for working with Athena include:
1.You must have an AWS Account
2.You should enable your account to export the cost and usage data into the
S3 bucket.
3.You can prepare buckets for Athena to connect.
4.AWS also creates manifest files with the use of metadata each time it writes
to the bucket. In fact, it creates a folder within the technology AWS billing data
bucket known as Athena that contains only the data.
5.For simplifying the setup, a region called the US-West-2 region can also be
used.
6.The last and final step is downloading the credentials for the new user
because the credentials help indirectly mapping to the database credentials.
8. Athena Benefits
Amazon Athena makes it easier to run the interactive queries against the extensive data by directly
uploading them in Amazon S3 and don’t worry about managing the infrastructure and handling the
data. Athena is best suited when we need to run the queries against some weblogs for troubleshooting
the issues in the site.
•Based on SQL: You can use Athena to run SQL queries against the desired table that is
configured in the Glue data catalogue or data sources that you can connect to using the
Athena Query Federation SDK. For users who already know SQL, there is no learning curve to
get started.
•Open architecture (no vendor lock-in): Athena enables open access to data rather than lock-in
to a specific tool or technology. This manifests itself in various ways;
•Ubiquitous Access: Because your data is stored in an S3 bucket and the schema is defined in
the Glue Data Catalog, you can switch between query engines that can read from these
sources without redefining the schema or creating a separate copy of the data.
9. Athena Benefits
Amazon Athena makes it easier to run the interactive queries against the extensive data by directly
uploading them in Amazon S3 and don’t worry about managing the infrastructure and handling the
data. Athena is best suited when we need to run the queries against some weblogs for troubleshooting
the issues in the site.
•Separated storage and computing resources: Athena has a complete separation of computing
and memory resources. Data is stored in your Amazon S3 account, while Amazon Web
Services provide Athena computation as a shared resource among all Athena users.
•Open file formats: Unlike many high-performance databases, Athena does not use a
proprietary file format but supports standard open source formats such as Apache Parquet,
ORC, CSV, and JSON.
•Low cost: Athena’s pricing model is based on terabytes of scanned data. You can control and
keep costs down by checking only the data you need to answer a specific query (this can be
done using data splitting – see below).
•Access to all your data: Most organizations process only 30 to 35 percent of their data into a
traditional data warehouse due to the high operational and infrastructure costs of constantly
resizing database clusters.
10. Speed and Performance
As Amazon Athena is serverless, which makes it quicker and easier to execute the
queries on Amazon S3 without taking care of the server and the cluster to set up or
manage. Another thing is the initialization time, in Athena, we can straight away query
the data on Amazon S3, but in Redshift, we have to wait for the cluster to get active and
once the cluster is activated, only then we are allowed to query the data.
11.
12. Speed and Performance
•The optimization is limited to queries: You can optimize your questions, not your data.
However, your data is already stored in Amazon S3; performing transformations to use Athena
Athena may affect other users using the exact information for other purposes.
•Multi-tenancy means pooled resources: All Athena users receive a similar SLA for queries at
any time. In other words, the entire global user base is “competing” for the same resources –
and although AWS provides more as needed, this could mean that query performance
fluctuates depending on other people’s usage.
•No indexing: Indexes are integrated into traditional databases but do not exist in Athena. This
makes joining large tables a demanding operation that increases the load on Athena and
negatively impacts performance. For example, running a query by key requires scanning all
the data and searching for the desired key in the result list. This is solved using Upsolver
lookup tables.
•Partitioning: Efficient queries in Athena require partitioning of the data. Maintaining the
number of partitions in the park that meet your performance needs is essential. Every 500
divisions scanned will add 1 second to your query.
13.
14. Which data types does Amazon Athena support?
Athena can process numerous structured and unstructured data types, including
standard data formats like CSV (comma-separated value), JSON (JavaScript Object
Notation), ORC (Optimized Row Columnar), Apache Parquet and Apache Avro. Athena
also supports compressed data in Snappy, Zlib, LZO (Lempel-Ziv-Oberhumer) and Gzip
(GNU Zip) formats.
Other examples of supported data types include:
•Boolean
•TinyIT
•SMALLINT
•Column
•VARCHAR
•CHAR
•BigInt
•WorkGroupConfigurationUpdates
•UnprocessedNamedQueryId
15. Feature of Athena
•Serverless
It is serverless so that the end-user does not have to worry about configuration,
infrastructure, scaling, or failure. Athena takes care of it all easily.
•Pay Per Query
Athena charges you just for the query you run which is the amount of data that gets
managed per query. You can actually save a lot if you compress the data and format it
accordingly.
•Secure
Using the IAM policies and the AWS identity, Amazon Athena offers complete control
over the data set. With the data being stored in S3 buckets the IAM policies can help in
managing control to users.
•Available
Amazon Athena is highly available and the users can execute queries round the clock.
•Machine Learning
The developers can use Amazon Sage Maker for creating and deploying the machine
learning models in Amazon Athena.
16. What are the limitations of Amazon Athena?
•Optimization is limited to queries. For example, data already stored in S3 cannot be
optimized.
•No indexing options. Indexing options commonly appear in traditional databases.
Without indexing, the operation load on Athena increases, potentially affecting
performance.
•Efficient queries require partitioning. In order to enable efficient queries, data must first
be partitioned. Partitions must then be managed for what best fits performance needs.
•Stored procedures, parameterized queries and Presto federated connectors are not
supported. Amazon Athena Federated Query is needed to connect data sources.
•When querying a table with thousands of partitions, Athena can time out.
•Source files that start with an underscore or a dot are treated as hidden.
•The row and column size cannot exceed 32 megabytes.
•Athena does not support querying data in S3 Glacier and S3 Glacier Deep Archive
storage classes.
17. Summary
Athena is a service offered by Amazon that is an interactive query service. Athena makes it
easy for the user to directly analyze data in Amazon S3 (Simple Storage Service) using
standard SQL. For example, in the Amazon Management Console, it can be set to point to
where data is stored in Amazon S3 with a few clicks of a button. SQL can then be used to run
ad-hoc queries, bringing the result to the user in seconds.
•It does not store data. Instead, storage is managed entirely on Amazon S3. The Athena
query service is fully managed, so resources are automatically allocated by AWS as needed
to execute a query.
•Because your data is stored in an S3 bucket and the schema is defined in the Glue Data
Catalog, you can switch between query engines that can read from these sources without
redefining the schema or creating a separate copy of the data.
•As one of the best serverless architectures, Amazon Athena makes data queries
easy to use, set up and fast to run. In fact, the pay-per-use model of Athena makes
the entire thing affordable to run analytics. Moreover, since Athena works with
Amazon S3 and comes with great scalability, reliability, and durability, this is one of
the best suites to run analytics workloads.