Everything generates logs. Applications, infrastructure, security ... everything. Keeping track of the flood of log data is a big challenge, yet critical to your ability to understand your systems and troubleshoot (or prevent) issues. In this session, we will use both Amazon CloudWatch and application logs to show you how to build an end-to-end log analytics solution. First, we cover how to configure an Amazon Elaticsearch Service domain and ingest data into it using Amazon Kinesis Firehose, demonstrating how easy it is to transform data with Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data and configure a secure analytics environment. We demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.
2. Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
3. Value Proposition
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Secure – IAM for authentication; Encryption at rest & in transit
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions
4. Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
5. Glue automates the undifferentiated heavy-lifting of ETL
Cataloging data sources
Identifying data formats and data types
Generating Extract, Transform, Load code
Executing ETL jobs; managing dependencies
Secured by IAM policies
Handling errors
Managing and scaling resources
6. Data Catalog
Hive metastore compatible metadata repository of data sources.
Crawls data source to infer table, data type, partition format.
Job Execution
Runs jobs in Spark containers – automatic scaling based on SLA.
Glue is serverless - only pay for the resources you consume.
Job Authoring
Generates Python code to move data from source to destination.
Edit with your favorite IDE; share code snippets using Git.
Glue Components