This document discusses strategies for logging at scale. It notes that logging presents challenges around temporary storage, data capture, permanent storage, and visualization. The document recommends starting with SQL databases and using NoSQL like Elasticsearch for very large datasets or fast data ingest. It presents Amazon Kinesis, Firehose, and Elasticsearch Service as tools to help with data capture, transport, and search. Visualization can be done with Kibana or by loading data into Redshift for use with existing BI tools. The key lessons are to reuse existing technologies when possible and choose the right tools for each part of the logging pipeline.
7. Stealing Content…
‘Your First 10m Users’
ARC301 – re:Invent 2015
http://bitly.com/2015arc301
- Joel Williams
AWS Solutions Architect
8. >1 User
• Amazon Route 53 for DNS
• A single Elastic IP
• A single Amazon EC2 instance
• With full stack on this host
• Web app
• Database
• Management
• And so on…
Amazon
EC2
instance
Elastic IP
User
Amazon
Route 53
ARC301
9. >1 User
• A single place to read logs
Amazon
EC2
instance
Elastic IP
User
Amazon
Route 53
ARC301
10. >1 User
• A single place to read logs from
Amazon
EC2
instance
Elastic IP
User
Amazon
Route 53
ARC301
15. Users >1000
Web
Instance
RDS DB Instance
Active (Multi-AZ)
Availability Zone Availability Zone
Web
Instance
RDS DB Instance
Standby (Multi-AZ)
ELB
Balancer
User
Amazon
Route 53
32. Why start with SQL?
• Established and well-worn technology.
• Lots of existing code, communities, books, and tools.
• You aren’t going to break SQL DBs in your first 10 million
users. No, really, you won’t.*
• Clear patterns to scalability (especially in analytics)
*Unless you are doing something SUPER peculiar with the data or you have MASSIVE amounts of it.
…but even then SQL will have a place in your stack.
34. Why might you need NoSQL?
• Super low-latency applications
• Metadata-driven datasets
• Highly nonrelational data
• Need schema-less data constructs*
• Massive amounts of data (again, in the TB range)
• Rapid ingest of data (thousands of records/sec)
*Need!= “It’s easier to do dev without schemas”
35. Why might you need NoSQL?
• Super low-latency applications
• Metadata-driven datasets
• Highly nonrelational data
• Need schema-less data constructs*
• Massive amounts of data (again, in the TB range)
• Rapid ingest of data (thousands of records/sec)
*Need!= “It’s easier to do dev without schemas”
36. Why might you need NoSQL?
• Super low-latency applications
• Metadata-driven datasets
• Highly nonrelational data
• Need schema-less data constructs*
• Massive amounts of data (again, in the TB range)
• Rapid ingest of data (thousands of records/sec)
*Need!= “It’s easier to do dev without schemas”
37. Three Problems of Persistence
• Somewhere to stage
• Somewhere to live
• Somewhere to search
38. Log Dispatcher Architecture Revisited
App Server App Server App Server App Server
Kinesis
Firehose
Log Index
ElasticSearch
Log Index
ElasticSearch
Visualisation
Amazon
S3
JSON
39. Amazon S3
• Simple Storage Service
• Canonical logging target for ELB, CloudFront, etc.
• Virtually unlimited amounts of storage
• Support for Lambda operations
• Very fast – ideal for feeding other services (Redshift,
EMR/Hadoop)
• Data can be automatically pushed here from Amazon
Firehose
Amazon
S3
40. Three Problems of Persistence
• Somewhere to stage
• Somewhere to live
• Long tail
• Somewhere to search
41. Redshift
• PostgreSQL based MPP
database
• Petabyte scale data
warehousing
• Choice of nodes
• Dense compute
• Dense storage
• Already compatible with
your existing BI tools
dense
compute node
dense
storage node
Amazon
Redshift
Up to 128 nodes at 2PB
~256PB/cluster
42. Three Problems of Persistence
• Somewhere to stage
• Somewhere to live
• Somewhere to search
(streaming data)
43. Amazon ElasticSearch Service
• ElasticSearch
• Popular/Open Source
• Commonly used for log
and clickstream
• Managed Solution
• We prepackage Kibana
• Integrated with IAM,
Firehose, etc
Amazon
Elasticsearch Service
Amazon
Kinesis
Firehose
44. Three Problems of Persistence
• Somewhere to stage
• Somewhere to live
• Somewhere to search
(streaming data)
50. Logging Architecture
App Server App Server App Server App Server
Log
Aggregator
(Kafka/Kinesis/MQ)
Log
Aggregator
(Kafka/Kinesis/MQ)
Log
Index/Persist
(ElasticSearch, etc)
Log
Index/Persist
(ElasticSearch, etc)
Visualisation
51. Logging Architecture
App Server App Server App Server App Server
Log
Aggregator
(Kafka/Kinesis/MQ)
Log
Aggregator
(Kafka/Kinesis/MQ)
ElasticSearch ElasticSearch
Visualisation
52. Amazon Kinesis
• Firstly, a massively
scalable, low cost way to
send JSON objects to a
’stream’ hosted by AWS
• Users can write applications
(using KCL) to take data
from it and parse/evaluate
• Apps can be written in Java,
Lambda (Node, Python, Java),
etc
53. Kinesis Streams
• What was previously Kinesis
• Still very customisable, for
innovative stream workloads
• Users still write app to parse
data from the stream
Amazon Kinesis: New Features (re:Invent 2015)
Kinesis Firehose
• Fully managed data ingest
service
• Provision end point
• Send data to end point
• ???
• Data!
• Outputs to S3, Redshift,
ElasticSearch Service
• (And can do two at once)
54. Amazon Kinesis: New Features (Apr 2016)
Amazon Kinesis Agent
• Standalone Java application from AWS
• Collect and send logs to Kinesis Firehose
• Built-in:
• File rotation
• Failure retries
• Checkpoints
• Integrated with CloudWatch for alerting
60. Kibana
• Pre-packaged with Amazon ElasticSearch Service
• Easy to manage with freeform data
• Dashboards!
61. Your existing BI tools
• As before – your data exists on S3 (JSON)
• S3 -> Redshift
• Commission a Redshift cluster with IAM roles
• Write a manifest of the files to load (JSON)
• Issue a load
• Redshift is PgSQL compatible
• Drivers exist for many tools
64. Recap / Lessons / Next
• Logging is really hard.
• Use tools like AWS Firehose, Kinesis Agent and
ElasticSearch Service to make it easier
• Reuse data, tools and people where possible