Log management isn’t easy to do at scale. We designed Loggly Gen2 using the latest social-media-scale technologies—including ElasticSearch, Kafka from LinkedIn, and Apache Storm—as the backbone of ingestion processing for our multi-tenant, geo-distributed, and real-time log management system.
Since we launched Gen2, we’ve learned a lot more about these technologies. We regularly contribute back to the open source community, so we decided that it’s time to give an update on our experience with Storm and explain why we have dropped it from our platform, at least for now.
Read full blog post here: http://bit.ly/ScaleApacheStorm
Powerpoint exploring the locations used in television show Time Clash
What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope
1. What We Learned About
Scaling with Apache Storm
Apache Storm
Manoj Chaudhary
CTO & VP of Engineering
August 2014
| Log management as a service Simplifified Log Management
2. What Loggly Does
We’re the world’s most popular cloud-based
log management service
§ More than 5,000 customers
§ Near real-time indexing of events
Distributed architecture, built on AWS
Initial production services in 2011
§ Loggly Generation 2 released in Sept 2013
| Log management as a service Simplify Log Management
3. Agenda for this Presentation
§ The unique challenges of log management
§ Overview of the Loggly event pipeline
§ Use of open source technologies
§ Lessons we have learned
§ Why we removed Storm
§ Conclusions: the Storm 411
| Log management as a service Simplify Log Management
4. How Log Management Starts
Everyone starts with …
§ A bunch of log files (syslog, application specific)
§ On a bunch of machines
Management consists of doing the
simple stuff:
§ Rotate files, compress and delete
§ Information is there but awkward to find
specific events
§ Log retention policies evolve over time
| Log management as a service Simplify Log Management
5. As Log Data Grows
“…how can I make this someone else’s problem!”
“…let’s spend time managing our log capacity”
“…hmmm, our logs are getting a bit bloated”
Log Volume
Self-Inflicted Pain
| Log management as a service Simplify Log Management
6. Loggly Makes Log Management Much Easier
Use existing logging infrastructure
§ Real time syslog forwarding is built in
§ Application log file watching
Store logs in the cloud
§ Accessible when there is a system failure
§ Cost-effective data retention
Log messages in machine parsable format
§ JSON encoding when logging structured
information
§ Key-value pairs
| Log management as a service Simplify Log Management
7. Loggly’s Evolution
Gen1
• 2011-2013
• AWS EC2 deployment
• SOLR Cloud
• ZeroMQ for message
queue
Gen2
• Launched September
2013
• AWS deployment
• Utilized ElasticSearch,
Kafka, Storm
Incremental
Improvements
and Scale
| Log management as a service Simplify Log Management
8. The Challenges of Log Management at Scale
§ Big data
§ >750 billion events logged to
date
§ Sustained bursts of 100,000+
events per second
§ Data space measured in
petabytes
§ Need for high fault tolerance
§ Near real-time indexing
requirements
§ Time-series index
management
| Log management as a service Simplify Log Management
9. About Apache Storm
Open sourced by Twitter in September 2011
§ Now an Apache Software Foundation project
§ Currently Incubator Status
Framework is for stream processing
§ Distributed
§ Fault tolerant
§ Computation
§ Fail-fast components
| Log management as a service Simplify Log Management
10. Storm Logical View
Bolt
Example Topology
Spout Bolt Bolt
Bolt
Spouts emit source stream Bolts perform stream processing
| Log management as a service Simplify Log Management
11. Storm Physical View
Nimbus
ZooKeeper
ZooKeeper
Supervisor Worker
Supervisor Worker
Supervisor Worker
Worker Node
Java process
executing a subset
of topology
Worker Process
ZooKeeper Executor Task
Supervisor
Supervisor
Master Daemon
§ Distributes Code
§ Assigns Tasks
§ Monitors Failures
Storing
Operational
Cluster State
Java thread
spawned by Worker,
runs tasks of same
component.
Daemon listening
for work assigned
to its node.
Component (spout /
bolt) instance,
performs the actual
data processing.
| Log management as a service Simplify Log Management
12. Log Ingestion and Processing Overview
Load Balancing
Kafka
Stage
2
Storm
Event
Processing
| Log management as a service Simplify Log Management
13. Event Pipeline in Summary
§ Storm provides Complex Event Processing
§ Where we run much of our secret-sauce
§ Stage 1 contains the raw Events
§ Stage 2 contains processed Events
§ Snapshot the last day of Stage 2 events to S3
| Log management as a service Simplify Log Management
14. What Attracted Us to Storm
§ Spout and bolts principle fit our network
approach, where logs could move from bolt to
bolt sequentially or need to be consumed by
several bolts in parallel
§ Guaranteed data processing of data stream
§ Allowed us to focus on writing the best possible code
for different bolts
§ Dynamic deployment makes it easy to add or
remove new nodes to adjust for actual loads and
requirements
§ Log data has peaks and valleys
| Log management as a service Simplify Log Management
15. Loggly Gen2 at Launch: Where Storm Fits In
| Log management as a service Simplify Log Management
Kafka
Stage 1
S3
Bucket
Identify
Customer
Summary
Statistics
Kafka
Stage 2
16. What We Learned
| Log management as a service Simplify Log Management
17. Guaranteed Delivery Causes
Big Performance Hit
Guaranteed delivery feature needed for
log management resilience but…
Bolt
Example Topology
ack
ack
ack
Spout Bolt Bolt
Bolt
ack
ack
Spouts emit source stream Bolts perform stream processing
2.5x hit to performance!!
| Log management as a service Simplify Log Management
18. Our Performance Testing
Preload
Kafka
broker
• 50 GB of raw log data from production
cluster
• Kafka partitions with 8 spouts and 20
mapper bolts
• 4K provisioned IPOS backend AWS instance
Deploy
Storm
topology
with Kafka
spout
• TOPOLOGY_ACKERS set to 0
• Kafka disks red hot
Ack’ing
per tuple
turned off
• Kafka disks not saturated
• Bolts not running on high capacity
Ack’ing
per tuple
enabled
Average events per
second processed per
250,000
200,000
150,000
100,000
50,000
-
cluster
Without
guaranteed
delivery
With
guaranteed
delivery
| Log management as a service Simplify Log Management
19. Potential Workaround: Batch Logs
§ Ack a set of logs instead of individual events
§ PROBLEM: not consistent with Storm’s
semantics of a “message”
It is not trivial to change the Kafka
spout as well as each bolt to
reinterpret a single message as a
bunch of logs.
| Log management as a service Simplify Log Management
20. Ultimate Solution: Build Custom Queue
for Module-to-Module Communication
Load Balancing
Kafka
Stage
2
Loggly
Custom
Module
| Log management as a service Simplify Log Management
21. Benefits of New Approach
§ High-performance, reliable
communication that implements our
workflow
§ Supports sustained rates of 100K+ events
per second
§ Relatively easy to port
| Log management as a service Simplify Log Management
22. Conclusions
Storm 0.82 has plenty of potential
But…
Log management’s unique
challenges drive the need
for a custom framework
| Log management as a service Simplify Log Management
23. Log Management is Our Full-Time Job.
It Shouldn’t Be Yours.
Try Loggly for Free! → http://bit.ly/ScaleApacheStorm
Unless You Want it to Be (Join us!)
Check out our career page to see if there’s a great match for your skills!
loggly.com/careers.
About Us:
Loggly is the world’s most popular cloud-based log management solution, used by
more than 5,000 happy customers to effortlessly spot problems in real-time, easily
pinpoint root causes and resolve issues faster to ensure application success.
Visit us at loggly.com or follow @loggly on Twitter.
| Log management as a service Simplify Log Management