SlideShare a Scribd company logo
1 of 23
Download to read offline
What We Learned About 
Scaling with Apache Storm 
Apache Storm 
Manoj Chaudhary 
CTO & VP of Engineering 
August 2014 
| Log management as a service Simplifified Log Management
What Loggly Does 
We’re the world’s most popular cloud-based 
log management service 
§ More than 5,000 customers 
§ Near real-time indexing of events 
Distributed architecture, built on AWS 
Initial production services in 2011 
§ Loggly Generation 2 released in Sept 2013 
| Log management as a service Simplify Log Management
Agenda for this Presentation 
§ The unique challenges of log management 
§ Overview of the Loggly event pipeline 
§ Use of open source technologies 
§ Lessons we have learned 
§ Why we removed Storm 
§ Conclusions: the Storm 411 
| Log management as a service Simplify Log Management
How Log Management Starts 
Everyone starts with … 
§ A bunch of log files (syslog, application specific) 
§ On a bunch of machines 
Management consists of doing the 
simple stuff: 
§ Rotate files, compress and delete 
§ Information is there but awkward to find 
specific events 
§ Log retention policies evolve over time 
| Log management as a service Simplify Log Management
As Log Data Grows 
“…how can I make this someone else’s problem!” 
“…let’s spend time managing our log capacity” 
“…hmmm, our logs are getting a bit bloated” 
Log Volume 
Self-Inflicted Pain 
| Log management as a service Simplify Log Management
Loggly Makes Log Management Much Easier 
Use existing logging infrastructure 
§ Real time syslog forwarding is built in 
§ Application log file watching 
Store logs in the cloud 
§ Accessible when there is a system failure 
§ Cost-effective data retention 
Log messages in machine parsable format 
§ JSON encoding when logging structured 
information 
§ Key-value pairs 
| Log management as a service Simplify Log Management
Loggly’s Evolution 
Gen1 
• 2011-2013 
• AWS EC2 deployment 
• SOLR Cloud 
• ZeroMQ for message 
queue 
Gen2 
• Launched September 
2013 
• AWS deployment 
• Utilized ElasticSearch, 
Kafka, Storm 
Incremental 
Improvements 
and Scale 
| Log management as a service Simplify Log Management
The Challenges of Log Management at Scale 
§ Big data 
§ >750 billion events logged to 
date 
§ Sustained bursts of 100,000+ 
events per second 
§ Data space measured in 
petabytes 
§ Need for high fault tolerance 
§ Near real-time indexing 
requirements 
§ Time-series index 
management 
| Log management as a service Simplify Log Management
About Apache Storm 
Open sourced by Twitter in September 2011 
§ Now an Apache Software Foundation project 
§ Currently Incubator Status 
Framework is for stream processing 
§ Distributed 
§ Fault tolerant 
§ Computation 
§ Fail-fast components 
| Log management as a service Simplify Log Management
Storm Logical View 
Bolt 
Example Topology 
Spout Bolt Bolt 
Bolt 
Spouts emit source stream Bolts perform stream processing 
| Log management as a service Simplify Log Management
Storm Physical View 
Nimbus 
ZooKeeper 
ZooKeeper 
Supervisor Worker 
Supervisor Worker 
Supervisor Worker 
Worker Node 
Java process 
executing a subset 
of topology 
Worker Process 
ZooKeeper Executor Task 
Supervisor 
Supervisor 
Master Daemon 
§ Distributes Code 
§ Assigns Tasks 
§ Monitors Failures 
Storing 
Operational 
Cluster State 
Java thread 
spawned by Worker, 
runs tasks of same 
component. 
Daemon listening 
for work assigned 
to its node. 
Component (spout / 
bolt) instance, 
performs the actual 
data processing. 
| Log management as a service Simplify Log Management
Log Ingestion and Processing Overview 
Load Balancing 
Kafka 
Stage 
2 
Storm 
Event 
Processing 
| Log management as a service Simplify Log Management
Event Pipeline in Summary 
§ Storm provides Complex Event Processing 
§ Where we run much of our secret-sauce 
§ Stage 1 contains the raw Events 
§ Stage 2 contains processed Events 
§ Snapshot the last day of Stage 2 events to S3 
| Log management as a service Simplify Log Management
What Attracted Us to Storm 
§ Spout and bolts principle fit our network 
approach, where logs could move from bolt to 
bolt sequentially or need to be consumed by 
several bolts in parallel 
§ Guaranteed data processing of data stream 
§ Allowed us to focus on writing the best possible code 
for different bolts 
§ Dynamic deployment makes it easy to add or 
remove new nodes to adjust for actual loads and 
requirements 
§ Log data has peaks and valleys 
| Log management as a service Simplify Log Management
Loggly Gen2 at Launch: Where Storm Fits In 
| Log management as a service Simplify Log Management 
Kafka 
Stage 1 
S3 
Bucket 
Identify 
Customer 
Summary 
Statistics 
Kafka 
Stage 2
What We Learned 
| Log management as a service Simplify Log Management
Guaranteed Delivery Causes 
Big Performance Hit 
Guaranteed delivery feature needed for 
log management resilience but… 
Bolt 
Example Topology 
ack 
ack 
ack 
Spout Bolt Bolt 
Bolt 
ack 
ack 
Spouts emit source stream Bolts perform stream processing 
2.5x hit to performance!! 
| Log management as a service Simplify Log Management
Our Performance Testing 
Preload 
Kafka 
broker 
• 50 GB of raw log data from production 
cluster 
• Kafka partitions with 8 spouts and 20 
mapper bolts 
• 4K provisioned IPOS backend AWS instance 
Deploy 
Storm 
topology 
with Kafka 
spout 
• TOPOLOGY_ACKERS set to 0 
• Kafka disks red hot 
Ack’ing 
per tuple 
turned off 
• Kafka disks not saturated 
• Bolts not running on high capacity 
Ack’ing 
per tuple 
enabled 
Average events per 
second processed per 
250,000 
200,000 
150,000 
100,000 
50,000 
- 
cluster 
Without 
guaranteed 
delivery 
With 
guaranteed 
delivery 
| Log management as a service Simplify Log Management
Potential Workaround: Batch Logs 
§ Ack a set of logs instead of individual events 
§ PROBLEM: not consistent with Storm’s 
semantics of a “message” 
It is not trivial to change the Kafka 
spout as well as each bolt to 
reinterpret a single message as a 
bunch of logs. 
| Log management as a service Simplify Log Management
Ultimate Solution: Build Custom Queue 
for Module-to-Module Communication 
Load Balancing 
Kafka 
Stage 
2 
Loggly 
Custom 
Module 
| Log management as a service Simplify Log Management
Benefits of New Approach 
§ High-performance, reliable 
communication that implements our 
workflow 
§ Supports sustained rates of 100K+ events 
per second 
§ Relatively easy to port 
| Log management as a service Simplify Log Management
Conclusions 
Storm 0.82 has plenty of potential 
But… 
Log management’s unique 
challenges drive the need 
for a custom framework 
| Log management as a service Simplify Log Management
Log Management is Our Full-Time Job. 
It Shouldn’t Be Yours. 
Try Loggly for Free! → http://bit.ly/ScaleApacheStorm 
Unless You Want it to Be (Join us!) 
Check out our career page to see if there’s a great match for your skills! 
loggly.com/careers. 
About Us: 
Loggly is the world’s most popular cloud-based log management solution, used by 
more than 5,000 happy customers to effortlessly spot problems in real-time, easily 
pinpoint root causes and resolve issues faster to ensure application success. 
Visit us at loggly.com or follow @loggly on Twitter. 
| Log management as a service Simplify Log Management

More Related Content

More from SolarWinds Loggly

Loggly - Tools and Techniques For Logging Microservices
Loggly - Tools and Techniques For Logging MicroservicesLoggly - Tools and Techniques For Logging Microservices
Loggly - Tools and Techniques For Logging MicroservicesSolarWinds Loggly
 
Loggly - 5 Popular .NET Logging Libraries
Loggly - 5 Popular .NET Logging LibrariesLoggly - 5 Popular .NET Logging Libraries
Loggly - 5 Popular .NET Logging LibrariesSolarWinds Loggly
 
Loggly - IT Operations in a Serverless World (Infographic)
Loggly - IT Operations in a Serverless World (Infographic)Loggly - IT Operations in a Serverless World (Infographic)
Loggly - IT Operations in a Serverless World (Infographic)SolarWinds Loggly
 
Loggly - Case Study - Loggly and Docker Deliver Powerful Monitoring for XAPPm...
Loggly - Case Study - Loggly and Docker Deliver Powerful Monitoring for XAPPm...Loggly - Case Study - Loggly and Docker Deliver Powerful Monitoring for XAPPm...
Loggly - Case Study - Loggly and Docker Deliver Powerful Monitoring for XAPPm...SolarWinds Loggly
 
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...SolarWinds Loggly
 
Loggly - Case Study - Loggly and Kubernetes Give Molecule Easy Access to the ...
Loggly - Case Study - Loggly and Kubernetes Give Molecule Easy Access to the ...Loggly - Case Study - Loggly and Kubernetes Give Molecule Easy Access to the ...
Loggly - Case Study - Loggly and Kubernetes Give Molecule Easy Access to the ...SolarWinds Loggly
 
Loggly - Case Study - Datami Keeps Developer Productivity High with Loggly
Loggly - Case Study - Datami Keeps Developer Productivity High with LogglyLoggly - Case Study - Datami Keeps Developer Productivity High with Loggly
Loggly - Case Study - Datami Keeps Developer Productivity High with LogglySolarWinds Loggly
 
Loggly - Case Study - BEMOBI - Bemobi Monitors the Experience of 500 Million ...
Loggly - Case Study - BEMOBI - Bemobi Monitors the Experience of 500 Million ...Loggly - Case Study - BEMOBI - Bemobi Monitors the Experience of 500 Million ...
Loggly - Case Study - BEMOBI - Bemobi Monitors the Experience of 500 Million ...SolarWinds Loggly
 
Loggly - How to Scale Your Architecture and DevOps Practices for Big Data App...
Loggly - How to Scale Your Architecture and DevOps Practices for Big Data App...Loggly - How to Scale Your Architecture and DevOps Practices for Big Data App...
Loggly - How to Scale Your Architecture and DevOps Practices for Big Data App...SolarWinds Loggly
 
Loggly - Benchmarking 5 Node.js Logging Libraries
Loggly - Benchmarking 5 Node.js Logging LibrariesLoggly - Benchmarking 5 Node.js Logging Libraries
Loggly - Benchmarking 5 Node.js Logging LibrariesSolarWinds Loggly
 

More from SolarWinds Loggly (10)

Loggly - Tools and Techniques For Logging Microservices
Loggly - Tools and Techniques For Logging MicroservicesLoggly - Tools and Techniques For Logging Microservices
Loggly - Tools and Techniques For Logging Microservices
 
Loggly - 5 Popular .NET Logging Libraries
Loggly - 5 Popular .NET Logging LibrariesLoggly - 5 Popular .NET Logging Libraries
Loggly - 5 Popular .NET Logging Libraries
 
Loggly - IT Operations in a Serverless World (Infographic)
Loggly - IT Operations in a Serverless World (Infographic)Loggly - IT Operations in a Serverless World (Infographic)
Loggly - IT Operations in a Serverless World (Infographic)
 
Loggly - Case Study - Loggly and Docker Deliver Powerful Monitoring for XAPPm...
Loggly - Case Study - Loggly and Docker Deliver Powerful Monitoring for XAPPm...Loggly - Case Study - Loggly and Docker Deliver Powerful Monitoring for XAPPm...
Loggly - Case Study - Loggly and Docker Deliver Powerful Monitoring for XAPPm...
 
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
 
Loggly - Case Study - Loggly and Kubernetes Give Molecule Easy Access to the ...
Loggly - Case Study - Loggly and Kubernetes Give Molecule Easy Access to the ...Loggly - Case Study - Loggly and Kubernetes Give Molecule Easy Access to the ...
Loggly - Case Study - Loggly and Kubernetes Give Molecule Easy Access to the ...
 
Loggly - Case Study - Datami Keeps Developer Productivity High with Loggly
Loggly - Case Study - Datami Keeps Developer Productivity High with LogglyLoggly - Case Study - Datami Keeps Developer Productivity High with Loggly
Loggly - Case Study - Datami Keeps Developer Productivity High with Loggly
 
Loggly - Case Study - BEMOBI - Bemobi Monitors the Experience of 500 Million ...
Loggly - Case Study - BEMOBI - Bemobi Monitors the Experience of 500 Million ...Loggly - Case Study - BEMOBI - Bemobi Monitors the Experience of 500 Million ...
Loggly - Case Study - BEMOBI - Bemobi Monitors the Experience of 500 Million ...
 
Loggly - How to Scale Your Architecture and DevOps Practices for Big Data App...
Loggly - How to Scale Your Architecture and DevOps Practices for Big Data App...Loggly - How to Scale Your Architecture and DevOps Practices for Big Data App...
Loggly - How to Scale Your Architecture and DevOps Practices for Big Data App...
 
Loggly - Benchmarking 5 Node.js Logging Libraries
Loggly - Benchmarking 5 Node.js Logging LibrariesLoggly - Benchmarking 5 Node.js Logging Libraries
Loggly - Benchmarking 5 Node.js Logging Libraries
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

  • 1. What We Learned About Scaling with Apache Storm Apache Storm Manoj Chaudhary CTO & VP of Engineering August 2014 | Log management as a service Simplifified Log Management
  • 2. What Loggly Does We’re the world’s most popular cloud-based log management service § More than 5,000 customers § Near real-time indexing of events Distributed architecture, built on AWS Initial production services in 2011 § Loggly Generation 2 released in Sept 2013 | Log management as a service Simplify Log Management
  • 3. Agenda for this Presentation § The unique challenges of log management § Overview of the Loggly event pipeline § Use of open source technologies § Lessons we have learned § Why we removed Storm § Conclusions: the Storm 411 | Log management as a service Simplify Log Management
  • 4. How Log Management Starts Everyone starts with … § A bunch of log files (syslog, application specific) § On a bunch of machines Management consists of doing the simple stuff: § Rotate files, compress and delete § Information is there but awkward to find specific events § Log retention policies evolve over time | Log management as a service Simplify Log Management
  • 5. As Log Data Grows “…how can I make this someone else’s problem!” “…let’s spend time managing our log capacity” “…hmmm, our logs are getting a bit bloated” Log Volume Self-Inflicted Pain | Log management as a service Simplify Log Management
  • 6. Loggly Makes Log Management Much Easier Use existing logging infrastructure § Real time syslog forwarding is built in § Application log file watching Store logs in the cloud § Accessible when there is a system failure § Cost-effective data retention Log messages in machine parsable format § JSON encoding when logging structured information § Key-value pairs | Log management as a service Simplify Log Management
  • 7. Loggly’s Evolution Gen1 • 2011-2013 • AWS EC2 deployment • SOLR Cloud • ZeroMQ for message queue Gen2 • Launched September 2013 • AWS deployment • Utilized ElasticSearch, Kafka, Storm Incremental Improvements and Scale | Log management as a service Simplify Log Management
  • 8. The Challenges of Log Management at Scale § Big data § >750 billion events logged to date § Sustained bursts of 100,000+ events per second § Data space measured in petabytes § Need for high fault tolerance § Near real-time indexing requirements § Time-series index management | Log management as a service Simplify Log Management
  • 9. About Apache Storm Open sourced by Twitter in September 2011 § Now an Apache Software Foundation project § Currently Incubator Status Framework is for stream processing § Distributed § Fault tolerant § Computation § Fail-fast components | Log management as a service Simplify Log Management
  • 10. Storm Logical View Bolt Example Topology Spout Bolt Bolt Bolt Spouts emit source stream Bolts perform stream processing | Log management as a service Simplify Log Management
  • 11. Storm Physical View Nimbus ZooKeeper ZooKeeper Supervisor Worker Supervisor Worker Supervisor Worker Worker Node Java process executing a subset of topology Worker Process ZooKeeper Executor Task Supervisor Supervisor Master Daemon § Distributes Code § Assigns Tasks § Monitors Failures Storing Operational Cluster State Java thread spawned by Worker, runs tasks of same component. Daemon listening for work assigned to its node. Component (spout / bolt) instance, performs the actual data processing. | Log management as a service Simplify Log Management
  • 12. Log Ingestion and Processing Overview Load Balancing Kafka Stage 2 Storm Event Processing | Log management as a service Simplify Log Management
  • 13. Event Pipeline in Summary § Storm provides Complex Event Processing § Where we run much of our secret-sauce § Stage 1 contains the raw Events § Stage 2 contains processed Events § Snapshot the last day of Stage 2 events to S3 | Log management as a service Simplify Log Management
  • 14. What Attracted Us to Storm § Spout and bolts principle fit our network approach, where logs could move from bolt to bolt sequentially or need to be consumed by several bolts in parallel § Guaranteed data processing of data stream § Allowed us to focus on writing the best possible code for different bolts § Dynamic deployment makes it easy to add or remove new nodes to adjust for actual loads and requirements § Log data has peaks and valleys | Log management as a service Simplify Log Management
  • 15. Loggly Gen2 at Launch: Where Storm Fits In | Log management as a service Simplify Log Management Kafka Stage 1 S3 Bucket Identify Customer Summary Statistics Kafka Stage 2
  • 16. What We Learned | Log management as a service Simplify Log Management
  • 17. Guaranteed Delivery Causes Big Performance Hit Guaranteed delivery feature needed for log management resilience but… Bolt Example Topology ack ack ack Spout Bolt Bolt Bolt ack ack Spouts emit source stream Bolts perform stream processing 2.5x hit to performance!! | Log management as a service Simplify Log Management
  • 18. Our Performance Testing Preload Kafka broker • 50 GB of raw log data from production cluster • Kafka partitions with 8 spouts and 20 mapper bolts • 4K provisioned IPOS backend AWS instance Deploy Storm topology with Kafka spout • TOPOLOGY_ACKERS set to 0 • Kafka disks red hot Ack’ing per tuple turned off • Kafka disks not saturated • Bolts not running on high capacity Ack’ing per tuple enabled Average events per second processed per 250,000 200,000 150,000 100,000 50,000 - cluster Without guaranteed delivery With guaranteed delivery | Log management as a service Simplify Log Management
  • 19. Potential Workaround: Batch Logs § Ack a set of logs instead of individual events § PROBLEM: not consistent with Storm’s semantics of a “message” It is not trivial to change the Kafka spout as well as each bolt to reinterpret a single message as a bunch of logs. | Log management as a service Simplify Log Management
  • 20. Ultimate Solution: Build Custom Queue for Module-to-Module Communication Load Balancing Kafka Stage 2 Loggly Custom Module | Log management as a service Simplify Log Management
  • 21. Benefits of New Approach § High-performance, reliable communication that implements our workflow § Supports sustained rates of 100K+ events per second § Relatively easy to port | Log management as a service Simplify Log Management
  • 22. Conclusions Storm 0.82 has plenty of potential But… Log management’s unique challenges drive the need for a custom framework | Log management as a service Simplify Log Management
  • 23. Log Management is Our Full-Time Job. It Shouldn’t Be Yours. Try Loggly for Free! → http://bit.ly/ScaleApacheStorm Unless You Want it to Be (Join us!) Check out our career page to see if there’s a great match for your skills! loggly.com/careers. About Us: Loggly is the world’s most popular cloud-based log management solution, used by more than 5,000 happy customers to effortlessly spot problems in real-time, easily pinpoint root causes and resolve issues faster to ensure application success. Visit us at loggly.com or follow @loggly on Twitter. | Log management as a service Simplify Log Management