At BlackBerry we had a complex problem: several dozen services, fully distinct in their instrumentation and log formats and with wildly different needs in scale and analysis. The biggest single problem we faced was how to feed that data into Hadoop, and how to manage it once it was there. In this session we will review the use cases that led to the creation of LogDriver, our toolkit for loading, analyzing and managing logs in Hadoop.
2. Internal Use Only
Tackle
>350TB per day (two years ago)
1. Segmented across NAS devices and services
2. 40+ services across tens of thousands of servers
3. Geographically distributed
4. Ad-Hoc searching and reporting took days
5. ETL pipelines were complex and fragile
Confidential and Proprietary2 Confidential and Proprietary2
3. Internal Use Only
Big needy data
1. Significantly reduce storage costs
2. Improve access times for searches
3. Provide an Ad-hoc access system
4. Secure Multitenant platform
5. Grow with us without major rearchitecture
6. Low-impact deployment
Confidential and Proprietary3 Confidential and Proprietary3
4. Internal Use Only
LogDriver
1. Our toolkit for loading, maintaining and searching log
data in Hadoop.
Includes:
Generic Avro format for log content (boom files)
High Performance Flume replacement “Sawmill”**
Data lifecycle management tools
Log search and access tools
Confidential and Proprietary4 Confidential and Proprietary4
5. Internal Use Only
The Boom File Format
1. Supports unknown, generic log types as long as they
conform to basic RFC date formats.
2. Provides mechanisms to reconstruct original order,
though does not require order on disk or during MR
processing.
3. Millisecond precision.
Confidential and Proprietary5 Confidential and Proprietary5
6. Internal Use Only
The Boom File Format
1. Aims to avoid small
compression blocks
2. Averages 87% compression
with deflate()
3. Comes with Pig UDFs that
unrolls arrays.
Confidential and Proprietary6 Confidential and Proprietary6
7. Internal Use Only
Syslog & Sawmill Ingest
1. Avoid changes to the front end services
2. Make data available for use as soon as possible
3. Serialize into Boom format (including compression)
4. Perform at high volume, fail predictably and report
Confidential and Proprietary7 Confidential and Proprietary7
8. Internal Use Only
Syslog & Sawmill Ingest
1. - Responsible for providing RFC* compliant log streams
2. - Preferably over TCP
3. ... And that’s it
4. *RFC3164/RFC5424
Confidential and Proprietary8 Confidential and Proprietary8
Service Syslog Sawmill HDFS
9. Internal Use Only
Syslog & Sawmill Ingest
1. - Provide filter and split functionality if required
2. - Correct badly formatted logs from services
3. - Deliver content to sawmill via TCP syslog
Confidential and Proprietary9 Confidential and Proprietary9
Service Syslog Sawmill HDFS
10. Internal Use Only
Syslog & Sawmill Ingest
1. - Accept all content as quickly as possible
2. - Parse date strings of possible formats
3. - Serialize and compress content into Boom format
4. - Deliver one-minute files to HDFS incoming directory
5. - Drop content and report in case of failures
Confidential and Proprietary10 Confidential and Proprietary10
Service Syslog Sawmill HDFS
11. Internal Use Only
Syslog & Sawmill Ingest
1. - Be up
Confidential and Proprietary11 Confidential and Proprietary11
Service Syslog Sawmill HDFS
12. Internal Use Only
Filesystem Structure
1. /service/dc11/bbm/logs/20130627/14/applog/...
2. -Datacenter - Date
3. -Service Name - Hour
4. - Component Name
5. (Or whatever you want to call them)
Confidential and Proprietary12 Confidential and Proprietary12
13. Internal Use Only
Filesystem Structure
.../applog/incoming/.. for incoming files from Sawmill
.../applog/working/.. for logs in merge (explained later)
.../applog/data/.. for merged, ready data
.../applog/archive/.. for archived data (explained later)
.../applog/failed/.. for content in failed state
.../applog/_READY flag indicating merged data
Confidential and Proprietary13 Confidential and Proprietary13
14. Internal Use Only
File Maintenance
1. Focused on:
1. - Low delay to access newly delivered data
2. - Optimize data for HDFS (large files)
3. - Low CPU / Cluster impact of maintenance
4. - Maintenance cannot impact query results
Confidential and Proprietary14 Confidential and Proprietary14
15. Internal Use Only
Merge Job
1. Rolls one minute files into hourly files up to 10G in size
2. Uses Zookeeper advisory locking
3. Map-Only job initiated from Oozie Workflow
4. Does not decompress log content
5. Sets _READY flag on completion
Confidential and Proprietary15 Confidential and Proprietary15
Incoming Data Archive
Merge Filter
16. Internal Use Only
Filter Job
1. Filter down to archive content using string match or regex
2. Keep all or Drop all options
3. Map-Only job initiated from Oozie Workflow
4. Will delete data in the archive after configured window
Confidential and Proprietary16 Confidential and Proprietary16
Incoming Data Archive
Merge Filter
17. Internal Use Only
Metadata
1. Tools for tracking logdriver managed content
2. JSON formatted schema and nice command line tools
Confidential and Proprietary17 Confidential and Proprietary17
19. Internal Use Only
Access Tools
1. Uses heavily optimized MR and pig jobs
2. - logsearch for direct string matching (fastest)
3. - logmultisearch for boolean AND/OR (still pretty fast)
4. - loggrep for full regex search (speed of government)
5. Abstracts filesystem, handles locking, guarantees order
Confidential and Proprietary19 Confidential and Proprietary19
20. Internal Use Only
Cool Stuff
Confidential and Proprietary20 Confidential and Proprietary20
Random Ad-Hoc jobs Merge/Filter Jobs
21. Internal Use Only
Cool Stuff
Confidential and Proprietary21 Confidential and Proprietary21
Optimized sort
approach!
22. Internal Use Only
Roadmap
1. Kafka + Storm replacing Sawmill and Storm?
1. - Guaranteed delivery with disk caching
2. - Ad-hoc real-time queries to incoming logstreams
3. - Other cool stuff with Storm
SOLRCloud and integration with Cloudera Search?
- Even faster search!
1. HCatalog integration?
Confidential and Proprietary22 Confidential and Proprietary22
23. Internal Use Only
Now Open Source!
1. https://github.com/blackberry/hadoop-logdriver
2. Apache 2.0 Licensed
3. Available Now!
Confidential and Proprietary23 Confidential and Proprietary23
24. Internal Use Only
Acknowledgements
1. Will Chartrand
2. Matt McDowell
3. The rest of the Hadoop teams at BlackBerry!
Confidential and Proprietary24 Confidential and Proprietary24
-Two year ago, traditional infrastructure: NAS Storage, dedicated parsing and ETL pipelines feeding large OLTP oracle databases -Growth from 350TB to 550TB in about a year. Over 650TB/day now -Requirements: Growth, flexible searching, decreased cost, advanced processing
-Talk about our general needs for Hadoop -Cover the need to avoid impacting production services in the deployment: ie, why we left syslog as the way into hadoop
- Note that we deal with thousands of messages per millisecond
- Note that we deal with thousands of messages per millisecond