SlideShare uma empresa Scribd logo
1 de 25
Feeding The Elephant
Approaching 1 PB/day
Aaron Wiebe
Internal Use Only
Tackle
>350TB per day (two years ago)
1. Segmented across NAS devices and services
2. 40+ services across tens of thousands of servers
3. Geographically distributed
4. Ad-Hoc searching and reporting took days
5. ETL pipelines were complex and fragile
Confidential and Proprietary2 Confidential and Proprietary2
Internal Use Only
Big needy data
1. Significantly reduce storage costs
2. Improve access times for searches
3. Provide an Ad-hoc access system
4. Secure Multitenant platform
5. Grow with us without major rearchitecture
6. Low-impact deployment
Confidential and Proprietary3 Confidential and Proprietary3
Internal Use Only
LogDriver
1. Our toolkit for loading, maintaining and searching log
data in Hadoop.
Includes:
Generic Avro format for log content (boom files)
High Performance Flume replacement “Sawmill”**
Data lifecycle management tools
Log search and access tools
Confidential and Proprietary4 Confidential and Proprietary4
Internal Use Only
The Boom File Format
1. Supports unknown, generic log types as long as they
conform to basic RFC date formats.
2. Provides mechanisms to reconstruct original order,
though does not require order on disk or during MR
processing.
3. Millisecond precision.
Confidential and Proprietary5 Confidential and Proprietary5
Internal Use Only
The Boom File Format
1. Aims to avoid small
compression blocks
2. Averages 87% compression
with deflate()
3. Comes with Pig UDFs that
unrolls arrays.
Confidential and Proprietary6 Confidential and Proprietary6
Internal Use Only
Syslog & Sawmill Ingest
1. Avoid changes to the front end services
2. Make data available for use as soon as possible
3. Serialize into Boom format (including compression)
4. Perform at high volume, fail predictably and report
Confidential and Proprietary7 Confidential and Proprietary7
Internal Use Only
Syslog & Sawmill Ingest
1. - Responsible for providing RFC* compliant log streams
2. - Preferably over TCP
3. ... And that’s it
4. *RFC3164/RFC5424
Confidential and Proprietary8 Confidential and Proprietary8
Service Syslog Sawmill HDFS
Internal Use Only
Syslog & Sawmill Ingest
1. - Provide filter and split functionality if required
2. - Correct badly formatted logs from services
3. - Deliver content to sawmill via TCP syslog
Confidential and Proprietary9 Confidential and Proprietary9
Service Syslog Sawmill HDFS
Internal Use Only
Syslog & Sawmill Ingest
1. - Accept all content as quickly as possible
2. - Parse date strings of possible formats
3. - Serialize and compress content into Boom format
4. - Deliver one-minute files to HDFS incoming directory
5. - Drop content and report in case of failures
Confidential and Proprietary10 Confidential and Proprietary10
Service Syslog Sawmill HDFS
Internal Use Only
Syslog & Sawmill Ingest
1. - Be up
Confidential and Proprietary11 Confidential and Proprietary11
Service Syslog Sawmill HDFS
Internal Use Only
Filesystem Structure
1. /service/dc11/bbm/logs/20130627/14/applog/...
2. -Datacenter - Date
3. -Service Name - Hour
4. - Component Name
5. (Or whatever you want to call them)
Confidential and Proprietary12 Confidential and Proprietary12
Internal Use Only
Filesystem Structure
.../applog/incoming/.. for incoming files from Sawmill
.../applog/working/.. for logs in merge (explained later)
.../applog/data/.. for merged, ready data
.../applog/archive/.. for archived data (explained later)
.../applog/failed/.. for content in failed state
.../applog/_READY flag indicating merged data
Confidential and Proprietary13 Confidential and Proprietary13
Internal Use Only
File Maintenance
1. Focused on:
1. - Low delay to access newly delivered data
2. - Optimize data for HDFS (large files)
3. - Low CPU / Cluster impact of maintenance
4. - Maintenance cannot impact query results
Confidential and Proprietary14 Confidential and Proprietary14
Internal Use Only
Merge Job
1. Rolls one minute files into hourly files up to 10G in size
2. Uses Zookeeper advisory locking
3. Map-Only job initiated from Oozie Workflow
4. Does not decompress log content
5. Sets _READY flag on completion
Confidential and Proprietary15 Confidential and Proprietary15
Incoming Data Archive
Merge Filter
Internal Use Only
Filter Job
1. Filter down to archive content using string match or regex
2. Keep all or Drop all options
3. Map-Only job initiated from Oozie Workflow
4. Will delete data in the archive after configured window
Confidential and Proprietary16 Confidential and Proprietary16
Incoming Data Archive
Merge Filter
Internal Use Only
Metadata
1. Tools for tracking logdriver managed content
2. JSON formatted schema and nice command line tools
Confidential and Proprietary17 Confidential and Proprietary17
Internal Use Only
Metadata
Confidential and Proprietary18 Confidential and Proprietary18
Internal Use Only
Access Tools
1. Uses heavily optimized MR and pig jobs
2. - logsearch for direct string matching (fastest)
3. - logmultisearch for boolean AND/OR (still pretty fast)
4. - loggrep for full regex search (speed of government)
5. Abstracts filesystem, handles locking, guarantees order
Confidential and Proprietary19 Confidential and Proprietary19
Internal Use Only
Cool Stuff
Confidential and Proprietary20 Confidential and Proprietary20
Random Ad-Hoc jobs Merge/Filter Jobs
Internal Use Only
Cool Stuff
Confidential and Proprietary21 Confidential and Proprietary21
Optimized sort
approach!
Internal Use Only
Roadmap
1. Kafka + Storm replacing Sawmill and Storm?
1. - Guaranteed delivery with disk caching
2. - Ad-hoc real-time queries to incoming logstreams
3. - Other cool stuff with Storm
SOLRCloud and integration with Cloudera Search?
- Even faster search!
1. HCatalog integration?
Confidential and Proprietary22 Confidential and Proprietary22
Internal Use Only
Now Open Source!
1. https://github.com/blackberry/hadoop-logdriver
2. Apache 2.0 Licensed
3. Available Now!
Confidential and Proprietary23 Confidential and Proprietary23
Internal Use Only
Acknowledgements
1. Will Chartrand
2. Matt McDowell
3. The rest of the Hadoop teams at BlackBerry!
Confidential and Proprietary24 Confidential and Proprietary24
Questions?
Confidential and Proprietary25 Confidential and Proprietary25

Mais conteúdo relacionado

Mais procurados

IPWB and IPFS at WAC2017
IPWB and IPFS at WAC2017IPWB and IPFS at WAC2017
IPWB and IPFS at WAC2017David Dias
 
Learn how to build decentralized and serverless html5 applications with embar...
Learn how to build decentralized and serverless html5 applications with embar...Learn how to build decentralized and serverless html5 applications with embar...
Learn how to build decentralized and serverless html5 applications with embar...Alessandro Confetti
 
Разведка в сетях IPv6
Разведка в сетях IPv6Разведка в сетях IPv6
Разведка в сетях IPv6Positive Hack Days
 
Configure the active directory infrastructure
Configure the active directory infrastructureConfigure the active directory infrastructure
Configure the active directory infrastructureAzad Kaki
 
OSDC 2016 - Unifying Logs and Metrics Data with Elastic Beats by Monica Sarbu
OSDC 2016 - Unifying Logs and Metrics Data with Elastic Beats by Monica SarbuOSDC 2016 - Unifying Logs and Metrics Data with Elastic Beats by Monica Sarbu
OSDC 2016 - Unifying Logs and Metrics Data with Elastic Beats by Monica SarbuNETWAYS
 
Fernando Gont - The Hack Summit 2021 - State of the Art in IPv6 Security
Fernando Gont - The Hack Summit 2021 - State of the Art in IPv6 SecurityFernando Gont - The Hack Summit 2021 - State of the Art in IPv6 Security
Fernando Gont - The Hack Summit 2021 - State of the Art in IPv6 SecurityEdgeUno
 
Mtcna outline
Mtcna outlineMtcna outline
Mtcna outlinesourmkn
 
Mdx ietf foss_2018 (cyberstorm.mu)
Mdx ietf foss_2018 (cyberstorm.mu)Mdx ietf foss_2018 (cyberstorm.mu)
Mdx ietf foss_2018 (cyberstorm.mu)loganaden
 
Solr and ManifoldCF
Solr and ManifoldCFSolr and ManifoldCF
Solr and ManifoldCFMinoru Osuka
 
Meeting 13. web server i
Meeting 13. web server iMeeting 13. web server i
Meeting 13. web server iSyaiful Ahdan
 

Mais procurados (14)

IPWB and IPFS at WAC2017
IPWB and IPFS at WAC2017IPWB and IPFS at WAC2017
IPWB and IPFS at WAC2017
 
Learn how to build decentralized and serverless html5 applications with embar...
Learn how to build decentralized and serverless html5 applications with embar...Learn how to build decentralized and serverless html5 applications with embar...
Learn how to build decentralized and serverless html5 applications with embar...
 
Redecentralizing the Web: IPFS and Filecoin
Redecentralizing the Web: IPFS and FilecoinRedecentralizing the Web: IPFS and Filecoin
Redecentralizing the Web: IPFS and Filecoin
 
Разведка в сетях IPv6
Разведка в сетях IPv6Разведка в сетях IPv6
Разведка в сетях IPv6
 
Configure the active directory infrastructure
Configure the active directory infrastructureConfigure the active directory infrastructure
Configure the active directory infrastructure
 
OSDC 2016 - Unifying Logs and Metrics Data with Elastic Beats by Monica Sarbu
OSDC 2016 - Unifying Logs and Metrics Data with Elastic Beats by Monica SarbuOSDC 2016 - Unifying Logs and Metrics Data with Elastic Beats by Monica Sarbu
OSDC 2016 - Unifying Logs and Metrics Data with Elastic Beats by Monica Sarbu
 
wget, curl and scp
wget, curl and scpwget, curl and scp
wget, curl and scp
 
Fernando Gont - The Hack Summit 2021 - State of the Art in IPv6 Security
Fernando Gont - The Hack Summit 2021 - State of the Art in IPv6 SecurityFernando Gont - The Hack Summit 2021 - State of the Art in IPv6 Security
Fernando Gont - The Hack Summit 2021 - State of the Art in IPv6 Security
 
Mtcna outline
Mtcna outlineMtcna outline
Mtcna outline
 
Redis
RedisRedis
Redis
 
Mdx ietf foss_2018 (cyberstorm.mu)
Mdx ietf foss_2018 (cyberstorm.mu)Mdx ietf foss_2018 (cyberstorm.mu)
Mdx ietf foss_2018 (cyberstorm.mu)
 
All Zones
All ZonesAll Zones
All Zones
 
Solr and ManifoldCF
Solr and ManifoldCFSolr and ManifoldCF
Solr and ManifoldCF
 
Meeting 13. web server i
Meeting 13. web server iMeeting 13. web server i
Meeting 13. web server i
 

Destaque

Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk
 
SplunkLive! Customer Presentation - FINRA
SplunkLive! Customer Presentation - FINRASplunkLive! Customer Presentation - FINRA
SplunkLive! Customer Presentation - FINRASplunk
 
Splunk live! customer presentation – zoosk
Splunk live! customer presentation – zooskSplunk live! customer presentation – zoosk
Splunk live! customer presentation – zooskSplunk
 
El juego, como forma de aprendizaje en la era de la informaciónn
El juego, como forma de aprendizaje en la era de la informaciónnEl juego, como forma de aprendizaje en la era de la informaciónn
El juego, como forma de aprendizaje en la era de la informaciónnMechitachanc
 
Headache (2)
Headache (2)Headache (2)
Headache (2)zanzoon
 
International Journal of Law and Psychiatry
International Journal of Law and PsychiatryInternational Journal of Law and Psychiatry
International Journal of Law and PsychiatryAriel Eytan
 
Splunk live! ninjas_break-out
Splunk live! ninjas_break-outSplunk live! ninjas_break-out
Splunk live! ninjas_break-outSplunk
 
Splunk live! Customer Presentation – Wellsfargo
Splunk live! Customer Presentation – WellsfargoSplunk live! Customer Presentation – Wellsfargo
Splunk live! Customer Presentation – WellsfargoSplunk
 
DevOps Powered by Splunk
DevOps Powered by SplunkDevOps Powered by Splunk
DevOps Powered by SplunkSplunk
 
DevOps and Splunk
DevOps and SplunkDevOps and Splunk
DevOps and SplunkSplunk
 
Sugerencias para la elaboración de la ruta de mejora
Sugerencias para la elaboración de la ruta de mejoraSugerencias para la elaboración de la ruta de mejora
Sugerencias para la elaboración de la ruta de mejoravamosporlaeducacion
 
In-Text Citations
In-Text CitationsIn-Text Citations
In-Text Citationsnolawriter
 
Wells Fargo Customer Presentation
Wells Fargo Customer PresentationWells Fargo Customer Presentation
Wells Fargo Customer PresentationSplunk
 
Threat Hunting with Splunk
Threat Hunting with SplunkThreat Hunting with Splunk
Threat Hunting with SplunkSplunk
 

Destaque (19)

Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search Dojo
 
SplunkLive! Customer Presentation - FINRA
SplunkLive! Customer Presentation - FINRASplunkLive! Customer Presentation - FINRA
SplunkLive! Customer Presentation - FINRA
 
Splunk live! customer presentation – zoosk
Splunk live! customer presentation – zooskSplunk live! customer presentation – zoosk
Splunk live! customer presentation – zoosk
 
El juego, como forma de aprendizaje en la era de la informaciónn
El juego, como forma de aprendizaje en la era de la informaciónnEl juego, como forma de aprendizaje en la era de la informaciónn
El juego, como forma de aprendizaje en la era de la informaciónn
 
Headache (2)
Headache (2)Headache (2)
Headache (2)
 
Herramientas Web 2.0
Herramientas Web 2.0Herramientas Web 2.0
Herramientas Web 2.0
 
First Lady 1
First Lady 1First Lady 1
First Lady 1
 
resume
resumeresume
resume
 
International Journal of Law and Psychiatry
International Journal of Law and PsychiatryInternational Journal of Law and Psychiatry
International Journal of Law and Psychiatry
 
Splunk live! ninjas_break-out
Splunk live! ninjas_break-outSplunk live! ninjas_break-out
Splunk live! ninjas_break-out
 
Splunk live! Customer Presentation – Wellsfargo
Splunk live! Customer Presentation – WellsfargoSplunk live! Customer Presentation – Wellsfargo
Splunk live! Customer Presentation – Wellsfargo
 
cv
cvcv
cv
 
DevOps Powered by Splunk
DevOps Powered by SplunkDevOps Powered by Splunk
DevOps Powered by Splunk
 
RESUME
RESUMERESUME
RESUME
 
DevOps and Splunk
DevOps and SplunkDevOps and Splunk
DevOps and Splunk
 
Sugerencias para la elaboración de la ruta de mejora
Sugerencias para la elaboración de la ruta de mejoraSugerencias para la elaboración de la ruta de mejora
Sugerencias para la elaboración de la ruta de mejora
 
In-Text Citations
In-Text CitationsIn-Text Citations
In-Text Citations
 
Wells Fargo Customer Presentation
Wells Fargo Customer PresentationWells Fargo Customer Presentation
Wells Fargo Customer Presentation
 
Threat Hunting with Splunk
Threat Hunting with SplunkThreat Hunting with Splunk
Threat Hunting with Splunk
 

Semelhante a Feeding the Elephant: Approaching 1PB/Day

GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
 
How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARAlexander Falk
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSandeep Patil
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar
 
Mastering Your Universe with P4 Search
Mastering Your Universe with P4 SearchMastering Your Universe with P4 Search
Mastering Your Universe with P4 SearchPerforce
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin Vanessa Lošić
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 
Backups-khtn document 2023 tai lieu hay.pdf
Backups-khtn document 2023 tai lieu hay.pdfBackups-khtn document 2023 tai lieu hay.pdf
Backups-khtn document 2023 tai lieu hay.pdftrihang02122018
 
LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)Linaro
 
Log aggregation: using Elasticsearch, Fluentd/Fluentbit and Kibana (EFK)
Log aggregation: using Elasticsearch, Fluentd/Fluentbit and Kibana (EFK)Log aggregation: using Elasticsearch, Fluentd/Fluentbit and Kibana (EFK)
Log aggregation: using Elasticsearch, Fluentd/Fluentbit and Kibana (EFK)Lee Myring
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent
 
What's new in Oracle Trace File Analyzer version 12.2.1.1.0
What's new in Oracle Trace File Analyzer version 12.2.1.1.0What's new in Oracle Trace File Analyzer version 12.2.1.1.0
What's new in Oracle Trace File Analyzer version 12.2.1.1.0Sandesh Rao
 
Decentralized possibilities with filecoin & ipfs_encode filecoin club
Decentralized possibilities with filecoin & ipfs_encode filecoin clubDecentralized possibilities with filecoin & ipfs_encode filecoin club
Decentralized possibilities with filecoin & ipfs_encode filecoin clubKlaraOrban
 
Accelerating Software Development with NetApp's P4flex
Accelerating Software Development with NetApp's P4flexAccelerating Software Development with NetApp's P4flex
Accelerating Software Development with NetApp's P4flexPerforce
 
Pratiques administration avancées et techniques de développement
Pratiques administration avancées et techniques de développementPratiques administration avancées et techniques de développement
Pratiques administration avancées et techniques de développementParis Salesforce Developer Group
 

Semelhante a Feeding the Elephant: Approaching 1PB/Day (20)

GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...
 
How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGAR
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...
 
Mastering Your Universe with P4 Search
Mastering Your Universe with P4 SearchMastering Your Universe with P4 Search
Mastering Your Universe with P4 Search
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
 
The Quick Migration of File Servers
The Quick Migration of File ServersThe Quick Migration of File Servers
The Quick Migration of File Servers
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Backups-khtn document 2023 tai lieu hay.pdf
Backups-khtn document 2023 tai lieu hay.pdfBackups-khtn document 2023 tai lieu hay.pdf
Backups-khtn document 2023 tai lieu hay.pdf
 
XDC demo: CTA
XDC demo: CTAXDC demo: CTA
XDC demo: CTA
 
LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)
 
Log aggregation: using Elasticsearch, Fluentd/Fluentbit and Kibana (EFK)
Log aggregation: using Elasticsearch, Fluentd/Fluentbit and Kibana (EFK)Log aggregation: using Elasticsearch, Fluentd/Fluentbit and Kibana (EFK)
Log aggregation: using Elasticsearch, Fluentd/Fluentbit and Kibana (EFK)
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
HDF for the Cloud
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
What's new in Oracle Trace File Analyzer version 12.2.1.1.0
What's new in Oracle Trace File Analyzer version 12.2.1.1.0What's new in Oracle Trace File Analyzer version 12.2.1.1.0
What's new in Oracle Trace File Analyzer version 12.2.1.1.0
 
Decentralized possibilities with filecoin & ipfs_encode filecoin club
Decentralized possibilities with filecoin & ipfs_encode filecoin clubDecentralized possibilities with filecoin & ipfs_encode filecoin club
Decentralized possibilities with filecoin & ipfs_encode filecoin club
 
Accelerating Software Development with NetApp's P4flex
Accelerating Software Development with NetApp's P4flexAccelerating Software Development with NetApp's P4flex
Accelerating Software Development with NetApp's P4flex
 
Pratiques administration avancées et techniques de développement
Pratiques administration avancées et techniques de développementPratiques administration avancées et techniques de développement
Pratiques administration avancées et techniques de développement
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Feeding the Elephant: Approaching 1PB/Day

  • 1. Feeding The Elephant Approaching 1 PB/day Aaron Wiebe
  • 2. Internal Use Only Tackle >350TB per day (two years ago) 1. Segmented across NAS devices and services 2. 40+ services across tens of thousands of servers 3. Geographically distributed 4. Ad-Hoc searching and reporting took days 5. ETL pipelines were complex and fragile Confidential and Proprietary2 Confidential and Proprietary2
  • 3. Internal Use Only Big needy data 1. Significantly reduce storage costs 2. Improve access times for searches 3. Provide an Ad-hoc access system 4. Secure Multitenant platform 5. Grow with us without major rearchitecture 6. Low-impact deployment Confidential and Proprietary3 Confidential and Proprietary3
  • 4. Internal Use Only LogDriver 1. Our toolkit for loading, maintaining and searching log data in Hadoop. Includes: Generic Avro format for log content (boom files) High Performance Flume replacement “Sawmill”** Data lifecycle management tools Log search and access tools Confidential and Proprietary4 Confidential and Proprietary4
  • 5. Internal Use Only The Boom File Format 1. Supports unknown, generic log types as long as they conform to basic RFC date formats. 2. Provides mechanisms to reconstruct original order, though does not require order on disk or during MR processing. 3. Millisecond precision. Confidential and Proprietary5 Confidential and Proprietary5
  • 6. Internal Use Only The Boom File Format 1. Aims to avoid small compression blocks 2. Averages 87% compression with deflate() 3. Comes with Pig UDFs that unrolls arrays. Confidential and Proprietary6 Confidential and Proprietary6
  • 7. Internal Use Only Syslog & Sawmill Ingest 1. Avoid changes to the front end services 2. Make data available for use as soon as possible 3. Serialize into Boom format (including compression) 4. Perform at high volume, fail predictably and report Confidential and Proprietary7 Confidential and Proprietary7
  • 8. Internal Use Only Syslog & Sawmill Ingest 1. - Responsible for providing RFC* compliant log streams 2. - Preferably over TCP 3. ... And that’s it 4. *RFC3164/RFC5424 Confidential and Proprietary8 Confidential and Proprietary8 Service Syslog Sawmill HDFS
  • 9. Internal Use Only Syslog & Sawmill Ingest 1. - Provide filter and split functionality if required 2. - Correct badly formatted logs from services 3. - Deliver content to sawmill via TCP syslog Confidential and Proprietary9 Confidential and Proprietary9 Service Syslog Sawmill HDFS
  • 10. Internal Use Only Syslog & Sawmill Ingest 1. - Accept all content as quickly as possible 2. - Parse date strings of possible formats 3. - Serialize and compress content into Boom format 4. - Deliver one-minute files to HDFS incoming directory 5. - Drop content and report in case of failures Confidential and Proprietary10 Confidential and Proprietary10 Service Syslog Sawmill HDFS
  • 11. Internal Use Only Syslog & Sawmill Ingest 1. - Be up Confidential and Proprietary11 Confidential and Proprietary11 Service Syslog Sawmill HDFS
  • 12. Internal Use Only Filesystem Structure 1. /service/dc11/bbm/logs/20130627/14/applog/... 2. -Datacenter - Date 3. -Service Name - Hour 4. - Component Name 5. (Or whatever you want to call them) Confidential and Proprietary12 Confidential and Proprietary12
  • 13. Internal Use Only Filesystem Structure .../applog/incoming/.. for incoming files from Sawmill .../applog/working/.. for logs in merge (explained later) .../applog/data/.. for merged, ready data .../applog/archive/.. for archived data (explained later) .../applog/failed/.. for content in failed state .../applog/_READY flag indicating merged data Confidential and Proprietary13 Confidential and Proprietary13
  • 14. Internal Use Only File Maintenance 1. Focused on: 1. - Low delay to access newly delivered data 2. - Optimize data for HDFS (large files) 3. - Low CPU / Cluster impact of maintenance 4. - Maintenance cannot impact query results Confidential and Proprietary14 Confidential and Proprietary14
  • 15. Internal Use Only Merge Job 1. Rolls one minute files into hourly files up to 10G in size 2. Uses Zookeeper advisory locking 3. Map-Only job initiated from Oozie Workflow 4. Does not decompress log content 5. Sets _READY flag on completion Confidential and Proprietary15 Confidential and Proprietary15 Incoming Data Archive Merge Filter
  • 16. Internal Use Only Filter Job 1. Filter down to archive content using string match or regex 2. Keep all or Drop all options 3. Map-Only job initiated from Oozie Workflow 4. Will delete data in the archive after configured window Confidential and Proprietary16 Confidential and Proprietary16 Incoming Data Archive Merge Filter
  • 17. Internal Use Only Metadata 1. Tools for tracking logdriver managed content 2. JSON formatted schema and nice command line tools Confidential and Proprietary17 Confidential and Proprietary17
  • 18. Internal Use Only Metadata Confidential and Proprietary18 Confidential and Proprietary18
  • 19. Internal Use Only Access Tools 1. Uses heavily optimized MR and pig jobs 2. - logsearch for direct string matching (fastest) 3. - logmultisearch for boolean AND/OR (still pretty fast) 4. - loggrep for full regex search (speed of government) 5. Abstracts filesystem, handles locking, guarantees order Confidential and Proprietary19 Confidential and Proprietary19
  • 20. Internal Use Only Cool Stuff Confidential and Proprietary20 Confidential and Proprietary20 Random Ad-Hoc jobs Merge/Filter Jobs
  • 21. Internal Use Only Cool Stuff Confidential and Proprietary21 Confidential and Proprietary21 Optimized sort approach!
  • 22. Internal Use Only Roadmap 1. Kafka + Storm replacing Sawmill and Storm? 1. - Guaranteed delivery with disk caching 2. - Ad-hoc real-time queries to incoming logstreams 3. - Other cool stuff with Storm SOLRCloud and integration with Cloudera Search? - Even faster search! 1. HCatalog integration? Confidential and Proprietary22 Confidential and Proprietary22
  • 23. Internal Use Only Now Open Source! 1. https://github.com/blackberry/hadoop-logdriver 2. Apache 2.0 Licensed 3. Available Now! Confidential and Proprietary23 Confidential and Proprietary23
  • 24. Internal Use Only Acknowledgements 1. Will Chartrand 2. Matt McDowell 3. The rest of the Hadoop teams at BlackBerry! Confidential and Proprietary24 Confidential and Proprietary24
  • 25. Questions? Confidential and Proprietary25 Confidential and Proprietary25

Notas do Editor

  1. -Two year ago, traditional infrastructure: NAS Storage, dedicated parsing and ETL pipelines feeding large OLTP oracle databases -Growth from 350TB to 550TB in about a year. Over 650TB/day now -Requirements: Growth, flexible searching, decreased cost, advanced processing
  2. -Talk about our general needs for Hadoop -Cover the need to avoid impacting production services in the deployment: ie, why we left syslog as the way into hadoop
  3. - Note that we deal with thousands of messages per millisecond
  4. - Note that we deal with thousands of messages per millisecond