SlideShare uma empresa Scribd logo
1 de 22
INSTRUMENTING YOUR INSTRUMENTS
Premal Shah
Co-Founder @ 6sense
Hadoop Summit 2016
AGENDA
What does 6sense do?
How do we do it?
What does the pipeline look like?
Where do we do it?
What are the challenges?
How are we planning to solve them?
WHAT DOES 6SENSE DO?
• We find prospects that are in market to buy
• We empower marketing and sales teams
SAMPLE OUTPUT
Account Name Buying Stage Profile Fit
ACME Corporation Purchase Strong
ABC Corp Decision Strong
XYZ Systems Consideration Medium
Doe Inc Awareness Strong
PURCHASE
DECISION
CONSIDERATION
AWARENESS
HOW DO WE DO IT?
1st Party
•Web
•CRM
•Marketing
Automation
3rd Party
•Web
•Search
•Ad
Impressions
Modelling &
Scoring
Actionable
Data for the
Customer
Customer
Systems
WHAT DOES THE PIPELINE LOOK LIKE?
Customer
Systems
Ingest Process Export
Customer
Systems
THE DAILY PROCESS GRAPH (DAG)
THE REAL WORLD
THE REAL WORLD * N
PIPELINE COMPONENTS
Hadoop Eco System
YARN
Hive
Presto
Mesos World
Mesos
Chronos
Marathon
WORKFLOW
Chronos Queue Marathon
Jobs
•Hadoop
•Hive
•Presto
•Python
WHERE DO WE DO IT?
• AWS
─ Elastic
─ Easy to experiment
─ No CAPEX
• Hadoop
─ Data Nodes are run separately from Node Managers
─ Most of the data sits in S3
PROJECT RAVEN
WHAT AFFECTS PERFORMANCE
• Hive
─ Joins
─ Non-Partitioned tables
─ Filters
─ Bucketing
• Hadoop
─ File format
─ Compression
─ Data Locality
METRICS THAT MATTER
• # of Mappers
• # of Input Files
• # of Input Records
• # of Records passed on to the next stage
• Time taken in
─ Mappers
─ Copy
─ Shuffle
─ Reducers
• # of Reducers
• # of compressed vs uncompressed files
• File formats
• Etc.
WHAT DO WE STORE?
• Job Name 1
─ Date 1
o Yarn Job # 1
 Metrics
o Yarn Job # 2
 Metrics
─ Date 2
o Repeat as above
• Job Name 2
─ Repeat as above
WHAT DO WE USE THEM FOR?
• Finding the Job that
─ Is the slowest
─ Process the most files
─ Filter out most of the data
─ Use the most amount of memory
• Observe trends over time in the above metrics
• Get alerted on changes in the trends, both up and down
RECOMMENDATIONS
• Storage Format
• Compression Type
• Partition Columns
• Bucketing
• Etc.
OPTIMIZATIONS
• Which job is causing the bottleneck?
• How many errors can we tolerate?
• Which job is the biggest offender?
• Which job fails the most?
• What did the latest release do?
SCALING
• Can we scale the number of customers?
• What does it cost to add a customer?
• What does it cost to add a job to each customer’s pipeline?
VENDOR SHOUT OUT
• ClusterK (now AWS Spot Fleet)
─ Allows us to use different instance types to load balance and reduce costs
• Sumo Logic
─ Detect variances in behavior over a custom time period
• OpsClarity
─ Collects, monitors and alerts on the following metrics
o AWS Cloud Watch metrics (Queue length, S3 bucket size, etc.)
o Host metrics (CPU, Memory, Disk Space, etc.)
o Service metrics (YARN, HBase, Mesos, etc.)
o Container metrics - Docker
o Custom metrics – Anything else you want to send
THANK YOU
• premal at 6sense.com
• https://www.linkedin.com/in/premaljshah

Mais conteúdo relacionado

Mais procurados

Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
Smart Enterprise Big Data Bus for the Modern Responsive EnterpriseSmart Enterprise Big Data Bus for the Modern Responsive Enterprise
Smart Enterprise Big Data Bus for the Modern Responsive EnterpriseDataWorks Summit
 
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...DataStax Academy
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftAttunity
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Data Con LA
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data IntegrationsPat Patterson
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaAttunity
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Spark Summit
 
Real-time Distributed Stream Processing @ Scale
Real-time Distributed Stream Processing@ ScaleReal-time Distributed Stream Processing@ Scale
Real-time Distributed Stream Processing @ ScaleJerome Boulon
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...DataStax Academy
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...Big Data Spain
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraAttunity
 
Break Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and MicrosoftBreak Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and MicrosoftAttunity
 
Disrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the CloudDisrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the CloudJen Aman
 

Mais procurados (20)

ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
Smart Enterprise Big Data Bus for the Modern Responsive EnterpriseSmart Enterprise Big Data Bus for the Modern Responsive Enterprise
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
 
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
 
Real-time Distributed Stream Processing @ Scale
Real-time Distributed Stream Processing@ ScaleReal-time Distributed Stream Processing@ Scale
Real-time Distributed Stream Processing @ Scale
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
Break Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and MicrosoftBreak Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and Microsoft
 
Disrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the CloudDisrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the Cloud
 

Destaque

Unidad 3 vanessa
Unidad 3 vanessaUnidad 3 vanessa
Unidad 3 vanessaVaness Ilb
 
Zombie Apocalypse Workshop by Warren Santer and Kyle Somers, Solutions Archit...
Zombie Apocalypse Workshop by Warren Santer and Kyle Somers, Solutions Archit...Zombie Apocalypse Workshop by Warren Santer and Kyle Somers, Solutions Archit...
Zombie Apocalypse Workshop by Warren Santer and Kyle Somers, Solutions Archit...Amazon Web Services
 
Chefkoch goes Drupal8
Chefkoch goes Drupal8Chefkoch goes Drupal8
Chefkoch goes Drupal8Per Bernhardt
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 

Destaque (9)

Drama
DramaDrama
Drama
 
Mapa conceptual unidad ii
Mapa conceptual unidad iiMapa conceptual unidad ii
Mapa conceptual unidad ii
 
Unidad 3 vanessa
Unidad 3 vanessaUnidad 3 vanessa
Unidad 3 vanessa
 
Zombie Apocalypse Workshop by Warren Santer and Kyle Somers, Solutions Archit...
Zombie Apocalypse Workshop by Warren Santer and Kyle Somers, Solutions Archit...Zombie Apocalypse Workshop by Warren Santer and Kyle Somers, Solutions Archit...
Zombie Apocalypse Workshop by Warren Santer and Kyle Somers, Solutions Archit...
 
La Computadora
La ComputadoraLa Computadora
La Computadora
 
Chefkoch goes Drupal8
Chefkoch goes Drupal8Chefkoch goes Drupal8
Chefkoch goes Drupal8
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 

Semelhante a Instrumenting Your Instruments: Optimizing Hadoop Pipelines at 6sense

Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
 
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation ProjectsAmazon Web Services
 
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014Amazon Web Services
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
ARC202:real world real time analytics
ARC202:real world real time analyticsARC202:real world real time analytics
ARC202:real world real time analyticsSebastian Montini
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010Christopher Brown
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale
 
ENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersAmazon Web Services
 
ENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersAmazon Web Services
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planningasya999
 
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...Lucas Jellema
 
Three signs your architecture is too small for big data. Camp IT December 2014
Three signs your architecture is too small for big data.  Camp IT December 2014Three signs your architecture is too small for big data.  Camp IT December 2014
Three signs your architecture is too small for big data. Camp IT December 2014Craig Jordan
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...Julien SIMON
 
Presentation meetup ElasticSearch Paris #10
Presentation meetup ElasticSearch Paris #10Presentation meetup ElasticSearch Paris #10
Presentation meetup ElasticSearch Paris #10Renaud Boutet
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLTriNimbus
 
AWS Well Architected Framework
AWS Well Architected FrameworkAWS Well Architected Framework
AWS Well Architected FrameworkJohn McCormack
 
Logmatic at ElasticSearch November Paris meetup
Logmatic at ElasticSearch November Paris meetupLogmatic at ElasticSearch November Paris meetup
Logmatic at ElasticSearch November Paris meetuplogmatic.io
 
How We end the Walking Dead in the Enterprise - Session Sponsored by Versent
How We end the Walking Dead in the Enterprise - Session Sponsored by VersentHow We end the Walking Dead in the Enterprise - Session Sponsored by Versent
How We end the Walking Dead in the Enterprise - Session Sponsored by VersentAmazon Web Services
 
ENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million usersENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million usersAmazon Web Services
 

Semelhante a Instrumenting Your Instruments: Optimizing Hadoop Pipelines at 6sense (20)

Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
 
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
(ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
ARC202:real world real time analytics
ARC202:real world real time analyticsARC202:real world real time analytics
ARC202:real world real time analytics
 
Design for Scale / Surge 2010
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to Cloud
 
ENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million Users
 
ENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million Users
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
 
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
 
Three signs your architecture is too small for big data. Camp IT December 2014
Three signs your architecture is too small for big data.  Camp IT December 2014Three signs your architecture is too small for big data.  Camp IT December 2014
Three signs your architecture is too small for big data. Camp IT December 2014
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
 
Presentation meetup ElasticSearch Paris #10
Presentation meetup ElasticSearch Paris #10Presentation meetup ElasticSearch Paris #10
Presentation meetup ElasticSearch Paris #10
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
AWS Well Architected Framework
AWS Well Architected FrameworkAWS Well Architected Framework
AWS Well Architected Framework
 
Logmatic at ElasticSearch November Paris meetup
Logmatic at ElasticSearch November Paris meetupLogmatic at ElasticSearch November Paris meetup
Logmatic at ElasticSearch November Paris meetup
 
How We end the Walking Dead in the Enterprise - Session Sponsored by Versent
How We end the Walking Dead in the Enterprise - Session Sponsored by VersentHow We end the Walking Dead in the Enterprise - Session Sponsored by Versent
How We end the Walking Dead in the Enterprise - Session Sponsored by Versent
 
ENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million usersENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million users
 

Mais de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Último (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Instrumenting Your Instruments: Optimizing Hadoop Pipelines at 6sense

  • 1. INSTRUMENTING YOUR INSTRUMENTS Premal Shah Co-Founder @ 6sense Hadoop Summit 2016
  • 2. AGENDA What does 6sense do? How do we do it? What does the pipeline look like? Where do we do it? What are the challenges? How are we planning to solve them?
  • 3. WHAT DOES 6SENSE DO? • We find prospects that are in market to buy • We empower marketing and sales teams
  • 4. SAMPLE OUTPUT Account Name Buying Stage Profile Fit ACME Corporation Purchase Strong ABC Corp Decision Strong XYZ Systems Consideration Medium Doe Inc Awareness Strong PURCHASE DECISION CONSIDERATION AWARENESS
  • 5. HOW DO WE DO IT? 1st Party •Web •CRM •Marketing Automation 3rd Party •Web •Search •Ad Impressions Modelling & Scoring Actionable Data for the Customer
  • 6. Customer Systems WHAT DOES THE PIPELINE LOOK LIKE? Customer Systems Ingest Process Export Customer Systems
  • 7. THE DAILY PROCESS GRAPH (DAG)
  • 10. PIPELINE COMPONENTS Hadoop Eco System YARN Hive Presto Mesos World Mesos Chronos Marathon
  • 12. WHERE DO WE DO IT? • AWS ─ Elastic ─ Easy to experiment ─ No CAPEX • Hadoop ─ Data Nodes are run separately from Node Managers ─ Most of the data sits in S3
  • 14. WHAT AFFECTS PERFORMANCE • Hive ─ Joins ─ Non-Partitioned tables ─ Filters ─ Bucketing • Hadoop ─ File format ─ Compression ─ Data Locality
  • 15. METRICS THAT MATTER • # of Mappers • # of Input Files • # of Input Records • # of Records passed on to the next stage • Time taken in ─ Mappers ─ Copy ─ Shuffle ─ Reducers • # of Reducers • # of compressed vs uncompressed files • File formats • Etc.
  • 16. WHAT DO WE STORE? • Job Name 1 ─ Date 1 o Yarn Job # 1  Metrics o Yarn Job # 2  Metrics ─ Date 2 o Repeat as above • Job Name 2 ─ Repeat as above
  • 17. WHAT DO WE USE THEM FOR? • Finding the Job that ─ Is the slowest ─ Process the most files ─ Filter out most of the data ─ Use the most amount of memory • Observe trends over time in the above metrics • Get alerted on changes in the trends, both up and down
  • 18. RECOMMENDATIONS • Storage Format • Compression Type • Partition Columns • Bucketing • Etc.
  • 19. OPTIMIZATIONS • Which job is causing the bottleneck? • How many errors can we tolerate? • Which job is the biggest offender? • Which job fails the most? • What did the latest release do?
  • 20. SCALING • Can we scale the number of customers? • What does it cost to add a customer? • What does it cost to add a job to each customer’s pipeline?
  • 21. VENDOR SHOUT OUT • ClusterK (now AWS Spot Fleet) ─ Allows us to use different instance types to load balance and reduce costs • Sumo Logic ─ Detect variances in behavior over a custom time period • OpsClarity ─ Collects, monitors and alerts on the following metrics o AWS Cloud Watch metrics (Queue length, S3 bucket size, etc.) o Host metrics (CPU, Memory, Disk Space, etc.) o Service metrics (YARN, HBase, Mesos, etc.) o Container metrics - Docker o Custom metrics – Anything else you want to send
  • 22. THANK YOU • premal at 6sense.com • https://www.linkedin.com/in/premaljshah

Notas do Editor

  1. If the queue has a backlog, we have to scale up all the systems to keep it moving. The data is in HDFS, S3, RDBMS, etc.
  2. Analyzing a hive query could be hard, but analyzing the performance metrics of many successful runs should be an easier problem to solve. Users are busy, they write a query and move on. The system should look at the performance and suggest optimizations One size does not fit all. Customer A's web data could have a very different scale than Customer B
  3. Load Balancer Queue Length CPU and Memory Services up time