Debugging data pipelines with Presto at Ola

•

0 gostou•119 visualizações

Shubham Tagra

Talk at Presto Bangalore Meetup by Karan Kumar on Debugging data pipelines @OLA

Software

Debugging data pipelines
Karan Kumar
SDE 3
Dataplatform

Overview
● Our Journey
● Analytics@Ola
● CDC Overview
● Application Events
● Majority Sources
● Hello Presto
● The Presto Kafka Problem
● Solution
● Results
● We like ambari!
● How to expose?
● Hue drawbacks
● Presto as a ﬁrst class citizen of Hue
● Roadmap

Overview of analytics@Ola
● 25k query run daily by business analysts.
● ~400 business analysts.
● 2.5 TB of daily data ingest.
● ~3k tables maintained by dataplatform.
● Auth managed via Ranger

Majority sources
● MYSQL
● PSQL
● Kafka
● MongoDB
● Hbase
● ScyllaDB
● Hive

Hello Presto
● Single uniﬁed view across data sources
● Proﬁling and automated alerting
● Drastic reduction in TAT.
● Integration with Jira Hooks

The Presto-Kafka Problem
● Gets all the partitions, start scanning from earliest and then apply ﬁlters
● Topic addition requires conﬁg change

The Presto-Kafka Solution
● Hit the broker for the topic list every time.
● Make use of message_timestamp in kafka versions > 0.10.1xx

Results
● Earlier .
● With message timestamp
● With predicate pushdown

We like ambari!!
● Exposing presto on ambari .
● Patching open source ambari to ﬁt our needs of pulling tars from s3.
● Out of the box alerting and monitoring.
● Releasing plugins via s3 poll.
● Autoscaling via AWS autoscaling groups.

That's okay but how to expose?
● We had 3 choices.
○ MSTR
○ Hue
○ New interface like superset

Why Hue will not work?
● No results download
● No query progress
● No query kill functionality
● Result caching
● Download limit on rows fetched and not size.
● Launching jvm for each user

Why MSTR did not work?
● Downloading was tedious.
● Per user memory issue.
● UI unfamiliarity.

Presto as a ﬁrst class citizen for hue
● Results download upto 100 mb.
● Query progress .
● Query kill supported .
● Query expiry after 7 days. No need to rerun historical q’s
● Coordinator query url

Roadmap
1. Contributing presto kafka connector back
2. Presto oozie support
3. Getting Presto Ranger PR merged
4. Deprecating Hive for analysts

Mais conteúdo relacionado

Mais procurados

Argus Production Monitoring at SalesforceHBaseCon

Tales from Taming the Long TailHBaseCon

Update on OpenTSDB and AsyncHBase HBaseCon

Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax

Rolling Out Apache HBase for Mobile Offerings at Visa HBaseCon

Elephants in the CloudMike Fowler

Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar

ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...Altinity Ltd

Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty

Google Cloud & Your DataMike Fowler

Data Analysis with TensorFlow in PostgreSQLEDB

Need for Time series DatabasePramit Choudhary

HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBaseHBaseCon

Enterprise Scale Topological Data Analysis Using SparkAlpine Data

Using ScyllaDB with JanusGraph for Cyber SecurityScyllaDB

HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon

Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi

Performance Troubleshooting Using Apache Spark MetricsDatabricks

Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh

25 snowflake剑飞陈

Mais procurados (20)

Argus Production Monitoring at Salesforce

Tales from Taming the Long Tail

Update on OpenTSDB and AsyncHBase

Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016

Rolling Out Apache HBase for Mobile Offerings at Visa

Elephants in the Cloud

Hoodie: How (And Why) We built an analytical datastore on Spark

ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...

Amazon aws big data demystified | Introduction to streaming and messaging flu...

Google Cloud & Your Data

Data Analysis with TensorFlow in PostgreSQL

Need for Time series Database

HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase

Enterprise Scale Topological Data Analysis Using Spark

Using ScyllaDB with JanusGraph for Cyber Security

HBaseCon 2015: OpenTSDB and AsyncHBase Update

Introduction to Data Engineer and Data Pipeline at Credit OK

Performance Troubleshooting Using Apache Spark Metrics

Will it Scale? The Secrets behind Scaling Stream Processing Applications

25 snowflake

Semelhante a Debugging data pipelines with Presto at Ola

Netflix Data Pipeline With KafkaSteven Wu

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Building real time Data Pipeline using Spark Streamingdatamantra

Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward

kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community

Bootstrapping state in Apache FlinkDataWorks Summit

NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg

Server fleet management using Camunda by Akhil Ahujacamunda services GmbH

[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...Anna Ossowski

Processing Terabytes of data every day … and sleeping at night (infiniteConf ...Luciano Mammino

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini

Tips & Tricks for Apache Kafka®confluent

Processing TeraBytes of data every day and sleeping at nightLuciano Mammino

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv

AWS Lambdas are cool - Cheminfo Stories Day 1ChemAxon

AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata

kafkaAriel Moskovich

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent

Eko10 Workshop Opensource Database AuditingJuan Berner

Semelhante a Debugging data pipelines with Presto at Ola (20)

Netflix Data Pipeline With Kafka

Building real time Data Pipeline using Spark Streaming

Scala like distributed collections - dumping time-series data with apache spark

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

kranonit S06E01 Игорь Цинько: High load

Bootstrapping state in Apache Flink

NetflixOSS Meetup season 3 episode 1

Server fleet management using Camunda by Akhil Ahuja

[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...

Processing Terabytes of data every day … and sleeping at night (infiniteConf ...

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015

Tips & Tricks for Apache Kafka®

Processing TeraBytes of data every day and sleeping at night

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

AWS Lambdas are cool - Cheminfo Stories Day 1

AWS Techniques and lessons writing low cost autoscaling GitLab runners

kafka

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Eko10 Workshop Opensource Database Auditing

Mais de Shubham Tagra

Alluxio Data Orchestration Platform for the CloudShubham Tagra

Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Shubham Tagra

Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019Shubham Tagra

Presto Bangalore Meetup1 Event Listeners@quboleShubham Tagra

Presto Bangalore Meetup1 Presto Raptor@olaShubham Tagra

Presto Bangalore Meetup1 Ranger+Presto@olaShubham Tagra

Presto Bangalore Meetup1 Repertoire@MyntraShubham Tagra

RubiXShubham Tagra

Mais de Shubham Tagra (8)

Alluxio Data Orchestration Platform for the Cloud

Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...

Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019

Presto Bangalore Meetup1 Event Listeners@qubole

Presto Bangalore Meetup1 Presto Raptor@ola

Presto Bangalore Meetup1 Ranger+Presto@ola

Presto Bangalore Meetup1 Repertoire@Myntra

RubiX

Último

eSoftTools IMAP Backup Software and migration toolsosttopstonverter

Sending Calendar Invites on SES and Calendarsnack.pdf31events.com

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver

SAM Training Session - How to use EXCEL ?Alexandre Beguel

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171

Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Ronisha Informatics Private Limited Catalogueitservices996

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne

Precise and Complete Requirements? An Elusive GoalLionel Briand

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

Strategies for using alternative queries to mitigate zero resultsJean Silva

How to submit a standout Adobe Champion ApplicationBradBedford3

Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

Debugging data pipelines with Presto at Ola

1. Debugging data pipelines Karan Kumar SDE 3 Dataplatform

2. Overview ● Our Journey ● Analytics@Ola ● CDC Overview ● Application Events ● Majority Sources ● Hello Presto ● The Presto Kafka Problem ● Solution ● Results ● We like ambari! ● How to expose? ● Hue drawbacks ● Presto as a ﬁrst class citizen of Hue ● Roadmap

4. Overview of analytics@Ola ● 25k query run daily by business analysts. ● ~400 business analysts. ● 2.5 TB of daily data ingest. ● ~3k tables maintained by dataplatform. ● Auth managed via Ranger

5. CDC Overview

6. Application Events

7. Majority sources ● MYSQL ● PSQL ● Kafka ● MongoDB ● Hbase ● ScyllaDB ● Hive

8. Hello Presto ● Single uniﬁed view across data sources ● Proﬁling and automated alerting ● Drastic reduction in TAT. ● Integration with Jira Hooks

9. The Presto-Kafka Problem ● Gets all the partitions, start scanning from earliest and then apply ﬁlters ● Topic addition requires conﬁg change

10. The Presto-Kafka Solution ● Hit the broker for the topic list every time. ● Make use of message_timestamp in kafka versions > 0.10.1xx

11. Results ● Earlier . ● With message timestamp ● With predicate pushdown

12. We like ambari!! ● Exposing presto on ambari . ● Patching open source ambari to ﬁt our needs of pulling tars from s3. ● Out of the box alerting and monitoring. ● Releasing plugins via s3 poll. ● Autoscaling via AWS autoscaling groups.

13. That's okay but how to expose? ● We had 3 choices. ○ MSTR ○ Hue ○ New interface like superset

14. Why Hue will not work? ● No results download ● No query progress ● No query kill functionality ● Result caching ● Download limit on rows fetched and not size. ● Launching jvm for each user

15. Why MSTR did not work? ● Downloading was tedious. ● Per user memory issue. ● UI unfamiliarity.

16. Presto as a ﬁrst class citizen for hue ● Results download upto 100 mb. ● Query progress . ● Query kill supported . ● Query expiry after 7 days. No need to rerun historical q’s ● Coordinator query url

17. Roadmap 1. Contributing presto kafka connector back 2. Presto oozie support 3. Getting Presto Ranger PR merged 4. Deprecating Hive for analysts

18. Thanks!! Questions?

Debugging data pipelines with Presto at Ola

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Debugging data pipelines with Presto at Ola

Semelhante a Debugging data pipelines with Presto at Ola (20)

Mais de Shubham Tagra

Mais de Shubham Tagra (8)

Último

Último (20)

Debugging data pipelines with Presto at Ola