SlideShare a Scribd company logo
1 of 13
Analytics @ Inmobi
      Gaurav Agarwal
    gaurav@inmobi.com
Scale
Large data sizes
 ~3B records per day ~= 2.5T
 uncompressed json
 ~100 primary, ~300 derived
 dimensions.
 ~50 measures.
 Analysis horizon -years
Scope
Highly Dynamic Data and Analytic needs
  Frequent addition of newer dimensions.
  Very dynamic query patterns.
  Both canned and ad-hoc reports.
  Multiple phase-shifted large data
  streams.
  Different kind of consumers – sales,
  analyst, execs, machines.
The Beginnings: Perl to MR
           (Hadoop)..
Logs summarized using perl. Low volumes
(order of hundred thousand).
Perl could not handle increased volumes
(millions). (2Q, 2010)
MR jobs to aggregate logs and populate DB (3
machine cluster)
DB views increased ; creating MR jobs time
consuming, error prone and hard. (3Q, 2010)
Solving for Pipeline - Pig
MR: New job per need ; known by few.
Pig: Well suited for medium complexity
pipeline jobs.
Data gets aggregated using Pig and
pushed to DB for analytics.
Analytics gets complex
Business evolved; complex analytics needed.
DB suffers ‘limited angle view’ problems.
Proliferation of materialized views.
Hive: not mature (early 2011), too much resource
on small/medium clusters, lot of flux, not
optimal, difficult to fix things and add features.
Back to Pig: Team of engineers writing ad-hoc pig
scripts for business; Performance only as good as
person writing the query – very low productivity.
Realization
Frequently ‘tools’ don’t work as intended. Too
much customizations and constant tuning.
Difficult to absorb the dynamics of the data.
Too generic and not optimal for our data
models and cluster size.
Parts of the required stack – difficult to
integrate and maintain.
Pig not suited for analytics by business. Too
much technical knowledge needed.
Yoda
Developed in-house system to satisfy ad-
hoc analytics.
Complete Stack (ETL, Query Processor,
Query Builder, Visualization) on top of
Hadoop, for processing logs & analytics.
(Q1, 2011)
SQL like operations like Select, Sum, Avg
Min, Max, Count, Distinct, Decode,
Expressions, GroupBy, Where, Having,
Decode, UDF, UDAF etc.
Yoda cont..
Heavily optimize storage and
queries for the data model.
All the fact data streams and
metadata in a coherent, seamless
view.
Platform–UI as well as API (to
embed the functionality it in other
apps).
Life of a Query
     UI        Optimizations     Validate         Select
Convert to     Metadata->        Query            Metadata
protbuf        Fact                               Create Joins
               promotions:       Select Cube      Estimate cost
Transmit
               GroupBy                            Select Priority
Json                             Select Optimal   Determine
               Where
                                 grain            Split size




Collect data   (Reducer)       (Mapper)           (Driver)
               Do Aggregate    Filter Push Down
Format and                     at record          Optimize
output CSV     Apply Formula   reconstruction.    query via
                               Fact filters.      reorganizatio
Update         Perform         Join.              n
status         Having          Dim Filter.
                               Select/Group       Generate MR
Notify user    Top N           Partial            spec.
                               aggregation
What worked
Efficiency in modeling and joins
  Solid data modeling. Wasteful to
  perform joins on the fly. Single-stage
  MR to both group and join.
  Map side metadata joins – efficient
  horizontal, vertical & filtered data
  load.
  Pre-join metadata once.
What worked cont..
Simplicity: Transparent Cube and
Aggregate selection (no From or Join
clause).
Ability to absorb data dynamics.
Intuitive query builder.
Analytics - not ‘just’ query.
Support for ‘scheduled’ ad-hoc queries.
Demo + QA

gaurav@inmobi.com

More Related Content

What's hot

Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAmazon Web Services
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?Attunity
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource AllocationSujith Jay Nair
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLVenu Anuganti
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
 
MS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data miningMS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data miningDataminingTools Inc
 
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configurationRfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configurationLeMeniz Infotech
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET Journal
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopLynn Langit
 

What's hot (14)

Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource Allocation
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
MS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data miningMS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data mining
 
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configurationRfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 

Similar to Yoda fifth elephant

Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network
 
Analysis Services Best Practices From Large Deployments
Analysis Services   Best Practices From Large DeploymentsAnalysis Services   Best Practices From Large Deployments
Analysis Services Best Practices From Large Deploymentsrsnarayanan
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesYang Li
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusterst_ivanov
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...Karthik K Iyengar
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergencekvnnrao
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)camunda services GmbH
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Aditya Varun Chadha
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 

Similar to Yoda fifth elephant (20)

Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
BigData
BigDataBigData
BigData
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Analysis Services Best Practices From Large Deployments
Analysis Services   Best Practices From Large DeploymentsAnalysis Services   Best Practices From Large Deployments
Analysis Services Best Practices From Large Deployments
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergence
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 

Recently uploaded

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Yoda fifth elephant

  • 1. Analytics @ Inmobi Gaurav Agarwal gaurav@inmobi.com
  • 2. Scale Large data sizes ~3B records per day ~= 2.5T uncompressed json ~100 primary, ~300 derived dimensions. ~50 measures. Analysis horizon -years
  • 3. Scope Highly Dynamic Data and Analytic needs Frequent addition of newer dimensions. Very dynamic query patterns. Both canned and ad-hoc reports. Multiple phase-shifted large data streams. Different kind of consumers – sales, analyst, execs, machines.
  • 4. The Beginnings: Perl to MR (Hadoop).. Logs summarized using perl. Low volumes (order of hundred thousand). Perl could not handle increased volumes (millions). (2Q, 2010) MR jobs to aggregate logs and populate DB (3 machine cluster) DB views increased ; creating MR jobs time consuming, error prone and hard. (3Q, 2010)
  • 5. Solving for Pipeline - Pig MR: New job per need ; known by few. Pig: Well suited for medium complexity pipeline jobs. Data gets aggregated using Pig and pushed to DB for analytics.
  • 6. Analytics gets complex Business evolved; complex analytics needed. DB suffers ‘limited angle view’ problems. Proliferation of materialized views. Hive: not mature (early 2011), too much resource on small/medium clusters, lot of flux, not optimal, difficult to fix things and add features. Back to Pig: Team of engineers writing ad-hoc pig scripts for business; Performance only as good as person writing the query – very low productivity.
  • 7. Realization Frequently ‘tools’ don’t work as intended. Too much customizations and constant tuning. Difficult to absorb the dynamics of the data. Too generic and not optimal for our data models and cluster size. Parts of the required stack – difficult to integrate and maintain. Pig not suited for analytics by business. Too much technical knowledge needed.
  • 8. Yoda Developed in-house system to satisfy ad- hoc analytics. Complete Stack (ETL, Query Processor, Query Builder, Visualization) on top of Hadoop, for processing logs & analytics. (Q1, 2011) SQL like operations like Select, Sum, Avg Min, Max, Count, Distinct, Decode, Expressions, GroupBy, Where, Having, Decode, UDF, UDAF etc.
  • 9. Yoda cont.. Heavily optimize storage and queries for the data model. All the fact data streams and metadata in a coherent, seamless view. Platform–UI as well as API (to embed the functionality it in other apps).
  • 10. Life of a Query UI Optimizations Validate Select Convert to Metadata-> Query Metadata protbuf Fact Create Joins promotions: Select Cube Estimate cost Transmit GroupBy Select Priority Json Select Optimal Determine Where grain Split size Collect data (Reducer) (Mapper) (Driver) Do Aggregate Filter Push Down Format and at record Optimize output CSV Apply Formula reconstruction. query via Fact filters. reorganizatio Update Perform Join. n status Having Dim Filter. Select/Group Generate MR Notify user Top N Partial spec. aggregation
  • 11. What worked Efficiency in modeling and joins Solid data modeling. Wasteful to perform joins on the fly. Single-stage MR to both group and join. Map side metadata joins – efficient horizontal, vertical & filtered data load. Pre-join metadata once.
  • 12. What worked cont.. Simplicity: Transparent Cube and Aggregate selection (no From or Join clause). Ability to absorb data dynamics. Intuitive query builder. Analytics - not ‘just’ query. Support for ‘scheduled’ ad-hoc queries.