SlideShare uma empresa Scribd logo
1 de 33
River
A data workflow management system


Harel Ben Attia
Senior Software Engineer
– Tens of Billions of Recommendations per month
– Most major publishers in the World
– Hundreds GBs of new data every day
Context
• Data Processing Workflows

• Multiple Types of Processing
  – Rollups, Grouping, Filtering, Algorithm
    Calculations


• Multiple Stages of Processing
  – Using the output of other processes as input
Problems
• Dependency “Management”
  – Hardcoded into code/scripts
  – Time-based using cron or another scheduler


• Logic is scattered around the system
  – Developers need to take care of
    monitoring, alerts, permissions etc.
  – Multiple Locations of Execution
River


Data Processing
 Management
 Infrastructure
River
• Execution Management
   –   Full Execution History and Filtering
   –   Monitoring and Actionable Alerting      Ops / NOC
   –   Automatic Retries
   –   Web UI

• Ease of Development
   – Declarative Data Processing Definitions
   – Decentralized                             Developers
        • Shared Data, separate development
   – JobLogs

• Data Driven Dependencies
   – Why?
Other Approaches




      A                B           C
         Option 1              Option 2



A          B C         t   A   J     B    C
            J
Other Approaches



                   Option 2



   A J B C                    t
Other Approaches




               D Fails
D sends email

Developer of D
still works here


        Where is the code?
Other Approaches



                     2am is a
D=                   great hour for
                     troubleshooting!
       Data from C is missing…


C=                   The data of C
                     is all there!
Other Approaches


                               X:37 seems like a
                             good time… C never
                              finished after X:30
                                    anyway




A B C                                               t


Job J has been working for
 more than a week before
        the incident
                                      D …
Other Approaches


Need to rerun processes B, C and D



•Which hours failed?

•How to run all of them for the specific hours?
  •Without running A again?
  •Without colliding with ongoing executions?
Other Approaches
                       “A will never take more
            than 15 minutes, so X:20 is more than enough”



           A
        X:00
                                                            t


                                      J
A WILL eventually take longer
River
• Execution Management
   –   Full Execution History + Filtering and Searching
   –   Monitoring and Actionable Alerting
   –   Automatic Retries
   –   Web UI
   –   JobLogs

• Ease of Development
   – Declarative Data Processing Definitions
   – Decentralized
        • Shared Data, separate development


• Data Driven Dependencies
   – Why?
                 Robustness              Reliability      Parallelism
River

What?            When?




Where?           How?
Execution Layer – the “What”
       Every data processing task is called a Job

           A Job can contain multiple Steps

•   Importing from MySQL to Hive
•   Hive Queries
•   JDBC Queries
•   Transfer data from Hive into MySQL and to Cassandra
•   Running External Commands:
    MapReduce, Java, bash, Legacy code, etc.

                  Jobs use Parameters
Scheduling Layer – the “When”
        Each job registers to an event, which will trigger its execution

                 Each job emits an event at job completion




Events that describe Data Availability            Events that are time dependent
The “How” and the “Where”
             Both handled by the infrastructure

• Integration to other systems
   • Connecting to Hive/Hadoop/Cassandra        Logical names to
                                                 all data sources
   • Connecting to JDBC Databases
                                      “readOnlyDataWarehouse”
   • Retries, throttling, timeouts    ”productionCassandra”


• Monitoring and Alerts               Centralized Management, email
                                       notifications and dashboards


• Location of Execution              Actual location is hidden from the
                                               developer/ops
River UI




           FailDownload JobLog
               Job and Dependents
                  Restart Job
Monitoring Dashboard
Monitoring Dashboard
Steps

Copy Data From JDBC to Hive
sourceDB = “productionDatabase”
sourceTable = “myRawData”
targetCluster = “onlineHadoopCluster”
targetHiveTable = “rawDataTable”
Filter = “date=#handledDate#”




     Steps only contain what needs to be done
A bit more about triggers
            Triggers have parameters as well

          Date=2012-10-10,hour=15              Date=2012-10-10,hour=19




Parameters Propagate through jobs and to other triggers
Developer’s Point-of-View


 Automatic
   Retries
 Parameters
Pass-through
Trigger Queue                Execution Queue




           River
                      Trigger               Execution       Spring
                     Manager                Manager         Batch
Topology                                                               Spring Batch DB



            Hive/Hadoop            OS       Cassandra        JDBC
              Interface         Interface    Inerface      Interface




                                   External
                                   Systems
Dependencies
for detailed example
Trigger Queue                               Execution Queue
      Date=2012-01-02   T1                           T2
                                                     T3         Job1,Job2
                                                                Job3
          hour=03                                         Date=2012-01-02
                                                     Date=2012-01-02
                                                              hour=03
                                                         hour=03
                                                                                                  Job1
                                                                                                  Job2
                        River                 T1
                                              T3
                                              T2                                   Job3
                                                                                   Job1,Job2


                                   Trigger                         Execution             Spring
                                  Manager                          Manager               Batch
           Job1,Job2
           Job3
Topology                                                                                                 Spring Batch DB
                                       (from Job1)                                       (from Job2)


                         Hive/Hadoop               OS              Cassandra              JDBC
                           Interface           Interface            Inerface            Interface




                                                   External
                                                   Systems
Success Example
UI
                          Trigger Queue                      Execution Queue                             Job2
                                           T3         Job3                              Job2
                                           Date=2012-01-02                      Date=2012-01-02
                                               hour=03                              hour=03
                                                                                      Job2
                                                                                      Job2
                  River             T3                               Job2
                                                                     Job3


                             Trigger                     Execution           Spring
                            Manager                      Manager             Batch
           Job3
Topology                                                                                          Spring Batch DB



                   Hive/Hadoop            OS             Cassandra            JDBC
                     Interface         Interface          Inerface          Interface




                                          External
                                          Systems
Failure Example
Notable Features
• Parameter Enrichment
   – Example: #beginningOfMonth

• Precondition Expressions
   – Example: isLastDayOfMonth(#handleDate)

• Data Comparison Capabilities
   – Data Validations
   – Supports Tolerance
      • Absolute and Percentage margins


• Command Line and Java Clients
River at
•   6 River Instances Running
•   5 Teams
•   ~4100 Jobs running every day
•   ~50 Different Job Types

• Job Failures due to environment issues have
  almost no overhead
• Automatic restarts of jobs when data arrives late
Illustration by Chris Whetzel


                 Future Plans
•   Multiple Dependencies
•   Offline Job Testing Capabilities
•   Improved DSL for Job Definitions
•   Support for Master/Worker River machines
•   Job Priorities
•   Analysis Tools


    Outbrain is working on Open Sourcing River
Questions
Thank You


Harel Ben Attia                    @harelba on Twitter
harel@outbrain.com   http://www.linkedin.com/in/harelba

Mais conteúdo relacionado

Destaque

DeskPro - AFPC patented UPS System
DeskPro - AFPC patented UPS SystemDeskPro - AFPC patented UPS System
DeskPro - AFPC patented UPS SystemMohammed Aggabi
 
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid GenerationHybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid GenerationMohammed Aggabi
 
Scorpions by Kruchinina Nadya
Scorpions by Kruchinina NadyaScorpions by Kruchinina Nadya
Scorpions by Kruchinina Nadyaaesc-msu
 
Proposta e estatisticas sakateampro english
Proposta e estatisticas sakateampro englishProposta e estatisticas sakateampro english
Proposta e estatisticas sakateampro englishJefferson Carneiro
 
#OrtechPowerSolutions #Consulting Episode1
#OrtechPowerSolutions #Consulting Episode1#OrtechPowerSolutions #Consulting Episode1
#OrtechPowerSolutions #Consulting Episode1Mohammed Aggabi
 
Music’s influence by Mikheyeva L., Knyazeva A.
Music’s influence by Mikheyeva L., Knyazeva A. Music’s influence by Mikheyeva L., Knyazeva A.
Music’s influence by Mikheyeva L., Knyazeva A. aesc-msu
 
Etude de Cas- Traite Vaches laitières
Etude de Cas-  Traite Vaches laitièresEtude de Cas-  Traite Vaches laitières
Etude de Cas- Traite Vaches laitièresMohammed Aggabi
 
Splin by Zamyatina Ekaterina
Splin by Zamyatina Ekaterina Splin by Zamyatina Ekaterina
Splin by Zamyatina Ekaterina aesc-msu
 

Destaque (9)

DeskPro - AFPC patented UPS System
DeskPro - AFPC patented UPS SystemDeskPro - AFPC patented UPS System
DeskPro - AFPC patented UPS System
 
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid GenerationHybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
 
Scorpions by Kruchinina Nadya
Scorpions by Kruchinina NadyaScorpions by Kruchinina Nadya
Scorpions by Kruchinina Nadya
 
Proposta e estatisticas sakateampro english
Proposta e estatisticas sakateampro englishProposta e estatisticas sakateampro english
Proposta e estatisticas sakateampro english
 
I brid3.0
I brid3.0I brid3.0
I brid3.0
 
#OrtechPowerSolutions #Consulting Episode1
#OrtechPowerSolutions #Consulting Episode1#OrtechPowerSolutions #Consulting Episode1
#OrtechPowerSolutions #Consulting Episode1
 
Music’s influence by Mikheyeva L., Knyazeva A.
Music’s influence by Mikheyeva L., Knyazeva A. Music’s influence by Mikheyeva L., Knyazeva A.
Music’s influence by Mikheyeva L., Knyazeva A.
 
Etude de Cas- Traite Vaches laitières
Etude de Cas-  Traite Vaches laitièresEtude de Cas-  Traite Vaches laitières
Etude de Cas- Traite Vaches laitières
 
Splin by Zamyatina Ekaterina
Splin by Zamyatina Ekaterina Splin by Zamyatina Ekaterina
Splin by Zamyatina Ekaterina
 

Semelhante a Outbrain River Presentation at Reversim Summit 2013

MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
From ddd to DDD : My journey from data-driven development to Domain-Driven De...
From ddd to DDD : My journey from data-driven development to Domain-Driven De...From ddd to DDD : My journey from data-driven development to Domain-Driven De...
From ddd to DDD : My journey from data-driven development to Domain-Driven De...Thibaud Desodt
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Batching and Java EE (jdk.io)
Batching and Java EE (jdk.io)Batching and Java EE (jdk.io)
Batching and Java EE (jdk.io)Ryan Cuprak
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11mislam77
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012Weiwei Chen
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweetmislam77
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of Viewaragozin
 
Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayJ On The Beach
 
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceAngelo Corsaro
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 

Semelhante a Outbrain River Presentation at Reversim Summit 2013 (20)

MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
From ddd to DDD : My journey from data-driven development to Domain-Driven De...
From ddd to DDD : My journey from data-driven development to Domain-Driven De...From ddd to DDD : My journey from data-driven development to Domain-Driven De...
From ddd to DDD : My journey from data-driven development to Domain-Driven De...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Batching and Java EE (jdk.io)
Batching and Java EE (jdk.io)Batching and Java EE (jdk.io)
Batching and Java EE (jdk.io)
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 
Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
 
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution Service
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 

Outbrain River Presentation at Reversim Summit 2013

  • 1. River A data workflow management system Harel Ben Attia Senior Software Engineer
  • 2. – Tens of Billions of Recommendations per month – Most major publishers in the World – Hundreds GBs of new data every day
  • 3. Context • Data Processing Workflows • Multiple Types of Processing – Rollups, Grouping, Filtering, Algorithm Calculations • Multiple Stages of Processing – Using the output of other processes as input
  • 4. Problems • Dependency “Management” – Hardcoded into code/scripts – Time-based using cron or another scheduler • Logic is scattered around the system – Developers need to take care of monitoring, alerts, permissions etc. – Multiple Locations of Execution
  • 6. River • Execution Management – Full Execution History and Filtering – Monitoring and Actionable Alerting Ops / NOC – Automatic Retries – Web UI • Ease of Development – Declarative Data Processing Definitions – Decentralized Developers • Shared Data, separate development – JobLogs • Data Driven Dependencies – Why?
  • 7. Other Approaches A B C Option 1 Option 2 A B C t A J B C J
  • 8. Other Approaches Option 2 A J B C t
  • 9. Other Approaches D Fails D sends email Developer of D still works here Where is the code?
  • 10. Other Approaches 2am is a D= great hour for troubleshooting! Data from C is missing… C= The data of C is all there!
  • 11. Other Approaches X:37 seems like a good time… C never finished after X:30 anyway A B C t Job J has been working for more than a week before the incident D …
  • 12. Other Approaches Need to rerun processes B, C and D •Which hours failed? •How to run all of them for the specific hours? •Without running A again? •Without colliding with ongoing executions?
  • 13. Other Approaches “A will never take more than 15 minutes, so X:20 is more than enough” A X:00 t J A WILL eventually take longer
  • 14. River • Execution Management – Full Execution History + Filtering and Searching – Monitoring and Actionable Alerting – Automatic Retries – Web UI – JobLogs • Ease of Development – Declarative Data Processing Definitions – Decentralized • Shared Data, separate development • Data Driven Dependencies – Why? Robustness Reliability Parallelism
  • 15. River What? When? Where? How?
  • 16. Execution Layer – the “What” Every data processing task is called a Job A Job can contain multiple Steps • Importing from MySQL to Hive • Hive Queries • JDBC Queries • Transfer data from Hive into MySQL and to Cassandra • Running External Commands: MapReduce, Java, bash, Legacy code, etc. Jobs use Parameters
  • 17. Scheduling Layer – the “When” Each job registers to an event, which will trigger its execution Each job emits an event at job completion Events that describe Data Availability Events that are time dependent
  • 18. The “How” and the “Where” Both handled by the infrastructure • Integration to other systems • Connecting to Hive/Hadoop/Cassandra Logical names to all data sources • Connecting to JDBC Databases “readOnlyDataWarehouse” • Retries, throttling, timeouts ”productionCassandra” • Monitoring and Alerts Centralized Management, email notifications and dashboards • Location of Execution Actual location is hidden from the developer/ops
  • 19. River UI FailDownload JobLog Job and Dependents Restart Job
  • 22. Steps Copy Data From JDBC to Hive sourceDB = “productionDatabase” sourceTable = “myRawData” targetCluster = “onlineHadoopCluster” targetHiveTable = “rawDataTable” Filter = “date=#handledDate#” Steps only contain what needs to be done
  • 23. A bit more about triggers Triggers have parameters as well Date=2012-10-10,hour=15 Date=2012-10-10,hour=19 Parameters Propagate through jobs and to other triggers
  • 24. Developer’s Point-of-View Automatic Retries Parameters Pass-through
  • 25. Trigger Queue Execution Queue River Trigger Execution Spring Manager Manager Batch Topology Spring Batch DB Hive/Hadoop OS Cassandra JDBC Interface Interface Inerface Interface External Systems
  • 27. Trigger Queue Execution Queue Date=2012-01-02 T1 T2 T3 Job1,Job2 Job3 hour=03 Date=2012-01-02 Date=2012-01-02 hour=03 hour=03 Job1 Job2 River T1 T3 T2 Job3 Job1,Job2 Trigger Execution Spring Manager Manager Batch Job1,Job2 Job3 Topology Spring Batch DB (from Job1) (from Job2) Hive/Hadoop OS Cassandra JDBC Interface Interface Inerface Interface External Systems Success Example
  • 28. UI Trigger Queue Execution Queue Job2 T3 Job3 Job2 Date=2012-01-02 Date=2012-01-02 hour=03 hour=03 Job2 Job2 River T3 Job2 Job3 Trigger Execution Spring Manager Manager Batch Job3 Topology Spring Batch DB Hive/Hadoop OS Cassandra JDBC Interface Interface Inerface Interface External Systems Failure Example
  • 29. Notable Features • Parameter Enrichment – Example: #beginningOfMonth • Precondition Expressions – Example: isLastDayOfMonth(#handleDate) • Data Comparison Capabilities – Data Validations – Supports Tolerance • Absolute and Percentage margins • Command Line and Java Clients
  • 30. River at • 6 River Instances Running • 5 Teams • ~4100 Jobs running every day • ~50 Different Job Types • Job Failures due to environment issues have almost no overhead • Automatic restarts of jobs when data arrives late
  • 31. Illustration by Chris Whetzel Future Plans • Multiple Dependencies • Offline Job Testing Capabilities • Improved DSL for Job Definitions • Support for Master/Worker River machines • Job Priorities • Analysis Tools Outbrain is working on Open Sourcing River
  • 33. Thank You Harel Ben Attia @harelba on Twitter harel@outbrain.com http://www.linkedin.com/in/harelba

Notas do Editor

  1. Thank you.