O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Apache Falcon - Simplifying Managing Data Jobs on Hadoop

10.697 visualizações

Publicada em

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Apache Falcon - Simplifying Managing Data Jobs on Hadoop

  1. 1. Data Management Platform on Hadoop Srikanth Sundarrajan Venkatesh Seetharam (Incubating)
  2. 2. © Hortonworks Inc. 2011 whoami  Srikanth Sundarrajan –Principal Architect, InMobi –PMC/Committer, Apache Falcon –Apache Hadoop Contributor –Hadoop Team @ Yahoo!  Venkatesh Seetharam –Architect/Developer, Hortonworks Inc. –Apache Falcon Committer, IPMC –Apache Knox Committer –Apache Hadoop, Sqoop, Oozie Contributor –Hadoop team at Yahoo! since 2007 –Built 2 generations of Data Management at Yahoo! Page 2 Architecting the Future of Big Data
  3. 3. Agenda 2 Falcon Overview 1 Motivation 3 Falcon Architecture 4 Case Studies
  5. 5. Data Processing Landscape External data source Acquire (Import) Data Processing (Transform/Pipeline ) Eviction Archive Replicate (Copy) Export
  6. 6. Core Services Process Management • Relays • Late data handling • Retries Data Management • Import/Export • Replication • Retention Data Governance • Lineage • Audit • SLA
  8. 8. Holistic Declaration of Intent picture courtersy: http://bigboxdetox.com
  9. 9. Entity Dependency Graph Hadoop / Hbase … Cluster External data source feed Process depends depends
  10. 10. <?xml version="1.0"?> <cluster colo=”NJ-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations> </cluster> Needed by distcp for replications Writing to HDFS Used to submit processes as MR Submit Oozie jobs Hive metastore to register/deregister partitions and get events on partition availability Used For alerts HDFS directories used by Falcon server Cluster Specification
  11. 11. Feed Specification <?xml version="1.0"?> <feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <tags externalSource=TeradataEDW-1, externalTarget=Marketing> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed> Feed run frequency in mins/hrs/days/mths Late arrival cutoff Global location across clusters - HDFS paths or Hive tables Feeds can belong to multiple groups One or more source & target clusters for retention & replication Access Permissions Metadata tagging
  12. 12. Process Specification <process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process> </process> How frequently does the process run , how many instances can be run in parallel and in what order Which cluster should the process run on and when The processing logic. Retry policy on failure Handling late input feeds Input & output feeds for process
  13. 13. Late Data Handling  Defines how the late (out of band) data is handled  Each Feed can define a late cut-off value <late-arrival cut-off="hours(4)”/>  Each Process can define how this late data is handled <late-process policy="exp-backoff" delay="hours(1)”> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process>  Policies include:  backoff  exp-backoff  final
  14. 14. Retry Policies  Each Process can define retry policy <process name="[process name]"> ... <retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/> <retry policy="backoff" delay="minutes(10)" attempts="3"/> ... </process>  Policies include:  backoff  exp-backoff
  15. 15. Lineage
  16. 16. Apache Falcon Provides Orchestrates Data Management Needs Tools Multi Cluster Management Oozie Replication Sqoop Scheduling Distcp Data Reprocessing Flume Dependency Management Map / Reduce Eviction Hive and Pig Jobs Governance Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications. Falcon: One-stop Shop for Data Management
  18. 18. High Level Architecture Apache Falcon Oozie Messaging HCatalog HDFS Entity Entity status Process status / notification CLI/REST JMS Config store
  19. 19. Feed Schedule Cluster xml Feed xml Falcon Falcon config store / Graph Retention / Replication workflow Oozie Scheduler HDFS JMS Notification per action Catalog service Instance Management
  20. 20. Process Schedule Cluster/fe ed xml Process xml Falcon Falcon config store / Graph Process workflow Oozie Scheduler HDFS JMS Notification per available feed Catalog service Instance Management
  21. 21. Physical Architecture • STANDALONE – Single Data Center – Single Falcon Server – Hadoop jobs and relevant processing involves only one cluster • DISTRBUTED – Multiple Data Centers – Falcon Server per DC – Multiple instances of hadoop clusters and workflow schedulers HADOOP Store & Process Falcon Server (standalone) Site 1 HADOOP Store & Process replication HADOOP Store & Process Falcon Server (standalone) Site 1 HADOOP Store & Process replication Site 2 Falcon Server (standalone) Falcon Prism Server (distributed)
  22. 22. CASE STUDY Multi Cluster Failover
  23. 23. Multi Cluster – Failover > Falcon manages workflow, replication or both. > Enables business continuity without requiring full data reprocessing. > Failover clusters require less storage and CPU. Staged Data Cleansed Data Conformed Data Presented Data Staged Data Presented Data BI and Analytics Primary Hadoop Cluster Failover Hadoop Cluster Replication
  24. 24. Retention Policies Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only > Sophisticated retention policies expressed in one place. > Simplify data retention for audit, compliance, or for data re-processing.
  25. 25. CASE STUDY Distributed Processing Example: Digital Advertising @ InMobi
  26. 26. Processing – Single Data Center Ad Request data Impression render event Click event Conversion event Continuou s Streaming (minutely) Hourly summary Enrichment (minutely/5 minutely) Summarizer
  27. 27. Global Aggregation Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer …….. DataCenter1 DataCenterN Consumable global aggregate
  28. 28. HIGHLIGHTS
  29. 29. Future Data Governance Data Pipeline Designer Authorization Monitoring/Management Dashboard
  30. 30. Summary
  31. 31. Questions?  Apache Falcon  http://falcon.incubator.apache.org  mailto: dev@falcon.incubator.apache.org  Srikanth Sundarrajan  sriksun@apache.org  #sriksun  Venkatesh Seetharam  venkatesh@apache.org  #innerzeal