SlideShare a Scribd company logo
1 of 31
Data Management Platform
on Hadoop
Venkatesh Seetharam
(Incubating)
© Hortonworks Inc. 2011
whoami
Hortonworks Inc.
–Architect/Developer
–Lead Data Management efforts
Apache
–Apache Falcon Committer, IPMC
–Apache Knox Committer
–Apache Hadoop, Sqoop, Oozie Contributor
Part of the Hadoop team at Yahoo! since 2007
–Senior Principal Architect of Hadoop Data at Yahoo!
–Built 2 generations of Data Management at Yahoo!
Page 2
Architecting the Future of Big Data
Agenda
2 Falcon Overview
1 Motivation
3 Falcon Architecture
4 Case Studies
MOTIVATION
Data Processing Landscape
External
data
source
Acquire
(Import)
Data Processing
(Transform/Pipeline
)
Eviction Archive
Replicate
(Copy)
Export
Core Services
Process
Management
• Relays
• Late data handling
• Retries
Data
Management
• Import/Export
• Replication
• Retention
Data
Governance
• Lineage
• Audit
• SLA
FALCON OVERVIEW
Holistic Declaration of Intent
picture courtersy: http://bigboxdetox.com
Entity Dependency Graph
Hadoop /
Hbase …
Cluster
External
data
source
feed Process
depends
depends
<?xml version="1.0"?>
<cluster colo=”NJ-datacenter" description="" name=”prod-cluster">
<interfaces>
<interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" />
<interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" />
<interface type="execute" endpoint=”rm:8050" version="2.2.0" />
<interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />
<interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" />
<interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" />
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/prod-cluster/staging" />
<location name="temp" path="/tmp" />
<location name="working" path="/apps/falcon/prod-cluster/working" />
</locations>
</cluster>
Needed by distcp
for replications
Writing to HDFS
Used to submit
processes as MR
Submit Oozie jobs
Hive metastore to
register/deregister
partitions and get
events on partition
availability
Used For alerts
HDFS directories used by Falcon server
Cluster Specification
Feed Specification
<?xml version="1.0"?>
<feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1">
<frequency>hours(1)</frequency>
<late-arrival cut-off="hours(6)”/>
<groups>churnAnalysisFeeds</groups>
<tags externalSource=TeradataEDW-1, externalTarget=Marketing>
<clusters>
<cluster name=”cluster-primary" type="source">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name=”cluster-secondary" type="target">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
<retention limit=”days(7)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
</locations>
<ACL owner=”hdfs" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
Feed run frequency in mins/hrs/days/mths
Late arrival cutoff
Global location across
clusters - HDFS paths
or Hive tables
Feeds can belong to
multiple groups
One or more source &
target clusters for
retention & replication
Access Permissions
Metadata tagging
Process Specification
<process name="process-test" xmlns="uri:falcon:process:0.1”>
<clusters>
<cluster name="cluster-primary">
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" />
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>days(1)</frequency>
<inputs>
<input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" />
</inputs>
<outputs>
<output instance="now(0,2)" feed="feed-clicks-clean" name="output" />
</outputs>
<workflow engine="pig" path="/apps/clickstream/clean-script.pig" />
<retry policy="periodic" delay="minutes(10)" attempts="3"/>
<late-process policy="exp-backoff" delay="hours(1)">
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
</process>
How frequently does the
process run , how many
instances can be run in parallel
and in what order
Which cluster should the
process run on and when
The processing logic.
Retry policy on
failure
Handling late input
feeds
Input & output feeds
for process
Late Data Handling
 Defines how the late (out of band) data is handled
 Each Feed can define a late cut-off value
<late-arrival cut-off="hours(4)”/>
 Each Process can define how this late data is handled
<late-process policy="exp-backoff" delay="hours(1)”>
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
 Policies include:
 backoff
 exp-backoff
 final
Retry Policies
 Each Process can define retry policy
<process name="[process name]">
...
<retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/>
<retry policy="backoff" delay="minutes(10)" attempts="3"/>
...
</process>
 Policies include:
 backoff
 exp-backoff
Lineage
Apache Falcon
Provides Orchestrates
Data Management Needs Tools
Multi Cluster Management Oozie
Replication Sqoop
Scheduling Distcp
Data Reprocessing Flume
Dependency Management Map / Reduce
Eviction Hive and Pig Jobs
Governance
Falcon provides a single interface to orchestrate data lifecycle.
Sophisticated DLM easily added to Hadoop applications.
Falcon: One-stop Shop for
Data Management
FALCON ARCHITECTURE
High Level Architecture
Apache
Falcon
Oozie
Messaging
HCatalog
HDFS
Entity
Entity
status
Process
status /
notification
CLI/REST
JMS
Config
store
Feed Schedule
Cluster
xml
Feed xml Falcon
Falcon config
store / Graph
Retention /
Replication
workflow
Oozie
Scheduler HDFS
JMS Notification
per action
Catalog
service
Instance
Management
Process Schedule
Cluster/fe
ed xml
Process
xml
Falcon
Falcon config
store / Graph
Process
workflow
Oozie
Scheduler HDFS
JMS Notification
per available
feed
Catalog
service
Instance
Management
Physical Architecture
• STANDALONE
– Single Data Center
– Single Falcon Server
– Hadoop jobs and relevant
processing involves only one
cluster
• DISTRBUTED
– Multiple Data Centers
– Falcon Server per DC
– Multiple instances of hadoop
clusters and workflow schedulers
HADOOP
Store & Process
Falcon
Server
(standalone)
Site 1
HADOOP
Store & Process
replication
HADOOP
Store & Process
Falcon
Server
(standalone)
Site 1
HADOOP
Store & Process
replication
Site 2
Falcon
Server
(standalone)
Falcon Prism
Server
(distributed)
CASE STUDY
Multi Cluster Failover
CASE STUDY
Distributed Processing
Example: Digital Advertising @ InMobi
Processing – Single Data Center
Ad Request
data
Impression
render event
Click event
Conversion
event
Continuou
s
Streaming
(minutely)
Hourly
summary
Enrichment
(minutely/5
minutely)
Summarizer
Global Aggregation
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
……..
DataCenter1
DataCenterN
Consumable
global aggregate
HIGHLIGHTS
Future
Data Governance
Data Pipeline Designer
Data Acquisition – file-based
Monitoring/Management
Dashboard
Summary
Questions?
 Apache Falcon
 http://falcon.incubator.apache.org
 mailto: dev@falcon.incubator.apache.org
 Venkatesh Seetharam
 venkatesh@apache.org
 #innerzeal

More Related Content

What's hot

Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies DataWorks Summit/Hadoop Summit
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in HadoopMadhan Neethiraj
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
 
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...Hortonworks
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Hortonworks
 
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentDataWorks Summit/Hadoop Summit
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Artem Ervits
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
 

What's hot (20)

Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Enterprise Data Classification and Provenance
Enterprise Data Classification and ProvenanceEnterprise Data Classification and Provenance
Enterprise Data Classification and Provenance
 
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop Environment
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Creating the Internet of Your Things
Creating the Internet of Your ThingsCreating the Internet of Your Things
Creating the Internet of Your Things
 

Similar to Apache Falcon at Hadoop Summit Europe 2014

Falcon - Data Management Platform on Hadoop (Beyond ETL)
Falcon - Data Management Platform on Hadoop (Beyond ETL)Falcon - Data Management Platform on Hadoop (Beyond ETL)
Falcon - Data Management Platform on Hadoop (Beyond ETL)DataWorks Summit
 
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Cedric CARBONE
 
Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014Modern Data Stack France
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in ActionHao Chen
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...SpringPeople
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Hao Chen
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
FOISDBA-Ver1.1.pptx
FOISDBA-Ver1.1.pptxFOISDBA-Ver1.1.pptx
FOISDBA-Ver1.1.pptxssuser20fcbe
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
 

Similar to Apache Falcon at Hadoop Summit Europe 2014 (20)

Falcon - Data Management Platform on Hadoop (Beyond ETL)
Falcon - Data Management Platform on Hadoop (Beyond ETL)Falcon - Data Management Platform on Hadoop (Beyond ETL)
Falcon - Data Management Platform on Hadoop (Beyond ETL)
 
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)
 
Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Anatomy of a Drupal Hack - TechKnowFile 2014
Anatomy of a Drupal Hack - TechKnowFile 2014Anatomy of a Drupal Hack - TechKnowFile 2014
Anatomy of a Drupal Hack - TechKnowFile 2014
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in Action
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
FOISDBA-Ver1.1.pptx
FOISDBA-Ver1.1.pptxFOISDBA-Ver1.1.pptx
FOISDBA-Ver1.1.pptx
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Apache Falcon at Hadoop Summit Europe 2014

  • 1. Data Management Platform on Hadoop Venkatesh Seetharam (Incubating)
  • 2. © Hortonworks Inc. 2011 whoami Hortonworks Inc. –Architect/Developer –Lead Data Management efforts Apache –Apache Falcon Committer, IPMC –Apache Knox Committer –Apache Hadoop, Sqoop, Oozie Contributor Part of the Hadoop team at Yahoo! since 2007 –Senior Principal Architect of Hadoop Data at Yahoo! –Built 2 generations of Data Management at Yahoo! Page 2 Architecting the Future of Big Data
  • 3. Agenda 2 Falcon Overview 1 Motivation 3 Falcon Architecture 4 Case Studies
  • 5. Data Processing Landscape External data source Acquire (Import) Data Processing (Transform/Pipeline ) Eviction Archive Replicate (Copy) Export
  • 6. Core Services Process Management • Relays • Late data handling • Retries Data Management • Import/Export • Replication • Retention Data Governance • Lineage • Audit • SLA
  • 8. Holistic Declaration of Intent picture courtersy: http://bigboxdetox.com
  • 9. Entity Dependency Graph Hadoop / Hbase … Cluster External data source feed Process depends depends
  • 10. <?xml version="1.0"?> <cluster colo=”NJ-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations> </cluster> Needed by distcp for replications Writing to HDFS Used to submit processes as MR Submit Oozie jobs Hive metastore to register/deregister partitions and get events on partition availability Used For alerts HDFS directories used by Falcon server Cluster Specification
  • 11. Feed Specification <?xml version="1.0"?> <feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <tags externalSource=TeradataEDW-1, externalTarget=Marketing> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed> Feed run frequency in mins/hrs/days/mths Late arrival cutoff Global location across clusters - HDFS paths or Hive tables Feeds can belong to multiple groups One or more source & target clusters for retention & replication Access Permissions Metadata tagging
  • 12. Process Specification <process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process> </process> How frequently does the process run , how many instances can be run in parallel and in what order Which cluster should the process run on and when The processing logic. Retry policy on failure Handling late input feeds Input & output feeds for process
  • 13. Late Data Handling  Defines how the late (out of band) data is handled  Each Feed can define a late cut-off value <late-arrival cut-off="hours(4)”/>  Each Process can define how this late data is handled <late-process policy="exp-backoff" delay="hours(1)”> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process>  Policies include:  backoff  exp-backoff  final
  • 14. Retry Policies  Each Process can define retry policy <process name="[process name]"> ... <retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/> <retry policy="backoff" delay="minutes(10)" attempts="3"/> ... </process>  Policies include:  backoff  exp-backoff
  • 16. Apache Falcon Provides Orchestrates Data Management Needs Tools Multi Cluster Management Oozie Replication Sqoop Scheduling Distcp Data Reprocessing Flume Dependency Management Map / Reduce Eviction Hive and Pig Jobs Governance Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications. Falcon: One-stop Shop for Data Management
  • 19. Feed Schedule Cluster xml Feed xml Falcon Falcon config store / Graph Retention / Replication workflow Oozie Scheduler HDFS JMS Notification per action Catalog service Instance Management
  • 20. Process Schedule Cluster/fe ed xml Process xml Falcon Falcon config store / Graph Process workflow Oozie Scheduler HDFS JMS Notification per available feed Catalog service Instance Management
  • 21. Physical Architecture • STANDALONE – Single Data Center – Single Falcon Server – Hadoop jobs and relevant processing involves only one cluster • DISTRBUTED – Multiple Data Centers – Falcon Server per DC – Multiple instances of hadoop clusters and workflow schedulers HADOOP Store & Process Falcon Server (standalone) Site 1 HADOOP Store & Process replication HADOOP Store & Process Falcon Server (standalone) Site 1 HADOOP Store & Process replication Site 2 Falcon Server (standalone) Falcon Prism Server (distributed)
  • 23.
  • 24.
  • 25. CASE STUDY Distributed Processing Example: Digital Advertising @ InMobi
  • 26. Processing – Single Data Center Ad Request data Impression render event Click event Conversion event Continuou s Streaming (minutely) Hourly summary Enrichment (minutely/5 minutely) Summarizer
  • 27. Global Aggregation Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer …….. DataCenter1 DataCenterN Consumable global aggregate
  • 29. Future Data Governance Data Pipeline Designer Data Acquisition – file-based Monitoring/Management Dashboard
  • 31. Questions?  Apache Falcon  http://falcon.incubator.apache.org  mailto: dev@falcon.incubator.apache.org  Venkatesh Seetharam  venkatesh@apache.org  #innerzeal

Editor's Notes

  1. In a typical big data environment involving Hadoop, the use cases tend to be around processing very large volumes of data either for machine or human consumption. Some of the data that gets to the hadoop platform can contain critical business &amp; financial information. The data processing team in such an environment is often distracted by the multitude of data management and process orchestration challenges. To name a fewIngesting large volumes of events/streamsIngesting slowly changing data typically available on a traditional databaseCreating a pipeline / sequence of processing logic to extract the desired piece of insight / informationHandling processing complexities relating to change of data / failuresManaging eviction of older data elementsBackup the data in an alternate location or archive it in a cheaper storage for DR/BCP &amp; Compliance requirementsShip data out of the hadoop environment periodically for machine or human consumption etcThese tend to be standard challenges that are better handled in a platform and this might allow the data processing team to focus on their core business application. A platform approach to this also allows us to adopt best practices in solving each of these for subsequent users of the platform to leverage.
  2. As we just noted that there are numerous data and process management services when made available to the data processing team, can reduce their day-to-day complexities significantly and allow them to focus on their business application. This is an enumeration of such services, which we intend to cover in adequate detail as we go along.More often than not pipelines are sequence of data processing or data movement tasks that need to happen before raw data can be transformed into a meaningfully consumable form. Normally the end stage of the pipeline where the final sets of data are produced is in the critical path and may be subject to tight SLA bounds. Any step in the sequence/pipeline if either delayed or failed could cause the pipeline to stall. It is important that each step in the pipeline handoff to the next step to avoid any buffering of time and to allow seamless progression of the pipeline. People who are familiar with Apache Oozie might be able to appreciate this feature provided through the Coordinator.As the pipelines gets more and more time critical and time sensitive, this becomes very very critical and this ought to be available off the shelf for application developers. It is also important for this feature to scalable to support the needs of concurrent pipelines.A fact that data volumes are large and increasing by the day is the reason one adopts a big data platform like Hadoop and that would automatically mean that we would run of space pretty soon, if we didn’t take care of evicting &amp; purging older instances of data. Few problems to consider for retention areShould avoid using a general purpose super user with world writable privileges to delete old data (for obvious reasons)Different types of data may require different criteria for aging and hence purgingOther life cycle functions like Archival of old data if defined ought to be scheduled before eviction kicks inHadoop is being increasingly critical for many businesses and for some users the raw data volumes are too large for them to be shipped to one place for processing, for others data needs to be redundantly available for business continuity reasons. In either scenarios replication of data from one cluster to another plays a vital role. This being available as a service would again free up the cycles from the application developer of these responsibilities. The key challenges to consider while offering this as a service areBandwidth consumption and managementChunking/bulking strategyCorrectness guaranteesHDFS version compatibility issues
  3. Data Lifecycle is Challenging in spite of some good Hadoop tools - Patchwork of tools complicate data lifecycle management.Some of the things we have spoken about so far can be done if we took a silo-ed approach. For instance it is possible to process few data sets and produce a few more through a scheduler. However if there are two other consumers of the data produced by the first workflow then the same will be repeatedly defined by the other two consumers and so on. There is a serious duplication of metadata information of what data is ingested, processed or produced and where they are processed and how they are produced. A single system which creates a complete view of this would be able to provide a fairly complete picture of what is happening in the system compared to collection to independent scheduled applications. Both the production support and application development team on Hadoop platform have to scramble and write custom script and monitoring system to get a broader and holistic view of what is happening. An approach where this information is systemically collected and used for seamless management can alleviate much of the pains of folks operating or developing data processing application on hadoop.There is a tendency to burn in feed locations, apps, cluster location, cluster servicesBut things may change over timeFrom where you ingest, the feed frequency, file locations, file formats, format conversions, compressions, the app, …You may end up with multiple clustersA dataset location may be different in different clustersSome dataset and apps may move from one cluster to anotherThings are slightly different in the BCP cluster
  4. The entity graph at the core is what makes Falcon what it is and that in a way enables all the unique features that Falcon has to offer or can potentially make available in future. At the coreDependency between Data Processing logic andCluster end pointsRules governing Data managementProcessing managementMetadata management
  5. Cluster specification is PER CLUSTER. Each cluster can have the following interfacesreadonly specifies the hadoop&apos;shftp address, it&apos;s endpoint is the value ofdfs.http.address.ex: hftp://corp.namenode:50070/write specifies the interface to write to hdfs, it&apos;s endpoint is the valueof fs.default.name.ex: hdfs://corp.namenode:8020Use the value defined in fs.default.nameexecute specifies the interface for resource manager, it&apos;s endpoint is the valueof mapred.job.tracker. ex:corp.jt:8021Use the value defined in yarn.resourcemanager.addressworkflow specifies the interface for workflow engine, example of it’s endpoint is value for OOZIE_URL.ex: http://corp.oozie:11000/ooziemessaging specifies the interface for sending feed availability messages, it’s endpoint is broker url with tcpaddress.ex: tcp://corp.messaging:61616?daemon=trueregistry specifies the interface for Hcatalog.
  6. LATE DATA- Source &amp; target clustersYou can configure multiple source &amp; target clustersACL TAG&lt;xs:complexType name=&quot;ACL&quot;&gt; &lt;xs:annotation&gt; &lt;xs:documentation&gt; Access control list for this feed. &lt;/xs:documentation&gt; &lt;/xs:annotation&gt; &lt;xs:attribute type=&quot;xs:string&quot; name=&quot;owner&quot;/&gt; &lt;xs:attribute type=&quot;xs:string&quot; name=&quot;group&quot;/&gt; &lt;xs:attribute type=&quot;xs:string&quot; name=&quot;permission&quot;/&gt; &lt;/xs:complexType&gt;RETENTION POLICY ACTIONS &lt;xs:simpleType name=&quot;action-type&quot;&gt; &lt;xs:restriction base=&quot;xs:string&quot;&gt; &lt;xs:annotation&gt; &lt;xs:documentation&gt; action type specifies the action that should be taken on a feedwhen the retention period of a feed expires on a cluster, the validactions are archive, delete, chown and chmod. &lt;/xs:documentation&gt; &lt;/xs:annotation&gt; &lt;xs:enumeration value=&quot;archive&quot;/&gt; &lt;xs:enumeration value=&quot;delete&quot;/&gt; &lt;xs:enumeration value=&quot;chown&quot;/&gt; &lt;xs:enumeration value=&quot;chmod&quot;/&gt; &lt;/xs:restriction&gt; &lt;/xs:simpleType&gt;SPECIFYING LOCATIONS .&lt;xs:complexType name=&quot;location&quot;&gt; &lt;xs:annotation&gt; &lt;xs:documentation&gt; location specifies the type of location like data, meta, statsand the corresponding paths for them. A feed should at least define the location for type data, whichspecifies the HDFS path pattern where the feed is generatedperiodically. ex: type=&quot;data&quot; path=&quot;/projects/TrafficHourly/${YEAR}-${MONTH}-${DAY}/traffic&quot; &lt;/xs:documentation&gt; &lt;/xs:annotation&gt; &lt;xs:attribute type=&quot;location-type&quot; name=&quot;type&quot; use=&quot;required&quot;/&gt; &lt;xs:attribute type=&quot;xs:string&quot; name=&quot;path&quot; use=&quot;required&quot;/&gt; Each location has a TYPE &lt;xs:simpleType name=&quot;location-type&quot;&gt; &lt;xs:restriction base=&quot;xs:string&quot;&gt; &lt;xs:enumeration value=&quot;data&quot;/&gt; &lt;xs:enumeration value=&quot;stats&quot;/&gt; &lt;xs:enumeration value=&quot;meta&quot;/&gt; &lt;xs:enumeration value=&quot;tmp&quot;/&gt; &lt;/xs:restriction&gt; &lt;/xs:simpleType&gt;SPECIFYING HIVE TABLES&lt;xs:complexType name=&quot;catalog-table&quot;&gt; &lt;xs:annotation&gt; &lt;xs:documentation&gt; catalog specifies the uri of a Hive table along with the partition spec.uri=&quot;catalog:$database:$table#(partition-key=partition-value);+&quot; Example: catalog:logs-db:clicks#ds=${YEAR}-${MONTH}-${DAY} &lt;/xs:documentation&gt; &lt;/xs:annotation&gt; &lt;xs:attribute type=&quot;xs:string&quot; name=&quot;uri&quot; use=&quot;required&quot;/&gt; &lt;/xs:complexType&gt;
  7. A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on.Process level validity – How long the process itself is valid Each cluster specified within a process inturn has validity mentioned, which tell the times between which the job should run on that specified clusterParallel – How many instances of the process can run in parallel. A new instance is started everytime the process is kicked-off based on the specified frequencyOrder – order in which the ready instances of the process are picked up. Mostly use FIFO..Timeout – on a per instance basis
  8. Certain class of applications, SLA critical machine consumable data (with some tolerance to error) doesn’t get affected much if some small percentage of data arrives late. Some examples of these class of applications include forecasting, predictions, risk management etc. However, applications with a “Close of Books” notion for human consumable, are used for factual reporting, results of which may be subject to audit. For these use cases, it is not acceptable to ignore data that arrived out of order or late. Late data handling defines how the late data should be handled. Each feed is defined with a late cut-off value which specifies the time till which late data is valid. For example, late cut-off of hours(6) means that data for nth hour can get delayed by upto 6 hours. Late data specification in process defines how this late data is handled.Late data policy defines how frequently check is done to detect late data. The policies supported are: backoff, exp-backoff(exponentionbackoff) and final(at feed&apos;s late cut-off). The policy along with delay defines the interval at which late data check is done.Late input specification for each input defines the workflow that should run when late data is detected for that input.
  9. The workflow is re-tried after 10 mins, 20 mins and 30 mins. With exponential backoff, the workflow will be re-tried after 10 mins, 20 mins and 40 mins.
  10. Falcon provides the key services data processing applications need.Complex data processing logic handled by Falcon instead of hard-coded in apps.Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop.
  11. System accepts entities using DSLInfrastructure, Datasets, Pipeline/Processing logicTransforms the input into automated and scheduled workflowsSystem orchestrates workflowsInstruments execution of configured policiesHandles retry logic and late data processingRecords audit, lineage Seamless integration with metastore/catalog (WIP)Provides notifications based on availabilityIntegrated Seamless experience to usersAutomates processing and tracks the end to end progressData Set management (Replication, Retention, etc.) offered as a serviceUsers can cherry pick, No coupling between primitivesProvides hooks for monitoring, metrics collection
  12. Ad Request, Click, Impression, Conversion feedMinutely (with identical location, retention configuration, but with many data centers)Summary dataHourly (with multiple partitions – one per dc, each configured as source and one target which is global datacenter)Click, Impression Conversion enrichment &amp; Summarizer processesSingle definition with multiple data centersIdentical periodicity and scheduling configuration