3. What is in store for you?
● Some history
● Motivation for Apache Falcon
● Overview of Apache Falcon
● How Apache Falcon is used @ InMobi
● Roadmap
● Q&A
5. Life was simple...
Click Logs Click Enhancer Enhanced
Clicks
Hourly
Aggregation Hourly
Clicks
Daily
clicks
Daily Aggregation
Metadata
Retention : 2 hours Frequency : 5 mins
Late Data arrival
Retention : 2 days
Replication required
Retention : 1 day
Retention : 7 days
Replication required
● Cron jobs to delete/archive data.
● Cron jobs to copy data from one cluster to
another.
● Email notifications and manual retries in case of
failures.
6. But, didn’t remain that way...
❏ Failures
❏ Data arriving late
❏ Re-processing
❏ Varied Data Replication
❏ Varied Data Retention
❏ Data Archival
❏ Lineage
❏ SLA monitoring
15. Process Relays
● Data dependency among the process in the pipeline.
● Can wait on imported data or replicated data.
16. Late Data Arrival
● Defines how late data is handled.
● How long to wait for the data and how to check for late data.
● Optional separate logic to handle late data processing.
17. Process Specification
<process name="clicks-hourly" xmlns="uri:falcon:process:0.1">
<clusters>
<cluster name="corp">
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/>
</cluster>
<parallel>1</parallel>
<order>LIFO</order>
<frequency>hours(1)</frequency>
<inputs>
<input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)"
partition="*/US"/>
</inputs>
<outputs>
<output name="clicksummary" feed="click-hourly" instance="today(0,0)"/>
</outputs>
<workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="
/user/guest/workflowlib"/>
<retry policy="periodic" delay="hours(10)" attempts="3"/>
<late-process policy="exp-backoff" delay="hours(1)">
<late-input input="click" workflow-path="hdfs://clicks/late/workflow"/>
</late-process>
</process>
Where should the process
run?
How should the process
run?
What to consume?
What to produce?
Processing logic
Late Data processing
20. Data Retention As a Service
● Data grows rapidly and cannot be kept around for ever. Needs to be deleted
or archived.
● Retention policy is dictated by the kind of data.
21. Data Replication As a Service
● Replication required for
○ Disaster Recovery.
○Local/Global Aggregation.
● Configurable resource consumption.
25. ● Dependency Graph
● Lineage - Data flow
○Current
○Historical
● SLA - Data monitoring
Capabilities
26. Dependency Graph
● Relationship existing between various elements in a pipeline.
● Entity Dependency
● Instance Dependency
● Instance - one run of a scheduled entity
● Feed instance
● Process instance
27. Entity dependency
● It returns the graph depicting the relationship between the various
processes and feeds in a given pipeline.
Click Process
Impression
Process
Click Enriched
Process
Summary
Click Feed
Impression
Feed
Enriched
Feed
Produces
Consumes
Produces
Produces
Consumes
Consumes
28. Instance dependency
Gives information about the producer and consumer for a particular instance.
Click
Enriched
Process 02:
00
Summary 02:
00
Click Feed 02:00
Impression Feed
02:00
Click Enriched
Feed 02:00
Produces
Consumes
Consumes
Consumes
Billing
Process
02:00
Consumes
Billing Feed 02:00
Produces
Consumes
29. Lineage
● Logical flow/movement of data from source
to destination.
● Ability to filter out data on certain attributes
of the entities
● Captures the metadata tags associated with
each of the entities as relationships.
● Falcon adds the ability to capture lineage for
both entities and its associated instances.
● Implemented using Graph DB
30. Graph DB
● DAG notation.
● Stores all CRUD operations.
● Used to store information about entities , instances and their
dependencies.
● Graph APIs exposed to analyse metrics.
● Example : Pipeline Health check.
33. How to query...
Lineage data can be queried in 3 ways:
● REST API
● CLI
● Dashboard (upcoming)
34. SLA Monitoring
● Feed - Alerts based on data Availability
● Process - Triggering of process instances.
● Two types of SLA
● SLA Low
● SLA High
● Dashboard
● Pluggable Alerting System
● Email
● JMS Notifications
37. Clusters @ InMobi
● Hadoop usage at InMobi
● ~ 6 Clusters
● > 1PB of storage
● > 5TB new data ingested each day
● > 20TB data crunched each day
● > 200 nodes in HDFS/MR clusters & > 40 nodes
in Hbase
● > 175K hadoop jobs / day
● > 60K Oozie workflows / day
● 300+ Falcon feed definitions
● 100+ Falcon process definitions
41. ● Security
○ Authentication and Authorization.
○ Ability to interface with secure endpoints.
● Recipes
○ A template which multiple process can use.
○ HiveDR, HDFS Replication recipes come OOB.
● Falcon Unit
○ Unit test framework for testing pipelines.
● Triage
● Audit
Other Notable Features
42. What else is out there?
gobblin - https://github.com/linkedin/gobblin
nifi - https://nifi.apache.org/
Compare and contrast - https://www.linkedin.com/pulse/nifi-vs-falcon-oozie-
birender-saini
43. ● Pluggable lifecycle
○Data Acquisition as a Service.
● A more powerful scheduler
○pipeline recovery
○data availability/manual trigger support
● Improved application packaging/deployment.
● Better UI
● Improved monitoring
Towards the Future