Mais conteúdo relacionado Semelhante a Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop (20) Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop1. Page 1 © Hortonworks Inc. 2014
Discover HDP 2.1
Apache Falcon for Data Governance in Hadoop
Hortonworks. We do Hadoop.
2. Page 2 © Hortonworks Inc. 2014
Speakers
Justin Sears
Hortonworks Product Marketing Manager
Himanshu Bari
Hortonworks Senior Product Manager & PM for Apache
Falcon & Apache Storm in Hortonworks Data Platform
Venkatesh Seetharam
Foundational Hadoop Architect, Engineer & Committer for
Apache Falcon and Apache Knox Gateway projects
3. Page 3 © Hortonworks Inc. 2014
Agenda
• Why You Need Apache Falcon
• Key New Falcon Features
• Demo
– Defining data pipelines
– Policies for retention
– Managing Falcon server with Apache Ambari
4. Page 4 © Hortonworks Inc. 2014
OPERATIONS
TOOLS
Provision,
Manage &
Monitor
DEV
&
DATA
TOOLS
Build & Test
A Modern Data Architecture
APPLICATIONS
DATA
SYSTEM
REPOSITORIES
RDBMS
EDW
MPP
Business
Analy<cs
Custom
Applica<ons
Packaged
Applica<ons
Governance
&Integration
ENTERPRISE HADOOP
Security
Operations
Data Access
Data Management
SOURCES
OLTP,
ERP,
CRM
Systems
Documents,
Emails
Web
Logs,
Click
Streams
Social
Networks
Machine
Generated
Sensor
Data
GeolocaCon
Data
5. Page 5 © Hortonworks Inc. 2014
HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform
Provision,
Manage
&
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data
Workflow,
Lifecycle
&
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN
:
Data
Opera<ng
System
DATA
MANAGEMENT
DATA
ACCESS
GOVERNANCE
&
INTEGRATION
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive/Tez,
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
In-‐Memory
AnalyCcs,
ISV
engines
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Hadoop
Distributed
File
System)
Batch
Map
Reduce
SECURITY
Authen<ca<on
Authoriza<on
Accoun<ng
Data
Protec<on
Storage:
HDFS
Resources:
YARN
Access:
Hive,
…
Pipeline:
Falcon
Cluster:
Knox
6. Page 6 © Hortonworks Inc. 2014
NoSQL
HBase
Accumulo
Stream
Storm
Others
In-‐Memory
AnalyCcs,
ISV
engines
Script
Pig
Search
Solr
HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform
Provision,
Manage
&
Monitor
Ambari
Zookeeper
Scheduling
Oozie
DATA
MANAGEMENT
OPERATIONS
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Hadoop
Distributed
File
System)
SECURITY
Authen<ca<on
Authoriza<on
Accoun<ng
Data
Protec<on
Storage:
HDFS
Resources:
YARN
Access:
Hive,
…
Pipeline:
Falcon
Cluster:
Knox
YARN
:
Data
Opera<ng
System
DATA
ACCESS
SQL
Hive/Tez,
HCatalog
Batch
Map
Reduce
Data
Workflow,
Lifecycle
&
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
GOVERNANCE
&
INTEGRATION
7. Page 7 © Hortonworks Inc. 2014
Outline
Falcon
Overview
Features
Architecture
& Demo
8. Page 8 © Hortonworks Inc. 2014
Simple Data Pipeline in Hadoop
Relatively simple Oozie workflow
Job1
Job2 JobN
Job3
Has a
Simple data pipeline
Raw
Data
Clean
Data
Prepped
Data
HDFS data lake
MR/Pig/Hive
BI
TOOLS
Data
Sources
MR/Pig/Hive
9. Page 9 © Hortonworks Inc. 2014
Quickly Gets Complicated….
Data stewards
• Impact analysis
• Monitor pipeline
• Track ownership
• Late data &
failure handling
Compliance teams
• Audit
• Retention
• Eviction
IT admins
• Monitor infra
• Replication
• Archival
Business & data
analysts
• Verify data
quality
Manually
write & wire
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
tools
Eg. DistCp
Typical data governance requirements
Raw Clean Prep
10. Page 10 © Hortonworks Inc. 2014
Apache Falcon to the Rescue
Data pipeline
Raw Clean Prep
Defined in
Auto generate
& orchestrate
Adds the required data
governance features
Falcon adds the required data governance features
DEFINITION
Replication | Retention
Eviction | Late data
MONITORING
TRACING
Audit | Lineage
Tagging
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
ecosystem
tools
Eg. DistCp
11. Page 11 © Hortonworks Inc. 2014
Outline
Falcon
Overview
Features
Architecture
& Demo
12. Page 12 © Hortonworks Inc. 2014
Falcon Basic Concepts
• Feed: Defines a “dataset” so a.k.a ‘datasets’
• Process: Consumes feeds, invokes processing logic & produces feeds
All these put together represent ‘Data Pipelines’ in Hadoop
CLUSTER
FEED
aka
DATASET
PROCESS
INPUT TO
CREATES
• Cluster: : Represents the “interfaces” to a Hadoop cluster
13. Page 13 © Hortonworks Inc. 2014
Data Pipeline Definition
XML based pipeline specification
Modular - Clusters, feeds & processes defined separately and then linked together
Easy to re-use across multiple pipelines
Out of the box policies
Predefined policies for replication, retention & late data handling Easily customization of policies
Extensible
Plug in external solutions at any step of the pipeline
Eg. Invoke third party data obfuscation components
14. Page 14 © Hortonworks Inc. 2014
Replication & Retention
Staged Data
Retain 5
Years
Cleansed
Data
Retain 3
Years
Conformed
Data
Retain 3
Years
Presented
Data
Retain Last
Copy Only
• Sophisticated retention policies expressed in one place
• Simplify data retention for audit, compliance, or for data re-processing
15. Page 15 © Hortonworks Inc. 2014
Data Pipeline Monitoring
DATA
Primary site DR site
Centralized monitoring of data pipeline with
Falcon + Ambari
Pipeline
run alerts
Hadoop Cluster-1 Hadoop Cluster-2
Pipeline
run history
Pipeline
scheduling
raw clean prep raw clean prep
16. Page 16 © Hortonworks Inc. 2014
Data Pipeline Tracing
.
Purchase
feed
Customer
feed
Product
feedStore feed
View dependencies
between clusters,
datasets and
processes
Data pipeline
dependencies
Add arbitrary
tags to feeds &
processes
Credit
feed
Sensitive encrypted
Data pipeline
tagging
Know who
modified a
dataset when
and into what
Data pipeline
audits
File-1
File-2
File-3
Analyze how a
dataset reached
a particular
state
Data pipeline
lineage
17. Page 17 © Hortonworks Inc. 2014
Falcon User Flow
Create cluster entity
& process XML
specifications
Validate and
save
specifications
to HDFS
Kick off
Feeds &
processes
Schedule
“Instances” of
feeds &
process to run
Ensure feeds
& processes
run as
expected
Update feeds
& processes
as needed
User
Falcon
Server
Falcon CLI
or API
Define pipeline Deploy pipeline Manage pipeline
‘instance’
suspend,
resume, kill
SCHEDULESUBMIT
18. Page 18 © Hortonworks Inc. 2014
Outline
Falcon
Overview
Features
Architecture
& Demo
19. Page 19 © Hortonworks Inc. 2014
Falcon Architecture
Centralized Falcon Orchestration
Framework
Hadoop ecosystem tools
Falcon
Server
JMS
API
&
UI
AMBARI
HDFS / Hive
Oozie
Entity
Specs
Scheduled
Jobs
Process
Status
MapRed / Pig / Hive /
Sqoop / Flume /
DistCP
Data
stewards
+
Hadoop
admins
20. Page 20 © Hortonworks Inc. 2014
Clickstream enrichment data pipeline
Use case description
• Clicks & impressions data lands hourly in my primary cluster (under HDFS location /../
{date}).
• Cluster is located in the Oregon data center.
• Data arrives from all NA-west-coast production servers.
• The input data feeds are often late for up to 4 hrs.
• We need to enrich the clickstream data with Ad impression metadata and make it
available to our marketing data science team for customer segmentation analysis.
• Primary Hadoop cluster does not need the raw and enriched click data after 3 months.
• Our IT policy requires us to backup all enriched click data and store it for 3 years in
our secondary Hadoop cluster in the Virginia data center.
21. Page 21 © Hortonworks Inc. 2014
Falcon Entity Relationships
CLICKSTREAM ENRICHMENT PIPELINE
Clicks
DATASET
Enriched
clicks
DATASET
Click
enrichment
PROCESSClicks ingest
PROCESS
Oregon Hadoop cluster
PRIMARY CLUSTER
Virginia
Hadoop cluster
BACKUP
CLUSTER
Creates
Runson
Storedon
Backup
to
Create
Impressions
ingest
PROCESS
Creates Impressions
DATASET
Runson
22. Page 22 © Hortonworks Inc. 2014
Learn More About Data Governance in Hadoop
Hortonworks.com/labs/data-management/
Register for the remaining 4
Discover HDP 2.1 Webinars
Hortonworks.com/webinars
Next Webinar:
Apache Hadoop 2.4.0,
YARN and HDFS
Wednesday, May 28, 9am Pacific
23. Page 23 © Hortonworks Inc. 2014
Thank you!