SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Page 1 © Hortonworks Inc. 2014
Discover HDP 2.1
Apache Falcon for Data Governance in Hadoop
Hortonworks. We do Hadoop.
Page 2 © Hortonworks Inc. 2014
Speakers
Justin Sears
Hortonworks Product Marketing Manager
Himanshu Bari
Hortonworks Senior Product Manager & PM for Apache
Falcon & Apache Storm in Hortonworks Data Platform
Venkatesh Seetharam
Foundational Hadoop Architect, Engineer & Committer for
Apache Falcon and Apache Knox Gateway projects
Page 3 © Hortonworks Inc. 2014
Agenda
•  Why You Need Apache Falcon
•  Key New Falcon Features
•  Demo
–  Defining data pipelines
–  Policies for retention
–  Managing Falcon server with Apache Ambari
Page 4 © Hortonworks Inc. 2014
OPERATIONS	
  TOOLS	
  
Provision,
Manage &
Monitor
DEV	
  &	
  DATA	
  TOOLS	
  
Build & Test
A Modern Data Architecture
APPLICATIONS	
  DATA	
  	
  SYSTEM	
  
REPOSITORIES	
  
RDBMS	
   EDW	
   MPP	
  
Business	
  	
  
Analy<cs	
  
Custom	
  Applica<ons	
  
Packaged	
  
Applica<ons	
  
Governance
&Integration
ENTERPRISE HADOOP
Security
Operations
Data Access
Data Management
SOURCES	
  
OLTP,	
  ERP,	
  
CRM	
  Systems	
  
Documents,	
  	
  
Emails	
  
Web	
  Logs,	
  
Click	
  Streams	
  
Social	
  Networks	
   Machine	
  
Generated	
  
Sensor	
  
Data	
  
GeolocaCon	
  Data	
  
Page 5 © Hortonworks Inc. 2014
HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform
	
  	
  
Provision,	
  
Manage	
  &	
  
Monitor	
  
	
  
Ambari	
  
Zookeeper	
  
Scheduling	
  
	
  
Oozie	
  
Data	
  Workflow,	
  
Lifecycle	
  &	
  
Governance	
  
	
  
Falcon	
  
Sqoop	
  
Flume	
  
NFS	
  
WebHDFS	
  
YARN	
  :	
  Data	
  Opera<ng	
  System	
  
DATA	
  	
  MANAGEMENT	
  
DATA	
  	
  ACCESS	
  
GOVERNANCE	
  &	
  
INTEGRATION	
  
OPERATIONS	
  
Script	
  
	
  
Pig	
  
	
  
	
  
Search	
  
	
  
Solr	
  
	
  
	
  
SQL	
  
	
  
Hive/Tez,	
  
HCatalog	
  
	
  
	
  
NoSQL	
  
	
  
HBase	
  
Accumulo	
  
	
  
	
  
Stream	
  
	
  	
  
Storm	
  
	
  
	
  
	
  
Others	
  
	
  
In-­‐Memory	
  
AnalyCcs,	
  	
  
ISV	
  engines	
  
1	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
  
°	
  
N	
  
HDFS	
  	
  
(Hadoop	
  Distributed	
  File	
  System)	
  
Batch	
  
	
  
Map	
  
Reduce	
  
	
  
	
  
SECURITY	
  
Authen<ca<on	
  
Authoriza<on	
  
Accoun<ng	
  
Data	
  Protec<on	
  
	
  
Storage:	
  HDFS	
  
Resources:	
  YARN	
  
Access:	
  Hive,	
  …	
  	
  
Pipeline:	
  Falcon	
  
Cluster:	
  Knox	
  
Page 6 © Hortonworks Inc. 2014
NoSQL	
  
	
  
HBase	
  
Accumulo	
  
	
  
	
  
Stream	
  
	
  	
  
Storm	
  
	
  
	
  
	
  
Others	
  
	
  
In-­‐Memory	
  
AnalyCcs,	
  	
  
ISV	
  engines	
  
Script	
  
	
  
Pig	
  
	
  
	
  
Search	
  
	
  
Solr	
  
	
  
	
  
HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform
	
  	
  
Provision,	
  
Manage	
  &	
  
Monitor	
  
	
  
Ambari	
  
Zookeeper	
  
Scheduling	
  
	
  
Oozie	
  
DATA	
  	
  MANAGEMENT	
  
OPERATIONS	
  
1	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
   °	
  
°	
  
°	
  
N	
  
HDFS	
  	
  
(Hadoop	
  Distributed	
  File	
  System)	
  
SECURITY	
  
Authen<ca<on	
  
Authoriza<on	
  
Accoun<ng	
  
Data	
  Protec<on	
  
	
  
Storage:	
  HDFS	
  
Resources:	
  YARN	
  
Access:	
  Hive,	
  …	
  	
  
Pipeline:	
  Falcon	
  
Cluster:	
  Knox	
  
YARN	
  :	
  Data	
  Opera<ng	
  System	
  
DATA	
  	
  ACCESS	
  
SQL	
  
	
  
Hive/Tez,	
  
HCatalog	
  
	
  
	
  
Batch	
  
	
  
Map	
  
Reduce	
  
	
  
	
  
Data	
  Workflow,	
  
Lifecycle	
  &	
  
Governance	
  
	
  
Falcon	
  
Sqoop	
  
Flume	
  
NFS	
  
WebHDFS	
  
GOVERNANCE	
  &	
  
INTEGRATION	
  
Page 7 © Hortonworks Inc. 2014
Outline
Falcon
Overview
Features
Architecture
& Demo
Page 8 © Hortonworks Inc. 2014
Simple Data Pipeline in Hadoop
Relatively simple Oozie workflow
Job1
Job2 JobN
Job3
Has a
Simple data pipeline
Raw
Data
Clean
Data
Prepped
Data
HDFS data lake
MR/Pig/Hive
BI
TOOLS
Data
Sources
MR/Pig/Hive
Page 9 © Hortonworks Inc. 2014
Quickly Gets Complicated….
Data stewards
•  Impact analysis
•  Monitor pipeline
•  Track ownership
•  Late data &
failure handling
Compliance teams
•  Audit
•  Retention
•  Eviction
IT admins
•  Monitor infra
•  Replication
•  Archival
Business & data
analysts
•  Verify data
quality
Manually
write & wire
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
tools
Eg. DistCp
Typical data governance requirements
Raw Clean Prep
Page 10 © Hortonworks Inc. 2014
Apache Falcon to the Rescue
Data pipeline
Raw Clean Prep
Defined in
Auto generate
& orchestrate
Adds the required data
governance features
Falcon adds the required data governance features
DEFINITION
Replication | Retention
Eviction | Late data
MONITORING
TRACING
Audit | Lineage
Tagging
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
ecosystem
tools
Eg. DistCp
Page 11 © Hortonworks Inc. 2014
Outline
Falcon
Overview
Features
Architecture
& Demo
Page 12 © Hortonworks Inc. 2014
Falcon Basic Concepts
• Feed: Defines a “dataset” so a.k.a ‘datasets’
• Process: Consumes feeds, invokes processing logic & produces feeds
All these put together represent ‘Data Pipelines’ in Hadoop
CLUSTER
FEED
aka
DATASET
PROCESS
INPUT TO
CREATES
• Cluster: : Represents the “interfaces” to a Hadoop cluster
Page 13 © Hortonworks Inc. 2014
Data Pipeline Definition
XML based pipeline specification
Modular - Clusters, feeds & processes defined separately and then linked together
Easy to re-use across multiple pipelines
Out of the box policies
Predefined policies for replication, retention & late data handling Easily customization of policies
Extensible
Plug in external solutions at any step of the pipeline
Eg. Invoke third party data obfuscation components
Page 14 © Hortonworks Inc. 2014
Replication & Retention
Staged Data
Retain 5
Years
Cleansed
Data
Retain 3
Years
Conformed
Data
Retain 3
Years
Presented
Data
Retain Last
Copy Only
•  Sophisticated retention policies expressed in one place
•  Simplify data retention for audit, compliance, or for data re-processing
Page 15 © Hortonworks Inc. 2014
Data Pipeline Monitoring
DATA
Primary site DR site
Centralized monitoring of data pipeline with
Falcon + Ambari
Pipeline
run alerts
Hadoop Cluster-1 Hadoop Cluster-2
Pipeline
run history
Pipeline
scheduling
raw clean prep raw clean prep
Page 16 © Hortonworks Inc. 2014
Data Pipeline Tracing
.
Purchase
feed
Customer
feed
Product
feedStore feed
View dependencies
between clusters,
datasets and
processes
Data pipeline
dependencies
Add arbitrary
tags to feeds &
processes
Credit
feed
Sensitive encrypted
Data pipeline
tagging
Know who
modified a
dataset when
and into what
Data pipeline
audits
File-1
File-2
File-3
Analyze how a
dataset reached
a particular
state
Data pipeline
lineage
Page 17 © Hortonworks Inc. 2014
Falcon User Flow
Create cluster entity
& process XML
specifications
Validate and
save
specifications
to HDFS
Kick off
Feeds &
processes
Schedule
“Instances” of
feeds &
process to run
Ensure feeds
& processes
run as
expected
Update feeds
& processes
as needed
User
Falcon
Server
Falcon CLI
or API
Define pipeline Deploy pipeline Manage pipeline
‘instance’
suspend,
resume, kill
SCHEDULESUBMIT
Page 18 © Hortonworks Inc. 2014
Outline
Falcon
Overview
Features
Architecture
& Demo
Page 19 © Hortonworks Inc. 2014
Falcon Architecture
Centralized Falcon Orchestration
Framework
Hadoop ecosystem tools
Falcon	
  Server	
   JMS	
  
API	
  
&	
  
UI	
  
AMBARI	
  
HDFS / Hive
Oozie
Entity
Specs
Scheduled
Jobs
Process
Status
MapRed / Pig / Hive /
Sqoop / Flume /
DistCP
Data
stewards
+
Hadoop
admins
Page 20 © Hortonworks Inc. 2014
Clickstream enrichment data pipeline
Use case description
•  Clicks & impressions data lands hourly in my primary cluster (under HDFS location /../
{date}).
•  Cluster is located in the Oregon data center.
•  Data arrives from all NA-west-coast production servers.
•  The input data feeds are often late for up to 4 hrs.
•  We need to enrich the clickstream data with Ad impression metadata and make it
available to our marketing data science team for customer segmentation analysis.
•  Primary Hadoop cluster does not need the raw and enriched click data after 3 months.
•  Our IT policy requires us to backup all enriched click data and store it for 3 years in
our secondary Hadoop cluster in the Virginia data center.
Page 21 © Hortonworks Inc. 2014
Falcon Entity Relationships
CLICKSTREAM ENRICHMENT PIPELINE
Clicks
DATASET
Enriched
clicks
DATASET
Click
enrichment
PROCESSClicks ingest
PROCESS
Oregon Hadoop cluster
PRIMARY CLUSTER
Virginia
Hadoop cluster
BACKUP
CLUSTER
Creates
Runson
Storedon
Backup
to
Create
Impressions
ingest
PROCESS
Creates Impressions
DATASET
Runson
Page 22 © Hortonworks Inc. 2014
Learn More About Data Governance in Hadoop
Hortonworks.com/labs/data-management/
Register for the remaining 4
Discover HDP 2.1 Webinars
Hortonworks.com/webinars
Next Webinar:
Apache Hadoop 2.4.0,
YARN and HDFS
Wednesday, May 28, 9am Pacific
Page 23 © Hortonworks Inc. 2014
Thank you!

Mais conteúdo relacionado

Mais procurados

Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
Hortonworks
 

Mais procurados (20)

Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble Storage
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun Connolly
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 

Destaque

Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
DataWorks Summit
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
DataWorks Summit
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
Hortonworks
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
DataWorks Summit
 

Destaque (20)

Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
 
Falcon Meetup
Falcon Meetup Falcon Meetup
Falcon Meetup
 
HPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and WorkflowsHPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and Workflows
 
Importing data in Oasis Montaj
Importing data in Oasis MontajImporting data in Oasis Montaj
Importing data in Oasis Montaj
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
 
2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider Webinar
 
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 

Semelhante a Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 

Semelhante a Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop (20)

Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Hadoop In Action
Hadoop In ActionHadoop In Action
Hadoop In Action
 
TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 

Mais de Hortonworks

Mais de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Último (20)

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

  • 1. Page 1 © Hortonworks Inc. 2014 Discover HDP 2.1 Apache Falcon for Data Governance in Hadoop Hortonworks. We do Hadoop.
  • 2. Page 2 © Hortonworks Inc. 2014 Speakers Justin Sears Hortonworks Product Marketing Manager Himanshu Bari Hortonworks Senior Product Manager & PM for Apache Falcon & Apache Storm in Hortonworks Data Platform Venkatesh Seetharam Foundational Hadoop Architect, Engineer & Committer for Apache Falcon and Apache Knox Gateway projects
  • 3. Page 3 © Hortonworks Inc. 2014 Agenda •  Why You Need Apache Falcon •  Key New Falcon Features •  Demo –  Defining data pipelines –  Policies for retention –  Managing Falcon server with Apache Ambari
  • 4. Page 4 © Hortonworks Inc. 2014 OPERATIONS  TOOLS   Provision, Manage & Monitor DEV  &  DATA  TOOLS   Build & Test A Modern Data Architecture APPLICATIONS  DATA    SYSTEM   REPOSITORIES   RDBMS   EDW   MPP   Business     Analy<cs   Custom  Applica<ons   Packaged   Applica<ons   Governance &Integration ENTERPRISE HADOOP Security Operations Data Access Data Management SOURCES   OLTP,  ERP,   CRM  Systems   Documents,     Emails   Web  Logs,   Click  Streams   Social  Networks   Machine   Generated   Sensor   Data   GeolocaCon  Data  
  • 5. Page 5 © Hortonworks Inc. 2014 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   YARN  :  Data  Opera<ng  System   DATA    MANAGEMENT   DATA    ACCESS   GOVERNANCE  &   INTEGRATION   OPERATIONS   Script     Pig       Search     Solr       SQL     Hive/Tez,   HCatalog       NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox  
  • 6. Page 6 © Hortonworks Inc. 2014 NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   Script     Pig       Search     Solr       HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   DATA    MANAGEMENT   OPERATIONS   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox   YARN  :  Data  Opera<ng  System   DATA    ACCESS   SQL     Hive/Tez,   HCatalog       Batch     Map   Reduce       Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   GOVERNANCE  &   INTEGRATION  
  • 7. Page 7 © Hortonworks Inc. 2014 Outline Falcon Overview Features Architecture & Demo
  • 8. Page 8 © Hortonworks Inc. 2014 Simple Data Pipeline in Hadoop Relatively simple Oozie workflow Job1 Job2 JobN Job3 Has a Simple data pipeline Raw Data Clean Data Prepped Data HDFS data lake MR/Pig/Hive BI TOOLS Data Sources MR/Pig/Hive
  • 9. Page 9 © Hortonworks Inc. 2014 Quickly Gets Complicated…. Data stewards •  Impact analysis •  Monitor pipeline •  Track ownership •  Late data & failure handling Compliance teams •  Audit •  Retention •  Eviction IT admins •  Monitor infra •  Replication •  Archival Business & data analysts •  Verify data quality Manually write & wire Multiple complex Oozie workflows Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Other Hadoop tools Eg. DistCp Typical data governance requirements Raw Clean Prep
  • 10. Page 10 © Hortonworks Inc. 2014 Apache Falcon to the Rescue Data pipeline Raw Clean Prep Defined in Auto generate & orchestrate Adds the required data governance features Falcon adds the required data governance features DEFINITION Replication | Retention Eviction | Late data MONITORING TRACING Audit | Lineage Tagging Multiple complex Oozie workflows Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN Other Hadoop ecosystem tools Eg. DistCp
  • 11. Page 11 © Hortonworks Inc. 2014 Outline Falcon Overview Features Architecture & Demo
  • 12. Page 12 © Hortonworks Inc. 2014 Falcon Basic Concepts • Feed: Defines a “dataset” so a.k.a ‘datasets’ • Process: Consumes feeds, invokes processing logic & produces feeds All these put together represent ‘Data Pipelines’ in Hadoop CLUSTER FEED aka DATASET PROCESS INPUT TO CREATES • Cluster: : Represents the “interfaces” to a Hadoop cluster
  • 13. Page 13 © Hortonworks Inc. 2014 Data Pipeline Definition XML based pipeline specification Modular - Clusters, feeds & processes defined separately and then linked together Easy to re-use across multiple pipelines Out of the box policies Predefined policies for replication, retention & late data handling Easily customization of policies Extensible Plug in external solutions at any step of the pipeline Eg. Invoke third party data obfuscation components
  • 14. Page 14 © Hortonworks Inc. 2014 Replication & Retention Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only •  Sophisticated retention policies expressed in one place •  Simplify data retention for audit, compliance, or for data re-processing
  • 15. Page 15 © Hortonworks Inc. 2014 Data Pipeline Monitoring DATA Primary site DR site Centralized monitoring of data pipeline with Falcon + Ambari Pipeline run alerts Hadoop Cluster-1 Hadoop Cluster-2 Pipeline run history Pipeline scheduling raw clean prep raw clean prep
  • 16. Page 16 © Hortonworks Inc. 2014 Data Pipeline Tracing . Purchase feed Customer feed Product feedStore feed View dependencies between clusters, datasets and processes Data pipeline dependencies Add arbitrary tags to feeds & processes Credit feed Sensitive encrypted Data pipeline tagging Know who modified a dataset when and into what Data pipeline audits File-1 File-2 File-3 Analyze how a dataset reached a particular state Data pipeline lineage
  • 17. Page 17 © Hortonworks Inc. 2014 Falcon User Flow Create cluster entity & process XML specifications Validate and save specifications to HDFS Kick off Feeds & processes Schedule “Instances” of feeds & process to run Ensure feeds & processes run as expected Update feeds & processes as needed User Falcon Server Falcon CLI or API Define pipeline Deploy pipeline Manage pipeline ‘instance’ suspend, resume, kill SCHEDULESUBMIT
  • 18. Page 18 © Hortonworks Inc. 2014 Outline Falcon Overview Features Architecture & Demo
  • 19. Page 19 © Hortonworks Inc. 2014 Falcon Architecture Centralized Falcon Orchestration Framework Hadoop ecosystem tools Falcon  Server   JMS   API   &   UI   AMBARI   HDFS / Hive Oozie Entity Specs Scheduled Jobs Process Status MapRed / Pig / Hive / Sqoop / Flume / DistCP Data stewards + Hadoop admins
  • 20. Page 20 © Hortonworks Inc. 2014 Clickstream enrichment data pipeline Use case description •  Clicks & impressions data lands hourly in my primary cluster (under HDFS location /../ {date}). •  Cluster is located in the Oregon data center. •  Data arrives from all NA-west-coast production servers. •  The input data feeds are often late for up to 4 hrs. •  We need to enrich the clickstream data with Ad impression metadata and make it available to our marketing data science team for customer segmentation analysis. •  Primary Hadoop cluster does not need the raw and enriched click data after 3 months. •  Our IT policy requires us to backup all enriched click data and store it for 3 years in our secondary Hadoop cluster in the Virginia data center.
  • 21. Page 21 © Hortonworks Inc. 2014 Falcon Entity Relationships CLICKSTREAM ENRICHMENT PIPELINE Clicks DATASET Enriched clicks DATASET Click enrichment PROCESSClicks ingest PROCESS Oregon Hadoop cluster PRIMARY CLUSTER Virginia Hadoop cluster BACKUP CLUSTER Creates Runson Storedon Backup to Create Impressions ingest PROCESS Creates Impressions DATASET Runson
  • 22. Page 22 © Hortonworks Inc. 2014 Learn More About Data Governance in Hadoop Hortonworks.com/labs/data-management/ Register for the remaining 4 Discover HDP 2.1 Webinars Hortonworks.com/webinars Next Webinar: Apache Hadoop 2.4.0, YARN and HDFS Wednesday, May 28, 9am Pacific
  • 23. Page 23 © Hortonworks Inc. 2014 Thank you!