Falcon - Data Management Platform on Hadoop (Beyond ETL)

•Transferir como PPTX, PDF•

10 gostaram•4,225 visualizações

Hadoop and its ecosystem of products have made storing and processing massive amounts of data common place. This has enabled numerous businesses to gain valuable foresights that they never could have in the past. While it is easy to leverage Hadoop for crunching large volumes of data, organizing data, managing life cycle of data and processing data is fairly involved. This is solved adequately well in a traditional data platform involving data warehouses and standard ETL (extract-transform-load) tools, but remains largely unsolved today. Besides data processing complexities, Hadoop presents new set of challenges relating to management of data. Data Management on Hadoop encompasses data motion (import/export), process orchestration (data pipelines, late/re-processing, scheduling), lifecycle management (retention, replication, DR, anonymization, archival), data discovery (data classification, Lineage), etc. among other concerns that are beyond ETL. The presentation focuses on a new data processing and management platform for Hadoop, Falcon that attempts to solve this problem by leveraging existing stacks in the Hadoop ecosystem. Falcon has been in production for nearly a year at InMobi and has been managing hundreds of feeds and processes.

Tecnologia Negócios

Data Management Platform
on Hadoop
Srikanth Sundarrajan
Venkatesh Seetharam
(Incubating)

whoami
Principal Architect
InMobi
Apache Hadoop
Contributor
Hadoop Team
@Yahoo!
Srikanth
Sundarrajan
Architect/Developer
Hortonworks
Apache Hadoop
Contributor
Data Management
@ Yahoo!
Venkatesh
Seetharam

Agenda
2 Falcon Overview
1 Motivation
3 Case Studies
4 Questions & Answers

Data Processing Landscape
External
data
source
Acquire
(Import)
Data Processing
(Transform/Pipeline
)
Eviction Archive
Replicate
(Copy)
Export

Core Services
Process
• Late data management
• Relays
Data
management
• Acquisition
• Replication
• Retention
Operability
• SLA
• Lineage

Process Management – Relays
picture courtersy: http://istockphoto.com/

Late Data Management
picture courtersy: http://iwebask.com

Data Retention As Service
picture courtersy: http://vimeo.com/

Data Replication As Service
picture courtersy: http://boylesmedia.com

Data Acquisition As Service
picture courtersy: http://wmpu.org

Operability – Dashboard
picture courtersy: http://www.opentrack.ch/

Holistic Declaration of Intent
picture courtersy: http://bigboxdetox.com

Entity Dependency Graph
Hadoop /
Hbase …
Cluster
External
data
source
feed Process
depends
depends

High Level Architecture
Apache
Falcon
Oozie
Messaging
HCatalog
Hadoop
Entity
Entity
status
Process
status /
notification
CLI/RES
T
JMS
Config
store

Feed Schedule
Cluster
xml
Feed xml Falcon
Falcon config
store / Graph
Retention /
Replication
workflow
Oozie
Scheduler HDFS
JMS Notification
per action
Catalog
service
Instance
Management

Process Schedule
Cluster/fe
ed xml
Process
xml
Falcon
Falcon config
store / Graph
Process
workflow
Oozie
Scheduler HDFS
JMS Notification
per available
feed
Catalog
service
Instance
Management

Physical Architecture
Falcon Colo 1
Falcon Colo 2
Falcon Colo 3
Scheduler
Scheduler
Scheduler
Falcon – Prism
Global view

CASE STUDY
Distributed Processing
Example: Digital Advertising @ InMobi

Hadoop @ InMobi
 About InMobi
 Worlds leading independent mobile advertising company
 Hadoop usage at InMobi
 ~ 6 Clusters
 > 1PB of storage
 > 5TB new data ingested each day
 > 20TB data crunched each day
 > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase
 > 175K hadoop jobs / day
 > 60K Oozie workflows / day
 300+ Falcon feed definitions
 100+ Falcon process definitions

Processing – Single Data Center
Ad Request
data
Impression
render event
Click event
Conversion
event
Continuou
s
Streaming
(minutely)
Hourly
summary
Enrichment
(minutely/5
minutely)
Summarizer

Global Aggregation
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
……..
DataCenter1
DataCenterN
Consumable
global aggregate

Future
Security
Embed Pig/Hive scripts
Data Acquisition – file-based
Monitoring/Management
Dashboard

Questions?
 Apache Falcon
 http://falcon.incubator.apache.org
 mailto: dev@falcon.incubator.apache.org
 Srikanth Sundarrajan
 sriksun@apache.org
 #sriksun
 Venkatesh Seetharam
 venkatesh@apache.org
 #innerzeal

Mais conteúdo relacionado

Mais procurados

Getting Started with Oracle APEXDataNext Solutions

Getting involved with Open Source at the ASFHortonworks

Sharing metadata across the data lake and streamsDataWorks Summit

Peteris Arajs - Where is my dataAndrejs Vorobjovs

IoT with Apache MXNet and Apache NiFi and MiniFiDataWorks Summit

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3DataWorks Summit

Obiee 12C and the Leap Forward in Lifecycle ManagementStewart Bryson

SharePoint Performance - Best Practices from the Field Jason Himmelstein

Pretius Oracle Apex PrimerPretius

Creating the Internet of Your ThingsDataWorks Summit/Hadoop Summit

Getting Ready to Use Redis with Apache Spark with Tague GriffithDatabricks

Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingDataWorks Summit/Hadoop Summit

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...VMware Tanzu

Breathing New Life into Apache Oozie with Apache Ambari Workflow ManagerDataWorks Summit

Apache Zeppelin Helium and BeyondDataWorks Summit/Hadoop Summit

LLAP: long-lived execution in HiveDataWorks Summit

Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar

Mais procurados (17)

Getting Started with Oracle APEX

Getting involved with Open Source at the ASF

Sharing metadata across the data lake and streams

Peteris Arajs - Where is my data

IoT with Apache MXNet and Apache NiFi and MiniFi

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3

Obiee 12C and the Leap Forward in Lifecycle Management

SharePoint Performance - Best Practices from the Field

Pretius Oracle Apex Primer

Creating the Internet of Your Things

Getting Ready to Use Redis with Apache Spark with Tague Griffith

Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...

Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager

Apache Zeppelin Helium and Beyond

LLAP: long-lived execution in Hive

Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6

Destaque

DMP Data Management PlatformAvinash Tiwary

The DMPDavid Tam

The Data Management Platform: The Digital Brain You Wish You Had by Audrey R...FOUNDConference

The DMP 101 - Data Management Platforms ExplainedEddy Widerker

MarketView Marketing Database Platform | Data Services, Inc.Data Services, Inc.

What Is a Data Management Platform and Why You Should Care?IgnitionOne

Using Hadoop as a platform for Master Data ManagementDataWorks Summit

Hoe verhoog je de impact van sponsoring?MEDIALAAN RESEARCH

Atelier "Comment Epater votre direction avec votre projet DMP" avec TagComman...Antoine Gay

Build2016 - P470 - Using Non-volatile Memory (NVDIMM-N) as Byte-Addressable S...Windows Developer

d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...Jens Mittelbach

The Dendro research data management platform: Applying ontologies to long-ter...João Rocha da Silva

ALLDATA 2015 - RDF Based Linked Data Management as a DaaS PlatformSeonho Kim

Connected Government Reference Architecture - WSO2Con 2014 USASelvaratnam Uthaiyashankar

Dmp essentialBe2See.

WSO2 Platform Overview - WSO2 Meetup 01 - 16th Oct 2014Selvaratnam Uthaiyashankar

Digital in-store - Reality vs FantasyRaymond Interactive

The SmartH2O project: a platform supporting residential water management thro...SmartH2O

Bluekai: Data Management Platforms (dmp) for PublishersBrian Crotty

EA Intensive Course "Building Enterprise Architecture" by mr.danairatSoftware Park Thailand

Destaque (20)

DMP Data Management Platform

The DMP

The Data Management Platform: The Digital Brain You Wish You Had by Audrey R...

The DMP 101 - Data Management Platforms Explained

MarketView Marketing Database Platform | Data Services, Inc.

What Is a Data Management Platform and Why You Should Care?

Using Hadoop as a platform for Master Data Management

Hoe verhoog je de impact van sponsoring?

Atelier "Comment Epater votre direction avec votre projet DMP" avec TagComman...

Build2016 - P470 - Using Non-volatile Memory (NVDIMM-N) as Byte-Addressable S...

d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...

The Dendro research data management platform: Applying ontologies to long-ter...

ALLDATA 2015 - RDF Based Linked Data Management as a DaaS Platform

Connected Government Reference Architecture - WSO2Con 2014 USA

Dmp essential

WSO2 Platform Overview - WSO2 Meetup 01 - 16th Oct 2014

Digital in-store - Reality vs Fantasy

The SmartH2O project: a platform supporting residential water management thro...

Bluekai: Data Management Platforms (dmp) for Publishers

EA Intensive Course "Building Enterprise Architecture" by mr.danairat

Semelhante a Falcon - Data Management Platform on Hadoop (Beyond ETL)

Apache Falcon at Hadoop Summit Europe 2014Seetharam Venkatesh

Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.

OOP 2014Emil Andreas Siemes

Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks

Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit

Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh

One Slide Overview: ORCL Big Data Integration and GovernanceJeffrey T. Pollock

Atom: A cloud native deep learning platform at SupremindAlluxio, Inc.

haute Disponibilité et reprise sur incident dans SharePoint avec groupes de d...Isabelle Van Campenhoudt

Haute Disponibilité et Reprise sur incidents en SharePoint 2013 avec Sql Serv...serge luca

Tachyon-2014-11-21-amp-camp5Haoyuan Li

Big Data Introduction - Solix empowerDurga Gadiraju

Building a Big Data PipelineJesus Rodriguez

Hortonworks Oracle Big Data Integration Hortonworks

Unbreakable SharePoint 2013 with SQL Server Always On Availability Groups (HA...serge luca

Tame Big Data with Oracle Data IntegrationMichael Rainey

Analytics at the Speed of Thought: Actian Express Overview Actian Corporation

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamAndreas Grabner

Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio, Inc.

Hadoop Now, Next and BeyondDataWorks Summit

Semelhante a Falcon - Data Management Platform on Hadoop (Beyond ETL) (20)

Apache Falcon at Hadoop Summit Europe 2014

Best Practice in Accelerating Data Applications with Spark+Alluxio

OOP 2014

Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Driving Enterprise Data Governance for Big Data Systems through Apache Falcon

Data Governance in Apache Falcon - Hadoop Summit Brussels 2015

One Slide Overview: ORCL Big Data Integration and Governance

Atom: A cloud native deep learning platform at Supremind

haute Disponibilité et reprise sur incident dans SharePoint avec groupes de d...

Haute Disponibilité et Reprise sur incidents en SharePoint 2013 avec Sql Serv...

Tachyon-2014-11-21-amp-camp5

Big Data Introduction - Solix empower

Building a Big Data Pipeline

Hortonworks Oracle Big Data Integration

Unbreakable SharePoint 2013 with SQL Server Always On Availability Groups (HA...

Tame Big Data with Oracle Data Integration

Analytics at the Speed of Thought: Actian Express Overview

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam

Alluxio: Unify Data at Memory Speed; 2016-11-18

Hadoop Now, Next and Beyond

Mais de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Mais de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Último

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Artificial intelligence in cctv survelliance.pptxhariprasad279825

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Story boards and shot lists for my a level piececharlottematthew16

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

"ML in Production",Oleksandr BaganFwdays

From Family Reminiscence to Scholarly Archive .Alan Dix

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Falcon - Data Management Platform on Hadoop (Beyond ETL)

1. Data Management Platform on Hadoop Srikanth Sundarrajan Venkatesh Seetharam (Incubating)

2. whoami Principal Architect InMobi Apache Hadoop Contributor Hadoop Team @Yahoo! Srikanth Sundarrajan Architect/Developer Hortonworks Apache Hadoop Contributor Data Management @ Yahoo! Venkatesh Seetharam

3. Agenda 2 Falcon Overview 1 Motivation 3 Case Studies 4 Questions & Answers

4. MOTIVATION

5. Data Processing Landscape External data source Acquire (Import) Data Processing (Transform/Pipeline ) Eviction Archive Replicate (Copy) Export

6. Core Services Process • Late data management • Relays Data management • Acquisition • Replication • Retention Operability • SLA • Lineage

7. Process Management – Relays picture courtersy: http://istockphoto.com/

8. Late Data Management picture courtersy: http://iwebask.com

9. Data Retention As Service picture courtersy: http://vimeo.com/

10. Data Replication As Service picture courtersy: http://boylesmedia.com

11. Data Acquisition As Service picture courtersy: http://wmpu.org

12. Operability – Dashboard picture courtersy: http://www.opentrack.ch/

13. FALCON OVERVIEW

14. Holistic Declaration of Intent picture courtersy: http://bigboxdetox.com

15. Entity Dependency Graph Hadoop / Hbase … Cluster External data source feed Process depends depends

16. High Level Architecture Apache Falcon Oozie Messaging HCatalog Hadoop Entity Entity status Process status / notification CLI/RES T JMS Config store

17. Feed Schedule Cluster xml Feed xml Falcon Falcon config store / Graph Retention / Replication workflow Oozie Scheduler HDFS JMS Notification per action Catalog service Instance Management

18. Process Schedule Cluster/fe ed xml Process xml Falcon Falcon config store / Graph Process workflow Oozie Scheduler HDFS JMS Notification per available feed Catalog service Instance Management

19. Physical Architecture Falcon Colo 1 Falcon Colo 2 Falcon Colo 3 Scheduler Scheduler Scheduler Falcon – Prism Global view

20. CASE STUDY Multi Cluster Failover

21.

22.

23. CASE STUDY Distributed Processing Example: Digital Advertising @ InMobi

24. Hadoop @ InMobi  About InMobi  Worlds leading independent mobile advertising company  Hadoop usage at InMobi  ~ 6 Clusters  > 1PB of storage  > 5TB new data ingested each day  > 20TB data crunched each day  > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase  > 175K hadoop jobs / day  > 60K Oozie workflows / day  300+ Falcon feed definitions  100+ Falcon process definitions

25. Processing – Single Data Center Ad Request data Impression render event Click event Conversion event Continuou s Streaming (minutely) Hourly summary Enrichment (minutely/5 minutely) Summarizer

26. Global Aggregation Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer …….. DataCenter1 DataCenterN Consumable global aggregate

27. HIGHLIGHTS

28. Future Security Embed Pig/Hive scripts Data Acquisition – file-based Monitoring/Management Dashboard

29. Summary

30. Questions?  Apache Falcon  http://falcon.incubator.apache.org  mailto: dev@falcon.incubator.apache.org  Srikanth Sundarrajan  sriksun@apache.org  #sriksun  Venkatesh Seetharam  venkatesh@apache.org  #innerzeal

Notas do Editor

In a typical big data environment involving Hadoop, the use cases tend to be around processing very large volumes of data either for machine or human consumption. Some of the data that gets to the hadoop platform can contain critical business & financial information. The data processing team in such an environment is often distracted by the multitude of data management and process orchestration challenges. To name a fewIngesting large volumes of events/streamsIngesting slowly changing data typically available on a traditional databaseCreating a pipeline / sequence of processing logic to extract the desired piece of insight / informationHandling processing complexities relating to change of data / failuresManaging eviction of older data elementsBackup the data in an alternate location or archive it in a cheaper storage for DR/BCP & Compliance requirementsShip data out of the hadoop environment periodically for machine or human consumption etcThese tend to be standard challenges that are better handled in a platform and this might allow the data processing team to focus on their core business application. A platform approach to this also allows us to adopt best practices in solving each of these for subsequent users of the platform to leverage.========================What do we mean by DMPlatform should provide these as services to users so users worry about business processingCaptures common themes and follows best practicesFrees users from such
As we just noted that there are numerous data and process management services when made available to the data processing team, can reduce their day-to-day complexities significantly and allow them to focus on their business application. This is an enumeration of such services, which we intend to cover in adequate detail as we go along.
More often than not pipelines are sequence of data processing or data movement tasks that need to happen before raw data can be transformed into a meaningfully consumable form. Normally the end stage of the pipeline where the final sets of data are produced is in the critical path and may be subject to tight SLA bounds. Any step in the sequence/pipeline if either delayed or failed could cause the pipeline to stall. It is important that each step in the pipeline handoff to the next step to avoid any buffering of time and to allow seamless progression of the pipeline. People who are familiar with Apache Oozie might be able to appreciate this feature provided through the Coordinator.As the pipelines gets more and more time critical and time sensitive, this becomes very very critical and this ought to be available off the shelf for application developers. It is also important for this feature to scalable to support the needs of concurrent pipelines.
From our experience there are typically two reasons why large volumes of data are processed, namelySLA critical machine consumable data (with some tolerance to error)Factual reporting with a “Close of Books” notion for human consumable (not always but frequently enough)While the first class of application doesn’t get affected much if some small percentage of data arrives late. Some examples of these class of applications include forecasting, predictions, risk management etc.However the second class of application are used for factual reporting, results of which may be subject to audit. For these use cases, it is not acceptable to ignore data that arrived out of order or late. The platform in such cases need to provide an option to the application author the ability to detect arrival of late data and enable re-processing. This might also require a cascading reprocess flow of all downstream apps. This service being available off the shelf to the application developer would relieve him/her of the pain of having to manage this themselves.
A fact that data volumes are large and increasing by the day is the reason one adopts a big data platform like Hadoop and that would automatically mean that we would run of space pretty soon, if we didn’t take care of evicting & purging older instances of data. Few problems to consider for retention areShould avoid using a general purpose super user with world writable privileges to delete old data (for obvious reasons)Different types of data may require different criteria for aging and hence purgingOther life cycle functions like Archival of old data if defined ought to be scheduled before eviction kicks in
Hadoop is being increasingly critical for many businesses and for some users the raw data volumes are too large for them to be shipped to one place for processing, for others data needs to be redundantly available for business continuity reasons. In either scenarios replication of data from one cluster to another plays a vital role. This being available as a service would again free up the cycles from the application developer of these responsibilities. The key challenges to consider while offering this as a service areBandwidth consumption and managementChunking/bulking strategyCorrectness guaranteesHDFS version compatibility issues =========================2 Dimensions:BCP/DRLocal/Global Agg – ship local aggs as part of a pipeline
Integrated view of what is happening currently in the system based on the holistic information about all the elements in the system (data, associated management functions, processing logic and the location) provide for a compelling view of the “State of the system” at any time. This is a much needed platform feature for the larger goal of “allowing data application developer to focus on the business or processing logic”.Adding alerting & notifications to this will complete the operability story.===============================DashboardAlertsNotifications
Some of the things we have spoken about so far can be done if we took a silo-ed approach. For instance it is possible to process few data sets and produce a few more through a scheduler. However if there are two other consumers of the data produced by the first workflow then the same will be repeatedly defined by the other two consumers and so on. There is a serious duplication of metadata information of what data is ingested, processed or produced and where they are processed and how they are produced. A single system which creates a complete view of this would be able to provide a fairly complete picture of what is happening in the system compared to collection to independent scheduled applications. Both the production support and application development team on Hadoop platform have to scramble and write custom script and monitoring system to get a broader and holistic view of what is happening. An approach where this information is systemically collected and used for seamless management can alleviate much of the pains of folks operating or developing data processing application on hadoop.
The entity graph at the core is what makes Falcon what it is and that in a way enables all the unique features that Falcon has to offer or can potentially make available in future. At the coreDependency between Data Processing logic andCluster end pointsRules governing Data managementProcessing managementMetadata management
System accepts entities using DSLInfrastructure, Datasets, Pipeline/Processing logicTransforms the input into automated and scheduled workflowsSystem orchestrates workflowsInstruments execution of configured policiesHandles retry logic and late data processingRecords audit, lineage Seamless integration with metastore/catalog (WIP)Provides notifications based on availabilityIntegrated Seamless experience to usersAutomates processing and tracks the end to end progressData Set management (Replication, Retention, etc.) offered as a serviceUsers can cherry pick, No coupling between primitivesProvides hooks for monitoring, metrics collection
Ad Request, Click, Impression, Conversion feedMinutely (with identical location, retention configuration, but with many data centers)Summary dataHourly (with multiple partitions – one per dc, each configured as source and one target which is global datacenter)Click, Impression Conversion enrichment & Summarizer processesSingle definition with multiple data centersIdentical periodicity and scheduling configuration

Falcon - Data Management Platform on Hadoop (Beyond ETL)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (17)

Destaque

Destaque (20)

Semelhante a Falcon - Data Management Platform on Hadoop (Beyond ETL)

Semelhante a Falcon - Data Management Platform on Hadoop (Beyond ETL) (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Falcon - Data Management Platform on Hadoop (Beyond ETL)

Notas do Editor