SlideShare uma empresa Scribd logo
1 de 86
March 2016
Data Movement &
Management Meet-up
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Agenda
• Networking
• Brief introduction - Venkat Ranganathan
• Falcon Use Case Discussion
• Falcon 0.9 Release and Demo
• New Features coming in 0.10
• Hive DR: Balu Vellanki
• Server side extensions – Sowmya Ramesh
• ADF and Instance search – Ying Zheng
• Hive based ingestion and export – Venkatesan Ramachandran
• Spark integration - Peeyush
• Sqoop 2 Features – Abraham Fine
Page 2
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Falcon At a Glance
> Falcon offers a high-level abstraction of key services for Hadoop data processing needs.
> Complex data processing logic such as late data handling and retries are handled by Falcon instead
of hard-coded in data processing apps.
> Falcon maximizes reuse and consistency, enabling faster development of data processing apps.
Data Processing Applications
Data Ingest
and
Replication
Scheduling
and
Coordination
Data Lifecycle
Policies
Multi-Cluster
Management
SLA
Management
Falcon Framework
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Usage Scenarios
• Dataset Replication
• Replicate datasets (whether HDFS files or Hive Tables) as part of your Disaster Recovery, Backup and
Archival plans.
• Falcon triggers processes for retries and handles late data arrival.
• Dataset Lifecycle Management
• Establish the retention policies for datasets.
• Falcon schedules and handles eviction.
• Dataset Lineage + Traceability
• View coarse-grained dependencies between clusters, datasets and processes.
Page 4
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataset Replication + Retention
HDFS Hive Tables
Weblog Dataset
retention
policy
HDFS
retention
policy
Hive Tables
Recommendations Dataset
retention
policy
retention
policy
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Datasets Across Environments
• Disaster Recovery and Backup between environments
• Publishing data between environments for Discovery
Page 6
Site to Site Site to Cloud
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
> Falcon manages process workflow and replication at different stages.
> Enables data continuity without requiring full data representation.
Falcon Example: Replication
Staged Data
Staged Data
Cleansed
Data
Presented
Data
Processed
Data
Conformed
Data
Replication
Replication
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
> Sophisticated retention policies expressed in one place.
> Simplify data retention for audit, compliance, or for data re-
processing.
Falcon Example: Retention
Staged Data
Retain 5 Years
Cleansed Data
Retain 3 Years
Conformed
Data
Retain 3 Years
Presented
Data
Retain Last
Copy Only
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Falcon Example: Late Data Handling
> Processing waits until all required input data is available.
> Checks for late data arrivals, issues retrigger processing as
necessary.
> Eliminates writing complex data handling rules within applications.
Online
Transaction
Data (via
Sqoop)
Web Log Data
(via FTP)
Staged Data
Combined
Dataset
Wait up to 4
hours for FTP data
to arrive
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Learn Falcon Using Tutorials
• http://hortonworks.com/hadoop-tutorial/defining-processing-
data-end-end-data-pipeline-apache-falcon/
• More to come…
• Questions – Please reach out to cnormile@hortonworks.com
Page 10
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Usage in a Pharma Company
Ivo Lašek
03/24 2016 Hadoop Data Management and Data Movement
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Search
Data
Integration
Data
Analytics
Open
Data
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Public Brazilian Data
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
Merge
Clean
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
Merge
Clean
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
Merge
Clean
Security and Data Governance
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
Merge
Clean
Data Catalog
Security and Data Governance
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Usage
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Datasets and Feeds
• HDFS and Hive based datasets
• HDFS folder and Hive table is a single feed in Falcon
• Our Dataset represents a HDFS folder or a collection of Hive tables
• Our Dataset corresponds to 1 or more Falcon feeds
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataset Level Properties
• Need to set dataset level properties, not table level
• Retention policy, frequency etc.
• Currently we use a middleware layer that translates datasets to
feeds
• Need to keep the primary information in the middleware layer
• Potential synchronization issues
• Falcon can’t be accessed directly
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parametrized Scripts
INSERT INTO TABLE ${falcon_output_database}.${falcon_output_table}
PARTITION (${falcon_output_partitions_hive})
select *
from ${falcon_input1_database}.${falcon_input1_table} table1
join ${falcon_input2_database}.${falcon_input2_table} table2
on i1.common_id = i2.common_id
WHERE ${falcon_input1_partition_filter_hive} AND
${falcon_input2_partition_filter_hive}
WHERE ds = ‘2015-08-14-09-00’ AND ds = ‘2015-08-14-10-00’
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Processes
• Process chaining
• Used to use pull model but constrained only to Sqoop and Oozie
based ingests
• Need to support external ingestion tools (e.g. ETL)
• Push model enabled by availability flag
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Collaboration on Falcon (Done)
• Falcon REST API trusted user support
• Impersonation is possible
• Necessary for our Middleware layer
• FALCON-1027
• Available in Falcon 0.8
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Wish List
• Retention policy
• For hundreds of tables there are hundreds of Oozie jobs launched at the same
time to check the retention
• Kerberos
• Kerberos ticket for Falcon principal expires after 1 day and Falcon needs to
renew it
• Workaround: Falcon restarts twice a day
• Explicitly triggered run of a process (off schedule)
• Version based retention policy (not only time based)
• Support for streaming
• Additional storages (e.g. Hbase)
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Contacts
• Ivo Lasek (ivo.lasek@merck.com)
• Twitter: @ilasek
• http://www.merck.com/
• http://www.msdit.cz/
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Features: What’s New in 0.9?
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
$whoami
• Pallavi Rao
• Architect, InMobi
• Committer, Apache Falcon
• Contributor, Apache PIG (on Spark)
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New Features
• Import from DB and export to a DB.
• Native Scheduler
• Enhanced Falcon Unit API
• Hive DR replication metrics via CLI
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Import/ Export
35
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Management Actions
36
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Missing Piece
37
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Import
Falcon
Feed
• Different Modes of extraction: Full or incremental
• Different Modes of output (merge): Snapshot, append
• Include/Exclude columns
RDBMS
HDFS
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Export
Falcon
Feed
• Different Modes of Load: Insert, update-only
• Include/ Exclude columns
RDBMS
HDFS
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Native Scheduler
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Builda Native Scheduler?
• Falcon uses Oozie for:
• DAG Execution
• Scheduling - Gaps Exist
• Simple periodic scheduling without data gating
• Cron + calendar based scheduling with/without data gating.
• Flexible data gating
• Support for a-periodic datasets and triggers based on data
availability.
• Support for external triggers.
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduler - Before
Falcon
Server Scheduler
Execution
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduler - The Plan
• Time based scheduling - Available in 0.9
• Data based gating - Will be available in .10
• Complete parity with Oozie and additional features - The release after.
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduler - After
Falcon Server
DAG
Executor
Execution
ANY DAG
ExecutorScheduler
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Additional Benefits
• Understands the notion of a pipeline
• Better throttling primitives
• Prioritization and backlog catch up
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Unit
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Motivation for Falcon Unit
• User errors caught only at deployment time
• Input/ Output feeds and paths not getting resolved
• Errors in specification
• Integration Tests require environment setup/teardown.
• Messy deployment scripts
• Time consuming
• Debugging was cumbersome.
• Logs scattered
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Unit
Falcon
Unit
In Process execution
env.
• Local Oozie
• Local File System
• Local Job Runner
• Local Message
Queue
Actual cluster
• Oozie
• HDFS
• YARN
• Active MQ
Test
suite
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What You Can Test
Data
Management
Data
Governance
Process
Management
● Data creation
● Data injection
● Retention
● Replication
● Lineage
● Data availability for verification
● Validation of definition
● Entity scheduling and status verification
● Correctness of data window being picked up
● Reruns
● Missing dependencies/properties
Future
Available in 0.8
Available in 0.9
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
For More Information
• https://cwiki.apache.org/confluence/display/FALCON/Release+Notes
• https://blogs.apache.org/falcon/entry/what_s_new_in_falcon
• http://falcon.apache.org
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Pipeline
In
RDBMS
HDFS
Falcon Feed
Falcon Process
Copy Cat Out
HDFS
RDBMS
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Hive Disaster Recovery
• Hive Event based replication
• Hive should set hive.metastore.event.listeners property to
org.apache.hive.hcatalog.listener.DbNotificationListener
• Requires Hive version 1.2.0 or above.
• Uses Falcon Recipe framework to support Hive DR.
• Requires Bootstrap operation from user.
• Will replicate: DB, Table, Partition
– Add/drop partition, update, delete, alter
• Wont replicate: Views, roles, direct HDFS writes without registering Metadata
Page 53
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New Features in 0.10
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Server Side Extensions
• Provide capability to add Falcon extensions that can be used to
provide a specific data management function
– Data anonymization, masking etc.
• Managed and accessed like other standard Falcon entities
– UI, CLI and REST API access
• Better manageability than client side recipes
– Types of extensions
– Trusted/provided extensions which are OOTB extensions that run in
the Falcon context
– Custom extensions: Custom recipes are user defined recipes. Extension
cooking will be done outside Falcon context in a new process.
Page 55
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Server side Extensions (Cont.)
• Extension Repository Management
– Templatized entities and parameterized workflows used during extension
cooking to realize well constructed Falcon entities are referred to as
extension artifacts
– Extension artifact store is a HDFS based store which Falcon system
maintains to store the extension artifacts
– should be configured using “*.extension.store.uri” property in Falcon
startup properties
– Rest API/CLI support should be provided for extension store
management
Page 56
57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Spark Integration
• Support Spark as a processing option with Hive, Pig and Oozie workflows
• Enables users to easily do data management functions using Spark
• Both Java and Python applications are supported
Page 57
58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Data Ingestion and Export
• Relational Database Ingestion and Export
– Falcon supports defining and scheduling import and export jobs
– - Supports Datasource as a top level abstraction
– Leverages Sqoop 1 internally
– Now supports Hcatalog tables as Source and Target for Export and Import
– Support for jceks based password alias
– WIP to support resource throttling on Data Sources
• Support for other types of Data Sources
– WIP to support data sources other than Relational databases
Page 58
59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
HDFS Snapshot Management and Replication
• Use Case
– Cost effective replication that only copies modified blocks
– Provides ability to rollback in case of data corruption
• Falcon will use server side extensions to implement this feature
• Extension will do the following
– Create the snapshot on source directory
– Replicate the directory using current and previous snapshot (If exists)
– Create snapshot on Target directory
• Snapshot retention policy
– Users can specify age limit and N number of snapshots to retain.
– Falcon will deletes snapshots on source and target that are older than the age
limit while retaining at least N snapshots.
Page 59
60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
New Feature : Cluster Entity Update
• Falcon will provide ability to update cluster entity without
having to delete and re-submit entity
• Use Cases
– Update Hadoop installation from unsecure to secure.
– Update from non-HA to high availability
• Cluster entity update expects underlying HDFS and Oozie
installations remain the same
• Cluster update only allowed by super user in falcon safe-mode.
• Falcon will do the following
– Update cluster entity when server is in safe mode
– When Falcon starts in normal mode, the coordinator/bundle jobs for all
dependent Feed/Process entities will be updated in workflow engine
Page 60
61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Falcon Safe Mode
• Falcon server will support starting in safe-mode
• Use cases
– Supports rolling upgrades
– Useful while updating cluster entities
• When in safe-mode, users can do the following
– Read operations on all entity/instances
– Suspend or Kill feed/process instances
– Update cluster entity.
• When in safe-mode, users cannot do the following
– Submit entity operations.
– Schedule operations on feed/process
– Validate, touch, dry-run operations
– Delete entity
– Instance rerun/resume operations
Page 61
62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory
• On-premise and cloud hybrid Hadoop data pipeline
– Build pipeline for HDP data processing on Azure (e.g. Hive)
– Copy data to Azure blobs (e.g. aggregation result from Hive)
– Use Azure Machine Learning platform for predictive analysis
• Keep sensitive data (e.g. PII) on-premises for privacy, compliance
reasons
• Share non-sensitive data on cloud for cross-region replication,
recovery, data prediction, etc.
Page 62
63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory
• Pipeline building and job tracking
Page 63
64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Search and Lineage - Current
• Entity search: Filter by name subsequence, tags, …
• Instance search of one entity: Filter by time and status
• Lineage for succeeded instances
65 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2015
Search and Lineage - New
• Global instance search
– Provide instance status summary
– Improve search performance
• Lineage for instances in all statuses
Page 65
66 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Abraham Fine
Apache Sqoop 2
67 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Software Engineer at Cloudera
• Previously:
• Software Engineer at Yahoo!
• Software Engineer at BrightRoll
• Student at The University of Illinois
at Urbana-Champaign
Who am I?
68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Committer – Apache Sqoop
69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Apache Sqoop?
70 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Apache SqoopTM is a tool
designed for efficiently transferring
bulk data between Apache Hadoop
and structured datastores such as
relational databases."
71 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
So much more than that…
(S)FTP
72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 1
A brief overview
73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 1
• Based on Connectors
• Responsible for Metadata lookups, and Data Transfer
• Majority of connectors are JDBC based
• Non-JDBC (direct) connectors for optimized data transfer
• Connectors responsible for all supported functionality
• HBase Import, Avro Support, ...
74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 1 Architecture
Job
Submission
75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 1 Shortcomings
• Client needs…
• direct access to the database
• Access to Hadoop configuration
• Connectors strictly coupled with MapReduce
• No way to manage database passwords for users
• Resource management is difficult
• Client needs the JDK
• Very long complicated command line scripts
76 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop as a service
Sqoop 2
77 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 2 Architecture
Repository
78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 2 Internals
79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connectors
• Connectors implement an interface that allows Sqoop to retrieve and
write data
• JDBC and HDFS are implemented with connectors
• They define the configuration needed to work with a type of data source
• Anyone can write connectors for Sqoop 2
80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Links
• If connectors are classes, then links are instances
• Links define connections to individual data sources
• Links contain inputs which are values assigned to the configuration
specified in the link’s connector
• “Sensitive values” are hidden from the user and encrypted in the
repository
81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 2 Internals
Connector
Link A
Input A.A Input A.B
Link B
Input B.A
82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Jobs
• From link
• To link
• Some extra configuration (for resource management, etc…)
Job
Link A Link B FromJobConf ToJobConf
83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Admin/DBA
• Sets up links and manages
passwords to databases
• User
• Sets up and runs jobs
2 Classes of Sqoop User
84 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo!
85 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
Abraham Fine
abefine@cloudera.com
https://www.linkedin.com/in/abrahamfine
@abrahamfine
86 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You

Mais conteúdo relacionado

Mais procurados

Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters Hortonworks
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Hortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts PresentationHortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts PresentationHortonworks
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPHortonworks
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution Hortonworks
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 

Mais procurados (20)

Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Hortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts PresentationHortonworks for Financial Analysts Presentation
Hortonworks for Financial Analysts Presentation
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 

Semelhante a Falcon Meetup

Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...Big Data Spain
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHaimo Liu
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storySunil Govindan
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemBryan Bende
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerDataWorks Summit
 
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureHadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureVinod Kumar Vavilapalli
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveBryan Bende
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiBryan Bende
 
Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Hortonworks
 

Semelhante a Falcon Meetup (20)

Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and FutureHadoop Summit San Jose 2015: YARN - Past, Present and Future
Hadoop Summit San Jose 2015: YARN - Past, Present and Future
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 

Mais de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Mais de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Último

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Falcon Meetup

  • 1. March 2016 Data Movement & Management Meet-up
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Agenda • Networking • Brief introduction - Venkat Ranganathan • Falcon Use Case Discussion • Falcon 0.9 Release and Demo • New Features coming in 0.10 • Hive DR: Balu Vellanki • Server side extensions – Sowmya Ramesh • ADF and Instance search – Ying Zheng • Hive based ingestion and export – Venkatesan Ramachandran • Spark integration - Peeyush • Sqoop 2 Features – Abraham Fine Page 2
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Falcon At a Glance > Falcon offers a high-level abstraction of key services for Hadoop data processing needs. > Complex data processing logic such as late data handling and retries are handled by Falcon instead of hard-coded in data processing apps. > Falcon maximizes reuse and consistency, enabling faster development of data processing apps. Data Processing Applications Data Ingest and Replication Scheduling and Coordination Data Lifecycle Policies Multi-Cluster Management SLA Management Falcon Framework
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Usage Scenarios • Dataset Replication • Replicate datasets (whether HDFS files or Hive Tables) as part of your Disaster Recovery, Backup and Archival plans. • Falcon triggers processes for retries and handles late data arrival. • Dataset Lifecycle Management • Establish the retention policies for datasets. • Falcon schedules and handles eviction. • Dataset Lineage + Traceability • View coarse-grained dependencies between clusters, datasets and processes. Page 4
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dataset Replication + Retention HDFS Hive Tables Weblog Dataset retention policy HDFS retention policy Hive Tables Recommendations Dataset retention policy retention policy
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Datasets Across Environments • Disaster Recovery and Backup between environments • Publishing data between environments for Discovery Page 6 Site to Site Site to Cloud
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 > Falcon manages process workflow and replication at different stages. > Enables data continuity without requiring full data representation. Falcon Example: Replication Staged Data Staged Data Cleansed Data Presented Data Processed Data Conformed Data Replication Replication
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 > Sophisticated retention policies expressed in one place. > Simplify data retention for audit, compliance, or for data re- processing. Falcon Example: Retention Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Falcon Example: Late Data Handling > Processing waits until all required input data is available. > Checks for late data arrivals, issues retrigger processing as necessary. > Eliminates writing complex data handling rules within applications. Online Transaction Data (via Sqoop) Web Log Data (via FTP) Staged Data Combined Dataset Wait up to 4 hours for FTP data to arrive
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Learn Falcon Using Tutorials • http://hortonworks.com/hadoop-tutorial/defining-processing- data-end-end-data-pipeline-apache-falcon/ • More to come… • Questions – Please reach out to cnormile@hortonworks.com Page 10
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Falcon Usage in a Pharma Company Ivo Lašek 03/24 2016 Hadoop Data Management and Data Movement
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Search Data Integration Data Analytics Open Data
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Public Brazilian Data
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Lake
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Lake
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Lake Merge Clean
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Lake Merge Clean
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Lake Merge Clean Security and Data Governance
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Lake Merge Clean Data Catalog Security and Data Governance
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Falcon Usage
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Datasets and Feeds • HDFS and Hive based datasets • HDFS folder and Hive table is a single feed in Falcon • Our Dataset represents a HDFS folder or a collection of Hive tables • Our Dataset corresponds to 1 or more Falcon feeds
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dataset Level Properties • Need to set dataset level properties, not table level • Retention policy, frequency etc. • Currently we use a middleware layer that translates datasets to feeds • Need to keep the primary information in the middleware layer • Potential synchronization issues • Falcon can’t be accessed directly
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parametrized Scripts INSERT INTO TABLE ${falcon_output_database}.${falcon_output_table} PARTITION (${falcon_output_partitions_hive}) select * from ${falcon_input1_database}.${falcon_input1_table} table1 join ${falcon_input2_database}.${falcon_input2_table} table2 on i1.common_id = i2.common_id WHERE ${falcon_input1_partition_filter_hive} AND ${falcon_input2_partition_filter_hive} WHERE ds = ‘2015-08-14-09-00’ AND ds = ‘2015-08-14-10-00’
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Processes • Process chaining • Used to use pull model but constrained only to Sqoop and Oozie based ingests • Need to support external ingestion tools (e.g. ETL) • Push model enabled by availability flag
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Collaboration on Falcon (Done) • Falcon REST API trusted user support • Impersonation is possible • Necessary for our Middleware layer • FALCON-1027 • Available in Falcon 0.8
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Wish List • Retention policy • For hundreds of tables there are hundreds of Oozie jobs launched at the same time to check the retention • Kerberos • Kerberos ticket for Falcon principal expires after 1 day and Falcon needs to renew it • Workaround: Falcon restarts twice a day • Explicitly triggered run of a process (off schedule) • Version based retention policy (not only time based) • Support for streaming • Additional storages (e.g. Hbase)
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Contacts • Ivo Lasek (ivo.lasek@merck.com) • Twitter: @ilasek • http://www.merck.com/ • http://www.msdit.cz/
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Falcon Features: What’s New in 0.9?
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved $whoami • Pallavi Rao • Architect, InMobi • Committer, Apache Falcon • Contributor, Apache PIG (on Spark)
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New Features • Import from DB and export to a DB. • Native Scheduler • Enhanced Falcon Unit API • Hive DR replication metrics via CLI
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Import/ Export 35
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Management Actions 36
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Missing Piece 37
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Import Falcon Feed • Different Modes of extraction: Full or incremental • Different Modes of output (merge): Snapshot, append • Include/Exclude columns RDBMS HDFS
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Export Falcon Feed • Different Modes of Load: Insert, update-only • Include/ Exclude columns RDBMS HDFS
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Native Scheduler
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Builda Native Scheduler? • Falcon uses Oozie for: • DAG Execution • Scheduling - Gaps Exist • Simple periodic scheduling without data gating • Cron + calendar based scheduling with/without data gating. • Flexible data gating • Support for a-periodic datasets and triggers based on data availability. • Support for external triggers.
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheduler - Before Falcon Server Scheduler Execution
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheduler - The Plan • Time based scheduling - Available in 0.9 • Data based gating - Will be available in .10 • Complete parity with Oozie and additional features - The release after.
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheduler - After Falcon Server DAG Executor Execution ANY DAG ExecutorScheduler
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Additional Benefits • Understands the notion of a pipeline • Better throttling primitives • Prioritization and backlog catch up
  • 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Falcon Unit
  • 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Motivation for Falcon Unit • User errors caught only at deployment time • Input/ Output feeds and paths not getting resolved • Errors in specification • Integration Tests require environment setup/teardown. • Messy deployment scripts • Time consuming • Debugging was cumbersome. • Logs scattered
  • 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Falcon Unit Falcon Unit In Process execution env. • Local Oozie • Local File System • Local Job Runner • Local Message Queue Actual cluster • Oozie • HDFS • YARN • Active MQ Test suite
  • 49. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What You Can Test Data Management Data Governance Process Management ● Data creation ● Data injection ● Retention ● Replication ● Lineage ● Data availability for verification ● Validation of definition ● Entity scheduling and status verification ● Correctness of data window being picked up ● Reruns ● Missing dependencies/properties Future Available in 0.8 Available in 0.9
  • 50. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved For More Information • https://cwiki.apache.org/confluence/display/FALCON/Release+Notes • https://blogs.apache.org/falcon/entry/what_s_new_in_falcon • http://falcon.apache.org
  • 51. 51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo
  • 52. 52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Pipeline In RDBMS HDFS Falcon Feed Falcon Process Copy Cat Out HDFS RDBMS
  • 53. 53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Hive Disaster Recovery • Hive Event based replication • Hive should set hive.metastore.event.listeners property to org.apache.hive.hcatalog.listener.DbNotificationListener • Requires Hive version 1.2.0 or above. • Uses Falcon Recipe framework to support Hive DR. • Requires Bootstrap operation from user. • Will replicate: DB, Table, Partition – Add/drop partition, update, delete, alter • Wont replicate: Views, roles, direct HDFS writes without registering Metadata Page 53
  • 54. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New Features in 0.10
  • 55. 55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Server Side Extensions • Provide capability to add Falcon extensions that can be used to provide a specific data management function – Data anonymization, masking etc. • Managed and accessed like other standard Falcon entities – UI, CLI and REST API access • Better manageability than client side recipes – Types of extensions – Trusted/provided extensions which are OOTB extensions that run in the Falcon context – Custom extensions: Custom recipes are user defined recipes. Extension cooking will be done outside Falcon context in a new process. Page 55
  • 56. 56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Server side Extensions (Cont.) • Extension Repository Management – Templatized entities and parameterized workflows used during extension cooking to realize well constructed Falcon entities are referred to as extension artifacts – Extension artifact store is a HDFS based store which Falcon system maintains to store the extension artifacts – should be configured using “*.extension.store.uri” property in Falcon startup properties – Rest API/CLI support should be provided for extension store management Page 56
  • 57. 57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Spark Integration • Support Spark as a processing option with Hive, Pig and Oozie workflows • Enables users to easily do data management functions using Spark • Both Java and Python applications are supported Page 57
  • 58. 58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Data Ingestion and Export • Relational Database Ingestion and Export – Falcon supports defining and scheduling import and export jobs – - Supports Datasource as a top level abstraction – Leverages Sqoop 1 internally – Now supports Hcatalog tables as Source and Target for Export and Import – Support for jceks based password alias – WIP to support resource throttling on Data Sources • Support for other types of Data Sources – WIP to support data sources other than Relational databases Page 58
  • 59. 59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 HDFS Snapshot Management and Replication • Use Case – Cost effective replication that only copies modified blocks – Provides ability to rollback in case of data corruption • Falcon will use server side extensions to implement this feature • Extension will do the following – Create the snapshot on source directory – Replicate the directory using current and previous snapshot (If exists) – Create snapshot on Target directory • Snapshot retention policy – Users can specify age limit and N number of snapshots to retain. – Falcon will deletes snapshots on source and target that are older than the age limit while retaining at least N snapshots. Page 59
  • 60. 60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 New Feature : Cluster Entity Update • Falcon will provide ability to update cluster entity without having to delete and re-submit entity • Use Cases – Update Hadoop installation from unsecure to secure. – Update from non-HA to high availability • Cluster entity update expects underlying HDFS and Oozie installations remain the same • Cluster update only allowed by super user in falcon safe-mode. • Falcon will do the following – Update cluster entity when server is in safe mode – When Falcon starts in normal mode, the coordinator/bundle jobs for all dependent Feed/Process entities will be updated in workflow engine Page 60
  • 61. 61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Falcon Safe Mode • Falcon server will support starting in safe-mode • Use cases – Supports rolling upgrades – Useful while updating cluster entities • When in safe-mode, users can do the following – Read operations on all entity/instances – Suspend or Kill feed/process instances – Update cluster entity. • When in safe-mode, users cannot do the following – Submit entity operations. – Schedule operations on feed/process – Validate, touch, dry-run operations – Delete entity – Instance rerun/resume operations Page 61
  • 62. 62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory • On-premise and cloud hybrid Hadoop data pipeline – Build pipeline for HDP data processing on Azure (e.g. Hive) – Copy data to Azure blobs (e.g. aggregation result from Hive) – Use Azure Machine Learning platform for predictive analysis • Keep sensitive data (e.g. PII) on-premises for privacy, compliance reasons • Share non-sensitive data on cloud for cross-region replication, recovery, data prediction, etc. Page 62
  • 63. 63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory • Pipeline building and job tracking Page 63
  • 64. 64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Search and Lineage - Current • Entity search: Filter by name subsequence, tags, … • Instance search of one entity: Filter by time and status • Lineage for succeeded instances
  • 65. 65 © Hortonworks Inc. 2011 – 2016. All Rights Reserved © Hortonworks Inc. 2015 Search and Lineage - New • Global instance search – Provide instance status summary – Improve search performance • Lineage for instances in all statuses Page 65
  • 66. 66 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Abraham Fine Apache Sqoop 2
  • 67. 67 © Hortonworks Inc. 2011 – 2016. All Rights Reserved • Software Engineer at Cloudera • Previously: • Software Engineer at Yahoo! • Software Engineer at BrightRoll • Student at The University of Illinois at Urbana-Champaign Who am I?
  • 68. 68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Committer – Apache Sqoop
  • 69. 69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Apache Sqoop?
  • 70. 70 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “Apache SqoopTM is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases."
  • 71. 71 © Hortonworks Inc. 2011 – 2016. All Rights Reserved So much more than that… (S)FTP
  • 72. 72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop 1 A brief overview
  • 73. 73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop 1 • Based on Connectors • Responsible for Metadata lookups, and Data Transfer • Majority of connectors are JDBC based • Non-JDBC (direct) connectors for optimized data transfer • Connectors responsible for all supported functionality • HBase Import, Avro Support, ...
  • 74. 74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop 1 Architecture Job Submission
  • 75. 75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop 1 Shortcomings • Client needs… • direct access to the database • Access to Hadoop configuration • Connectors strictly coupled with MapReduce • No way to manage database passwords for users • Resource management is difficult • Client needs the JDK • Very long complicated command line scripts
  • 76. 76 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop as a service Sqoop 2
  • 77. 77 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop 2 Architecture Repository
  • 78. 78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop 2 Internals
  • 79. 79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Connectors • Connectors implement an interface that allows Sqoop to retrieve and write data • JDBC and HDFS are implemented with connectors • They define the configuration needed to work with a type of data source • Anyone can write connectors for Sqoop 2
  • 80. 80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Links • If connectors are classes, then links are instances • Links define connections to individual data sources • Links contain inputs which are values assigned to the configuration specified in the link’s connector • “Sensitive values” are hidden from the user and encrypted in the repository
  • 81. 81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop 2 Internals Connector Link A Input A.A Input A.B Link B Input B.A
  • 82. 82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Jobs • From link • To link • Some extra configuration (for resource management, etc…) Job Link A Link B FromJobConf ToJobConf
  • 83. 83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved • Admin/DBA • Sets up links and manages passwords to databases • User • Sets up and runs jobs 2 Classes of Sqoop User
  • 84. 84 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo!
  • 85. 85 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions? Abraham Fine abefine@cloudera.com https://www.linkedin.com/in/abrahamfine @abrahamfine
  • 86. 86 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You