Integração de Dados com Apache NIFI - Marco Garcia Cetax

Integração de Dados com
Apache Nifi
Marco Garcia
CTO, Founder – Cetax, TutorPro
mgarcia@cetax.com.br
https://www.linkedin.com/in/mgarciacetax/

Com mais de 20 anos de experiência em TI, sendo 18 exclusivamente com Business
Intelligence , Data Warehouse e Big Data, Marco Garcia é certificado pelo Kimball University,
nos EUA, onde obteve aula pessoalmente com Ralph Kimball – um dos principais gurus do
Data Warehouse.
1º Instrutor Certificado Hortonworks LATAM
Arquiteto de Dados e Instrutor na Cetax Consultoria.
02
Apresentação

• Remote sensor delivery (Internet of Things - IoT)
• Intra-site / Inter-site / global distribution (Enterprise)
• Ingest for feeding analytics (Big Data)
• Data Processing (Simple Event Processing)
Where do we find Data Flow?

SimplisticViewofEnterpriseDataFlow
The Data Flow Thing
Process and Analyze
Data
Acquire Data
Store Data

Basics of Connecting Systems
For every connection,
these must agree:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
6. Frequency of event
7. Authorization access
8. Relevance
P1
Producer
C1
Consumer

IoT is Driving New
Requirements

IoATDataGrowsFasterThanWeConsumeIt
Much of the new data
exists in-flight between
systems and devices as
part of the Internet of
AnythingNEW
TRADITIONAL
The Opportunity
Unlock transformational business value
from a full fidelity of data and analytics
for all data.
Geolocation
Server logs
Files & emails
ERP, CRM, SCM
Traditional Data Sources
Internet of Anything
Sensors
and machines
Clickstream
Web & social

Internet of Anything is Driving New Requirements
Need trusted insights from data at the very edge to the data lake in real-
time with full-fidelity
Data generated by sensors, machines, geo-location devices, logs, clickstreams, social feeds, etc.
Modern applications need access to both data-in-motion and data-at-rest
IoAT data flows are multi-directional and point-to-point
Very different than existing ETL, data movement, and streaming technologies which are generally one direction
The perimeter is outside the data center and can be very jagged
This “Jagged Edge” creates new opportunity for security, data protection, data governance and provenance

Meeting IoAT Edge Requirements
GATHE
R
DELIVER
PRIORITIZE
Track from the edge Through to the datacenter
Small Footprints
operate with very little power
Limited Bandwidth
can create high latency
Data Availability
exceeds transmission bandwidth
recoverability
Data Must Be Secured
throughout its journey
both the data plane and control plane

The Need for Data Provenance
For Operators
• Traceability, lineage
• Recovery and replay
For Compliance
• Audit trail
• Remediation
For Business
• Value sources
• Value IT investment
BEGIN
END
LINEAGE

The Need for Fine-grained Security and
Compliance
It’s not enough to say you have
encrypted communications
• Enterprise authorization
services –entitlements
change often
• People and systems with
different roles require
difference access levels
• Tagged/classified data

Real-time Data Flow
It’s not just how quickly you
move data – it’s about how
quickly you can change behavior
and seize new opportunities

HDF Powered by Apache NiFi Addresses Modern
Data Flow Challenges
Aggregate all IoAT data from sensors, geo-location devices, machines, logs,
files, and feeds via a highly secure lightweight agent
Collect: Bring Together• Logs
• Files
• Feeds
• Sensors
Mediate point-to-point and bi-directional data flows, delivering data reliably
to real-time applications and storage platforms such as HDP
Conduct: Mediate the Data Flow• Deliver
• Secure
• Govern
• Audit
Parse, filter, join, transform, fork, and clone data in motion to
empower analytics and perishable insights
Curate: Gain Insights• Parse
• Filter
• Transform
• Fork
• Clone

ApacheNifiManagesData-in-Motion
Core
InfrastructureSources
 Constrained
 High-latency
 Localized context
 Hybrid – cloud / on-premises
 Low-latency
 Global context
Regional
Infrastructure
Apache NiFi, Apache MiNiFi, Apache Kafka, Apache Storm are trademarks of the Apache Software Foundation

Developed by the NSA over
the last 8 years.
"NSA's innovators work on
some of the most
challenging national security
problems imaginable,"
"Commercial enterprises
could use it to quickly
control, manage, and
analyze the flow of
information from
geographically dispersed
sites – creating
comprehensive situational
awareness"
-- Linda L. Burger,
Director of the NSA
NiFi Developed by the National Security Agency

November 2014
NiFi is donated to the Apache Software Foundation
(ASF) through NSA’s Technology Transfer Program
and enters ASF’s incubator.
2006
NiagaraFiles (NiFi) was first incepted at the National
Security Agency (NSA)
ABriefHistory
July 2015
NiFi reaches ASF top-level project status

Designed In Response to Real World Demands
Visual User Interface
Drag and drop for efficient, agile operations
Immediate Feedback
Start, stop, tune, replay dataflows in real-time
Adaptive to Volume and Bandwidth
Any data, big or small
Provenance Metadata
Governance, compliance & data evaluation
Secure Data Acquisition & Transport
Fine grained encryption for controlled data sharing
HDF Powered by
Apache NiFi

Apache NiFi
• Powerful and reliable system to process and
distribute data.
• Directed graphs of data routing and
transformation.
• Web-based User Interface for creating,
monitoring, & controlling data flows
• Highly configurable - modify data flow at runtime,
dynamically prioritize data
• Data Provenance tracks data through entire
system
• Easily extensible through development of custom
components [1] https://nifi.apache.org/

Nifi Use Cases
Ingest Logs for Cyber Security:
Integrated and secure log collection for real-time
data analytics and threat detection
Feed Data to Streaming Analytics:
Accelerate big data ROI by streaming data into
analytics systems such as Apache Storm or Apache
Spark Streaming
Data Warehouse Offload:
Convert source data to streaming data and use
HDF for data movement before delivering it for
ETL processing. Enable ETL processing to be
offloaded to Hadoop without having to change
source systems.
Move Data Internally:
Optimize resource utilization by
moving data between data centers or
between on-premises infrastructure
and cloud infrastructure
Capture IoT Data:
Transport disparate and often remote
IoT data in real time, despite any
limitations in device footprint, power
or connectivity—avoiding data loss
Big Data Ingest
Easily and efficiently ingest data into Hadoop

Apache NiFi: The three key concepts
• Manage the flow of information
• Data Provenance
• Secure the control plane and
data plane

Apache NiFi – Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Multi-tenant
Authorization
• Designed for extension
• Clustering

FlowBasedProgramming(FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing, transformation, or
mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing various
processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and manages the
threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send data via
ports. A process group allows creation of entirely new component simply by
composition of its components.

PrimaryComponents
NiFi executes within a JVM living within a host operating system. The primary components of NiFi then living
within the JVM are as follows:
Web Server
• The purpose of the web server is to host NiFi’s HTTP-based command and control API.
Flow Controller
• The flow controller is the brains of the operation.
• It provides threads for extensions to run on and manages their schedule of when they’ll receive resources to
execute.
Extensions
• There are various types of extensions for NiFi which will be described in other documents.
• But the key point here is that extensions operate/execute within the JVM.

PrimaryComponents(Cont..)
FlowFile Repository
• The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is
presently active in the flow.
• The default approach is a persistent Write-Ahead Log that lives on a specified disk partition.
Content Repository
• The Content Repository is where the actual content bytes of a given FlowFile live.
• The default approach stores blocks of data in the file system.
• More than one file system storage location can be specified so as to get different physical partitions engaged
to reduce contention on any single volume.
Provenance Repository
• The Provenance Repository is where all provenance event data is stored.
• The repository construct is pluggable with the default implementation being to use one or more physical
disk volumes.
• Within each location event data is indexed and searchable.

NiFiCluster
Starting with the NiFi 1.x/HDF-2.x release, a Zero-Master Clustering paradigm is employed.
NiFi Cluster Coordinator:
• A Cluster Coordinator is the node in a NiFI cluster that is responsible managing the nodes in a cluster.
• Determines which nodes are allowed in the cluster.
• Providing the most up-to-date flow to newly joining nodes.
Nodes:
• Each cluster is made up of one or more nodes. The nodes do the actual data processing.
Primary Node:
• Every cluster has one Primary Node. On this node, it is possible to run "Isolated Processors" (see below).
ZooKeeper Server:
• It is used to automatically elect a Primary Node and cluster co-ordinator.
We will learn in detail about NiFi Cluster in following Lessons..

NiFi - User Interface
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections

NiFi - Provenance
• Tracks data at each point as it flows
through the system
• Records, indexes, and makes
events available for display
• Handles fan-in/fan-out, i.e. merging
and splitting data
• View attributes and content at given
points in time

NiFi - Queue Prioritization
• Configure a prioritizer per connection
• Determine what is important for your
data – time based, arrival order,
importance of a data set
• Funnel many connections down to a
single connection to prioritize across
data sets
• Develop your own prioritizer if needed

NiFi - Extensibility
Built from the ground up with extensions in mind
Service-loader pattern for…
• Processors
• Controller Services
• Reporting Tasks
• Prioritizers
Extensions packaged as NiFi Archives (NARs)
• Deploy NiFi lib directory and restart
• Provides ClassLoader isolation
• Same model as standard components

NiFi-Security
Administration
Central management and consistent
security
• Automatic NiFi Cluster Coordinator and Primary Node election with Zookeeper.
• Multiple entry Points
Authentication
Authenticate users and systems
• 2-Way SSL support out of the box; LDAP Integration; Kerberos Integration
Authorization
Provision access to data
• Multitenant Authorization
• File-based authority provider – Global and Component level Access policies
• Ranger Based Authority Provider
Audit
Maintain a record of data access
• Detailed logging of all user actions
• Detailed logging of key system behaviors
• Data Provenance enables unparalleled tracking from the edge through the Lake
Data Protection
Protect data at rest and in motion
• Support a variety of SSL/encrypted protocols
• Tag and utilize tags on data for fine grained access controls
• Encrypt/decrypt content using pre-shared key mechanisms
• Encrypted Passwords in Configuration Files
Initial Admin Manually designate initial
admin user granted access to
the UI
Legacy Authorized Users converted previously configured
users and roles to the multi-
tenant model
Cluster Node
Identities
Secure identities for
each node.

Obrigado !
Visite nos :
www.cetax.com.br
Estamos contratando !

Integração de Dados com Apache NIFI - Marco Garcia Cetax

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Integração de Dados com Apache NIFI - Marco Garcia Cetax

Semelhante a Integração de Dados com Apache NIFI - Marco Garcia Cetax (20)

Mais de Marco Garcia

Mais de Marco Garcia (17)

Último

Último (20)

Integração de Dados com Apache NIFI - Marco Garcia Cetax

Notas do Editor