Nessa apresentação vamos mostrar um pouco mais sobre essa ferramenta de integração open source, também um pouco sobre o produto Hortonworks Data Flow (HDF).
Como Nifi é possível integrar fontes distintas como APIs, Bancos de Dados, Hadoop, HDFS, etc.
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Integração de Dados com Apache NIFI - Marco Garcia Cetax
1. Integração de Dados com
Apache Nifi
Marco Garcia
CTO, Founder – Cetax, TutorPro
mgarcia@cetax.com.br
https://www.linkedin.com/in/mgarciacetax/
2. Com mais de 20 anos de experiência em TI, sendo 18 exclusivamente com Business
Intelligence , Data Warehouse e Big Data, Marco Garcia é certificado pelo Kimball University,
nos EUA, onde obteve aula pessoalmente com Ralph Kimball – um dos principais gurus do
Data Warehouse.
1º Instrutor Certificado Hortonworks LATAM
Arquiteto de Dados e Instrutor na Cetax Consultoria.
02
Apresentação
4. • Remote sensor delivery (Internet of Things - IoT)
• Intra-site / Inter-site / global distribution (Enterprise)
• Ingest for feeding analytics (Big Data)
• Data Processing (Simple Event Processing)
Where do we find Data Flow?
6. Basics of Connecting Systems
For every connection,
these must agree:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
6. Frequency of event
7. Authorization access
8. Relevance
P1
Producer
C1
Consumer
8. IoATDataGrowsFasterThanWeConsumeIt
Much of the new data
exists in-flight between
systems and devices as
part of the Internet of
AnythingNEW
TRADITIONAL
The Opportunity
Unlock transformational business value
from a full fidelity of data and analytics
for all data.
Geolocation
Server logs
Files & emails
ERP, CRM, SCM
Traditional Data Sources
Internet of Anything
Sensors
and machines
Clickstream
Web & social
9. Internet of Anything is Driving New Requirements
Need trusted insights from data at the very edge to the data lake in real-
time with full-fidelity
Data generated by sensors, machines, geo-location devices, logs, clickstreams, social feeds, etc.
Modern applications need access to both data-in-motion and data-at-rest
IoAT data flows are multi-directional and point-to-point
Very different than existing ETL, data movement, and streaming technologies which are generally one direction
The perimeter is outside the data center and can be very jagged
This “Jagged Edge” creates new opportunity for security, data protection, data governance and provenance
10. Meeting IoAT Edge Requirements
GATHE
R
DELIVER
PRIORITIZE
Track from the edge Through to the datacenter
Small Footprints
operate with very little power
Limited Bandwidth
can create high latency
Data Availability
exceeds transmission bandwidth
recoverability
Data Must Be Secured
throughout its journey
both the data plane and control plane
11. The Need for Data Provenance
For Operators
• Traceability, lineage
• Recovery and replay
For Compliance
• Audit trail
• Remediation
For Business
• Value sources
• Value IT investment
BEGIN
END
LINEAGE
12. The Need for Fine-grained Security and
Compliance
It’s not enough to say you have
encrypted communications
• Enterprise authorization
services –entitlements
change often
• People and systems with
different roles require
difference access levels
• Tagged/classified data
13. Real-time Data Flow
It’s not just how quickly you
move data – it’s about how
quickly you can change behavior
and seize new opportunities
14. HDF Powered by Apache NiFi Addresses Modern
Data Flow Challenges
Aggregate all IoAT data from sensors, geo-location devices, machines, logs,
files, and feeds via a highly secure lightweight agent
Collect: Bring Together• Logs
• Files
• Feeds
• Sensors
Mediate point-to-point and bi-directional data flows, delivering data reliably
to real-time applications and storage platforms such as HDP
Conduct: Mediate the Data Flow• Deliver
• Secure
• Govern
• Audit
Parse, filter, join, transform, fork, and clone data in motion to
empower analytics and perishable insights
Curate: Gain Insights• Parse
• Filter
• Transform
• Fork
• Clone
16. Developed by the NSA over
the last 8 years.
"NSA's innovators work on
some of the most
challenging national security
problems imaginable,"
"Commercial enterprises
could use it to quickly
control, manage, and
analyze the flow of
information from
geographically dispersed
sites – creating
comprehensive situational
awareness"
-- Linda L. Burger,
Director of the NSA
NiFi Developed by the National Security Agency
17. November 2014
NiFi is donated to the Apache Software Foundation
(ASF) through NSA’s Technology Transfer Program
and enters ASF’s incubator.
2006
NiagaraFiles (NiFi) was first incepted at the National
Security Agency (NSA)
ABriefHistory
July 2015
NiFi reaches ASF top-level project status
18. Designed In Response to Real World Demands
Visual User Interface
Drag and drop for efficient, agile operations
Immediate Feedback
Start, stop, tune, replay dataflows in real-time
Adaptive to Volume and Bandwidth
Any data, big or small
Provenance Metadata
Governance, compliance & data evaluation
Secure Data Acquisition & Transport
Fine grained encryption for controlled data sharing
HDF Powered by
Apache NiFi
19. Apache NiFi
• Powerful and reliable system to process and
distribute data.
• Directed graphs of data routing and
transformation.
• Web-based User Interface for creating,
monitoring, & controlling data flows
• Highly configurable - modify data flow at runtime,
dynamically prioritize data
• Data Provenance tracks data through entire
system
• Easily extensible through development of custom
components [1] https://nifi.apache.org/
20. Nifi Use Cases
Ingest Logs for Cyber Security:
Integrated and secure log collection for real-time
data analytics and threat detection
Feed Data to Streaming Analytics:
Accelerate big data ROI by streaming data into
analytics systems such as Apache Storm or Apache
Spark Streaming
Data Warehouse Offload:
Convert source data to streaming data and use
HDF for data movement before delivering it for
ETL processing. Enable ETL processing to be
offloaded to Hadoop without having to change
source systems.
Move Data Internally:
Optimize resource utilization by
moving data between data centers or
between on-premises infrastructure
and cloud infrastructure
Capture IoT Data:
Transport disparate and often remote
IoT data in real time, despite any
limitations in device footprint, power
or connectivity—avoiding data loss
Big Data Ingest
Easily and efficiently ingest data into Hadoop
22. Apache NiFi: The three key concepts
• Manage the flow of information
• Data Provenance
• Secure the control plane and
data plane
23. Apache NiFi – Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Multi-tenant
Authorization
• Designed for extension
• Clustering
24. FlowBasedProgramming(FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing, transformation, or
mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing various
processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and manages the
threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send data via
ports. A process group allows creation of entirely new component simply by
composition of its components.
27. PrimaryComponents
NiFi executes within a JVM living within a host operating system. The primary components of NiFi then living
within the JVM are as follows:
Web Server
• The purpose of the web server is to host NiFi’s HTTP-based command and control API.
Flow Controller
• The flow controller is the brains of the operation.
• It provides threads for extensions to run on and manages their schedule of when they’ll receive resources to
execute.
Extensions
• There are various types of extensions for NiFi which will be described in other documents.
• But the key point here is that extensions operate/execute within the JVM.
28. PrimaryComponents(Cont..)
FlowFile Repository
• The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is
presently active in the flow.
• The default approach is a persistent Write-Ahead Log that lives on a specified disk partition.
Content Repository
• The Content Repository is where the actual content bytes of a given FlowFile live.
• The default approach stores blocks of data in the file system.
• More than one file system storage location can be specified so as to get different physical partitions engaged
to reduce contention on any single volume.
Provenance Repository
• The Provenance Repository is where all provenance event data is stored.
• The repository construct is pluggable with the default implementation being to use one or more physical
disk volumes.
• Within each location event data is indexed and searchable.
29. NiFiCluster
Starting with the NiFi 1.x/HDF-2.x release, a Zero-Master Clustering paradigm is employed.
NiFi Cluster Coordinator:
• A Cluster Coordinator is the node in a NiFI cluster that is responsible managing the nodes in a cluster.
• Determines which nodes are allowed in the cluster.
• Providing the most up-to-date flow to newly joining nodes.
Nodes:
• Each cluster is made up of one or more nodes. The nodes do the actual data processing.
Primary Node:
• Every cluster has one Primary Node. On this node, it is possible to run "Isolated Processors" (see below).
ZooKeeper Server:
• It is used to automatically elect a Primary Node and cluster co-ordinator.
We will learn in detail about NiFi Cluster in following Lessons..
30. NiFi - User Interface
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
31. NiFi - Provenance
• Tracks data at each point as it flows
through the system
• Records, indexes, and makes
events available for display
• Handles fan-in/fan-out, i.e. merging
and splitting data
• View attributes and content at given
points in time
32. NiFi - Queue Prioritization
• Configure a prioritizer per connection
• Determine what is important for your
data – time based, arrival order,
importance of a data set
• Funnel many connections down to a
single connection to prioritize across
data sets
• Develop your own prioritizer if needed
33. NiFi - Extensibility
Built from the ground up with extensions in mind
Service-loader pattern for…
• Processors
• Controller Services
• Reporting Tasks
• Prioritizers
Extensions packaged as NiFi Archives (NARs)
• Deploy NiFi lib directory and restart
• Provides ClassLoader isolation
• Same model as standard components
34. NiFi-Security
Administration
Central management and consistent
security
• Automatic NiFi Cluster Coordinator and Primary Node election with Zookeeper.
• Multiple entry Points
Authentication
Authenticate users and systems
• 2-Way SSL support out of the box; LDAP Integration; Kerberos Integration
Authorization
Provision access to data
• Multitenant Authorization
• File-based authority provider – Global and Component level Access policies
• Ranger Based Authority Provider
Audit
Maintain a record of data access
• Detailed logging of all user actions
• Detailed logging of key system behaviors
• Data Provenance enables unparalleled tracking from the edge through the Lake
Data Protection
Protect data at rest and in motion
• Support a variety of SSL/encrypted protocols
• Tag and utilize tags on data for fine grained access controls
• Encrypt/decrypt content using pre-shared key mechanisms
• Encrypted Passwords in Configuration Files
Initial Admin Manually designate initial
admin user granted access to
the UI
Legacy Authorized Users converted previously configured
users and roles to the multi-
tenant model
Cluster Node
Identities
Secure identities for
each node.
Where do we find Data Flow?
Every Moving Metal have sensors nowadays, transferring its data In and out of It.
Enterprise data, from chains or data end points towards Central data center or a Data hub before reaching Central ware house.
Social Media Information tweets, posts, comments likes, Click stream data for analytics
Simple Massaging and processing of data as it arrives.
Simplistic View of Enterprise Data Flow
- The Diagram above shows Simplistic View of Enterprise Data Flow, how a data flow solution helps acquire, process, analyze and store Data.
Basics of Connecting Systems
When we look at basics of connecting systems these must agree:
Protocol
Format
Schema
Priority
Size of event
Frequency of event
Authorization access
Relevance
IoAT Data Grows Faster Than We Consume It
The emergence and explosion from the Internet of Anything data has put tremendous pressure on the existing platforms.
- The data from these new paradigm sources has created several key challenges:
Exponential Growth. As of 2013 there was an estimated 2.8ZB [Zettabyte] of data across the cybersphere, and that is expected to grow to 44ZB by 2020, with 85% of this data growth coming from new types of data including connected devices.
Varied Nature. The incoming data can have little or no structure, or structure that changes too frequently for reliable schema creation at time of ingest.
Value at High Volumes. The incoming data can have little or no value as individual, or small groups of, records. But at high volumes and longer historical perspectives can be inspected for patterns and used for advanced analytic applications.
This New Data Paradigm opens up the Opportunity for both an architectural and business transformation that applies to virtually every industry.
abbreviation
Enterprise resource planning (ERP)
Customer Relationship Management (CRM )
Supply Chain Management (SCM)
Internet of Anything is Driving New Requirements
As more and more data is generated from the Internet of Anything (IoAT) including from sensors, geo-location devices, server logs, clicks, machines, social feeds, as well as any other data source at the edge, the technical issue of securely ingesting and processing the data from the “jagged edge” is an issue.
Customers and developers have no choice but to create custom, disjointed and loosely integrated solutions to solve the problem of analyzing data and providing insights.
Traditional data and multiple streams from a variety of sources created the need for those custom solutions thus driving up cost and complexity.
The IoAT data edges created specific data flow requirements that Hortonworks DataFlow satisfies:
Edges with small footprints operate with very little power
Limited bandwidth and high latency are commonplace
Data availability often exceeds transmission bandwidth
Data must be secured throughout its journey
Who all Need Data Provenance and why?
- For Operators- Traceability, lineage, Recovery and replay
- For Compliance - Audit trail, Remediation
- For Business - Value sources, Value IT investment
The Need for Fine-grained Security and Compliance
- LDAP Integration coming up as pluggable authentication
- User roles and control with different access levels.
- Tagging the data with priority or classification
Real-time Data Flow
- Leverege IoT platform makes extensive use of HDP already; they basically host the platform for customers like “Special Forces”
- They’re looking at NiFi to replace the Ingestors and Translators portion of their architecture
- NiFi would then flow the data into Kafka for downstream data delivery to real-time and historical analytic applications
- NiFi gives them the ability to add new data feeds (with corresponding NiFi processors) in a matter of hours (rather than days/weeks); they use a JSON spec file that contains the info needed to plumb in the new NiFi processor
- NiFi data provenance capabilities are a big value (knowing where data come from and tracking where/how it flows is a key operational capability)
- NiFi’s logging and tracing capabilities make it easy to debug dataflows, and NiFi’s ability to replay flows is invaluable as well (ex. they were able to replay a weeks worth of inbound data in an hour)
- They like the ability to fork a flow to plug in a new processor (agility is a key attribute)
- Leverege is not dealing with large volumes (ex. only dealing with 1000’s of messages per minute) so they have no input into scalability / sizing yet
- NiFi is currently running on 2 servers
PRESCIENT EDGE NOTES:
- “Traveler Safety” is a key application they provide
- They built their own “data curation” toolset (comprised of lots of Python scripts) for getting data from a range of sources
- 355 independent data sources, with many sources being aggregators of other data sources; so they deal with a total of ~3,500 sources in aggregate
- Sources are mostly from IP endpoints such as Twitter feeds to Closed Caption video feeds (that they¹re interested in scraping through the video file for travel security-related breaking news items)
- Existing tools lacked data provenance, so they looked at NiFi and got very excited at its capabilities
- They wrapped their existing toolset of Python scripts as NiFi processors which makes them available with NiFi tool with consistent provenance capabilities
- NiFi provides the "data curation" and “fork in the road” capability they need to deal with data before storing in SAP Hana (and potentially other data systems including HDP)
- SAP Hana provides a COTS solution for geo-coding, language translation from 37 languages, and visualization abilities thought SAP tools for their “Traveler Safety” app
- They¹re using SAP tools since it helped them accelerate time to solution (I.e. They don¹t have a lot of time and resources to build analytic apps and visualizations from raw open source tools)
- Their application is able to dynamically draw threat zone and WITH NIFI, they are able to tie back to the specific data sources that were involved in flagging the threat
WARGAMING.NET NOTES:
- Using Hadoop (CDH in pre-prod and Oracle BDA 12 nodes / 700TB) in prod
- Lots of data in relational DBMS¹s
- LOGs in MySQL, managing schemas, changing databases, etc.
- Funnel all data into Oracle BDA on a daily batch for Impala and Hive and then Oracle database for downstream aggregated reports and Tableau
- Looking to use NiFi to front-end data flow that forks into Kafka and HDFS (they use Avro to format the HDFS data)
- Using Kafka for enterprise analytical events/messaging bus; while NiFi may do some similar things, they¹re committed to Kafka as the standard messaging protocol
- They also aggregate game stats (how many kills, shots fired, etc.) and store those logs into S3 using Amazon Kinesis; they then pull down from there for analytic needs with Hadoop
- They essentially see NiFi as the data Collector and pipeline Conductor that ultimately forks the data flow into Kafka steam and HDFS stream
- The thing the like about NiFi is that it enables them to hand a runbook and a the NiFi tool to the Ops team who can operate the dataflows, start/stop processors when needed, etc. without a Java developer having to be involved every time something goes wrong or generate warnings/errors. Less beepers for developers == good.
HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges
- HDF provides 3 key capabilities – the ability to collect data from different types of data sources via a highly secure lightweigt agent, the ability to mediate the data flow to/from the data source and the “collector”, and the ability to trace, parse, transform data in motion to enable analytics and derive insights within an operationally relevant time window.
Systems fail
Networks fail, disks fail, software crashes, people make mistakes.
Data access exceeds capacity to consume
Sometimes a given data source can outpace some part of the processing or delivery chain - it only takes one weak-link to have an issue.
Boundary conditions are mere suggestions
You will invariably get data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format.
What is noise one day becomes signal the next
Priorities of an organization change - rapidly. Enabling new flows and changing existing ones must be fast.
Systems evolve at different rates
The protocols and formats used by a given system can change anytime and often irrespective of the systems around them. Dataflow exists to connect what is essentially a massively distributed system of components that are loosely or not-at-all designed to work together.
Compliance and security
Laws, regulations, and policies change. Business to business agreements change. System to system and system to user interactions must be secure, trusted, accountable.
Continuous improvement occurs in production
It is often not possible to come even close to replicating production environments in the lab.
Hortonworks: Powering the Future of Data
NiFi Developed by the National Security Agency
Hortonworks DataFlow is based on technology originally created by the NSA that encountered big data collection and processing issues at a scale and stage that is beyond most enterprise implementations today.
Dataflow was designed inherently to meet the timely decision making needs from collecting and analyzing data from a wide range of disparate data sources, securely, efficiently and over a geographically disperse and possibly fragmented network the likes of which are becoming commonplace in many industries today.
Deployed at scale for almost a decade before being contributed to the Open Source Community Hortonworks Dataflow has been proven to be an excellent and effective tool that integrates the most common current and future needs of big data acquisition and ingestion for accurately informed, on-time decision making.
A Brief History Of NiFi
2006 - NiagaraFiles (NiFi) was first incepted at the National Security Agency (NSA)
November 2014 - NiFi is donated to the Apache Software Foundation (ASF) through NSA’s Technology Transfer Program and enters ASF’s incubator.
July 2015 - NiFi reaches ASF top-level project status
HDF Designed In Response to Real World Demands
HDF provides a number of benefits to customers, developers, data stewards including:
Use of standard open source software with the Hortonworks DataFlow powered by Apache NiFi and Hortonworks Data Platform powered by Apache Hadoop
An easy, web-based, seamless experience that allows for simple drag and drop design, control, feedback, and monitoring of all data sources “Off the Shelf”
A highly configurable solution that optimizes for high throughput and low bandwidth on all data
Fine-grained provenance metadata supporting compliance and governance
Secure end-to-end data routing includes encryption & compression
SSL, SSH, HTTPS, encrypted content
Pluggable role-based authentication/authorization
Apache NiFi
Powerful and reliable system to process and distribute data.
Directed graphs of data routing and transformation.
Web-based User Interface for creating, monitoring, & controlling data flows
Highly configurable - modify data flow at runtime, dynamically prioritize data
Data Provenance tracks data through entire system
Easily extensible through development of custom components
HDF Use Cases
They optimize their Splunk investment by pre-filtering data before sending to Splunk for storage.
They ingest logs for cyber security and threat detection.
They feed data to streaming analytics engines like Apache Spark or Apache Storm (both of which ship with Hortonworks Data Platform).
They move their own data internally between data centers on premises or to the cloud.
And of course, they capture data from the Internet of Things. HDF was originally designed to be robust, so that it could continue to move data despite varying device footprints or fluctuating power or connectivity levels. The data keeps flowing, without being lost in transit.
Predictive Analytics - Ensure the highest value data is captured and available for analysis
Fraud Detection - Move sales transaction data in real time to analyze on demand
Accelerated Data Collection - An integrated, data collection platform with full transparency into provenance and flow of data
IoT Optimization - Secure, Prioritize, Enrich and Trace data at the edge
Big Data Ingest - Easily and efficiently ingest data into Hadoop
You can find more Details on use cases below:
http://hortonworks.com/hdf/use-cases/
3 Main Central theme of NiFi
Really Solid Flow control/management of bidirectional data flow.
Fine grain details of data and its life cycle. With UI which solves some enterprise problems of data governance.
Rock solid security on Data and Control
Apache NiFi – Key Features
Guaranteed Delivery
A core philosophy of NiFi has been that even at very high scale, guaranteed delivery is a must. This is achieved through effective use of a purpose-built persistent write-ahead log and content repository. Together they are designed in such a way as to allow for very high transaction rates, effective load-spreading, copy-on-write, and play to the strengths of traditional disk read/writes.
Data Buffering w/ Back Pressure and Pressure Release
NiFi supports buffering of all queued data as well as the ability to provide back pressure as those queues reach specified limits or to age off data as it reaches a specified age (its value has perished).
Prioritized Queuing
NiFi allows the setting of one or more prioritization schemes for how data is retrieved from a queue. The default is oldest first, but there are times when data should be pulled newest first, largest first, or some other custom scheme.
Flow Specific QoS (latency v throughput, loss tolerance, etc.)
There are points of a dataflow where the data is absolutely critical and it is loss intolerant. There are also times when it must be processed and delivered within seconds to be of any value. NiFi enables the fine-grained flow specific configuration of these concerns.
Data Provenance
NiFi automatically records, indexes, and makes available provenance data as objects flow through the system even across fan-in, fan-out, transformations, and more. This information becomes extremely critical in supporting compliance, troubleshooting, optimization, and other scenarios.
Recovery / Recording a rolling buffer of fine-grained history
NiFi’s content repository is designed to act as a rolling buffer of history. Data is removed only as it ages off the content repository or as space is needed. This combined with the data provenance capability makes for an incredibly useful basis to enable click-to-content, download of content, and replay, all at a specific point in an object’s lifecycle which can even span generations.
Visual Command and Control
Dataflows can become quite complex. Being able to visualize those flows and express them visually can help greatly to reduce that complexity and to identify areas that need to be simplified. NiFi enables not only the visual establishment of dataflows but it does so in real-time. Rather than being design and deploy it is much more like molding clay. If you make a change to the dataflow that change immediately takes effect. Changes are fine-grained and isolated to the affected components. You don’t need to stop an entire flow or set of flows just to make some specific modification.
Flow Templates
Dataflows tend to be highly pattern oriented and while there are often many different ways to solve a problem, it helps greatly to be able to share those best practices. Templates allow subject matter experts to build and publish their flow designs and for others to benefit and collaborate on them.
Security
System to system
A dataflow is only as good as it is secure. NiFi at every point in a dataflow offers secure exchange through the use of protocols with encryption such as 2-way SSL. In addition NiFi enables the flow to encrypt and decrypt content and use shared-keys or other mechanisms on either side of the sender/recipient equation.
User to system
NiFi enables 2-Way SSL authentication and provides pluggable authorization so that it can properly control a user’s access and at particular levels (read-only, dataflow manager, admin). If a user enters a sensitive property like a password into the flow, it is immediately encrypted server side and never again exposed on the client side even in its encrypted form.
Designed for Extension
NiFi is at its core built for extension and as such it is a platform on which dataflow processes can execute and interact in a predictable and repeatable manner.
Points of extension
Processors, Controller Services, Reporting Tasks, Prioritizers, Customer User Interfaces
Classloader Isolation
For any component-based system, dependency nightmares can quickly occur. NiFi addresses this by providing a custom class loader model, ensuring that each extension bundle is exposed to a very limited set of dependencies. As a result, extensions can be built with little concern for whether they might conflict with another extension. The concept of these extension bundles is called NiFi Archives and will be discussed in greater detail in the developer’s guide.
Clustering (scale-out)
NiFi is designed to scale-out through the use of clustering many nodes together as described above. If a single node is provisioned and configured to handle hundreds of MB/s then a modest cluster could be configured to handle GB/s. This then brings about interesting challenges of load balancing and fail-over between NiFi and the systems from which it gets data. Use of asynchronous queuing based protocols like messaging services, Kafka, etc., can help. Use of NiFi’s
-to-site feature is also very effective as it is a protocol that allows NiFi and a client (could be another NiFi cluster) to talk to each other, share information about loading, and to exchange data on specific authorized ports.
Flow Based Programming (FBP)
Introducing Flow Based Programming fundamentals, why they matter, and how NiFi adopts them
FlowFile
Unit of data moving through the system
Content + Attributes (key/value pairs)
Processor
Performs the work, can access FlowFiles
Connection
Links between processors
Queues that can be dynamically prioritized
Process Group
Set of processors and their connections
Receive data via input ports, send data via output ports
NiFi Architecture
Introducing the architecture of NiFi.
NiFi executes within a JVM living within a host operating system. The primary components of NiFi then living within the JVM are described in following slides.
NiFi Architecture
Introducing the architecture of NiFi.
NiFi executes within a JVM living within a host operating system. The primary components of NiFi then living within the JVM are described in following slides.
Primary Components
NiFi executes within a JVM living within a host operating system. The primary components of NiFi then living within the JVM are as follows:
Web Server
The purpose of the web server is to host NiFi’s HTTP-based command and control API.
Flow Controller
The flow controller is the brains of the operation.
It provides threads for extensions to run on and manages their schedule of when they’ll receive resources to execute.
Extensions
There are various types of extensions for NiFi which will be described in other documents.
But the key point here is that extensions operate/execute within the JVM.
Custom processors
NiFi Plugins for applications to talk to Ports
Controller services
Primary Components
FlowFile Repository
The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is presently active in the flow.
The default approach is a persistent Write-Ahead Log that lives on a specified disk partition.
Content Repository
The Content Repository is where the actual content bytes of a given FlowFile live.
The default approach stores blocks of data in the file system.
More than one file system storage location can be specified so as to get different physical partitions engaged to reduce contention on any single volume.
Provenance Repository
The Provenance Repository is where all provenance event data is stored.
The repository construct is pluggable with the default implementation being to use one or more physical disk volumes.
Within each location event data is indexed and searchable.
NiFi Cluster Components
NiFi is also able to operate within a cluster, components are:
NiFi Cluster Coordinator:
A Cluster Coordinator is the node in a NiFI cluster that is responsible managing the nodes in a cluster.
Determines which nodes are allowed in the cluster.
Providing the most up-to-date flow to newly joining nodes.
NiFi Nodes (Node) :
These nodes do the actual data processing.
Primary Node:
The first node who joined the cluster, who can run Isolated Processors.
We will learn in detail about NiFi Cluster in following Lessons..
NiFi User Interface
The NiFi User Interface (UI) provides mechanisms for creating automated dataflows, as well as visualizing, editing, monitoring, and administering those dataflows.
The UI can be broken down into several segments, each responsible for different functionality of the application.
This section provides screenshots of the application and highlights the different segments of the UI.
When the application is started, the user is able to navigate to the User Interface by going to the default address of http://<hostname>:8080/nifi in a web browser.
There are no permissions configured by default, so anyone is able to view and modify the dataflow.
Data Provenance
While monitoring a dataflow, users often need a way to determine what happened to a particular data object (FlowFile).
NiFi’s Data Provenance page provides that information.
Because NiFi records and indexes data provenance details as objects flow through the system, users may perform searches, conduct troubleshooting and evaluate things like dataflow compliance and optimization in real time.
By default, NiFi updates this information every five minutes, but that is configurable.
NiFi - Queue Prioritization
Configure a prioritizer per connection
Determine what is important for your data – time based, arrival order, importance of a data set
Funnel many connections down to a single connection to prioritize across data sets
Develop your own prioritizer if needed
NiFi – Extensibility
Built from the ground up with extensions in mind
Extensions packaged as NiFi Archives (NARs)
Deploy NiFi lib directory and restart
Provides ClassLoader isolation
Same model as standard components
Service-loader pattern for…
Processors
Controller Services
Reporting Tasks
Prioritizers
NiFi Security
NiFi provides several different configuration options for security purposes.
The most important properties are those under the "security properties" heading in the nifi.properties file.
NiFi supports user authentication via client certificates or via username/password.
Username/password authentication is performed by a Login Identity Provider.