Mais conteúdo relacionado Semelhante a Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, GetInData (20) Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, GetInData2. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
About me
● Big Data DevOps Engineer - GetInData
● Focused on infrastructure, cloud, Big Data, AI, scalable
web applications
● Certified Google Cloud Architect
● Certified Kubernetes Administrator
3. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Content
● Apache NiFi - Overview
● Kubernetes + NiFi = ?
● How to deploy?
● Managing CICD pipelines
● Managing ETL pipelines
● Observability of the NiFi
● Lessons learnt
5. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Use cases
Apache NiFi is a popular, big data processing engine with
graphical Web UI.
Use cases:
● Managing ETL pipelines
● Making stream from batch
● Download files, process them and then save to the final
destination
● etc.
6. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Why Kubernetes?
● Simple to run the NiFi for another team
● Simplify management of complex NiFi ecosystem
7. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
8. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Perception
Business
logic
CI/CD
Idempotency
Reprocessing
Explainability
Monitoring
Testing
Serving
Infrastructure
Data Ingestion
Security
9. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Reality
Business logic
CI/CD
Idempotency
Reprocessing
Explainability
Monitoring
Testing
Serving
Infrastructure
Data Ingestion
Security
11. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Custom NiFi image
Custom NARs build phase
NiFi Plugins
NiFi Base image
12. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi and Kubernetes
Apache NiFi is not prepared to run in Kubernetes and it’s like
running typical stateful Java app in the container.
What about NiFi Stateless?
● Great for simple pipelines but then we can use simpler
tools
● Being stuck with the simplest steps
● We can lose data
13. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Planning resources usage
NiFi can be a heavy-load service and we can still easily find
processors that cause memory leak so we need to set up resources
wisely.
JVM settings Final specs
must based
on PoC
Fast storage
may be key
NiFi loves
RAM
14. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Network performance
● What is the performance of the network between Kubernetes
and Hadoop cluster or the target storage like the object
storage?
● Verify stability of the connections
● Think of using HDFS Data Nodes as the Kubernetes Nodes
15. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Storage performance
● How much data do we process in NiFi?
● How much data from provenance, content or flowfile
repositories do we need to store?
● Which storage can we add to our Kubernetes?
16. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Planning migration
Migration is the perfect time to clean up NiFi pipelines.
Start with the
simplest pipelines
Verify used
processors
Monitor resources
usage
17. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Helm Chart makes sense
Apache NiFi deployment consists of many services and
ConfigMaps.
The simplest way of making it future-proof (ready to be used
for another instance) is to create Helm chart with the
dedicated values file.
18. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
NiFi is a typical stateful app
Apache NiFi is a perfect example of the application that we
can run as the Statefulset on Kubernetes.
We need:
● Stable, unique network identifiers.
● Ordered, graceful deployment and scaling.
19. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Kubernetes Statefulset
Manages the deployment and scaling of a set of Pods, and
provides guarantees about the ordering and uniqueness of
these Pods.
Unlike a Deployment, a StatefulSet maintains a sticky identity
for each of their Pods.
20. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Cluster or single instances?
Cluster NiFi requires adding ZooKeeper instances and it
makes the application more complicated and less robust.
Single instance NiFi used for many separate NiFis, each one
responsible for smaller parts of data processing pipelines, is
well-suited for Kubernetes world and can be even better than
the clustered one.
21. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Each NiFi should trust each NiFi
When we use NiFi and NiFi Registry, remember about adding
all required certificates to Truststore.
If we want to use any connections between separate NiFi
instances, we can import certificate from each other by using
simple OpenSSL command.
23. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Certificates: keystore and truststore
HTTPS is a requirements for NiFi and NiFi Registry if we want to use
authentication & authorization layer that is necessary for any
production deployment.
● Import certificates from external services like AD which we use
- it can be done during startup of the container.
● It’s a MUST-HAVE for any production-grade platform.
24. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Certificates: keystore and truststore
Managing of keystores and truststores can be done in multiple
ways.
Manually created
Stored in:
● Secrets Manager
● Vault
● K8s Secrets
Dynamically created
Stored in the pod,
created by NiFi Toolkit
during booting up
container
25. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Kerberos setup
Simple action that opens all doors to the Kerberized Hadoop
World.
Mount krb5.conf
1 2 3
Install Kerberos Client
packages
Create headless keytab
26. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache Ranger - robust assistant
Apache Ranger is the perfect match for NiFi to manage all
permissions to its resources.
Multistage Dockerfile
with Ranger Plugin
1 2 3
Policies settings
Audit features
Start and test if it
works
27. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache Ranger - how to add it?
Managing permissions from Ranger UI is simple but configuration
from the NiFi side can be tough.
HDFS setup to send audit logs
Kerberos setup for Infra solr
Audit
28. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache Ranger - how to add it?
There may be still open bugs and issues but, fortunately, we can
overcome all challenges.
Issues
No buckets in the
NiFi Registry with
Ranger?
Possible it can’t
map its roles with
Ranger roles or
users.
Remember about
checking the version
of plugin and if it
supports your
Ranger version.
Do not forget about
using one JDK
version.
30. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
CICD for NiFi and pipelines
The only way supported by NiFi for CICD, is to use NiFi Registry
and all parameters features but it may fail.
31. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi Registry
Apache NiFi Registry was created to become a kind of Git
repository for Apache NiFi pipelines.
Set up the external
database
Read carefully
release notes
32. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi Registry
34. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
External & Custom Scripts
Migration to Kubernetes might be a great time to implement full
CICD pipeline for any scripts used by users.
External Storage
Downloaded From
External Source
Built-in in the
Image
35. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
New environment, new challenges
Migration always show how many components we can update,
change or replace.
Tests
External
Resources
Technical
Users
Security
Documentation
36. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Divide and conquer
Running single NiFi instance requires dividing our pipelines into
separate instances.
How to calculate
required resources?
How to divide?
How to observe?
37. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
38. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
NiFi strengths and weaknesses
What we loved: NiFi web UI allows lightning fast
development.
What we hated: It has serious limitations when
compared to programming languages.
Choosing NiFi does not mean avoiding writing any
code, but the amount of code that has to be written can
be significantly reduced.
40. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Resources monitoring layer
Start with basic stuff: monitor Kubernetes and its resources.
Nodes
Resources
Events
Storage
Pods status
41. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
NiFi monitoring layer
Starting from 1.10.0, Apache NiFi delivers client for exposing metrics
to Prometheus that provides default metrics.
Push metrics to
PushGateway
Expose metrics
to Prometheus
Default
metrics
Custom
metrics
42. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Prometheus - Stories
service discovery
simple on k8s
limited security
archived data
how old data is required?
monitor monitoring
43. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Logs Analytics
In many cases, NiFi and NiFi Registry can show the root cause of
the issues with turned on DEBUG.
Use logs
analytics tool
Do not run
DEBUG in
production
Sidecars
tailing logs are
great
44. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Logs analytics - which tool should I choose?
Logs Analytics for Developers Logs Analytics for Business
Loki ElasticSearch
45. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
ELK vs. Loki
ELK Loki + Promtail/Fluentd
Indexing Keys and content of each key Only labels
Query language Query DSL or Lucene QL LogQL
Tool for data visualisation Kibana Grafana
Query performances Faster due to indexed all the data Slower due to indexing only labels
Resource requirements Higher due to the need of indexing Lower due to index only labels
47. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
The Most Important Questions before migration
Does
Kubernetes
add value?
How big
Kubernetes
cluster do we
have?
Do I have
experience with
Kubernetes?
48. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
The Most Important Aspects before migration
Cluster vs. Single Node
Secured or not?
With Registry or not?
Categorizing pipelines
Count required CPU and
RAM
Required write/read disk
performance
49. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi on Kubernetes
Helm charts, CICD pipelines and tuned configuration of NiFi can
be great and simplify managing of the platform.
It requires a lot of time spent on configuration and finding the
right parameters to make NiFi faster and working with external
services like NiFI Registry or Ranger.
50. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Challenging journey
● NiFi on Kubernetes is still an undiscovered world
● NiFi is rapidly developing so it’s worth to update it frequently to
the latest releases
● Making it CICD-friendly application can be tough
● Managing certificates on Kubernetes may not be the piece of
cake
● Exposing app to users requires choosing the right solution
51. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
NiFi Scripted Components
Article.
52. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi and NiFi Registry on Kubernetes
Article.
53. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Join Us!
Cloud DevOps Engineer
Kubernetes, Terraform, public cloud
Link
Data Engineer (AWS)
Spark, Snowflake, AWS
Link
Data Scientist
Data Science, SQL
Link
Backend Developer
Java / Scala, GCP, NoSQL
Link
54. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Q&A
55. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Contact details
albert.lewandowski@getindata.com
LinkedIn:
https://www.linkedin.com/in/albert-lewandowski