SlideShare uma empresa Scribd logo
1 de 56
Baixar para ler offline
Best Practices for
ETL with Apache
NiFi on Kubernetes
Author: Albert Lewandowski
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
About me
● Big Data DevOps Engineer - GetInData
● Focused on infrastructure, cloud, Big Data, AI, scalable
web applications
● Certified Google Cloud Architect
● Certified Kubernetes Administrator
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Content
● Apache NiFi - Overview
● Kubernetes + NiFi = ?
● How to deploy?
● Managing CICD pipelines
● Managing ETL pipelines
● Observability of the NiFi
● Lessons learnt
Introduction to the
jungle
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Use cases
Apache NiFi is a popular, big data processing engine with
graphical Web UI.
Use cases:
● Managing ETL pipelines
● Making stream from batch
● Download files, process them and then save to the final
destination
● etc.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Why Kubernetes?
● Simple to run the NiFi for another team
● Simplify management of complex NiFi ecosystem
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Perception
Business
logic
CI/CD
Idempotency
Reprocessing
Explainability
Monitoring
Testing
Serving
Infrastructure
Data Ingestion
Security
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Reality
Business logic
CI/CD
Idempotency
Reprocessing
Explainability
Monitoring
Testing
Serving
Infrastructure
Data Ingestion
Security
Kubernetes + NiFi
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Custom NiFi image
Custom NARs build phase
NiFi Plugins
NiFi Base image
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi and Kubernetes
Apache NiFi is not prepared to run in Kubernetes and it’s like
running typical stateful Java app in the container.
What about NiFi Stateless?
● Great for simple pipelines but then we can use simpler
tools
● Being stuck with the simplest steps
● We can lose data
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Planning resources usage
NiFi can be a heavy-load service and we can still easily find
processors that cause memory leak so we need to set up resources
wisely.
JVM settings Final specs
must based
on PoC
Fast storage
may be key
NiFi loves
RAM
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Network performance
● What is the performance of the network between Kubernetes
and Hadoop cluster or the target storage like the object
storage?
● Verify stability of the connections
● Think of using HDFS Data Nodes as the Kubernetes Nodes
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Storage performance
● How much data do we process in NiFi?
● How much data from provenance, content or flowfile
repositories do we need to store?
● Which storage can we add to our Kubernetes?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Planning migration
Migration is the perfect time to clean up NiFi pipelines.
Start with the
simplest pipelines
Verify used
processors
Monitor resources
usage
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Helm Chart makes sense
Apache NiFi deployment consists of many services and
ConfigMaps.
The simplest way of making it future-proof (ready to be used
for another instance) is to create Helm chart with the
dedicated values file.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
NiFi is a typical stateful app
Apache NiFi is a perfect example of the application that we
can run as the Statefulset on Kubernetes.
We need:
● Stable, unique network identifiers.
● Ordered, graceful deployment and scaling.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Kubernetes Statefulset
Manages the deployment and scaling of a set of Pods, and
provides guarantees about the ordering and uniqueness of
these Pods.
Unlike a Deployment, a StatefulSet maintains a sticky identity
for each of their Pods.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Cluster or single instances?
Cluster NiFi requires adding ZooKeeper instances and it
makes the application more complicated and less robust.
Single instance NiFi used for many separate NiFis, each one
responsible for smaller parts of data processing pipelines, is
well-suited for Kubernetes world and can be even better than
the clustered one.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Each NiFi should trust each NiFi
When we use NiFi and NiFi Registry, remember about adding
all required certificates to Truststore.
If we want to use any connections between separate NiFi
instances, we can import certificate from each other by using
simple OpenSSL command.
How to deploy?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Certificates: keystore and truststore
HTTPS is a requirements for NiFi and NiFi Registry if we want to use
authentication & authorization layer that is necessary for any
production deployment.
● Import certificates from external services like AD which we use
- it can be done during startup of the container.
● It’s a MUST-HAVE for any production-grade platform.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Certificates: keystore and truststore
Managing of keystores and truststores can be done in multiple
ways.
Manually created
Stored in:
● Secrets Manager
● Vault
● K8s Secrets
Dynamically created
Stored in the pod,
created by NiFi Toolkit
during booting up
container
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Kerberos setup
Simple action that opens all doors to the Kerberized Hadoop
World.
Mount krb5.conf
1 2 3
Install Kerberos Client
packages
Create headless keytab
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache Ranger - robust assistant
Apache Ranger is the perfect match for NiFi to manage all
permissions to its resources.
Multistage Dockerfile
with Ranger Plugin
1 2 3
Policies settings
Audit features
Start and test if it
works
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache Ranger - how to add it?
Managing permissions from Ranger UI is simple but configuration
from the NiFi side can be tough.
HDFS setup to send audit logs
Kerberos setup for Infra solr
Audit
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache Ranger - how to add it?
There may be still open bugs and issues but, fortunately, we can
overcome all challenges.
Issues
No buckets in the
NiFi Registry with
Ranger?
Possible it can’t
map its roles with
Ranger roles or
users.
Remember about
checking the version
of plugin and if it
supports your
Ranger version.
Do not forget about
using one JDK
version.
Managing CICD
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
CICD for NiFi and pipelines
The only way supported by NiFi for CICD, is to use NiFi Registry
and all parameters features but it may fail.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi Registry
Apache NiFi Registry was created to become a kind of Git
repository for Apache NiFi pipelines.
Set up the external
database
Read carefully
release notes
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi Registry
Managing ETL pipelines
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
External & Custom Scripts
Migration to Kubernetes might be a great time to implement full
CICD pipeline for any scripts used by users.
External Storage
Downloaded From
External Source
Built-in in the
Image
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
New environment, new challenges
Migration always show how many components we can update,
change or replace.
Tests
External
Resources
Technical
Users
Security
Documentation
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Divide and conquer
Running single NiFi instance requires dividing our pipelines into
separate instances.
How to calculate
required resources?
How to divide?
How to observe?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
NiFi strengths and weaknesses
What we loved: NiFi web UI allows lightning fast
development.
What we hated: It has serious limitations when
compared to programming languages.
Choosing NiFi does not mean avoiding writing any
code, but the amount of code that has to be written can
be significantly reduced.
Observability of the NiFi
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Resources monitoring layer
Start with basic stuff: monitor Kubernetes and its resources.
Nodes
Resources
Events
Storage
Pods status
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
NiFi monitoring layer
Starting from 1.10.0, Apache NiFi delivers client for exposing metrics
to Prometheus that provides default metrics.
Push metrics to
PushGateway
Expose metrics
to Prometheus
Default
metrics
Custom
metrics
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Prometheus - Stories
service discovery
simple on k8s
limited security
archived data
how old data is required?
monitor monitoring
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Logs Analytics
In many cases, NiFi and NiFi Registry can show the root cause of
the issues with turned on DEBUG.
Use logs
analytics tool
Do not run
DEBUG in
production
Sidecars
tailing logs are
great
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Logs analytics - which tool should I choose?
Logs Analytics for Developers Logs Analytics for Business
Loki ElasticSearch
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
ELK vs. Loki
ELK Loki + Promtail/Fluentd
Indexing Keys and content of each key Only labels
Query language Query DSL or Lucene QL LogQL
Tool for data visualisation Kibana Grafana
Query performances Faster due to indexed all the data Slower due to indexing only labels
Resource requirements Higher due to the need of indexing Lower due to index only labels
Lessons learnt
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
The Most Important Questions before migration
Does
Kubernetes
add value?
How big
Kubernetes
cluster do we
have?
Do I have
experience with
Kubernetes?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
The Most Important Aspects before migration
Cluster vs. Single Node
Secured or not?
With Registry or not?
Categorizing pipelines
Count required CPU and
RAM
Required write/read disk
performance
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi on Kubernetes
Helm charts, CICD pipelines and tuned configuration of NiFi can
be great and simplify managing of the platform.
It requires a lot of time spent on configuration and finding the
right parameters to make NiFi faster and working with external
services like NiFI Registry or Ranger.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Challenging journey
● NiFi on Kubernetes is still an undiscovered world
● NiFi is rapidly developing so it’s worth to update it frequently to
the latest releases
● Making it CICD-friendly application can be tough
● Managing certificates on Kubernetes may not be the piece of
cake
● Exposing app to users requires choosing the right solution
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
NiFi Scripted Components
Article.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache NiFi and NiFi Registry on Kubernetes
Article.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Join Us!
Cloud DevOps Engineer
Kubernetes, Terraform, public cloud
Link
Data Engineer (AWS)
Spark, Snowflake, AWS
Link
Data Scientist
Data Science, SQL
Link
Backend Developer
Java / Scala, GCP, NoSQL
Link
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Q&A
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Contact details
albert.lewandowski@getindata.com
LinkedIn:
https://www.linkedin.com/in/albert-lewandowski
Thank you for your
attention!

Mais conteúdo relacionado

Mais procurados

BYOP: Custom Processor Development with Apache NiFi
BYOP: Custom Processor Development with Apache NiFiBYOP: Custom Processor Development with Apache NiFi
BYOP: Custom Processor Development with Apache NiFi
DataWorks Summit
 

Mais procurados (20)

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
BYOP: Custom Processor Development with Apache NiFi
BYOP: Custom Processor Development with Apache NiFiBYOP: Custom Processor Development with Apache NiFi
BYOP: Custom Processor Development with Apache NiFi
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
 

Semelhante a Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, GetInData

AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
Timothy Spann
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
GetInData
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300
Timothy Spann
 

Semelhante a Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, GetInData (20)

AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
 
Using containerization to enable your microservice architecture
Using containerization to enable your microservice architecture Using containerization to enable your microservice architecture
Using containerization to enable your microservice architecture
 
Pluggable Infrastructure with CI/CD and Docker
Pluggable Infrastructure with CI/CD and DockerPluggable Infrastructure with CI/CD and Docker
Pluggable Infrastructure with CI/CD and Docker
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
 
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
 
The path to a serverless-native era with Kubernetes
The path to a serverless-native era with KubernetesThe path to a serverless-native era with Kubernetes
The path to a serverless-native era with Kubernetes
 
Kubernetes Storage Webinar.pptx
Kubernetes Storage Webinar.pptxKubernetes Storage Webinar.pptx
Kubernetes Storage Webinar.pptx
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin
 
What's new in open stack juno (pnw os meetup)
What's new in open stack juno (pnw os meetup)What's new in open stack juno (pnw os meetup)
What's new in open stack juno (pnw os meetup)
 
Galera on kubernetes_no_video
Galera on kubernetes_no_videoGalera on kubernetes_no_video
Galera on kubernetes_no_video
 
Building a raspberry pi cluster
Building a raspberry pi clusterBuilding a raspberry pi cluster
Building a raspberry pi cluster
 
Performance analysis with_ceph
Performance analysis with_cephPerformance analysis with_ceph
Performance analysis with_ceph
 
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
 
Using FLiP with influxdb for EdgeAI IoT at Scale
Using FLiP with influxdb for EdgeAI IoT at ScaleUsing FLiP with influxdb for EdgeAI IoT at Scale
Using FLiP with influxdb for EdgeAI IoT at Scale
 
Kubernetes basics and hands on exercise
Kubernetes basics and hands on exerciseKubernetes basics and hands on exercise
Kubernetes basics and hands on exercise
 
Composing services with Kubernetes
Composing services with KubernetesComposing services with Kubernetes
Composing services with Kubernetes
 
Docker - A high level introduction to dockers and containers
Docker - A high level introduction to dockers and containersDocker - A high level introduction to dockers and containers
Docker - A high level introduction to dockers and containers
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300
 
OpenStack Cinder Best Practices - Meet Up
OpenStack Cinder Best Practices - Meet UpOpenStack Cinder Best Practices - Meet Up
OpenStack Cinder Best Practices - Meet Up
 

Mais de GetInData

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
GetInData
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
GetInData
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
GetInData
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
GetInData
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
GetInData
 

Mais de GetInData (20)

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competition
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team?
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easier
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataModel serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
 
Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
 

Último

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, GetInData

  • 1. Best Practices for ETL with Apache NiFi on Kubernetes Author: Albert Lewandowski
  • 2. © Copyright. All rights reserved. Not to be reproduced without prior written consent. About me ● Big Data DevOps Engineer - GetInData ● Focused on infrastructure, cloud, Big Data, AI, scalable web applications ● Certified Google Cloud Architect ● Certified Kubernetes Administrator
  • 3. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Content ● Apache NiFi - Overview ● Kubernetes + NiFi = ? ● How to deploy? ● Managing CICD pipelines ● Managing ETL pipelines ● Observability of the NiFi ● Lessons learnt
  • 5. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Use cases Apache NiFi is a popular, big data processing engine with graphical Web UI. Use cases: ● Managing ETL pipelines ● Making stream from batch ● Download files, process them and then save to the final destination ● etc.
  • 6. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Why Kubernetes? ● Simple to run the NiFi for another team ● Simplify management of complex NiFi ecosystem
  • 7. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  • 8. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Perception Business logic CI/CD Idempotency Reprocessing Explainability Monitoring Testing Serving Infrastructure Data Ingestion Security
  • 9. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Reality Business logic CI/CD Idempotency Reprocessing Explainability Monitoring Testing Serving Infrastructure Data Ingestion Security
  • 11. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Custom NiFi image Custom NARs build phase NiFi Plugins NiFi Base image
  • 12. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache NiFi and Kubernetes Apache NiFi is not prepared to run in Kubernetes and it’s like running typical stateful Java app in the container. What about NiFi Stateless? ● Great for simple pipelines but then we can use simpler tools ● Being stuck with the simplest steps ● We can lose data
  • 13. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Planning resources usage NiFi can be a heavy-load service and we can still easily find processors that cause memory leak so we need to set up resources wisely. JVM settings Final specs must based on PoC Fast storage may be key NiFi loves RAM
  • 14. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Network performance ● What is the performance of the network between Kubernetes and Hadoop cluster or the target storage like the object storage? ● Verify stability of the connections ● Think of using HDFS Data Nodes as the Kubernetes Nodes
  • 15. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Storage performance ● How much data do we process in NiFi? ● How much data from provenance, content or flowfile repositories do we need to store? ● Which storage can we add to our Kubernetes?
  • 16. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Planning migration Migration is the perfect time to clean up NiFi pipelines. Start with the simplest pipelines Verify used processors Monitor resources usage
  • 17. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Helm Chart makes sense Apache NiFi deployment consists of many services and ConfigMaps. The simplest way of making it future-proof (ready to be used for another instance) is to create Helm chart with the dedicated values file.
  • 18. © Copyright. All rights reserved. Not to be reproduced without prior written consent. NiFi is a typical stateful app Apache NiFi is a perfect example of the application that we can run as the Statefulset on Kubernetes. We need: ● Stable, unique network identifiers. ● Ordered, graceful deployment and scaling.
  • 19. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Kubernetes Statefulset Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods.
  • 20. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Cluster or single instances? Cluster NiFi requires adding ZooKeeper instances and it makes the application more complicated and less robust. Single instance NiFi used for many separate NiFis, each one responsible for smaller parts of data processing pipelines, is well-suited for Kubernetes world and can be even better than the clustered one.
  • 21. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Each NiFi should trust each NiFi When we use NiFi and NiFi Registry, remember about adding all required certificates to Truststore. If we want to use any connections between separate NiFi instances, we can import certificate from each other by using simple OpenSSL command.
  • 23. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Certificates: keystore and truststore HTTPS is a requirements for NiFi and NiFi Registry if we want to use authentication & authorization layer that is necessary for any production deployment. ● Import certificates from external services like AD which we use - it can be done during startup of the container. ● It’s a MUST-HAVE for any production-grade platform.
  • 24. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Certificates: keystore and truststore Managing of keystores and truststores can be done in multiple ways. Manually created Stored in: ● Secrets Manager ● Vault ● K8s Secrets Dynamically created Stored in the pod, created by NiFi Toolkit during booting up container
  • 25. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Kerberos setup Simple action that opens all doors to the Kerberized Hadoop World. Mount krb5.conf 1 2 3 Install Kerberos Client packages Create headless keytab
  • 26. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache Ranger - robust assistant Apache Ranger is the perfect match for NiFi to manage all permissions to its resources. Multistage Dockerfile with Ranger Plugin 1 2 3 Policies settings Audit features Start and test if it works
  • 27. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache Ranger - how to add it? Managing permissions from Ranger UI is simple but configuration from the NiFi side can be tough. HDFS setup to send audit logs Kerberos setup for Infra solr Audit
  • 28. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache Ranger - how to add it? There may be still open bugs and issues but, fortunately, we can overcome all challenges. Issues No buckets in the NiFi Registry with Ranger? Possible it can’t map its roles with Ranger roles or users. Remember about checking the version of plugin and if it supports your Ranger version. Do not forget about using one JDK version.
  • 30. © Copyright. All rights reserved. Not to be reproduced without prior written consent. CICD for NiFi and pipelines The only way supported by NiFi for CICD, is to use NiFi Registry and all parameters features but it may fail.
  • 31. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache NiFi Registry Apache NiFi Registry was created to become a kind of Git repository for Apache NiFi pipelines. Set up the external database Read carefully release notes
  • 32. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache NiFi Registry
  • 34. © Copyright. All rights reserved. Not to be reproduced without prior written consent. External & Custom Scripts Migration to Kubernetes might be a great time to implement full CICD pipeline for any scripts used by users. External Storage Downloaded From External Source Built-in in the Image
  • 35. © Copyright. All rights reserved. Not to be reproduced without prior written consent. New environment, new challenges Migration always show how many components we can update, change or replace. Tests External Resources Technical Users Security Documentation
  • 36. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Divide and conquer Running single NiFi instance requires dividing our pipelines into separate instances. How to calculate required resources? How to divide? How to observe?
  • 37. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  • 38. © Copyright. All rights reserved. Not to be reproduced without prior written consent. NiFi strengths and weaknesses What we loved: NiFi web UI allows lightning fast development. What we hated: It has serious limitations when compared to programming languages. Choosing NiFi does not mean avoiding writing any code, but the amount of code that has to be written can be significantly reduced.
  • 40. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Resources monitoring layer Start with basic stuff: monitor Kubernetes and its resources. Nodes Resources Events Storage Pods status
  • 41. © Copyright. All rights reserved. Not to be reproduced without prior written consent. NiFi monitoring layer Starting from 1.10.0, Apache NiFi delivers client for exposing metrics to Prometheus that provides default metrics. Push metrics to PushGateway Expose metrics to Prometheus Default metrics Custom metrics
  • 42. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Prometheus - Stories service discovery simple on k8s limited security archived data how old data is required? monitor monitoring
  • 43. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Logs Analytics In many cases, NiFi and NiFi Registry can show the root cause of the issues with turned on DEBUG. Use logs analytics tool Do not run DEBUG in production Sidecars tailing logs are great
  • 44. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Logs analytics - which tool should I choose? Logs Analytics for Developers Logs Analytics for Business Loki ElasticSearch
  • 45. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ELK vs. Loki ELK Loki + Promtail/Fluentd Indexing Keys and content of each key Only labels Query language Query DSL or Lucene QL LogQL Tool for data visualisation Kibana Grafana Query performances Faster due to indexed all the data Slower due to indexing only labels Resource requirements Higher due to the need of indexing Lower due to index only labels
  • 47. © Copyright. All rights reserved. Not to be reproduced without prior written consent. The Most Important Questions before migration Does Kubernetes add value? How big Kubernetes cluster do we have? Do I have experience with Kubernetes?
  • 48. © Copyright. All rights reserved. Not to be reproduced without prior written consent. The Most Important Aspects before migration Cluster vs. Single Node Secured or not? With Registry or not? Categorizing pipelines Count required CPU and RAM Required write/read disk performance
  • 49. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache NiFi on Kubernetes Helm charts, CICD pipelines and tuned configuration of NiFi can be great and simplify managing of the platform. It requires a lot of time spent on configuration and finding the right parameters to make NiFi faster and working with external services like NiFI Registry or Ranger.
  • 50. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Challenging journey ● NiFi on Kubernetes is still an undiscovered world ● NiFi is rapidly developing so it’s worth to update it frequently to the latest releases ● Making it CICD-friendly application can be tough ● Managing certificates on Kubernetes may not be the piece of cake ● Exposing app to users requires choosing the right solution
  • 51. © Copyright. All rights reserved. Not to be reproduced without prior written consent. NiFi Scripted Components Article.
  • 52. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Apache NiFi and NiFi Registry on Kubernetes Article.
  • 53. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Join Us! Cloud DevOps Engineer Kubernetes, Terraform, public cloud Link Data Engineer (AWS) Spark, Snowflake, AWS Link Data Scientist Data Science, SQL Link Backend Developer Java / Scala, GCP, NoSQL Link
  • 54. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Q&A
  • 55. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Contact details albert.lewandowski@getindata.com LinkedIn: https://www.linkedin.com/in/albert-lewandowski
  • 56. Thank you for your attention!