SlideShare a Scribd company logo
1 of 18
Kafka & Hadoop in Rakuten
Apr 21st, 2021
Yongduck Lee
Cloud Platform. Dept.
Rakuten Group, Inc.
2
What is Apache Kafka?
Apache Kafka is an open-source distributed event streaming platform used by
thousands of companies for high-performance data pipelines, streaming
analytics, data integration, and mission-critical applications.
• Unified platform for handling real-time data feeds
• High-throughput to support high volume event streams
• Graceful dealing with large data backlogs
• Low-latency delivery to handle more traditional messaging
use-cases.
• Fault-tolerance in the presence of machine failures
• Not use in-process cache of the data
https://kafka.apache.org
3
What is Elasticsearch?
Elasticsearch is a distributed, RESTful search and analytics engine capable
of addressing a growing number of use cases. As the heart of the Elastic
Stack, it centrally stores your data for lightning-fast search, fine-tuned
relevancy, and powerful analytics that scale with ease.
https://www.elastic.co/elasticsear
ch/
primary replica
Data Nodes
Master Nodes
ML Nodes
Coordinating Nodes
Transform Nodes
Remote Cluster
Nodes
Cluster A
Cluster B
Cluster C
Client
4
What is Hadoop?
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.
https://hadoop.apache.org
It is designed to scale up from single
servers to thousands of machines, each
offering local computation and storage.
Rather than rely on hardware to deliver
high-availability, the library itself is
designed to detect and handle failures at
the application layer, so delivering a
highly-available service on top of a cluster
of computers, each of which may be
prone to failures.
5
Data Pipeline Concept
Data
Provider
Data Collection Data Wrangling Data Process &
Analysis
Visualization
• Data investigation
• Reporting
• Historical data
• Near time consumers
Realtime data (5-15sec)
• Realtime dashboards
• Traffic anomalies
• Initial research
• Recent data
• Real-Time Collection
• CDC
• Full Dump Data
System /Network
• Application logs
• Access logs
• Transactions logs
• OS Logs/ Network Traffic
User Behaviors
• Purchase
• Page View
• Click
• RQ/LDTime
• Geo Location
• Review
• Product Search
Service
Platform
Event / Product / Profile Info
• Email
• Campaign
• Questionnaires
• Product/Item
• Demography
Product
• Enrichment
• Normalization
• Cleaning
Data
Users
(Actors)
Heterogenous Data
INFRA
RCMD
RANKING
Log
Management
Data
Analysis
….
Near Real-Time Or Batch
- Unstructured data
- Semi-structured data
- Structured data
……
6
Data Pipeline Concept
Sub second / interactive investigation of
data as time series
Complex analytics, data processing, AI
etc over large datasets.
May take from seconds to days to run
depending on workload and processing
framework
7
Kafka in Rakuten
We have been providing Kafka Service from Kafka 0.8 to 2.4 with PLAINTEXT, SASL_PLANTEXT, and
SASL_TLS, Handling around 1.3 Million Message/sec ( 10 GB/sec IN/OUT) around peak time at normal date.
At 2021 Super Sale, we handled more than 2.5 times messages and traffics.
62 Kafka Clusters (7440 Core, 21TB Mem, 4972
Topics)
5th/Mar/2021
22:45 PM
77 Kafka Clusters (7904 Core, 22TB Mem, 5091
Topics)
08/Apr/2021
7.440 K
8
Kafka in Rakuten
NA EU JP
69
4
4
Near Real-Time One-way Mirroring
Cross-DC Active/Active | Active/Hot Standby Kafka
using MirrorMaker2 + KafkaConnect
9
Elasticsearch in Rakuten
We have been providing ES Service from 2.X to 7.X with Basic & Commercial Subscriptions, indexing
hundreds of thousands doc/sec for near-real time log management & monitoring and user behavior & KPI
analysis. At 2021 Super Sale, we handled more than 2 times docs and traffics.
47 ES Clusters (5960 Core, 6.4TB Mem, 71TB
Indices)
10
Hadoop In Rakuten
Vcore Mem Disk
72K 442TB 130 PB
RAM
Nodes
1K
08/Apr/2021
We are providing HDP2 & HDP3 Clusters in JP/EU/US regions. Our use case is very aggressive multi-tenants
who are using as data lake/data analysis/backup & archiving, etc. All CPU-intensive, Memory-intensive,
Disk-intensive use-case are running on clusters at the same time but we are providing high stability and
performance service with rich experiences on Hadoop administration from the 1st generation of Apache
Hadoop.
11
Hadoop in Rakuten
12
Challenge on Kafka
Mirroring Throughput between Region or Zone
- Temporary network failure.
- High Latency
- Location of MirrorMaker Pros & Cons
Instability or cluster broken
- High Load during Rebalancing or Recovery.
- Rack-awareness
- Major/Minor Upgrade or Patching
JDK & Cross-Realm Issue
- Consuming & Producing between Cluster
with different Realm or Service Name
- JDK Specification about Kerberos Authentication
OOM on Brokers or Zookeeper
- Many Consumer or Producer
- Large size of message
- Z-node creation
- Increase # of partitions
- Relocate Mirror Maker on Source Side and increase
Producer Parallelism
Parallelism
- Reduce size of data which will be replicated during
recovery or rebalancing by small servers with proper
size of DISK, CPU, and Mem for java/scalar
Scale-Out than Scale-Up
- Use Streaming Framework (Spark, Flink, and so on)
- Use Middleware which are supporting different
service name and Version.(NiFi)
- Use Global KDC and one Realm for Kafka Clusters
Global KDC and Proper Streaming Solution
- Guide users by proactive consultation as
professional.
- Authorization on ZK nodes
Confirm Use-Case and Dedicated ZK
13
Challenge on Elasticsearch
Mixed Indexing Query Pattern
• Doc/sec (100K doc/sec ~ 1K doc/sec)
• Size per index (1TB/hour~1GB/hour)
• Short- or Long-term query
Unbalanced Shard distribution
• total # of shard per nodes
• balance of high or low loads of shards per nodes
Too many Indices and shards
• long retention
• Many shards on index for load distribution.
Arbitrary Docs indexing on ES
• Arbitrary # of Json Field.
• Invalid data which are not matched with Data Type
• Too many Json Field in doc.
Fast Query in the middle of High load of indexing.
OOM on Data Nodes and Coordinating Nodes.
Hard to scale out only for High load index.
……
14
Challenge on Elasticsearch
Hot
Cold
Data Nodes
Master Nodes
Coordinating Nodes
Client
Coordinating Nodes
Hot
Cold
Hot
Cold HL
Group
ML
Group
LL
Group
Routing
SEH
Template
IDX
Move/Merge/READ-ONLY
15
Challenge on Elasticsearch
Hot
Cold
Data Nodes
Master Nodes
Coordinating Nodes
Client
Coordinating Nodes
Hot
Cold
Hot
Cold HL
Group
ML
Group
LL
Group
SEH
Template
IDX
Move/READ-ONLY
16
Challenge on Hadoop
Aggressive Multi-tenant on Big box of Cluster
- Job Pending or Execution Delay
- NameNode Slowdown
- Zookeeper Timeout
- NameNode Heap
- Localization Issues
- Large # of Files
High Performance & Low Cost
- CPU-Intensive
- Memory Intensive
- Disk-Intensive
Preemption
Federation
Zookeeper Separation
Continuous
balancing
Dedicated Node
with Labeling
Heterogenous
Proper Node Design
Based on Needs
Utilizing SSD & HDD
On-Premise
Training Course
NameNode RPC QoS
17
Future Challenges
Self-Service
• Self-Operation
• Data Profiling &
Governance
• Broker Level Administration
• Active-Active Mirroring
Next Generation
• Kafka vs ???
• Elasticsearch vs ???
Return To Apache Hadoop
• HDP Subscription Policy
• Ambari to Chef or Ansible
• Rakuten Distribution
Hadoop
Containerization
• Service Discovery
• Persistent Storage or Local
Storage
• Physical vs Logical
Separation
Kafka & Hadoop in Rakuten

More Related Content

What's hot

What's hot (20)

Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
楽天サービスとインフラ部隊
楽天サービスとインフラ部隊楽天サービスとインフラ部隊
楽天サービスとインフラ部隊
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
楽天ネットワークエンジニアたちが目指す、次世代データセンターとは
楽天ネットワークエンジニアたちが目指す、次世代データセンターとは楽天ネットワークエンジニアたちが目指す、次世代データセンターとは
楽天ネットワークエンジニアたちが目指す、次世代データセンターとは
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
Yahoo! JAPANを支えるビッグデータプラットフォーム技術
Yahoo! JAPANを支えるビッグデータプラットフォーム技術Yahoo! JAPANを支えるビッグデータプラットフォーム技術
Yahoo! JAPANを支えるビッグデータプラットフォーム技術
 
Rakuten's Private Cloud
Rakuten's Private CloudRakuten's Private Cloud
Rakuten's Private Cloud
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
 
ベアメタルで実現するSpark&Trino on K8sなデータ基盤
ベアメタルで実現するSpark&Trino on K8sなデータ基盤ベアメタルで実現するSpark&Trino on K8sなデータ基盤
ベアメタルで実現するSpark&Trino on K8sなデータ基盤
 
リクルートのビッグデータ活用基盤とビッグデータ活用のためのメタデータ管理Webのご紹介
リクルートのビッグデータ活用基盤とビッグデータ活用のためのメタデータ管理Webのご紹介リクルートのビッグデータ活用基盤とビッグデータ活用のためのメタデータ管理Webのご紹介
リクルートのビッグデータ活用基盤とビッグデータ活用のためのメタデータ管理Webのご紹介
 
モニタリングプラットフォーム開発の裏側
モニタリングプラットフォーム開発の裏側モニタリングプラットフォーム開発の裏側
モニタリングプラットフォーム開発の裏側
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
Apache Sparkのご紹介 (後半:技術トピック)
Apache Sparkのご紹介 (後半:技術トピック)Apache Sparkのご紹介 (後半:技術トピック)
Apache Sparkのご紹介 (後半:技術トピック)
 
FIWARE 概要 - FIWARE WednesdayWebinars
FIWARE 概要 - FIWARE WednesdayWebinarsFIWARE 概要 - FIWARE WednesdayWebinars
FIWARE 概要 - FIWARE WednesdayWebinars
 

Similar to Kafka & Hadoop in Rakuten

Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 

Similar to Kafka & Hadoop in Rakuten (20)

Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

More from Rakuten Group, Inc.

More from Rakuten Group, Inc. (12)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
Unclouding Container Challenges
 Unclouding  Container Challenges Unclouding  Container Challenges
Unclouding Container Challenges
 
Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...
Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...
Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...
 
アジャイル開発とメトリクス
アジャイル開発とメトリクスアジャイル開発とメトリクス
アジャイル開発とメトリクス
 
AR/SLAM and IoT
AR/SLAM and IoTAR/SLAM and IoT
AR/SLAM and IoT
 
Introduction of Rakuten Commerce QA Night#2
Introduction of Rakuten Commerce QA Night#2Introduction of Rakuten Commerce QA Night#2
Introduction of Rakuten Commerce QA Night#2
 
Improve test automation operation
Improve test automation operationImprove test automation operation
Improve test automation operation
 

Recently uploaded

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Kafka & Hadoop in Rakuten

  • 1. Kafka & Hadoop in Rakuten Apr 21st, 2021 Yongduck Lee Cloud Platform. Dept. Rakuten Group, Inc.
  • 2. 2 What is Apache Kafka? Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. • Unified platform for handling real-time data feeds • High-throughput to support high volume event streams • Graceful dealing with large data backlogs • Low-latency delivery to handle more traditional messaging use-cases. • Fault-tolerance in the presence of machine failures • Not use in-process cache of the data https://kafka.apache.org
  • 3. 3 What is Elasticsearch? Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning-fast search, fine-tuned relevancy, and powerful analytics that scale with ease. https://www.elastic.co/elasticsear ch/ primary replica Data Nodes Master Nodes ML Nodes Coordinating Nodes Transform Nodes Remote Cluster Nodes Cluster A Cluster B Cluster C Client
  • 4. 4 What is Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. https://hadoop.apache.org It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 5. 5 Data Pipeline Concept Data Provider Data Collection Data Wrangling Data Process & Analysis Visualization • Data investigation • Reporting • Historical data • Near time consumers Realtime data (5-15sec) • Realtime dashboards • Traffic anomalies • Initial research • Recent data • Real-Time Collection • CDC • Full Dump Data System /Network • Application logs • Access logs • Transactions logs • OS Logs/ Network Traffic User Behaviors • Purchase • Page View • Click • RQ/LDTime • Geo Location • Review • Product Search Service Platform Event / Product / Profile Info • Email • Campaign • Questionnaires • Product/Item • Demography Product • Enrichment • Normalization • Cleaning Data Users (Actors) Heterogenous Data INFRA RCMD RANKING Log Management Data Analysis …. Near Real-Time Or Batch - Unstructured data - Semi-structured data - Structured data ……
  • 6. 6 Data Pipeline Concept Sub second / interactive investigation of data as time series Complex analytics, data processing, AI etc over large datasets. May take from seconds to days to run depending on workload and processing framework
  • 7. 7 Kafka in Rakuten We have been providing Kafka Service from Kafka 0.8 to 2.4 with PLAINTEXT, SASL_PLANTEXT, and SASL_TLS, Handling around 1.3 Million Message/sec ( 10 GB/sec IN/OUT) around peak time at normal date. At 2021 Super Sale, we handled more than 2.5 times messages and traffics. 62 Kafka Clusters (7440 Core, 21TB Mem, 4972 Topics) 5th/Mar/2021 22:45 PM 77 Kafka Clusters (7904 Core, 22TB Mem, 5091 Topics) 08/Apr/2021 7.440 K
  • 8. 8 Kafka in Rakuten NA EU JP 69 4 4 Near Real-Time One-way Mirroring Cross-DC Active/Active | Active/Hot Standby Kafka using MirrorMaker2 + KafkaConnect
  • 9. 9 Elasticsearch in Rakuten We have been providing ES Service from 2.X to 7.X with Basic & Commercial Subscriptions, indexing hundreds of thousands doc/sec for near-real time log management & monitoring and user behavior & KPI analysis. At 2021 Super Sale, we handled more than 2 times docs and traffics. 47 ES Clusters (5960 Core, 6.4TB Mem, 71TB Indices)
  • 10. 10 Hadoop In Rakuten Vcore Mem Disk 72K 442TB 130 PB RAM Nodes 1K 08/Apr/2021 We are providing HDP2 & HDP3 Clusters in JP/EU/US regions. Our use case is very aggressive multi-tenants who are using as data lake/data analysis/backup & archiving, etc. All CPU-intensive, Memory-intensive, Disk-intensive use-case are running on clusters at the same time but we are providing high stability and performance service with rich experiences on Hadoop administration from the 1st generation of Apache Hadoop.
  • 12. 12 Challenge on Kafka Mirroring Throughput between Region or Zone - Temporary network failure. - High Latency - Location of MirrorMaker Pros & Cons Instability or cluster broken - High Load during Rebalancing or Recovery. - Rack-awareness - Major/Minor Upgrade or Patching JDK & Cross-Realm Issue - Consuming & Producing between Cluster with different Realm or Service Name - JDK Specification about Kerberos Authentication OOM on Brokers or Zookeeper - Many Consumer or Producer - Large size of message - Z-node creation - Increase # of partitions - Relocate Mirror Maker on Source Side and increase Producer Parallelism Parallelism - Reduce size of data which will be replicated during recovery or rebalancing by small servers with proper size of DISK, CPU, and Mem for java/scalar Scale-Out than Scale-Up - Use Streaming Framework (Spark, Flink, and so on) - Use Middleware which are supporting different service name and Version.(NiFi) - Use Global KDC and one Realm for Kafka Clusters Global KDC and Proper Streaming Solution - Guide users by proactive consultation as professional. - Authorization on ZK nodes Confirm Use-Case and Dedicated ZK
  • 13. 13 Challenge on Elasticsearch Mixed Indexing Query Pattern • Doc/sec (100K doc/sec ~ 1K doc/sec) • Size per index (1TB/hour~1GB/hour) • Short- or Long-term query Unbalanced Shard distribution • total # of shard per nodes • balance of high or low loads of shards per nodes Too many Indices and shards • long retention • Many shards on index for load distribution. Arbitrary Docs indexing on ES • Arbitrary # of Json Field. • Invalid data which are not matched with Data Type • Too many Json Field in doc. Fast Query in the middle of High load of indexing. OOM on Data Nodes and Coordinating Nodes. Hard to scale out only for High load index. ……
  • 14. 14 Challenge on Elasticsearch Hot Cold Data Nodes Master Nodes Coordinating Nodes Client Coordinating Nodes Hot Cold Hot Cold HL Group ML Group LL Group Routing SEH Template IDX Move/Merge/READ-ONLY
  • 15. 15 Challenge on Elasticsearch Hot Cold Data Nodes Master Nodes Coordinating Nodes Client Coordinating Nodes Hot Cold Hot Cold HL Group ML Group LL Group SEH Template IDX Move/READ-ONLY
  • 16. 16 Challenge on Hadoop Aggressive Multi-tenant on Big box of Cluster - Job Pending or Execution Delay - NameNode Slowdown - Zookeeper Timeout - NameNode Heap - Localization Issues - Large # of Files High Performance & Low Cost - CPU-Intensive - Memory Intensive - Disk-Intensive Preemption Federation Zookeeper Separation Continuous balancing Dedicated Node with Labeling Heterogenous Proper Node Design Based on Needs Utilizing SSD & HDD On-Premise Training Course NameNode RPC QoS
  • 17. 17 Future Challenges Self-Service • Self-Operation • Data Profiling & Governance • Broker Level Administration • Active-Active Mirroring Next Generation • Kafka vs ??? • Elasticsearch vs ??? Return To Apache Hadoop • HDP Subscription Policy • Ambari to Chef or Ansible • Rakuten Distribution Hadoop Containerization • Service Discovery • Persistent Storage or Local Storage • Physical vs Logical Separation