SlideShare a Scribd company logo
1 of 33
Download to read offline
The picture can't be displayed.
Technical tips for secure
Apache Hadoop cluster
Akira Ajisaka, Kei Kori
Yahoo Japan Corporation
Big Data
Akira Ajisaka (@ajis_ka)
• Software Engineer in Hadoop team @ Yahoo! JAPAN
– Upgraded HDFS to 3.3.0 and enabled RBF
– R&D for more secure Hadoop cluster than just enabling
Kerberos auth
• Apache Hadoop committer/PMC
– ~800 commits in various components in 6 years
– Handled and announced several CVEs
– Manages build and QA environment
Kei KORI (@2k0ri)
• Data Platform Engineer
in Hadoop team @ Yahoo! JAPAN
– Built upgrading to and continuous delivery for HDFS 3.3.0
– Research of operation for more secure Hadoop cluster
• Kubernetes admin for Hadoop client environment
– Migrates users from VM/BM to cloud native way
– Integrates ML/DL workloads with Hadoop ecosystem
Session Overview
4
Session Overview
Prerequisites:
• Hadoop is not secure by default
• Kerberos authentication is required
This talk is to introduce further details in practice:
• Wire encryption in Hadoop ecosystem
• HDFS transparent data encryption at rest
• Other considerations
Wire encryption
in Hadoop ecosystem
6
Background
For making Hadoop ecosystem more secure than
perimeter security
• Not only authenticate but encrypt communications
• Protection and mitigation from internal threats like
packet sniffing
• Part of security compliance like NIST SP800-171
Overview: wire encryption types
between components
• HTTP encryption
– HDFS, YARN, MapReduce, KMS, HttpFS, Spark, Hive, Oozie, Livy
• RPC encryption
– HDFS, YARN, MapReduce, KMS, Spark, Hive, Oozie, ZooKeeper
• Block data transfer encryption
– HDFS
• Shuffle encryption
– MapReduce, Spark, Tez
HTTP encryption for Hadoop
• dfs.http.policy: HTTPS_ONLY in hdfs-site,
yarn.http.policy: HTTPS_ONLY in yarn-site,
mapreduce.jobhistory.http.policy: HTTPS_ONLY in mapred-site
etc.
– Enable TLS on WebUI/REST API endpoints
– HTTP_AND_HTTPS while rolling update endpoints
• yarn.timeline-service.webapp.https.address in yarn-site,
mapreduce.jobhistory.webapp.https.address in mapred-site
– Set History/Timeline Server endpoints with HTTPS
• Storing certs and passphrases
using Hadoop Credential Provider into
hadoop.security.credential.provider.path
– Separates permissions from configs
– Prevents exposure outside of hadoop.security.sensitive-config-keys filtering
RPC encryption for Hadoop
• hadoop.rpc.protection: privacy in core-site
– Encrypts RPC incl. Kerberos authentication on SASL layer
– Propagates to
hadoop.security.saslproperties.resolver.class,
dfs.data.transfer.saslproperties.resolver.class and
dfs.data.transfer.protection
• hadoop.rpc.protection: privacy,authentication
while rolling update whole Hadoop servers/clients
– Accepts falling back to non-encrypted RPC
Block data transfer encryption for
Hadoop
• dfs.encrypt.data.transfer: true,
dfs.encrypt.data.transfer.cipher.suites:
AES/CTR/NoPadding in hdfs-site
– Only encrypts payload between HDFS client and DataNodes
• Rolling update is not supported within configs
– Needs managing list of encrypted nodes or extend/implement
own dfs.trustedchannel.resolver.class
– Trusted nodes by dfs.trustedchannel.resolver.class
are forced to transfer without encryption regardless of its
encryption status
Encryption for Spark
In spark-defaults:
• HTTP encryption
– spark.ssl.sparkHistory.enabled true
• Switches protocol on 1 port, does not support HTTP_AND_HTTPS
– spark.yarn.historyServer.address https://...
• RPC encryption
– spark.authenticate: true
• Also in yarn-site
– spark.authenticate.enableSaslEncryption true
– spark.network.sasl.serverAlwaysEncrypt true
• After all Spark components recognized enableSaslEncryption
• Shuffle encryption
– spark.network.crypto.enabled true
– spark.io.encryption.enabled true
• Encrypts spilled caches and RDDs on local disks
Encryption for Hive
• hive.server2.thrift.sasl.qop: auth-conf in hive-site
– Encrypts JDBC between client and HiveServer2 binary mode
– And Thrift between clients and Hive Metastore
• hive.server2.use.SSL: true in hive-site
– Only for HS2 http mode
– HS2 binary mode cannot enable both TLS and SASL
• Encryption for JDBC between HS2/Hive Metastore and remote RDBMS
• Shuffle encryption
– Tez:
tez.runtime.shuffle.ssl.enable: true,
tez.runtime.shuffle.keep-alive.enabled: true in tez-site
– MapReduce:
mapreduce.ssl.enabled: true,
mapreduce.shuffle.ssl.enabled: true in mapred-site
– Requires server certs for all NodeManagers
Challenges in HTTP encryption: for
Application Master / Spark Driver
• Server certs for ApplicationMaster / SparkDriver
need to be readable by the user who submitted it
– ApplicationMaster and SparkDriver run as the user
– WebApplicationProxy between ResourceManager and
ApplicationMaster relies on this encryption
• Applications support TLS and can bundle certs since
– Spark 3.0.0: SPARK-24621
– MapReduce 3.3.0: MAPREDUCE-4669
– Tez: not supported yet
Encryption for ZooKeeper server
• Authenticate with SASL, encrypt with TLS
– ZooKeeper doen not respect SASL QOP
• Requires ZooKeeper 3.5.6 or above for servers/quorums
– serverCnxnFactory=org.apache.zookeeper.server.Nett
yServerCnxnFactory
– sslQuorum=true
– ssl.clientAuth=NONE
– ssl.quorum.clientAuth=NONE
• Needs ZOOKEEPER-4276 to follow Upgrading existing
non-TLS cluster with no downtime
– Makes ZK can serve only with secureClientPort
Encryption for ZooKeeper client
• Also Requires ZooKeeper 3.5.6 or above for clients
-Dzookeeper.client.secure=true
-Dzookeeper.clientCnxnSocket=
org.apache.zookeeper.ClientCnxnSocketNetty
in client JVM args
– HADOOP_OPTS environment variable
– mapreduce.admin.map.child.java.opts,
mapreduce.admin.reduce.child.java.opts in mapred-site
for Oozie Coordinator MapReduce jobs
• Needs to replace and update ZooKeeper jars in all components
which communicate with ZooKeeper
– ZKFC, ResourceManager, Hive clients incl. HS2, Oozie and Livy
– Apache Curator also be updated to 4.2.0, Netty from 4.0 to 4.1
Enforcing Kerberos AuthN/Z for
ZooKeeper
• Requires ZooKeeper 3.6.0 or above for servers
– 3.6.0+:
zookeeper.sessionRequireClientSASLAuth=true
– 3.7.0+:
enforce.auth.enabled=true
enforce.auth.schemes=sasl
• Oozie Hive action will not work with forcing ZK SASL
– when acquiring the lock for Hive Metastore
– Has no mechanisms to delegate authentication or
impersonation for ZooKeeper
– Using HiveServer2 / Oozie Hive2 action solve it
HDFS transparent data
encryption (TDE) at rest
18
Background
HDFS blocks are written to local filesystem of the DataNodes
• the data is not encrypted by default
• encryption is required in several use cases
Encryption can be done at several layers:
• Application: most secure, but hardest to do
• Database: most databases have this, but may incur performance
penalties
• Filesystem: high performance, transparent, but may not be flexible
• Disk: only really protects against physical theft
HDFS TDE fits between database and filesystem level
Overview: encryption/decryption is
transparent to the clients
KeyProvider: Where KEK is saved
Implementations of KeyProvider API
• Hadoop KMS: JavaKeyStoreProvider
– JCEKS files in Hadoop compatible filesystems (localFS, HDFS,
cloud storage)
– Not recommended
• Apache Ranger KMS: RangerKeyStoreProvider
– RDBMS
– master key can be stored in Luna HSM (optional)
– HSM is required in some use cases
• PCI-DSS, FIPS 140-2
Extending KeyProvider API is
not difficult
• Mandatory methods for HDFS TDE
– getKeyVersion, getCurrentKey, getMetadata
• Optional methods (nice to have for operation)
– getKeys, getKeysMetadata, getKeyVersions, createKey, deleteKey,
rollNewVersion
– If not implemented, you need to create/delete/list/roll keys in some
way
• Use cases:
– LinkedIn integrated with its own key management service, LiKMS
https://engineering.linkedin.com/blog/2021/the-exabyte-club--
linkedin-s-journey-of-scaling-the-hadoop-distr
– Yahoo! JAPAN also integrated with our own credential store by only
~500 LOC (including test code)
KeyProvider is actually stable,
can be used safely
• KeyProvider is @Public and @Unstable
– @Unstable in Hadoop means "incompatible changes are
allowed at any time"
• Actually, the API is very stable
– No incompatible changes
– Ranger uses it since 2015: RANGER-247
• Provided a patch to mark it stable
– HADOOP-17544
Hadoop KMS: Where KEK is
cached and performs
authorization
• KMS interacts with HDFS clients, NameNodes, and KeyProvider
• KMS have its own ACLs separated from HDFS ACLs
– An attacker cannot decrypt data even if HDFS ACLs are compromised
– If 'usera' reads/writes data in the encryption zone with 'keya', the
configuration in kms-acls.xml will be:
– The configuration is hot-reloaded
• For HA and scalability, multiple KMS instances are supported
<property>
<name>key.acl.keya.DECRYPT_EEK</name>
<value>usera</value>
</property>
How to deploy multiple KMS
instances
Two Approaches:
1. Behind a load-balancer or VIP
2. Using LoadBalancingKMSClientProvider
– Implicitly used when multiple URIs are specified in
hadoop.security.key.provider.path
If you have a LB or VIP, use it
• No configuration change to scale-out/decommission
• LB saves clients' retry cost
– LoadBalancingKMSClientProvider first try to connect to a KMS, if fails, then
connect to another KMS
How to configure multiple KMS
instances
• Delegation Token must be synchronized
– Use ZKDelegationTokenSecretManager
– Documented an example configuration: HADOOP-17794
• hadoop.security.token.service.use_ip
– If true (default), fails to validate SSL certificates in multi-
homed environment
– Documented: HADOOP-12665
Tuning Hadoop KMS
• Documented and discussed in HADOOP-15743
– Reduce SSL session cache size and TTL
– Tuning https idle timeout
– Increase max file descriptors
– etc.
• This tuning is effective in HttpFS as well
– Both KMS/HttpFS use Jetty via HttpServer2
Recap: HDFS TDE
• Careful configuration required
– How to save KEK
– Running multiple KMS instances
– KMS Tuning
– Where to create encryption zones
– ACLs (including key ACLs and impersonation)
• They are not straightforward despite the long time
since the feature was developed
Other considerations
29
Updating SSL certificates
• Hadoop >= 3.3.1 allows updating SSL certificates
without downtime: HADOOP-16524
– Use hot-reload feature in Jetty
– Except DataNode since DN don't rely on Jetty
• Useful especially for NameNode because it takes >
30 minutes to restart in large cluster
Other considerations
• It is important to be ready to upgrade at any time
– Sometimes CVEs have been published and the vendors
warn users to upgrade
• Security requirements may increase later, so be
prepared for that early
• Operational considerations are also necessary
– Not only the cluster configuration but also the operations
will be change
Conclusion & Future work
We introduced many technical tips for secure Hadoop
cluster
• However, they might change in the future
• Need to catch up with the OSS community
Future work
• How to enable SSL/TLS in ApplicationMaster & Spark Driver
Web UIs
• Impersonation does not work correctly in KMSClientProvider:
HDFS-13697
THANK YOU
QUESTIONS?
@aajisaka @2k0ri

More Related Content

What's hot

Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 

What's hot (20)

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Hadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxHadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache Knox
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのかApache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 

Similar to Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon

Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
DataWorks Summit
 
Building Open Source Identity Management with FreeIPA
Building Open Source Identity Management with FreeIPABuilding Open Source Identity Management with FreeIPA
Building Open Source Identity Management with FreeIPA
LDAPCon
 

Similar to Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon (20)

Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
 
Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Deep Dive: AWS CloudHSM (Classic)
Deep Dive: AWS CloudHSM (Classic)Deep Dive: AWS CloudHSM (Classic)
Deep Dive: AWS CloudHSM (Classic)
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 
Securing Hadoop - MapR Technologies
Securing Hadoop - MapR TechnologiesSecuring Hadoop - MapR Technologies
Securing Hadoop - MapR Technologies
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Risk Management for Data: Secured and Governed
Risk Management for Data: Secured and GovernedRisk Management for Data: Secured and Governed
Risk Management for Data: Secured and Governed
 
Building Open Source Identity Management with FreeIPA
Building Open Source Identity Management with FreeIPABuilding Open Source Identity Management with FreeIPA
Building Open Source Identity Management with FreeIPA
 
Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of View
 
TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
 
Hadoop security overview_hit2012_1117rev
Hadoop security overview_hit2012_1117revHadoop security overview_hit2012_1117rev
Hadoop security overview_hit2012_1117rev
 

More from Yahoo!デベロッパーネットワーク

More from Yahoo!デベロッパーネットワーク (20)

ゼロから始める転移学習
ゼロから始める転移学習ゼロから始める転移学習
ゼロから始める転移学習
 
継続的なモデルモニタリングを実現するKubernetes Operator
継続的なモデルモニタリングを実現するKubernetes Operator継続的なモデルモニタリングを実現するKubernetes Operator
継続的なモデルモニタリングを実現するKubernetes Operator
 
ヤフーでは開発迅速性と品質のバランスをどう取ってるか
ヤフーでは開発迅速性と品質のバランスをどう取ってるかヤフーでは開発迅速性と品質のバランスをどう取ってるか
ヤフーでは開発迅速性と品質のバランスをどう取ってるか
 
オンプレML基盤on Kubernetes パネルディスカッション
オンプレML基盤on Kubernetes パネルディスカッションオンプレML基盤on Kubernetes パネルディスカッション
オンプレML基盤on Kubernetes パネルディスカッション
 
LakeTahoe
LakeTahoeLakeTahoe
LakeTahoe
 
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
 
Persistent-memory-native Database High-availability Feature
Persistent-memory-native Database High-availability FeaturePersistent-memory-native Database High-availability Feature
Persistent-memory-native Database High-availability Feature
 
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
 
eコマースと実店舗の相互利益を目指したデザイン #yjtc
eコマースと実店舗の相互利益を目指したデザイン #yjtceコマースと実店舗の相互利益を目指したデザイン #yjtc
eコマースと実店舗の相互利益を目指したデザイン #yjtc
 
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtcヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
 
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtcYahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
 
ビッグデータから人々のムードを捉える #yjtc
ビッグデータから人々のムードを捉える #yjtcビッグデータから人々のムードを捉える #yjtc
ビッグデータから人々のムードを捉える #yjtc
 
サイエンス領域におけるMLOpsの取り組み #yjtc
サイエンス領域におけるMLOpsの取り組み #yjtcサイエンス領域におけるMLOpsの取り組み #yjtc
サイエンス領域におけるMLOpsの取り組み #yjtc
 
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtcヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
 
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtcYahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
 
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
 
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtcPC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
 
モブデザインによる多職種チームのコミュニケーション改善 #yjtc
モブデザインによる多職種チームのコミュニケーション改善 #yjtcモブデザインによる多職種チームのコミュニケーション改善 #yjtc
モブデザインによる多職種チームのコミュニケーション改善 #yjtc
 
「新しいおうち探し」のためのAIアシスト検索 #yjtc
「新しいおうち探し」のためのAIアシスト検索 #yjtc「新しいおうち探し」のためのAIアシスト検索 #yjtc
「新しいおうち探し」のためのAIアシスト検索 #yjtc
 
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtcユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon

  • 1. The picture can't be displayed. Technical tips for secure Apache Hadoop cluster Akira Ajisaka, Kei Kori Yahoo Japan Corporation Big Data
  • 2. Akira Ajisaka (@ajis_ka) • Software Engineer in Hadoop team @ Yahoo! JAPAN – Upgraded HDFS to 3.3.0 and enabled RBF – R&D for more secure Hadoop cluster than just enabling Kerberos auth • Apache Hadoop committer/PMC – ~800 commits in various components in 6 years – Handled and announced several CVEs – Manages build and QA environment
  • 3. Kei KORI (@2k0ri) • Data Platform Engineer in Hadoop team @ Yahoo! JAPAN – Built upgrading to and continuous delivery for HDFS 3.3.0 – Research of operation for more secure Hadoop cluster • Kubernetes admin for Hadoop client environment – Migrates users from VM/BM to cloud native way – Integrates ML/DL workloads with Hadoop ecosystem
  • 5. Session Overview Prerequisites: • Hadoop is not secure by default • Kerberos authentication is required This talk is to introduce further details in practice: • Wire encryption in Hadoop ecosystem • HDFS transparent data encryption at rest • Other considerations
  • 7. Background For making Hadoop ecosystem more secure than perimeter security • Not only authenticate but encrypt communications • Protection and mitigation from internal threats like packet sniffing • Part of security compliance like NIST SP800-171
  • 8. Overview: wire encryption types between components • HTTP encryption – HDFS, YARN, MapReduce, KMS, HttpFS, Spark, Hive, Oozie, Livy • RPC encryption – HDFS, YARN, MapReduce, KMS, Spark, Hive, Oozie, ZooKeeper • Block data transfer encryption – HDFS • Shuffle encryption – MapReduce, Spark, Tez
  • 9. HTTP encryption for Hadoop • dfs.http.policy: HTTPS_ONLY in hdfs-site, yarn.http.policy: HTTPS_ONLY in yarn-site, mapreduce.jobhistory.http.policy: HTTPS_ONLY in mapred-site etc. – Enable TLS on WebUI/REST API endpoints – HTTP_AND_HTTPS while rolling update endpoints • yarn.timeline-service.webapp.https.address in yarn-site, mapreduce.jobhistory.webapp.https.address in mapred-site – Set History/Timeline Server endpoints with HTTPS • Storing certs and passphrases using Hadoop Credential Provider into hadoop.security.credential.provider.path – Separates permissions from configs – Prevents exposure outside of hadoop.security.sensitive-config-keys filtering
  • 10. RPC encryption for Hadoop • hadoop.rpc.protection: privacy in core-site – Encrypts RPC incl. Kerberos authentication on SASL layer – Propagates to hadoop.security.saslproperties.resolver.class, dfs.data.transfer.saslproperties.resolver.class and dfs.data.transfer.protection • hadoop.rpc.protection: privacy,authentication while rolling update whole Hadoop servers/clients – Accepts falling back to non-encrypted RPC
  • 11. Block data transfer encryption for Hadoop • dfs.encrypt.data.transfer: true, dfs.encrypt.data.transfer.cipher.suites: AES/CTR/NoPadding in hdfs-site – Only encrypts payload between HDFS client and DataNodes • Rolling update is not supported within configs – Needs managing list of encrypted nodes or extend/implement own dfs.trustedchannel.resolver.class – Trusted nodes by dfs.trustedchannel.resolver.class are forced to transfer without encryption regardless of its encryption status
  • 12. Encryption for Spark In spark-defaults: • HTTP encryption – spark.ssl.sparkHistory.enabled true • Switches protocol on 1 port, does not support HTTP_AND_HTTPS – spark.yarn.historyServer.address https://... • RPC encryption – spark.authenticate: true • Also in yarn-site – spark.authenticate.enableSaslEncryption true – spark.network.sasl.serverAlwaysEncrypt true • After all Spark components recognized enableSaslEncryption • Shuffle encryption – spark.network.crypto.enabled true – spark.io.encryption.enabled true • Encrypts spilled caches and RDDs on local disks
  • 13. Encryption for Hive • hive.server2.thrift.sasl.qop: auth-conf in hive-site – Encrypts JDBC between client and HiveServer2 binary mode – And Thrift between clients and Hive Metastore • hive.server2.use.SSL: true in hive-site – Only for HS2 http mode – HS2 binary mode cannot enable both TLS and SASL • Encryption for JDBC between HS2/Hive Metastore and remote RDBMS • Shuffle encryption – Tez: tez.runtime.shuffle.ssl.enable: true, tez.runtime.shuffle.keep-alive.enabled: true in tez-site – MapReduce: mapreduce.ssl.enabled: true, mapreduce.shuffle.ssl.enabled: true in mapred-site – Requires server certs for all NodeManagers
  • 14. Challenges in HTTP encryption: for Application Master / Spark Driver • Server certs for ApplicationMaster / SparkDriver need to be readable by the user who submitted it – ApplicationMaster and SparkDriver run as the user – WebApplicationProxy between ResourceManager and ApplicationMaster relies on this encryption • Applications support TLS and can bundle certs since – Spark 3.0.0: SPARK-24621 – MapReduce 3.3.0: MAPREDUCE-4669 – Tez: not supported yet
  • 15. Encryption for ZooKeeper server • Authenticate with SASL, encrypt with TLS – ZooKeeper doen not respect SASL QOP • Requires ZooKeeper 3.5.6 or above for servers/quorums – serverCnxnFactory=org.apache.zookeeper.server.Nett yServerCnxnFactory – sslQuorum=true – ssl.clientAuth=NONE – ssl.quorum.clientAuth=NONE • Needs ZOOKEEPER-4276 to follow Upgrading existing non-TLS cluster with no downtime – Makes ZK can serve only with secureClientPort
  • 16. Encryption for ZooKeeper client • Also Requires ZooKeeper 3.5.6 or above for clients -Dzookeeper.client.secure=true -Dzookeeper.clientCnxnSocket= org.apache.zookeeper.ClientCnxnSocketNetty in client JVM args – HADOOP_OPTS environment variable – mapreduce.admin.map.child.java.opts, mapreduce.admin.reduce.child.java.opts in mapred-site for Oozie Coordinator MapReduce jobs • Needs to replace and update ZooKeeper jars in all components which communicate with ZooKeeper – ZKFC, ResourceManager, Hive clients incl. HS2, Oozie and Livy – Apache Curator also be updated to 4.2.0, Netty from 4.0 to 4.1
  • 17. Enforcing Kerberos AuthN/Z for ZooKeeper • Requires ZooKeeper 3.6.0 or above for servers – 3.6.0+: zookeeper.sessionRequireClientSASLAuth=true – 3.7.0+: enforce.auth.enabled=true enforce.auth.schemes=sasl • Oozie Hive action will not work with forcing ZK SASL – when acquiring the lock for Hive Metastore – Has no mechanisms to delegate authentication or impersonation for ZooKeeper – Using HiveServer2 / Oozie Hive2 action solve it
  • 19. Background HDFS blocks are written to local filesystem of the DataNodes • the data is not encrypted by default • encryption is required in several use cases Encryption can be done at several layers: • Application: most secure, but hardest to do • Database: most databases have this, but may incur performance penalties • Filesystem: high performance, transparent, but may not be flexible • Disk: only really protects against physical theft HDFS TDE fits between database and filesystem level
  • 21. KeyProvider: Where KEK is saved Implementations of KeyProvider API • Hadoop KMS: JavaKeyStoreProvider – JCEKS files in Hadoop compatible filesystems (localFS, HDFS, cloud storage) – Not recommended • Apache Ranger KMS: RangerKeyStoreProvider – RDBMS – master key can be stored in Luna HSM (optional) – HSM is required in some use cases • PCI-DSS, FIPS 140-2
  • 22. Extending KeyProvider API is not difficult • Mandatory methods for HDFS TDE – getKeyVersion, getCurrentKey, getMetadata • Optional methods (nice to have for operation) – getKeys, getKeysMetadata, getKeyVersions, createKey, deleteKey, rollNewVersion – If not implemented, you need to create/delete/list/roll keys in some way • Use cases: – LinkedIn integrated with its own key management service, LiKMS https://engineering.linkedin.com/blog/2021/the-exabyte-club-- linkedin-s-journey-of-scaling-the-hadoop-distr – Yahoo! JAPAN also integrated with our own credential store by only ~500 LOC (including test code)
  • 23. KeyProvider is actually stable, can be used safely • KeyProvider is @Public and @Unstable – @Unstable in Hadoop means "incompatible changes are allowed at any time" • Actually, the API is very stable – No incompatible changes – Ranger uses it since 2015: RANGER-247 • Provided a patch to mark it stable – HADOOP-17544
  • 24. Hadoop KMS: Where KEK is cached and performs authorization • KMS interacts with HDFS clients, NameNodes, and KeyProvider • KMS have its own ACLs separated from HDFS ACLs – An attacker cannot decrypt data even if HDFS ACLs are compromised – If 'usera' reads/writes data in the encryption zone with 'keya', the configuration in kms-acls.xml will be: – The configuration is hot-reloaded • For HA and scalability, multiple KMS instances are supported <property> <name>key.acl.keya.DECRYPT_EEK</name> <value>usera</value> </property>
  • 25. How to deploy multiple KMS instances Two Approaches: 1. Behind a load-balancer or VIP 2. Using LoadBalancingKMSClientProvider – Implicitly used when multiple URIs are specified in hadoop.security.key.provider.path If you have a LB or VIP, use it • No configuration change to scale-out/decommission • LB saves clients' retry cost – LoadBalancingKMSClientProvider first try to connect to a KMS, if fails, then connect to another KMS
  • 26. How to configure multiple KMS instances • Delegation Token must be synchronized – Use ZKDelegationTokenSecretManager – Documented an example configuration: HADOOP-17794 • hadoop.security.token.service.use_ip – If true (default), fails to validate SSL certificates in multi- homed environment – Documented: HADOOP-12665
  • 27. Tuning Hadoop KMS • Documented and discussed in HADOOP-15743 – Reduce SSL session cache size and TTL – Tuning https idle timeout – Increase max file descriptors – etc. • This tuning is effective in HttpFS as well – Both KMS/HttpFS use Jetty via HttpServer2
  • 28. Recap: HDFS TDE • Careful configuration required – How to save KEK – Running multiple KMS instances – KMS Tuning – Where to create encryption zones – ACLs (including key ACLs and impersonation) • They are not straightforward despite the long time since the feature was developed
  • 30. Updating SSL certificates • Hadoop >= 3.3.1 allows updating SSL certificates without downtime: HADOOP-16524 – Use hot-reload feature in Jetty – Except DataNode since DN don't rely on Jetty • Useful especially for NameNode because it takes > 30 minutes to restart in large cluster
  • 31. Other considerations • It is important to be ready to upgrade at any time – Sometimes CVEs have been published and the vendors warn users to upgrade • Security requirements may increase later, so be prepared for that early • Operational considerations are also necessary – Not only the cluster configuration but also the operations will be change
  • 32. Conclusion & Future work We introduced many technical tips for secure Hadoop cluster • However, they might change in the future • Need to catch up with the OSS community Future work • How to enable SSL/TLS in ApplicationMaster & Spark Driver Web UIs • Impersonation does not work correctly in KMSClientProvider: HDFS-13697