Every enterprise spends significant resources to protect its data. This is especially true in the case of big data, since some of this data may include sensitive or confidential customer and financial information. Common methods for protecting data include permissions and access controls as well as the encryption of data at rest and in flight.
The Hadoop community has recently rolled out Transparent Data Encryption (TDE) support in HDFS. Transparent Data Encryption refers to the process whereby data is transparently encrypted by the big data application writing the data; it is not decrypted again until it is accessed by another application. The data is encrypted during its entire lifespan—in transit and at rest—except when it is being specifically accessed by a processing application.
TDE is an excellent approach for protecting data stored in data lakes built on the latest versions of HDFS. However, it does have its challenges and limitations. Systems that want to use TDE require tight integration with enterprise-wide Kerberos Key Distribution Center (KDC) services and Key Management Systems (KMS). This integration isn’t easy to set up or maintain. These issues can be even more challenging in a virtualized or containerized environment where one Kerberos realm may be used to secure the big data compute cluster and a different Kerberos realm may be used to secure the HDFS filesystem accessed by this cluster.
BlueData has developed significant expertise in configuring, managing, and optimizing access to TDE-protected HDFS. This session at the Strata Data Conference in March 2018 (by Thomas Phelan, co-founder and chief architect at BlueData) offers a detailed overview of how transparent data encryption works with HDFS, with a particular focus on containerized environments.
You’ll learn how HDFS TDE is configured and maintained in an environment where many big data frameworks run simultaneously (e.g., in a hybrid cloud architecture using Docker containers). Moreover, you’ll learn how KDC credentials can be managed in a Kerberos cross-realm environment to provide data scientists and analysts with the greatest flexibility in accessing data while maintaining complete enterprise-grade data security.
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63763
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
How to Protect Big Data in a Containerized Environment
1. How to Protect Big Data in a
Containerized Environment
Thomas Phelan
Chief Architect, BlueData
@tapbluedata
2. Outline
Securing a Big Data Environment
Data Protection
Transparent Data Encryption
Transparent Data Encryption in a Containerized Environment
Takeaways
3. In the Beginning …
Hadoop was used to process public web data
- No compelling need for security
• No user or service authentication
• No data security
5. Layers of Security in Hadoop
Access
Authentication
Authorization
Data Protection
Auditing
Policy (protect from human error)
6. Hadoop Security: Data Protection
Reference: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_edh_overview.html
7. Focus on Data Security
Confidentiality
- Confidentiality is lost when data is accessed by someone not
authorized to do so
Integrity
- Integrity is lost when data is modified in unexpected ways
Availability
- Availability is lost when data is erased or becomes inaccessible
Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf
8. Hadoop Distributed File System (HDFS)
Data Security Features
- Access Control
- Data Encryption
- Data Replication
9. Access Control
Simple
- Identity determined by host operating system
Kerberos
- Identity determined by Kerberos credentials
- One realm for both compute and storage
- Required for HDFS Transparent Data Encryption
11. Data Replication
3 way replication
- Can survive any 2 failures
Erasure Coding
- Can survive more than 2 failures depending on parity bit configuration
12. HDFS with End-to-End Encryption
Confidentiality
- Data Access
Integrity
- Data Access + Data Encryption
Availability
- Data Access + Data Replication
13. Data Encryption
How to transform the data?
10101110001001000101110
00101000111010101010101
00011101010101110
Cleartext
XXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXX
XXX
Ciphertext
14. Data Encryption – At Rest
Data is encrypted while on persistent media (disk)
15. Data Encryption – In Transit
Data is encrypted while traveling over the network
17. HDFS Transparent Data Encryption (TDE)
End-to-end encryption
- Data is encrypted/decrypted at the client
• Data is protected at rest and in transit
Transparent
- No application level code changes required
18. HDFS TDE – Design
Goals:
- Only an authorized client/user can access cleartext
- HDFS never stores cleartext or unencrypted data encryption keys
19. HDFS TDE – Terminology
Encryption Zone
- A directory whose file contents will be encrypted upon write and
decrypted upon read
- An EZKEY is generated for each zone
20. HDFS TDE – Terminology
EZKEY – encryption zone key
DEK – data encryption key
EDEK – encrypted data encryption key
21. HDFS TDE - Data Encryption
The same key is used to encrypt and decrypt data
The size of the ciphertext is exactly the same as the size of the original
cleartext
- EZKEY + DEK => EDEK
- EDEK + EZKEY => DEK
22. HDFS TDE - Services
HDFS NameNode (NN)
Kerberos Key Distribution Center (KDC)
Hadoop Key Management Server (KMS)
- Key Trustee Server
23. HDFS TDE – Security Concepts
Division of Labor
- KMS creates the EZKEY & DEK
- KMS encrypts/decrypts the DEK/EDEK using the EZKEY
- HDFS NN communicates with the KMS to create EZKEYs &
EDEKs to store in the extended attributes in the encryption zone
- HDFS client communicates with the KMS to get the DEK using
the EZKEY and EDEK.
24. HDFS TDE – Security Concepts
The name of the EZKEY is stored in the HDFS extended
attributes of the directory associated with the encryption zone
The EDEK is stored in the HDFS extended attributes of the file in
the encryption zone
$ hadoop key …
$ hdfs crypto …
25. HDFS Examples
Simplified for the sake of clarity:
- Kerberos actions not shown
- NameNode EDEK cache not shown
28. HDFS TDE – File Write Work Flow
4. Decrypt DEK from EDEK
5. Return DEK
/encrypted_dir/file
write encrypted data
read
unencrypted data
/encrypted_dir/file
xattr: EDEK
3. Request DEK from EDEK & EZKEYNAME
29. HDFS TDE – File Read Work Flow
4. Decrypt DEK from EDEK
5. Return DEK
/encrypted_dir/file
read encrypted data
write
unencrypted data
/encrypted_dir/file
xattr: EDEK
3. Request DEK from EDEK & EZKEYNAME
30. Bring in the Containers (i.e. Docker)
Issues with containers are the same for any virtualization platform
- Multiple compute clusters
- Multiple HDFS file systems
- Multiple Kerberos realms
- Cross-realm trust configuration
31. Containers as Virtual Machines
Note – this is not about using containers to run Big Data tasks:
32. Containers as Virtual Machines
This is about running Hadoop / Big Data clusters in containers:
cluster
34. KDC Cross-Realm Trust
Different KDC realms for corporate, data, and compute
Must interact correctly in order for the Big Data cluster to function
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
35. KDC Cross-Realm Trust
Different KDC realms for corporate, data, and compute
- One-way trust
• Compute realm trusts the corporate realm
• Data realm trusts corporate realm
• Data realm trusts the compute realm
37. Key Management Service
Must be enterprise quality
- Key Trustee Server
• Java KeyStore KMS
• Cloudera Navigator Key Trustee Server
38. Containers as Virtual Machines
A true containerized Big Data environment:
DataLake
DataLake
DataLake
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
CORP.ENTERPRISE.COM
End Users
COMPUTE.ENTERPRISE.COM
Hadoop/Spark Service Principals
DATALAKE.ENTERPRISE.COM
HDFS Service Principals
39. Key Takeaways
Hadoop has many security layers
- HDFS Transparent Data Encryption (TDE) is best of breed
- Security is hard (complex)
- Virtualization / containerization only makes it potentially harder
- Compute and storage separation with virtualization /
containerization can make it even harder still
40. Key Takeaways
Be careful with a build vs. buy decision for containerized Big Data
- Recommendation: buy one already built
- There are turnkey solutions
(e.g. BlueData EPIC)
Reference: www.bluedata.com/blog/2017/08/hadoop-spark-docker-ten-things-to-know