The Document provides an overview of
the key security challenges in Big Data (Apache Hadoop)systems, and showcases the solutions used by Hortonworks Distribution to solve these security challenges.
The Document provides an overview of
the key security challenges in Big Data (Apache
Hadoop)systems, and showcases the solutions used by
Hortonworks Distribution to solve these security challenges.
Ø BigData and Security - Key Asks and Challenges
Ø BigData Security - what it encompasses
Ø BigData Security - Approaches
Ø Reference Architecture& Hadoop Security
Ø Security - HDP Solutions components
Ø Hadoop Security - Popular Hadoop Distributions
Contents
I can do so
much with
Big Data
BigData and Security
CheaperFaster decisions
Real-time
Analytics
Competitive
Advantage
With the many benefits of BigData and the growing dependance of organizations on it,
BigData presents new challenges with Data Security
~Increased threat due to increaseddevice and machine data
~Distributed platform sometimes lack peripheral security
~Interaction between distributed nodes are not secured
~Threadexposing sensitive data through unstructureddata
~Lack of Auditing & logging features
and many more …
BigDataPlatform
I can do so
much with Big
Data
Inside BigData Ecosystem - Security Challenges
Distributed environment-
lack of Peripheral Security to
streamline & monitor
external requests to the
BigData system
Need to protect data
against untrusted
processing tracks
Most NoSQL databases
do not provide
comprehensive security
features. including wrt
Unstructured data
Distributed environment
involves inter-node
communications, client
communication with
resource managers &
nodes - which is not
secured
Lack of Audit & Logging
features - to monitor & handle
security threats
Handling/masking
sensitive data e.g..
PII data
BigData Security - what it encompasses
Authentication
Authorization
Audit
Data Protection
Administration
How do i set policies across the entire environment ?
Who am I ?
What can I do ?
What, who, when ?
Protection/encryption of Data at rest & in motion, PII data
encryption, masking
BigData Security - Approaches
Perimeter Security (Walled Garden)
Data Centric Security
Cluster Security
• Secure the cluster by tightly controlling access
through firewalls or API gateways
• Simple to setup
• If the gateway is breached, there is not protection
• Data is secured using techniques -
• Encryption
• (data encryption using Hadoop KMS or
third-party KMS)
• Masking (e.g. PII data masking)
• Role based access to data
• Security at every step with the cluster
• Multi-level security authentication (Ldap, kerberos
etc) & Authorization
• Leveraging kerberos for Authentication
• SSL/TLC for encrypting data in motion.
• Audit – who, what, when ?
Data
Sources
Data Visualization
DW
Data Centric
Security
Authentication
using Kerberos
TLS/SSL
( Data In Motion)
• Authorization
• Data Protection
• Audit
PerimeterSecurity
PerimeterSecurity
Perimeter Security
Security & Governance
Data Integration Hub
Data Storage/Processing
Big Data Platform
Data Scientist
App
Developer
Admin
Data Intelligence
Predictive Models
Time
Series
Analysis
Regression
Analysis
Recommendation
Engine
External
Sources
firewall
Data Ingestion Data Storage & Processing Data Access
Big Data Security - Reference Architecture
Users
BigData - Security at each step
Data Ingestion Data Storage Data Processing
Data Access &
Visualization
Authentication
(LDAP/Kerberos)
y - y y
Authorization y y y y
Audit y y y y
Data Protection
(Encryption,
Masking)
y y - y
Apache Hadoop - Cluster Security
Master NameNode
Application
Data Node
Data Node
Data Node
SSL.TLS
Data Encryption
Identity & Auth
Logging & Monitoring
Kerberos
Systems like Apache Hadoop utilize distributed
computing, inter-node communication, replication, &
other cluster services - which exposes the cluster at
multiple levels.
Representative diagram - Does not show all components e.g.. YARN
Hadoop security - Solution components used by
Hortonworks
Apache Ranger
FW to enable, monitor and manage Comprehensive data security across the Hadoop
platform.
Centralized platform of Security policies, wire encryption, fine-grained access control
Apache Knox
• Enables Perimeter Security
• Kerberosencapsulation
• single access point for all REST interactions with Apache Hadoop clusters
• Centralized Authentication, Authorization, and Audit for Hadoop REST/HTTP services
• Integrated with existing systemsto simplify identity maintenance (SSO, LDAP, AD)
• Knox eliminates the client requirement of knowledge of cluster topology.
Kerberos
Authentication protocol that works on basisof ‘tickets’ to authenticate requests.
Enables authentication of all external requeststo BigData ecosystem, as well as
requestsinternal to the BigData ecosystem (incl. inter-node communications,
communications between resource manager and nodesetc)
SSL/TLS
SSL/TLS - This is the standard securitytechnology for for establishing an encrypted link
between server & client.
This ensuresthe data transfer between the client & server remains private and integral.
BigData Security Solution Components
• Apache Ranger - formerly XA Secure, before Hortonworksacquired XA Secure systems in
2014 is - FW to enable, monitor and manage Comprehensive Data Securityacross the Hadoop
platform.
• It provides -
• Centralized platform for Security policy Administration.
• Enables fine-grained accesscontrol
• Centralized Audit reporting
• Wire encryption
• HDFS Encryption with Ranger KMS
• Supports fine-grained Authorization & Auditing for following Apache projects
Apache Hadoop, Apache Hive, Apache HBase, Apache Storm, Apache Knox, Apache Solr,
Apache Kafka, YARN
BigData Security Solution Components
Source - Hortonworks
Ranger Plugins run on the specific applications, therefore do not have adverse affect on
system performance
BigData Security Solution Components
Ranger UI – enabling User/Usergrouplevel authorization for compoments – HDFS, YARN,
Hbase, Hive, Kafka and others.
BigData Security Solution Components
Using Apache Ranger to enable authorization for HDFS at User/Usergrouplevel.
Kafka and Security
• Kafka Security introducedin Kafka 0.9 & includedin
ConfluentPlatform 2.0
Key features :
• ClientAuthentication usingKerberos/TLSclient
certificates,so Kafka brokers know who is
making requests
• Unix like permission to control which users can
access which data
• Encrypted Network communication,allowing
messages to be sent securely across networks
• Authentication requiredfor communication
between Brokers and Zookeeper
• Apache Ranger support for Kafka authentication
& audit.
Source - Hortonworks
HBase & Spark - Security
• Apache HBase
• Leverage Apache Ranger,Apache Knox, Kerberosfor HBase Security
• Authentication,
• Authorization (ACLs)
• Encryption for Data at Rest
• Wire encryption
• Apache Spark
• Kerberos- token based Authentication
• Spark Communication Encryption settings
• leverage YARN SSL for Yarn NM - Executor communication
• Other settings include -
• spark.authenticate=true (enable RPC)
• spark.authenticate.enableSaslEncryption = true (wire encryption, shuffle)
• spark.ssl.enables=true
• Available HDP 2.5 onwards
• Fine-grained Column Level AccessControl using Hive LLAP (Long live and Process)
Spark Column Security with LLAP
1. SparkSQL gets data locations known as
“splits” from hive server, and plan query.
2. HiveServer2 authorizes access using
Ranger.Per-user policies like row filtering
are applied.
3. Spark gets modified query plan based on
dynamic security policy.
4. Spark reads data from LLAP,
filtering/masking done by LLAP
• Fine grained Column level access control for Spark
• Dynamic policiesper user, doesn't require Views
• Use Standard Ranger policiesand tools to control access and masking policies
Ranger Server
(Dynamic
policies)
Hive Server2
(Authorization)
Spark Client +
LLAP Context
LLAP
(Data read,
filter
pushdown)
1 2
3
4
Source - Hortonworks
BigData Security - Popular Hadoop Distributions
Distributions Description Perimeter Security Data Centric Security
Cloudera
Provides comprehensive security for
CDH leveraging Cloudera Manager,
Apache Sentry, Kerberos
Cloudera manager provides
Authentication and Network
Isolation. It can be integrated with
Kerberos/LDAP/AD
Leverages Apache sentry to provide fine grained cell
level security.
Auditing provided through Cloudera navigator. Security
for data in transit, through TLS and other mechanisms
- centrally deployed through Cloudera Manager
Transparent data-at-rest protection is provided through
the combination of HDFS encryption, Navigator
Encrypt, and Navigator Key Trustee.
MapR
Built-in support for Authentication,
Authorization, Impersonation,
Encryption, Auditing.TLS/SSL for data-
in-motion
Possess native authentication
mechanism. However has
capability for integration with
Kerberos
MapR supports Hadoop Access Control Lists (ACLs)
for regulating user privileges to the job queue and
cluster. The Secure Sockets Layer/Transport Layer
Security (SSL/TLS) protocol secures several channels
of HTTP traffic
Hortonworks
provides Comprehensive security for
Apache Hadoop stack - Perimeter
security, Data Centric, Cluster security
Leverages Apache Knox gateway
for Perimeter security.
Uses Apache Ranger for authorization(inc. fine
grained cell level), audit and data protection through
support HDFS Transparent Encryption.
Ranger supports security for multiple solutions
including Apache Kafka, HBase, YARN.