This talk was delivered by Anurag Shrivastava at Hadoop Summit 2015 Brussels. It covers how Apache Ranger, Apache Sentry, Apache Knox and Project Rhino can help you pass IT risk assessment in Hadoop projects.
Dev Dives: Streamline document processing with UiPath Studio Web
Hadoop Security Features that make your risk officer happy
1. HADOOP SECURITY FEATURES
That make your risk officer happy
By Anurag Shrivastava, ING Commercial Bank, Amsterdam
@shri2201
2. Security for Hadoop
Source: http://blogs.gartner.com/merv-adrian/2014/01/21/security-for-hadoop-dont-look-now/
Hadoop Security Features 2
3. Hadoop in Enterprise
Data Lake – an important information assets for enterprise
Data from System of
Records and Logs are stored
in Hadoop
Significant cost
savings for
Enterprise
Diverse types of
users
Picture Source: http://arunkottolli.blogspot.nl/2014/03/understanding-data-in-big-data.html
Hadoop Security Features 3
4. Operational Security in Enterprise
• User Access Management
• Security Event Monitoring
• Application State Monitoring
• Security Testing
• Patch Management
• Data Protection
• Backup and restore
Hadoop Security Features 4
5. User Access Management
Requirements
Privileged, group and generic accounts
Separation of technical and business users
Separation of environments (DTAP)
Separation of admins and other users
Separation of users in different business roles
Application of four eyes principle when entering or
changing the data
Hadoop Security Features 5
6. Security Event Monitoring
• Definition of application specific events
• All login attempts failed or successful
• Unauthorized attempt to access a table or file
• Operational performance of application
• Name node performance
• CPU, Disk
• Integration with Master Control Room
• Alerting the asset manager
Hadoop Security Features 6
7. Data Protection (1/2)
• Confidentiality
• Protect information from unauthorized
disclosure
• Integrity
• Ensure the accuracy, completeness and
timeliness of information and prevent data
tempering
• Availability
• Ensure that information and service is
available when required
Picture Source:
http://www.attix5.co.uk/thought-
leadership/why-data-protection-software-
essential-good-nights-sleep
Hadoop Security Features 7
8. Data Protection (2/2)
• Confidentiality
• Logon
• Access Control
• Malicious code protection
• Security Event Monitoring
• Encryption
• Integrity
• Message authentication code
• Data Lineage
Picture Source:
http://www.attix5.co.uk/thought-
leadership/why-data-protection-software-
essential-good-nights-sleep
Hadoop Security Features 8
9. Security under spotlight in Data Lake
• All kinds of enterprise data – structured,
semi-structured and unstructured
• Many groups of users – Data Scientists,
Analysts, Engineers, Marketers,
Managers
• Long term retention of data
• Different types of workloads
• Value of data grows as the data from
different sources are combined in Data
Lake
Picture source: http://beyondplm.com/2014/05/05/plm-downstream-usage-and-future-information-rivers/
Hadoop Security Features 9
10. Data Lake Risks
• Data Lake is an attractive target of inside and outside attackers
• Security compromise in Data Lake can have major or catastrophic
business impact
IT Risk assessment gives Hadoop implementation
the highest risk rating for Data Lake use case.
Hadoop Security Features 10
11. Lab Like Security is not Enough
Play Area Big Data Predictive Analytics
Lab
Production
System
Hadoop Security Features 11
12. Predictive Analytics Lab
Stepping Stone
(Citrix)
18 x Hadoop
Nodes
GIT, Libraries,
Build Tools
Monitoring
Services
Data Files in
Batches
Dedicated VLAN Shared ServicesShared Services
SMTP Relay
Internet via
Corporate
Infrastructure
Firewall Rules
Guard the
Perimeter
Security
Of Hadoop
Cluster
18 x Hadoop
Nodes
Lab like security works for a small group of people
Hadoop Security Features 12
13. Limitations of Hadoop
• No “Data at Rest” Encryption
• A Kerberos-Centric Approach
• Limited Authorization Capabilities
• Complexity of the Security Model and Configuration
Unfortunately this is not sufficient for Data Lake that ingests all the
data and caters to thousands of users.
Hadoop Security Features 13
14. Hadoop Security
Hadoop Security Solutions from Major Vendors
Hortonworks acquires XASecure to
bring ACLs in Hadoop
Apache Ranger
Apache Knox
Apache Falcon
Cloudera is working on Project Rhino Project Rhino
Apache Sentry
Hadoop Security Features 14
16. Apache Ranger
Apache Ranger currently supports authorization, auditing and security administration of limited
number of HDP components
Hive
HBase
Storm
Knox
HDFS
Hadoop Security Features 16
17. Apache Ranger Goals
1. Centralized security administration to manage all security related tasks in
a central UI or using REST APIs.
2. Fine grained authorization to do a specific action and/or operation with
Hadoop component/tool and managed through a central administration tool
3. Standardize authorization method across all Hadoop components.
4. Enhanced support for different authorization methods - Role based access
control, attribute based access control etc.
5. Centralize auditing of user access and administrative actions (security
related) within all the components of Hadoop.
Hadoop Security Features 17
19. Apache Falcon
• Visualize Data Pipeline Lineage
• Track Data Pipeline audit logs
• End to End Monitoring of Data
Pipeline
• Policies for Data Replication and
Retention
Hadoop Security Features 19
21. Goals of Project Rhino
• Provide encryption with hardware-enhanced performance
• Support enterprise-grade authentication and single sign-on for
Hadoop services
• Provide role-based access control in Hadoop with cell-level
granularity in HBase
• Ensure consistent auditing across essential Apache Hadoop
components
Hadoop Security Features 21
23. Making Risk Officer Happy
• Hadoop security has
more to offer
• Role based access
• Audit logging
• Data encryption
• User Access Management
• Security Event Monitoring
• Application State Monitoring
• Security Testing
• Patch Management
• Data Protection
• Backup and restore
Overlapping efforts of vendors, Lack of complete coverage for all products,
Varying commitment to open source would slow down the adoption of Hadoop.
Hadoop Security Features 23
Ask a question about the biggest data security breaches.
Target 40 million debit/credit card number stolen
Sony Online 102 million records
Home Depot 56 million payment cards
Hadoop Security was completely ineffective
APT is real..
We are bunch of people very excited about the technology when we hear about Hadoop.
However when it comes to security the it seems that nobody is bothered about it except risk officer.
This creates some tension between IT, business and risk.
Technology has not kept up with marketing.
All sweet marketing and enterprise sales guys sell Hadoop as the right system for enterprise.
Hadoop becomes the important information system assets in the enterprise
Enterprises find Hadoop attractive because of lower cost
Hadoop analytics is not limited to web logs alone but also data stores in system of records
Hadoop caters to diverse group of business and technical users
I see a paradox here.
A system for enterprise where CIO do not bother about the security.
State monitoring is about monitoring the application settings.
Security testing involves static and dynamic code scans
Patch management requires patch history is maintained, systems are tested after patching, deciding which patch is appropriate for the system
Backup frequency, logging of restore activity, incomplete backups are detected and safe storage of backup as per CIA rating
Typical requirements of user access management are explained.
Role based access.
You can use several techniques to convince your risk officer about data protection.
However as you bring all the data in data lake, you have to take all the measures.
A very important Hadoop use case (Data Lake) puts the Hadoop security story under hard test..
Multitenancy
A beautiful house without door locks..
Multi tenancy, workload segregation
User separation
Sanitized hadoop cluster does not work
Peripheral security with stepping stone has its limitations.
We had to implement two factor authentication.
Put Hadoop team in sanitized area.
Hadoop provides all or nothing model for security.
Relied heavily upon file system security
1. No “Data at Rest” Encryption. Currently, data is not encrypted at rest on HDFS. For organizations with strict security requirements related to the encryption of their data in Hadoop clusters, they are forced to use third-party tools for implementing HDFS disk-level encryption, or security-enhanced Hadoop distributions (like Intel’s distribution from earlier this year).
2. A Kerberos-Centric Approach – Hadoop security relies on Kerberos for authentication. For organizations utilizing other approaches not involving Kerberos, this means setting up a separate authentication system in the enterprise.
3. Limited Authorization Capabilities – Although Hadoop can be configured to perform authorization based on user and group permissions and Access Control Lists (ACLs), this may not be enough for every organization. Many organizations use flexible and dynamic access control policies based on XACML and Attribute-Based Access Control. Although it is certainly possible to perform these level of authorization filters using Accumulo, Hadoop’s authorization credentials are limited
4. Complexity of the Security Model and Configuration. There are a number of data flows involved in Hadoop authentication – Kerberos RPC authentication for applications and Hadoop Services, HTTP SPNEGO authentication for web consoles, and the use of delegation tokens, block tokens, and job tokens. For network encryption, there are also three encryption mechanisms that must be configured – Quality of Protection for SASL mechanisms, and SSL for web consoles, HDFS Data Transfer Encryption. All of these settings need to be separately configured – and it is easy to make mistakes.
As the Wall Street Journal reported, Bank of New York Mellon Corp.’s Hadoop system bogged down after too many employees accessed it. Ms. Crisp is hedging her bets by maintaining the bank’s commercial database and data warehouse software.
How Hadoop leaders have responded to these challenges.
In addition to several proprietary initiatives which are not covered here.
HDP 2.2 brings a major change in Hadoop security.
Acquisition of XA secure has been significant in terms of user access management.
Role based access for several components
Logging
Single console
Not a single point of failure
Apache ranger is very promising from the user access management perspective and security event monitoring perspective.
But not all the hadoop components are covered
Most security is geared toward the consumers of data.
No 4 & 5 is a very promising feature..
The following Hadoop services have integrations with the Knox Gateway:
WebHDFS (HDFS)
Templeton (HCatalog)
Stargate (HBase)
Oozie
Hive/JDBC
Sentry: Unified authorization and RBAC. Overlap with Ranger
Secure authorization
Limited coverage: Hive and Impala
Pluggable interfaces, binding with PIG
Cloudera CDH 4.3
Open source commitment of Cloudera is a big question mark?
DG Secure alternative for HDP.
Key distribution and management is included.
Snapshots, log etc. can be encrypted.
Crypto codecs.
Integration with PKI infra in a large enterprise is a challenge..
As compared to previous year, Hadoop security has lot more to offer but it is still far from being a complete system suited for Data Lake use cases.
You have to mix and match the components which is hard. Ranger is strong in user access management and security monitoring. Rhino is strong is data protection.
Hadoop is ready for the enterprise but still we are working on readiness..
You can’t make risk officer very happy..
All kind of reason for not building the security:
Performance, Architecture, You did not need it before.
Time to improve it..