SlideShare a Scribd company logo
1 of 22
Securing the Hadoop Ecosystem
Patrick Angeles
Big Data Warehouse Meetup
Feb 10, 2014
Why is Security Important?
About Me
Hadooping for 5+ years
• Responsible for several secure Hadoop deployments
• Did e-commerce and consumer analytics (PCI, PII,
etc.)
• Crypto and PKI in a previous life.
•
Why Secure Hadoop?
•

Multi-tenancy
•

•

You want your cluster to store data and run workloads
from multiple users and groups

Compliance
•

You have policies on which personnel can view what data
Agenda
Hadoop Ecosystem Interactions
• Security Concepts
• Security in Practice
•

•
•

IT Infrastructure Integration
Deployment Recommendations
Hadoop on its Own
WebHdfs
client

HDFS
client

Hadoop
NN

SNN

DN TT

Map
Task

DN TT

Map
Task

DN TT

Reduce
Task

HttpFS

MR
client

hdfs, httpfs & mapred users

JT

end users

protocols: RPC/data transfer/HTTP
Hadoop and Friends
service users

end users

clients

protocols: RPCs/data/HTTP/Thrift/Avro-RPC
services

clients
Hbase

Zookeeper

RPC

Hbase

RPC

Zookeeper
Oozie

HTTP

Oozie

WebHdfs
Pig

HTTP

Hue

Crunch

HTTP

browser

HTTP
Cascading

MapRed

RPC

Hadoop

RPC

Flume

Sqoop

Impala

Hive

Hive Metastore

Thrift

Avro RPC

Thrift

Flume

Impala
Security Concepts
Authentication
• Authorization
• Confidentiality
•

•

•

Encryption

Auditing
•

Traceability
Authentication
•

End Users to Services, as a user
•
•
•

•

Services to Services, as a service
•
•

•

CLI & libraries: Kerberos (kinit or keytab)
Web UIs: Kerberos SPNEGO & pluggable HTTP auth
MR tasks use delegation tokens
Credentials: Kerberos (keytab)
Client SSL certificates (for shuffle encryption)

Services to Services, on behalf of a user
•

Proxy-user (after Kerberos for service)
Authorization
•

HDFS Data
•

•

HBase Data
•

•

Fine-grained authorization through Apache Sentry (Incubating)

Jobs (Hadoop, Oozie)
•

•

Read/Write Access Control Lists (ACLs) at table level

Hive Server 2 and Impala
•

•

File System permissions (Unix like user/group
permissions)

Job ACLs for Hadoop Scheduler Queues, manage &
view jobs

Zookeeper
•

ACLs at znodes, authenticated & read/write
Confidentiality
•

Data in transit
RPC: using SASL
• HDFS data: using SASL
• HTTP: using SSL (web UIs, shuffle). Requires SSL
certs
•

•

Data at rest
Nothing out of the box
• Doable by: custom ‘compression’ codec or
local file system encryption
•
Auditing
•

Who accessed (read/write) FS data
•
•

•

Who submitted, managed, or viewed a Job or a
Query
•

•

NN audit log contains all file opens, creates
NN audit log contains all metadata ops, e.g. rename, listdir

JT, RM, and Job History Server logs contain history of all
jobs run on a cluster

Who submitted, managed, or viewed a workflow
•

Oozie audit logs contain history of all user requests
Auditing Gaps
•

Not all projects have explicit audit logs
•
•

•

It is difficult to correlate jobs & data access
•
•

•

Audit-like information can be extracted by processing logs
Eg: Impala query logs are distributed across all nodes
Eg: Map-Reduce jobs launched by Pig job
Eg: HDFS data accessed by a Map-Reduce job

Tools written on top of Hadoop can do this well
Security in Practice
Integration: Kerberos
Users don’t want Yet Another Credential
• Corp IT doesn’t want to provision thousands of
service principals
• Solution: local KDC + one-way trust
• Run a KDC (usually MIT Kerberos) in the cluster
•

•

•

Put all service principals here

Set up one-way trust of central corporate realm by
local KDC
•

Normal user credentials can be used to access Hadoop
Integration: Groups
•

Much of Hadoop authorization uses “groups”
•

•

Users’ groups are not stored in Hadoop anywhere
•
•

•

User ‘patrick’ might belong to groups ‘analysts’, ‘eng’, etc.
Refers to external system to determine group membership
NN/JT/Oozie/Hive servers all must perform group mapping

Default plugins for user/group mapping:
•
•
•

ShellBasedUnixGroupsMapping – forks/runs `/bin/id’
JniBasedUnixGroupsMapping – makes a system call
LdapGroupsMapping – talks directly to an LDAP server
Integration: Kerberos + LDAP

Central Active Directory

LDAP group
mapping

me@EXAMPLE.COM
…

Hadoop Cluster

NN

JT
Local KDC

Cross-realm trust

hdfs/host1@HADOOP.EXAMPLE.COM
yarn/host2@HADOOP.EXAMPLE.COM
…
Integration: Web Interfaces
•

Most web interfaces authenticate using SPNEGO
•
•
•

•

Standard HTTP authentication protocol
Used internally by services which communicate over HTTP
Most browsers support Kerberos SPNEGO authentication

Hadoop components which use servlets for web
interfaces can plug in custom filter
•

Integrate with intranet SSO HTTP solution
Recommendations
•

Security configuration is a PITA
•

•

Do only what you really need

Enable cluster security (Kerberos) only if un-trusted
groups of users are sharing the cluster
•

Otherwise use edge-security to keep outsiders out

Only enable wire encryption if required
• Only enable web interface authentication if required
•
Security Enablement
•

Secure Hadoop enablement order
1.
2.
3.
4.
5.
6.
7.

HDFS RPC (including SNN check-pointing)
JobTracker RPC
TaskTrackers RPC & LinuxTaskControler
Hadoop web UI
Configure monitoring to work with security
Other services (HBase, Oozie, Hive Metastore, etc)
Continue with authorization and network encryption if
needed
Administration
•

Use an admin/management tool
•
•
•

Several inter-related configuration knobs
To manage principals/keytabs creation and distribution
Automatically configures monitoring for security
Q&A

More Related Content

What's hot

Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
DataWorks Summit
 
Ranger admin dev overview
Ranger admin dev overviewRanger admin dev overview
Ranger admin dev overview
Tushar Dudhatra
 

What's hot (20)

Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
 
Hadoop Security
Hadoop SecurityHadoop Security
Hadoop Security
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
 
Article data-centric security key to cloud and digital business
Article   data-centric security key to cloud and digital businessArticle   data-centric security key to cloud and digital business
Article data-centric security key to cloud and digital business
 
Hadoop Security
Hadoop SecurityHadoop Security
Hadoop Security
 
Big data security
Big data securityBig data security
Big data security
 
Nl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesNl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenches
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
 
Implementing Security on a Large Multi-Tenant Cluster the Right Way
Implementing Security on a Large Multi-Tenant Cluster the Right WayImplementing Security on a Large Multi-Tenant Cluster the Right Way
Implementing Security on a Large Multi-Tenant Cluster the Right Way
 
What the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and VisibilityWhat the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and Visibility
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop Cluster
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 
Ranger admin dev overview
Ranger admin dev overviewRanger admin dev overview
Ranger admin dev overview
 
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
 

Viewers also liked

Anthony R Palazzo - Manufacturing-2016
Anthony R Palazzo - Manufacturing-2016Anthony R Palazzo - Manufacturing-2016
Anthony R Palazzo - Manufacturing-2016
Anthony Palazzo
 
Zlot Chorągwi
Zlot ChorągwiZlot Chorągwi
Zlot Chorągwi
cnskubiak
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentry
Brock Noland
 
Knowledge Economy - Knowledge Ecology
Knowledge Economy - Knowledge EcologyKnowledge Economy - Knowledge Ecology
Knowledge Economy - Knowledge Ecology
Jay Hays
 

Viewers also liked (20)

Caab2010jan Foh
Caab2010jan FohCaab2010jan Foh
Caab2010jan Foh
 
Resume 1
Resume 1Resume 1
Resume 1
 
Anthony R Palazzo - Manufacturing-2016
Anthony R Palazzo - Manufacturing-2016Anthony R Palazzo - Manufacturing-2016
Anthony R Palazzo - Manufacturing-2016
 
learning
learninglearning
learning
 
Ytube
YtubeYtube
Ytube
 
The Workers Business Model
The Workers Business ModelThe Workers Business Model
The Workers Business Model
 
3rd Slide Markma
3rd Slide Markma3rd Slide Markma
3rd Slide Markma
 
Zlot Chorągwi
Zlot ChorągwiZlot Chorągwi
Zlot Chorągwi
 
Hive contributors meetup apache sentry
Hive contributors meetup   apache sentryHive contributors meetup   apache sentry
Hive contributors meetup apache sentry
 
Cerner Corporation
Cerner CorporationCerner Corporation
Cerner Corporation
 
Expertise Hour: The Dos and Don'ts of Web Chat with Johan Jacobs
 Expertise Hour: The Dos and Don'ts of Web Chat with Johan Jacobs Expertise Hour: The Dos and Don'ts of Web Chat with Johan Jacobs
Expertise Hour: The Dos and Don'ts of Web Chat with Johan Jacobs
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Knowledge Economy - Knowledge Ecology
Knowledge Economy - Knowledge EcologyKnowledge Economy - Knowledge Ecology
Knowledge Economy - Knowledge Ecology
 
April 2014 HUG : Apache Sentry
April 2014 HUG : Apache SentryApril 2014 HUG : Apache Sentry
April 2014 HUG : Apache Sentry
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Presentación Práctica Grupal 2012
Presentación Práctica Grupal 2012Presentación Práctica Grupal 2012
Presentación Práctica Grupal 2012
 
Sentry - An Introduction
Sentry - An Introduction Sentry - An Introduction
Sentry - An Introduction
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
 
Proarrhythmia
ProarrhythmiaProarrhythmia
Proarrhythmia
 

Similar to Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera

SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 
Harjeet 2.3 hadoop year
Harjeet 2.3 hadoop yearHarjeet 2.3 hadoop year
Harjeet 2.3 hadoop year
Harjeet Singh
 

Similar to Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera (20)

Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
 
Middleware in Golang: InVision's Rye
Middleware in Golang: InVision's RyeMiddleware in Golang: InVision's Rye
Middleware in Golang: InVision's Rye
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Securing Hadoop in an Enterprise Context (v2)
Securing Hadoop in an Enterprise Context (v2)Securing Hadoop in an Enterprise Context (v2)
Securing Hadoop in an Enterprise Context (v2)
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
drupal 7 amfserver presentation: integrating flash and drupal
drupal 7 amfserver presentation: integrating flash and drupaldrupal 7 amfserver presentation: integrating flash and drupal
drupal 7 amfserver presentation: integrating flash and drupal
 
BigData Security - A Point of View
BigData Security - A Point of ViewBigData Security - A Point of View
BigData Security - A Point of View
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
 
TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Harjeet 2.3 hadoop year
Harjeet 2.3 hadoop yearHarjeet 2.3 hadoop year
Harjeet 2.3 hadoop year
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
 

More from Caserta

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Big Data Warehousing Meetup: Securing the Hadoop Ecosystem by Cloudera

  • 1. Securing the Hadoop Ecosystem Patrick Angeles Big Data Warehouse Meetup Feb 10, 2014
  • 2. Why is Security Important?
  • 3. About Me Hadooping for 5+ years • Responsible for several secure Hadoop deployments • Did e-commerce and consumer analytics (PCI, PII, etc.) • Crypto and PKI in a previous life. •
  • 4. Why Secure Hadoop? • Multi-tenancy • • You want your cluster to store data and run workloads from multiple users and groups Compliance • You have policies on which personnel can view what data
  • 5. Agenda Hadoop Ecosystem Interactions • Security Concepts • Security in Practice • • • IT Infrastructure Integration Deployment Recommendations
  • 6. Hadoop on its Own WebHdfs client HDFS client Hadoop NN SNN DN TT Map Task DN TT Map Task DN TT Reduce Task HttpFS MR client hdfs, httpfs & mapred users JT end users protocols: RPC/data transfer/HTTP
  • 7. Hadoop and Friends service users end users clients protocols: RPCs/data/HTTP/Thrift/Avro-RPC services clients Hbase Zookeeper RPC Hbase RPC Zookeeper Oozie HTTP Oozie WebHdfs Pig HTTP Hue Crunch HTTP browser HTTP Cascading MapRed RPC Hadoop RPC Flume Sqoop Impala Hive Hive Metastore Thrift Avro RPC Thrift Flume Impala
  • 8. Security Concepts Authentication • Authorization • Confidentiality • • • Encryption Auditing • Traceability
  • 9. Authentication • End Users to Services, as a user • • • • Services to Services, as a service • • • CLI & libraries: Kerberos (kinit or keytab) Web UIs: Kerberos SPNEGO & pluggable HTTP auth MR tasks use delegation tokens Credentials: Kerberos (keytab) Client SSL certificates (for shuffle encryption) Services to Services, on behalf of a user • Proxy-user (after Kerberos for service)
  • 10. Authorization • HDFS Data • • HBase Data • • Fine-grained authorization through Apache Sentry (Incubating) Jobs (Hadoop, Oozie) • • Read/Write Access Control Lists (ACLs) at table level Hive Server 2 and Impala • • File System permissions (Unix like user/group permissions) Job ACLs for Hadoop Scheduler Queues, manage & view jobs Zookeeper • ACLs at znodes, authenticated & read/write
  • 11. Confidentiality • Data in transit RPC: using SASL • HDFS data: using SASL • HTTP: using SSL (web UIs, shuffle). Requires SSL certs • • Data at rest Nothing out of the box • Doable by: custom ‘compression’ codec or local file system encryption •
  • 12. Auditing • Who accessed (read/write) FS data • • • Who submitted, managed, or viewed a Job or a Query • • NN audit log contains all file opens, creates NN audit log contains all metadata ops, e.g. rename, listdir JT, RM, and Job History Server logs contain history of all jobs run on a cluster Who submitted, managed, or viewed a workflow • Oozie audit logs contain history of all user requests
  • 13. Auditing Gaps • Not all projects have explicit audit logs • • • It is difficult to correlate jobs & data access • • • Audit-like information can be extracted by processing logs Eg: Impala query logs are distributed across all nodes Eg: Map-Reduce jobs launched by Pig job Eg: HDFS data accessed by a Map-Reduce job Tools written on top of Hadoop can do this well
  • 15. Integration: Kerberos Users don’t want Yet Another Credential • Corp IT doesn’t want to provision thousands of service principals • Solution: local KDC + one-way trust • Run a KDC (usually MIT Kerberos) in the cluster • • • Put all service principals here Set up one-way trust of central corporate realm by local KDC • Normal user credentials can be used to access Hadoop
  • 16. Integration: Groups • Much of Hadoop authorization uses “groups” • • Users’ groups are not stored in Hadoop anywhere • • • User ‘patrick’ might belong to groups ‘analysts’, ‘eng’, etc. Refers to external system to determine group membership NN/JT/Oozie/Hive servers all must perform group mapping Default plugins for user/group mapping: • • • ShellBasedUnixGroupsMapping – forks/runs `/bin/id’ JniBasedUnixGroupsMapping – makes a system call LdapGroupsMapping – talks directly to an LDAP server
  • 17. Integration: Kerberos + LDAP Central Active Directory LDAP group mapping me@EXAMPLE.COM … Hadoop Cluster NN JT Local KDC Cross-realm trust hdfs/host1@HADOOP.EXAMPLE.COM yarn/host2@HADOOP.EXAMPLE.COM …
  • 18. Integration: Web Interfaces • Most web interfaces authenticate using SPNEGO • • • • Standard HTTP authentication protocol Used internally by services which communicate over HTTP Most browsers support Kerberos SPNEGO authentication Hadoop components which use servlets for web interfaces can plug in custom filter • Integrate with intranet SSO HTTP solution
  • 19. Recommendations • Security configuration is a PITA • • Do only what you really need Enable cluster security (Kerberos) only if un-trusted groups of users are sharing the cluster • Otherwise use edge-security to keep outsiders out Only enable wire encryption if required • Only enable web interface authentication if required •
  • 20. Security Enablement • Secure Hadoop enablement order 1. 2. 3. 4. 5. 6. 7. HDFS RPC (including SNN check-pointing) JobTracker RPC TaskTrackers RPC & LinuxTaskControler Hadoop web UI Configure monitoring to work with security Other services (HBase, Oozie, Hive Metastore, etc) Continue with authorization and network encryption if needed
  • 21. Administration • Use an admin/management tool • • • Several inter-related configuration knobs To manage principals/keytabs creation and distribution Automatically configures monitoring for security
  • 22. Q&A

Editor's Notes

  1. Proxy-user setup:Relying party is configured to recognized super-users who are allowed to impersonate