RecordService for Unified Access Control

1© Cloudera, Inc. All rights reserved.
RecordService for Unified
Access Control
Michael Lazar, Sr. Systems Engineer

The Benefits of Hadoop...
One place for unlimited data
• All types
• More sources
• Faster, larger ingestion
Unified, multi-framework data access
• More users
• More tools
• Faster changes

…Can Create Information Security Challenges
Business Manager
• Run high value
workloads in
cluster
• Quickly adopt new
innovations
Information Security
• Follow established
policies and
procedures
• Maintain
compliance
IT/Operations
• Integrate with existing
IT investments
• Minimize end-user
support
• Automate
configuration

Comprehensive, Compliance-Ready Security
Authentication, Authorization, Audit, and Compliance
Access
Defining what users
and applications can
do with data
Technical Concepts:
Permissions
Authorization
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption, Tokenization,
Data masking
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
Cloudera Manager
Apache Sentry &
RecordService
Cloudera Navigator
Navigator Encrypt & Key
Trustee | Partners
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation

Perimeter Security Requirements
Preserve user choice of the right
Hadoop service (e.g. Impala, Spark)
Conform to centrally managed
authentication policies
Implement with existing standard
systems: Active Directory and
KerberosCloudera Manager
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation

Access Security Requirements
Provide users access to data
needed to do their job
Centrally manage access policies
Leverage a role-based access
control model built on AD
Access
Defining what users
and applications can
do with data
InfoSec Concept:
Authorization
Apache Sentry &
RecordService

Sentry – The Open Standard
Broad
Contributions
• Cloudera
• IBM
• Intel
• Oracle
Multi-Vendor
Support
• Cloudera
• IBM
• MapR
• Oracle
Wide Industry
Adoption
• Banking
• Healthcare
• Insurance
• Pharma
• Telco
Third-Party
Integrations
• Oracle
Endeca
• Platfora

Visual Policy Management

Fine-Grained HDFS Access without RecordService
Date/time Accnt # SSN Asset Trade Country
09:33:11 16-
Feb-2015
0234837823 238-23-
9876
AAPL Sell US
11:33:01 16-
Feb-2015
3947848494 329-44-
9847
TBT Buy EU
14:12:34 16-
Feb-2015
4848367383 123-56-
2345
IBM Sell UK
09:22:03 16-
Feb-2015
3485739384 585-11-
2345
INTC Buy US
11:55:33 16-
Feb-2015
3847598390 234-11-
8765
F Buy US
10:22:55 16-
Feb-2015
8765432176 344-22-
9876
UA Buy UK
13:45:24 16-
Feb-2015
3456789012 412-22-
8765
AMZN Sell EU
09:03:44 16-
Feb-2015
4857389329 123-44-
5678
TMV Buy US
15:55:55 16-
Feb-2015
4756983234 234-76-
9274
MA Buy UK
14:12:34 16-
Feb-2015
4848367383 123-56-
2345
IBM Sell UK
10:22:55 16-
Feb-2015
8765432176 344-22-
9876
UA Buy UK
15:55:55 16-
Feb-2015
4756983234 234-76-
9274
MA Buy UK
11:33:01 16-
Feb-2015
3947848494 329-44-
9847
TBT Buy EU
13:45:24 16-
Feb-2015
3456789012 412-22-
8765
AMZN Sell EU
09:33:11 16-
Feb-2015
0234837823 238-23-
9876
AAPL Sell US
09:22:03 16-
Feb-2015
3485739384 585-11-
2345
INTC Buy US
11:55:33 16-
Feb-2015
3847598390 234-11-
8765
F Buy US
09:03:44 16-
Feb-2015
4857389329 123-44-
5678
TMV Buy US
Split the original file
Use HDFS permissions to limit access

RecordService
Unified Access Control Enforcement
• New high performance security layer that
centrally enforces access control policies
across Hadoop
• Complements Apache Sentry’s unified policy
definition
• Row- and column-based security
• Dynamic data masking
• Apache-licensed open source
• Beta (TODO)
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
OTHER
Kite
NoSQL
HBase
OTHER
Object Store
FILESYSTEM
HDFS
RELATIONAL
Kudu

Fine-Grained HDFS Access Control with RecordService
• Apply controls to the master data file
• Row, column, and sub-column (masking) controls
• Enforce these across all access paths
09:33:11 16-
Feb-2015
0234837823 238-23-
9876
AAPL Sell US
11:33:01 16-
Feb-2015
3947848494 329-44-
9847
TBT Buy EU
14:12:34 16-
Feb-2015
4848367383 123-56-
2345
IBM Sell EU
09:22:03 16-
Feb-2015
3485739384 585-11-
2345
INTC Buy US
11:55:33 16-
Feb-2015
3847598390 234-11-
8765
F Buy US
10:22:55 16-
Feb-2015
8765432176 344-22-
9876
UA Buy EU
13:45:24 16-
Feb-2015
3456789012 412-22-
8765
AMZN Sell EU
Column-Level Controls
Row-LevelControls
09:33:11 16-
Feb-2015
0234837823 238-23-
9876
AAPL Sell US
11:33:01 16-
Feb-2015
3947848494 329-44-
9847
TBT Buy group2
14:12:34 16-
Feb-2015
4848367383 123-56-
2345
IBM Sell group3
09:22:03 16-
Feb-2015
3485739384 585-11-
2345
INTC Buy US
11:55:33 16-
Feb-2015
3847598390 234-11-
8765
F Buy US
10:22:55 16-
Feb-2015
8765432176 344-22-
9876
UA Buy group3
13:45:24 16-
Feb-2015
3456789012 412-22-
8765
AMZN Sell group2
Column-Level Controls
Row-LevelControls
XXX-XX
XXX-XX
XXX-XX
What U.S. Brokers See

RecordService is a distributed,
scalable, data access service for
unified authorization in Hadoop.
RecordService

Motivation
• As the Hadoop ecosystem expands, new components continue to be added
• Speaks to the overall flexibility of Hadoop
• This is good - more functionality, more workloads, more use cases.
• As use cases for Hadoop mature, user requirements and expectations increase:
• Security
• Performance
• Compatibility
• Maintainability
• The flexibility of Hadoop has come at cost of increased complexity

Introducing RecordService

Example: Security
Challenge: Provide unified fine-grained security across compute frameworks
• Integrating consistent security layer into every components is not scalable.
• Securing data at file-level precludes fine grained access control (column/row)
• File ACLs not enough - User can view all or nothing.
• Currently, must split files, duplicate data – large operational cost.
Solution: Add a level of abstraction - secure service to access datasets in “record”
format
• Can now apply fine-grained constraints on projection of dataset
• Same access control policy can be applied uniformly across compute
frameworks; uncoupled from underlying storage layer

Architecture

Architecture
• Runs as a distributed service: Planner Servers & Worker Servers
• Servers are stateless
• Easy HA, fault tolerance
• Planner Servers responsible for request planning
• Retrieve and combine metadata (NN, HMS, Sentry)
• Split generation: creates tasks for workers
• Perform authorization
• Worker Servers read from storage and construct records.
• IO, file parsing, predicate evaluation
• Run as the “source” for a DAG computation

Architecture – Server APIs
• Planner and Worker services expose thrift APIs:
• PlanRequest(), Exec(), Fetch()
• PlanRequest()
• Accepts SQL to specify request: Support SELECT and PROJECT
• Access to tables and views stored requires HMS
• Does not run operators with data exchange; “map only”
• Generates a list of tasks which contain the request, each with locality
• Exec()/Fetch()
• Returns records in a canonical optimized, columnar-format.

Architecture – Fault tolerance
• Cluster state persisted in ZK
• Planner/Worker membership, delegation tokens, secret keys
• Servers do not communicate with each other directly => scalability
• Planner services
• Expected to run a few (i.e. 3) for HA
• Fault tolerance handled with clients getting a list of planners and failing over
• Plan requests are short
• Worker services
• Expect to run on each node in the cluster with data
• Fault tolerance handled by framework (e.g. MR) rescheduling task

Architecture – Security
• Authentication using Kerberos and delegation tokens
• Planner authorizes request using metadata in Sentry
• Column level ACLs
• Row level ACLs – create a view with a predicate
• Masking – create a view with the masking function in the select list
• Tasks generated by the planner are signed with a shared key
• Worker executes generated tasks
• Does not authorize, relies on signed tasks
• Runs as user with full access to data, does not run user code

Architecture – Security example
CREATE VIEW data_view as
SELECT mask(credit_card_number) as ccn,
name, balance, region
FROM data WHERE region = “Europe”
1. Restrict access to the data set: disable access to data table and underlying
files in HDFS
2. Give access by creating view, data_view
3. Set column level permissions on data_view per user if necessary
Write path (ingest) unchanged. Job expected to run as privileged user.

Client APIs – Integration with ecosystem
• Similar APIs designed to integrate with MapReduce and Spark
• Client APIs make things simpler
• Don’t need to interact with HMS
• Don’t need to care about the underlying storage format: worker always returns
records in a canonical format.
• Don’t need to care about storage engine details (e.g. s3)

Client Integration APIs
• Drop in replacements for common existing InputFormats
• Can be used with Spark as well
• SparkSQL: integration with the Data Sources API
• Predicate pushdown, projection
• Migration should be easy

MR Example
// FileInputFormat.setInputPaths(job, new Path(args[0]));
// job.setInputFormatClass(AvroKeyInputFormat.class);
RecordServiceConfig.setInputTable(configuration, null, args[0]);
job.setInputFormatClass(
com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);

Spark Example
// Comment out one or the other
val file = sc.recordServiceTextFile(path)
// val file = sc.textFile(path)

Performance
• Shares some core components with Impala
• IO management, optimized C++ code, runtime code generation, uses low level
storage APIs
• Highly efficient implementation of the scan functionality
• Optimized columnar on wire format
• Inspired by Apache Parquet
• Accelerates performance for many workloads

Terasort
• ~Worst case scenario. Minimal schema: a single STRING column
• Custom RecordServiceTeraInputFormat (similar to TeraInputFormat)
• 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks)
• Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales
• See Github repo for more details and runnable examples.

TeraChecksum
1
0.48
0.23
1.03
0.8
0.85
0
0.2
0.4
0.6
0.8
1
1.2
1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark)
Normalizedjobtime
TeraChecksum
Without RecordService
With RecordService

Spark SQL
• Represents a more expected use case
• Data is fully schemed
• TPCDS
• 500GB scale factor, on parquet
• Cluster
• 5 node cluster, 12 cores/24 Hyper-Threaded, 126GB memory

0
50
100
150
200
250
300
350
TPCDS
SparkSQL
SparkSQL
SparkSQL with RecordService
Spark SQL
~15% improvement in query times; queries are not scan bound

Spark SQL
29.5
31
14
23.5
0
5
10
15
20
25
30
35
2% Selective Scan Sum(col)
SparkSQL
SparkSQL
SparkSQL with RecordService

State of the project
• Available for beta
• Integration with Spark and MR, and Pig/HCatalog.
• Working on complex type support
• More InputFormat support
• We’ll continually refresh beta, in particular client libraries.
• Apache 2.0 Licensed
• Intent to donate to Apache Software Foundation

Conclusion
• RecordService provides a schemed data access service for Hadoop
• Logical data access instead of physical
• Much more powerful abstraction
• Demonstrated security enforcement, improved performance
• Simpler: clients don’t need to worry about low level details: storage APIs, file
formats
• Opens the door for future improvements

Contributing!
• Mailing list: recordservice-user@googlegroups.com
• Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd-
p/Beta
• Contributions: http://github.com/cloudera/RecordServiceClient/
• Documentation: http://recordservice.io
• Bug Reporting: Open Github Issue
• Beta Download: http://www.cloudera.com/downloads/beta/record-
service/0-2-0.html
• Quickstart Virtual Machine: http://recordservice.io/vm/

RecordService for Unified Access Control

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (19)

Semelhante a RecordService for Unified Access Control

Semelhante a RecordService for Unified Access Control (20)

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Último

Último (20)

RecordService for Unified Access Control

Notas do Editor