SlideShare uma empresa Scribd logo
1 de 36
1© Cloudera, Inc. All rights reserved.
RecordService for Unified
Access Control
Michael Lazar, Sr. Systems Engineer
2© Cloudera, Inc. All rights reserved.
The Benefits of Hadoop...
One place for unlimited data
• All types
• More sources
• Faster, larger ingestion
Unified, multi-framework data access
• More users
• More tools
• Faster changes
3© Cloudera, Inc. All rights reserved.
…Can Create Information Security Challenges
Business Manager
• Run high value
workloads in
cluster
• Quickly adopt new
innovations
Information Security
• Follow established
policies and
procedures
• Maintain
compliance
IT/Operations
• Integrate with existing
IT investments
• Minimize end-user
support
• Automate
configuration
4© Cloudera, Inc. All rights reserved.
Comprehensive, Compliance-Ready Security
Authentication, Authorization, Audit, and Compliance
Access
Defining what users
and applications can
do with data
Technical Concepts:
Permissions
Authorization
Data
Protecting data in the
cluster from
unauthorized visibility
Technical Concepts:
Encryption, Tokenization,
Data masking
Visibility
Reporting on where
data came from and
how it’s being used
Technical Concepts:
Auditing
Lineage
Cloudera Manager
Apache Sentry &
RecordService
Cloudera Navigator
Navigator Encrypt & Key
Trustee | Partners
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
5© Cloudera, Inc. All rights reserved.
Perimeter Security Requirements
Preserve user choice of the right
Hadoop service (e.g. Impala, Spark)
Conform to centrally managed
authentication policies
Implement with existing standard
systems: Active Directory and
KerberosCloudera Manager
Perimeter
Guarding access to the
cluster itself
Technical Concepts:
Authentication
Network isolation
6© Cloudera, Inc. All rights reserved.
Access Security Requirements
Provide users access to data
needed to do their job
Centrally manage access policies
Leverage a role-based access
control model built on AD
Access
Defining what users
and applications can
do with data
InfoSec Concept:
Authorization
Apache Sentry &
RecordService
7© Cloudera, Inc. All rights reserved.
Sentry – The Open Standard
Broad
Contributions
• Cloudera
• IBM
• Intel
• Oracle
Multi-Vendor
Support
• Cloudera
• IBM
• MapR
• Oracle
Wide Industry
Adoption
• Banking
• Healthcare
• Insurance
• Pharma
• Telco
Third-Party
Integrations
• Oracle
Endeca
• Platfora
8© Cloudera, Inc. All rights reserved.
Visual Policy Management
9© Cloudera, Inc. All rights reserved.
Fine-Grained HDFS Access without RecordService
Date/time Accnt # SSN Asset Trade Country
09:33:11 16-
Feb-2015
0234837823 238-23-
9876
AAPL Sell US
11:33:01 16-
Feb-2015
3947848494 329-44-
9847
TBT Buy EU
14:12:34 16-
Feb-2015
4848367383 123-56-
2345
IBM Sell UK
09:22:03 16-
Feb-2015
3485739384 585-11-
2345
INTC Buy US
11:55:33 16-
Feb-2015
3847598390 234-11-
8765
F Buy US
10:22:55 16-
Feb-2015
8765432176 344-22-
9876
UA Buy UK
13:45:24 16-
Feb-2015
3456789012 412-22-
8765
AMZN Sell EU
09:03:44 16-
Feb-2015
4857389329 123-44-
5678
TMV Buy US
15:55:55 16-
Feb-2015
4756983234 234-76-
9274
MA Buy UK
Date/time Accnt # SSN Asset Trade Country
14:12:34 16-
Feb-2015
4848367383 123-56-
2345
IBM Sell UK
10:22:55 16-
Feb-2015
8765432176 344-22-
9876
UA Buy UK
15:55:55 16-
Feb-2015
4756983234 234-76-
9274
MA Buy UK
Date/time Accnt # SSN Asset Trade Country
11:33:01 16-
Feb-2015
3947848494 329-44-
9847
TBT Buy EU
13:45:24 16-
Feb-2015
3456789012 412-22-
8765
AMZN Sell EU
Date/time Accnt # SSN Asset Trade Country
09:33:11 16-
Feb-2015
0234837823 238-23-
9876
AAPL Sell US
09:22:03 16-
Feb-2015
3485739384 585-11-
2345
INTC Buy US
11:55:33 16-
Feb-2015
3847598390 234-11-
8765
F Buy US
09:03:44 16-
Feb-2015
4857389329 123-44-
5678
TMV Buy US
Split the original file
Use HDFS permissions to limit access
10© Cloudera, Inc. All rights reserved.
RecordService
Unified Access Control Enforcement
• New high performance security layer that
centrally enforces access control policies
across Hadoop
• Complements Apache Sentry’s unified policy
definition
• Row- and column-based security
• Dynamic data masking
• Apache-licensed open source
• Beta (TODO)
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
OTHER
Kite
NoSQL
HBase
OTHER
Object Store
FILESYSTEM
HDFS
RELATIONAL
Kudu
11© Cloudera, Inc. All rights reserved.
Fine-Grained HDFS Access Control with RecordService
• Apply controls to the master data file
• Row, column, and sub-column (masking) controls
• Enforce these across all access paths
Date/time Accnt # SSN Asset Trade Country
09:33:11 16-
Feb-2015
0234837823 238-23-
9876
AAPL Sell US
11:33:01 16-
Feb-2015
3947848494 329-44-
9847
TBT Buy EU
14:12:34 16-
Feb-2015
4848367383 123-56-
2345
IBM Sell EU
09:22:03 16-
Feb-2015
3485739384 585-11-
2345
INTC Buy US
11:55:33 16-
Feb-2015
3847598390 234-11-
8765
F Buy US
10:22:55 16-
Feb-2015
8765432176 344-22-
9876
UA Buy EU
13:45:24 16-
Feb-2015
3456789012 412-22-
8765
AMZN Sell EU
Column-Level Controls
Row-LevelControls
Date/time Accnt # SSN Asset Trade Country
09:33:11 16-
Feb-2015
0234837823 238-23-
9876
AAPL Sell US
11:33:01 16-
Feb-2015
3947848494 329-44-
9847
TBT Buy group2
14:12:34 16-
Feb-2015
4848367383 123-56-
2345
IBM Sell group3
09:22:03 16-
Feb-2015
3485739384 585-11-
2345
INTC Buy US
11:55:33 16-
Feb-2015
3847598390 234-11-
8765
F Buy US
10:22:55 16-
Feb-2015
8765432176 344-22-
9876
UA Buy group3
13:45:24 16-
Feb-2015
3456789012 412-22-
8765
AMZN Sell group2
Column-Level Controls
Row-LevelControls
XXX-XX
XXX-XX
XXX-XX
What U.S. Brokers See
12© Cloudera, Inc. All rights reserved.
RecordService is a distributed,
scalable, data access service for
unified authorization in Hadoop.
RecordService
13© Cloudera, Inc. All rights reserved.
Motivation
• As the Hadoop ecosystem expands, new components continue to be added
• Speaks to the overall flexibility of Hadoop
• This is good - more functionality, more workloads, more use cases.
• As use cases for Hadoop mature, user requirements and expectations increase:
• Security
• Performance
• Compatibility
• Maintainability
• The flexibility of Hadoop has come at cost of increased complexity
14© Cloudera, Inc. All rights reserved.
Introducing RecordService
15© Cloudera, Inc. All rights reserved.
Example: Security
Challenge: Provide unified fine-grained security across compute frameworks
• Integrating consistent security layer into every components is not scalable.
• Securing data at file-level precludes fine grained access control (column/row)
• File ACLs not enough - User can view all or nothing.
• Currently, must split files, duplicate data – large operational cost.
Solution: Add a level of abstraction - secure service to access datasets in “record”
format
• Can now apply fine-grained constraints on projection of dataset
• Same access control policy can be applied uniformly across compute
frameworks; uncoupled from underlying storage layer
16© Cloudera, Inc. All rights reserved.
Architecture
17© Cloudera, Inc. All rights reserved.
Architecture
• Runs as a distributed service: Planner Servers & Worker Servers
• Servers are stateless
• Easy HA, fault tolerance
• Planner Servers responsible for request planning
• Retrieve and combine metadata (NN, HMS, Sentry)
• Split generation: creates tasks for workers
• Perform authorization
• Worker Servers read from storage and construct records.
• IO, file parsing, predicate evaluation
• Run as the “source” for a DAG computation
18© Cloudera, Inc. All rights reserved.
Architecture – Server APIs
• Planner and Worker services expose thrift APIs:
• PlanRequest(), Exec(), Fetch()
• PlanRequest()
• Accepts SQL to specify request: Support SELECT and PROJECT
• Access to tables and views stored requires HMS
• Does not run operators with data exchange; “map only”
• Generates a list of tasks which contain the request, each with locality
• Exec()/Fetch()
• Returns records in a canonical optimized, columnar-format.
19© Cloudera, Inc. All rights reserved.
Architecture – Fault tolerance
• Cluster state persisted in ZK
• Planner/Worker membership, delegation tokens, secret keys
• Servers do not communicate with each other directly => scalability
• Planner services
• Expected to run a few (i.e. 3) for HA
• Fault tolerance handled with clients getting a list of planners and failing over
• Plan requests are short
• Worker services
• Expect to run on each node in the cluster with data
• Fault tolerance handled by framework (e.g. MR) rescheduling task
20© Cloudera, Inc. All rights reserved.
Architecture – Security
• Authentication using Kerberos and delegation tokens
• Planner authorizes request using metadata in Sentry
• Column level ACLs
• Row level ACLs – create a view with a predicate
• Masking – create a view with the masking function in the select list
• Tasks generated by the planner are signed with a shared key
• Worker executes generated tasks
• Does not authorize, relies on signed tasks
• Runs as user with full access to data, does not run user code
21© Cloudera, Inc. All rights reserved.
Architecture – Security example
CREATE VIEW data_view as
SELECT mask(credit_card_number) as ccn,
name, balance, region
FROM data WHERE region = “Europe”
1. Restrict access to the data set: disable access to data table and underlying
files in HDFS
2. Give access by creating view, data_view
3. Set column level permissions on data_view per user if necessary
Write path (ingest) unchanged. Job expected to run as privileged user.
22© Cloudera, Inc. All rights reserved.
Client APIs – Integration with ecosystem
• Similar APIs designed to integrate with MapReduce and Spark
• Client APIs make things simpler
• Don’t need to interact with HMS
• Don’t need to care about the underlying storage format: worker always returns
records in a canonical format.
• Don’t need to care about storage engine details (e.g. s3)
23© Cloudera, Inc. All rights reserved.
Client Integration APIs
• Drop in replacements for common existing InputFormats
• Can be used with Spark as well
• SparkSQL: integration with the Data Sources API
• Predicate pushdown, projection
• Migration should be easy
24© Cloudera, Inc. All rights reserved.
MR Example
// FileInputFormat.setInputPaths(job, new Path(args[0]));
// job.setInputFormatClass(AvroKeyInputFormat.class);
RecordServiceConfig.setInputTable(configuration, null, args[0]);
job.setInputFormatClass(
com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);
25© Cloudera, Inc. All rights reserved.
Spark Example
// Comment out one or the other
val file = sc.recordServiceTextFile(path)
// val file = sc.textFile(path)
26© Cloudera, Inc. All rights reserved.
Spark SQL Example
ctx.sql(s"""
|CREATE TEMPORARY TABLE $tbl
|USING com.cloudera.recordservice.spark.DefaultSource
|OPTIONS (
| RecordServiceTable '$db.$tbl',
| RecordServiceTableSize '$size'
|)
""".stripMargin)
27© Cloudera, Inc. All rights reserved.
Performance
• Shares some core components with Impala
• IO management, optimized C++ code, runtime code generation, uses low level
storage APIs
• Highly efficient implementation of the scan functionality
• Optimized columnar on wire format
• Inspired by Apache Parquet
• Accelerates performance for many workloads
28© Cloudera, Inc. All rights reserved.
Terasort
• ~Worst case scenario. Minimal schema: a single STRING column
• Custom RecordServiceTeraInputFormat (similar to TeraInputFormat)
• 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks)
• Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales
• See Github repo for more details and runnable examples.
29© Cloudera, Inc. All rights reserved.
TeraChecksum
1
0.48
0.23
1.03
0.8
0.85
0
0.2
0.4
0.6
0.8
1
1.2
1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark)
Normalizedjobtime
TeraChecksum
Without RecordService
With RecordService
30© Cloudera, Inc. All rights reserved.
Spark SQL
• Represents a more expected use case
• Data is fully schemed
• TPCDS
• 500GB scale factor, on parquet
• Cluster
• 5 node cluster, 12 cores/24 Hyper-Threaded, 126GB memory
31© Cloudera, Inc. All rights reserved.
0
50
100
150
200
250
300
350
TPCDS
SparkSQL
SparkSQL
SparkSQL with RecordService
Spark SQL
~15% improvement in query times; queries are not scan bound
32© Cloudera, Inc. All rights reserved.
Spark SQL
29.5
31
14
23.5
0
5
10
15
20
25
30
35
2% Selective Scan Sum(col)
SparkSQL
SparkSQL
SparkSQL with RecordService
33© Cloudera, Inc. All rights reserved.
State of the project
• Available for beta
• Integration with Spark and MR, and Pig/HCatalog.
• Working on complex type support
• More InputFormat support
• We’ll continually refresh beta, in particular client libraries.
• Apache 2.0 Licensed
• Intent to donate to Apache Software Foundation
34© Cloudera, Inc. All rights reserved.
Conclusion
• RecordService provides a schemed data access service for Hadoop
• Logical data access instead of physical
• Much more powerful abstraction
• Demonstrated security enforcement, improved performance
• Simpler: clients don’t need to worry about low level details: storage APIs, file
formats
• Opens the door for future improvements
35© Cloudera, Inc. All rights reserved.
Contributing!
• Mailing list: recordservice-user@googlegroups.com
• Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd-
p/Beta
• Contributions: http://github.com/cloudera/RecordServiceClient/
• Documentation: http://recordservice.io
• Bug Reporting: Open Github Issue
• Beta Download: http://www.cloudera.com/downloads/beta/record-
service/0-2-0.html
• Quickstart Virtual Machine: http://recordservice.io/vm/
Thank You

Mais conteúdo relacionado

Mais procurados

Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
Cloudera, Inc.
 

Mais procurados (20)

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondStanding Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data Platform
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
 
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence

 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
 

Destaque

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
Cloudera, Inc.
 
レビューのあり方、書き方
レビューのあり方、書き方レビューのあり方、書き方
レビューのあり方、書き方
orangesky
 

Destaque (19)

Sentry - An Introduction
Sentry - An Introduction Sentry - An Introduction
Sentry - An Introduction
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
April 2014 HUG : Apache Sentry
April 2014 HUG : Apache SentryApril 2014 HUG : Apache Sentry
April 2014 HUG : Apache Sentry
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
La nes, los bachilleratos pedagógicos y la formación docente
La nes,  los bachilleratos pedagógicos y la formación docente La nes,  los bachilleratos pedagógicos y la formación docente
La nes, los bachilleratos pedagógicos y la formación docente
 
#ID2013 - Data Silos by @nxfxcom
#ID2013 - Data Silos by @nxfxcom#ID2013 - Data Silos by @nxfxcom
#ID2013 - Data Silos by @nxfxcom
 
Aprendizajes por proyectos. Ana Maria Ramos
Aprendizajes por proyectos. Ana Maria RamosAprendizajes por proyectos. Ana Maria Ramos
Aprendizajes por proyectos. Ana Maria Ramos
 
My favourite sport is ...
My favourite sport is ...My favourite sport is ...
My favourite sport is ...
 
Deconstruyendo Google - Edición 2016
Deconstruyendo Google - Edición 2016Deconstruyendo Google - Edición 2016
Deconstruyendo Google - Edición 2016
 
Employee idea grant r2 2015 final
Employee idea grant r2 2015 finalEmployee idea grant r2 2015 final
Employee idea grant r2 2015 final
 
Didactica para el aprendizaje en niños de preescolar
Didactica  para el aprendizaje en niños de preescolarDidactica  para el aprendizaje en niños de preescolar
Didactica para el aprendizaje en niños de preescolar
 
Fighting the native battle and the rise of content marketing - Digiday WTF Ad...
Fighting the native battle and the rise of content marketing - Digiday WTF Ad...Fighting the native battle and the rise of content marketing - Digiday WTF Ad...
Fighting the native battle and the rise of content marketing - Digiday WTF Ad...
 
レビューのあり方、書き方
レビューのあり方、書き方レビューのあり方、書き方
レビューのあり方、書き方
 
OSM操作ガイド (環境情報学実習 4/23課題)
OSM操作ガイド (環境情報学実習 4/23課題)OSM操作ガイド (環境情報学実習 4/23課題)
OSM操作ガイド (環境情報学実習 4/23課題)
 
Presentación sobre qué es el plagio y cómo evitarlo
Presentación sobre qué es el plagio y cómo evitarloPresentación sobre qué es el plagio y cómo evitarlo
Presentación sobre qué es el plagio y cómo evitarlo
 
Sla 60 sites in 60 minutes 2012 slides
Sla 60 sites in 60 minutes 2012 slidesSla 60 sites in 60 minutes 2012 slides
Sla 60 sites in 60 minutes 2012 slides
 
Evolucion de la e-GEL
Evolucion de la e-GELEvolucion de la e-GEL
Evolucion de la e-GEL
 
8 Projects Combined Tim Vaughn AIA LEED AP
8 Projects Combined Tim Vaughn AIA LEED AP8 Projects Combined Tim Vaughn AIA LEED AP
8 Projects Combined Tim Vaughn AIA LEED AP
 
Telecommunications Consumers: A Behavioral Economic Analysis
Telecommunications Consumers: A Behavioral Economic AnalysisTelecommunications Consumers: A Behavioral Economic Analysis
Telecommunications Consumers: A Behavioral Economic Analysis
 

Semelhante a RecordService for Unified Access Control

Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
Timothy Spann
 

Semelhante a RecordService for Unified Access Control (20)

Cloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemachtCloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemacht
 
Oracle Management Cloud
Oracle Management Cloud Oracle Management Cloud
Oracle Management Cloud
 
Oracle Management Cloud
Oracle Management CloudOracle Management Cloud
Oracle Management Cloud
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Sql server 2014 online operations
Sql server 2014 online operationsSql server 2014 online operations
Sql server 2014 online operations
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
 
Flashback in OCI
Flashback in OCIFlashback in OCI
Flashback in OCI
 
Hadoop security implementationon 20171003
Hadoop security implementationon 20171003Hadoop security implementationon 20171003
Hadoop security implementationon 20171003
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
 
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
 
State of the ATT&CK
State of the ATT&CKState of the ATT&CK
State of the ATT&CK
 
Long and winding road - Chile 2014
Long and winding road - Chile 2014Long and winding road - Chile 2014
Long and winding road - Chile 2014
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
 
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
Cloudera User Group SF - Cloudera Manager: APIs & ExtensibilityCloudera User Group SF - Cloudera Manager: APIs & Extensibility
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 

Último (20)

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 

RecordService for Unified Access Control

  • 1. 1© Cloudera, Inc. All rights reserved. RecordService for Unified Access Control Michael Lazar, Sr. Systems Engineer
  • 2. 2© Cloudera, Inc. All rights reserved. The Benefits of Hadoop... One place for unlimited data • All types • More sources • Faster, larger ingestion Unified, multi-framework data access • More users • More tools • Faster changes
  • 3. 3© Cloudera, Inc. All rights reserved. …Can Create Information Security Challenges Business Manager • Run high value workloads in cluster • Quickly adopt new innovations Information Security • Follow established policies and procedures • Maintain compliance IT/Operations • Integrate with existing IT investments • Minimize end-user support • Automate configuration
  • 4. 4© Cloudera, Inc. All rights reserved. Comprehensive, Compliance-Ready Security Authentication, Authorization, Audit, and Compliance Access Defining what users and applications can do with data Technical Concepts: Permissions Authorization Data Protecting data in the cluster from unauthorized visibility Technical Concepts: Encryption, Tokenization, Data masking Visibility Reporting on where data came from and how it’s being used Technical Concepts: Auditing Lineage Cloudera Manager Apache Sentry & RecordService Cloudera Navigator Navigator Encrypt & Key Trustee | Partners Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation
  • 5. 5© Cloudera, Inc. All rights reserved. Perimeter Security Requirements Preserve user choice of the right Hadoop service (e.g. Impala, Spark) Conform to centrally managed authentication policies Implement with existing standard systems: Active Directory and KerberosCloudera Manager Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation
  • 6. 6© Cloudera, Inc. All rights reserved. Access Security Requirements Provide users access to data needed to do their job Centrally manage access policies Leverage a role-based access control model built on AD Access Defining what users and applications can do with data InfoSec Concept: Authorization Apache Sentry & RecordService
  • 7. 7© Cloudera, Inc. All rights reserved. Sentry – The Open Standard Broad Contributions • Cloudera • IBM • Intel • Oracle Multi-Vendor Support • Cloudera • IBM • MapR • Oracle Wide Industry Adoption • Banking • Healthcare • Insurance • Pharma • Telco Third-Party Integrations • Oracle Endeca • Platfora
  • 8. 8© Cloudera, Inc. All rights reserved. Visual Policy Management
  • 9. 9© Cloudera, Inc. All rights reserved. Fine-Grained HDFS Access without RecordService Date/time Accnt # SSN Asset Trade Country 09:33:11 16- Feb-2015 0234837823 238-23- 9876 AAPL Sell US 11:33:01 16- Feb-2015 3947848494 329-44- 9847 TBT Buy EU 14:12:34 16- Feb-2015 4848367383 123-56- 2345 IBM Sell UK 09:22:03 16- Feb-2015 3485739384 585-11- 2345 INTC Buy US 11:55:33 16- Feb-2015 3847598390 234-11- 8765 F Buy US 10:22:55 16- Feb-2015 8765432176 344-22- 9876 UA Buy UK 13:45:24 16- Feb-2015 3456789012 412-22- 8765 AMZN Sell EU 09:03:44 16- Feb-2015 4857389329 123-44- 5678 TMV Buy US 15:55:55 16- Feb-2015 4756983234 234-76- 9274 MA Buy UK Date/time Accnt # SSN Asset Trade Country 14:12:34 16- Feb-2015 4848367383 123-56- 2345 IBM Sell UK 10:22:55 16- Feb-2015 8765432176 344-22- 9876 UA Buy UK 15:55:55 16- Feb-2015 4756983234 234-76- 9274 MA Buy UK Date/time Accnt # SSN Asset Trade Country 11:33:01 16- Feb-2015 3947848494 329-44- 9847 TBT Buy EU 13:45:24 16- Feb-2015 3456789012 412-22- 8765 AMZN Sell EU Date/time Accnt # SSN Asset Trade Country 09:33:11 16- Feb-2015 0234837823 238-23- 9876 AAPL Sell US 09:22:03 16- Feb-2015 3485739384 585-11- 2345 INTC Buy US 11:55:33 16- Feb-2015 3847598390 234-11- 8765 F Buy US 09:03:44 16- Feb-2015 4857389329 123-44- 5678 TMV Buy US Split the original file Use HDFS permissions to limit access
  • 10. 10© Cloudera, Inc. All rights reserved. RecordService Unified Access Control Enforcement • New high performance security layer that centrally enforces access control policies across Hadoop • Complements Apache Sentry’s unified policy definition • Row- and column-based security • Dynamic data masking • Apache-licensed open source • Beta (TODO) STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr OTHER Kite NoSQL HBase OTHER Object Store FILESYSTEM HDFS RELATIONAL Kudu
  • 11. 11© Cloudera, Inc. All rights reserved. Fine-Grained HDFS Access Control with RecordService • Apply controls to the master data file • Row, column, and sub-column (masking) controls • Enforce these across all access paths Date/time Accnt # SSN Asset Trade Country 09:33:11 16- Feb-2015 0234837823 238-23- 9876 AAPL Sell US 11:33:01 16- Feb-2015 3947848494 329-44- 9847 TBT Buy EU 14:12:34 16- Feb-2015 4848367383 123-56- 2345 IBM Sell EU 09:22:03 16- Feb-2015 3485739384 585-11- 2345 INTC Buy US 11:55:33 16- Feb-2015 3847598390 234-11- 8765 F Buy US 10:22:55 16- Feb-2015 8765432176 344-22- 9876 UA Buy EU 13:45:24 16- Feb-2015 3456789012 412-22- 8765 AMZN Sell EU Column-Level Controls Row-LevelControls Date/time Accnt # SSN Asset Trade Country 09:33:11 16- Feb-2015 0234837823 238-23- 9876 AAPL Sell US 11:33:01 16- Feb-2015 3947848494 329-44- 9847 TBT Buy group2 14:12:34 16- Feb-2015 4848367383 123-56- 2345 IBM Sell group3 09:22:03 16- Feb-2015 3485739384 585-11- 2345 INTC Buy US 11:55:33 16- Feb-2015 3847598390 234-11- 8765 F Buy US 10:22:55 16- Feb-2015 8765432176 344-22- 9876 UA Buy group3 13:45:24 16- Feb-2015 3456789012 412-22- 8765 AMZN Sell group2 Column-Level Controls Row-LevelControls XXX-XX XXX-XX XXX-XX What U.S. Brokers See
  • 12. 12© Cloudera, Inc. All rights reserved. RecordService is a distributed, scalable, data access service for unified authorization in Hadoop. RecordService
  • 13. 13© Cloudera, Inc. All rights reserved. Motivation • As the Hadoop ecosystem expands, new components continue to be added • Speaks to the overall flexibility of Hadoop • This is good - more functionality, more workloads, more use cases. • As use cases for Hadoop mature, user requirements and expectations increase: • Security • Performance • Compatibility • Maintainability • The flexibility of Hadoop has come at cost of increased complexity
  • 14. 14© Cloudera, Inc. All rights reserved. Introducing RecordService
  • 15. 15© Cloudera, Inc. All rights reserved. Example: Security Challenge: Provide unified fine-grained security across compute frameworks • Integrating consistent security layer into every components is not scalable. • Securing data at file-level precludes fine grained access control (column/row) • File ACLs not enough - User can view all or nothing. • Currently, must split files, duplicate data – large operational cost. Solution: Add a level of abstraction - secure service to access datasets in “record” format • Can now apply fine-grained constraints on projection of dataset • Same access control policy can be applied uniformly across compute frameworks; uncoupled from underlying storage layer
  • 16. 16© Cloudera, Inc. All rights reserved. Architecture
  • 17. 17© Cloudera, Inc. All rights reserved. Architecture • Runs as a distributed service: Planner Servers & Worker Servers • Servers are stateless • Easy HA, fault tolerance • Planner Servers responsible for request planning • Retrieve and combine metadata (NN, HMS, Sentry) • Split generation: creates tasks for workers • Perform authorization • Worker Servers read from storage and construct records. • IO, file parsing, predicate evaluation • Run as the “source” for a DAG computation
  • 18. 18© Cloudera, Inc. All rights reserved. Architecture – Server APIs • Planner and Worker services expose thrift APIs: • PlanRequest(), Exec(), Fetch() • PlanRequest() • Accepts SQL to specify request: Support SELECT and PROJECT • Access to tables and views stored requires HMS • Does not run operators with data exchange; “map only” • Generates a list of tasks which contain the request, each with locality • Exec()/Fetch() • Returns records in a canonical optimized, columnar-format.
  • 19. 19© Cloudera, Inc. All rights reserved. Architecture – Fault tolerance • Cluster state persisted in ZK • Planner/Worker membership, delegation tokens, secret keys • Servers do not communicate with each other directly => scalability • Planner services • Expected to run a few (i.e. 3) for HA • Fault tolerance handled with clients getting a list of planners and failing over • Plan requests are short • Worker services • Expect to run on each node in the cluster with data • Fault tolerance handled by framework (e.g. MR) rescheduling task
  • 20. 20© Cloudera, Inc. All rights reserved. Architecture – Security • Authentication using Kerberos and delegation tokens • Planner authorizes request using metadata in Sentry • Column level ACLs • Row level ACLs – create a view with a predicate • Masking – create a view with the masking function in the select list • Tasks generated by the planner are signed with a shared key • Worker executes generated tasks • Does not authorize, relies on signed tasks • Runs as user with full access to data, does not run user code
  • 21. 21© Cloudera, Inc. All rights reserved. Architecture – Security example CREATE VIEW data_view as SELECT mask(credit_card_number) as ccn, name, balance, region FROM data WHERE region = “Europe” 1. Restrict access to the data set: disable access to data table and underlying files in HDFS 2. Give access by creating view, data_view 3. Set column level permissions on data_view per user if necessary Write path (ingest) unchanged. Job expected to run as privileged user.
  • 22. 22© Cloudera, Inc. All rights reserved. Client APIs – Integration with ecosystem • Similar APIs designed to integrate with MapReduce and Spark • Client APIs make things simpler • Don’t need to interact with HMS • Don’t need to care about the underlying storage format: worker always returns records in a canonical format. • Don’t need to care about storage engine details (e.g. s3)
  • 23. 23© Cloudera, Inc. All rights reserved. Client Integration APIs • Drop in replacements for common existing InputFormats • Can be used with Spark as well • SparkSQL: integration with the Data Sources API • Predicate pushdown, projection • Migration should be easy
  • 24. 24© Cloudera, Inc. All rights reserved. MR Example // FileInputFormat.setInputPaths(job, new Path(args[0])); // job.setInputFormatClass(AvroKeyInputFormat.class); RecordServiceConfig.setInputTable(configuration, null, args[0]); job.setInputFormatClass( com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);
  • 25. 25© Cloudera, Inc. All rights reserved. Spark Example // Comment out one or the other val file = sc.recordServiceTextFile(path) // val file = sc.textFile(path)
  • 26. 26© Cloudera, Inc. All rights reserved. Spark SQL Example ctx.sql(s""" |CREATE TEMPORARY TABLE $tbl |USING com.cloudera.recordservice.spark.DefaultSource |OPTIONS ( | RecordServiceTable '$db.$tbl', | RecordServiceTableSize '$size' |) """.stripMargin)
  • 27. 27© Cloudera, Inc. All rights reserved. Performance • Shares some core components with Impala • IO management, optimized C++ code, runtime code generation, uses low level storage APIs • Highly efficient implementation of the scan functionality • Optimized columnar on wire format • Inspired by Apache Parquet • Accelerates performance for many workloads
  • 28. 28© Cloudera, Inc. All rights reserved. Terasort • ~Worst case scenario. Minimal schema: a single STRING column • Custom RecordServiceTeraInputFormat (similar to TeraInputFormat) • 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks) • Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales • See Github repo for more details and runnable examples.
  • 29. 29© Cloudera, Inc. All rights reserved. TeraChecksum 1 0.48 0.23 1.03 0.8 0.85 0 0.2 0.4 0.6 0.8 1 1.2 1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark) Normalizedjobtime TeraChecksum Without RecordService With RecordService
  • 30. 30© Cloudera, Inc. All rights reserved. Spark SQL • Represents a more expected use case • Data is fully schemed • TPCDS • 500GB scale factor, on parquet • Cluster • 5 node cluster, 12 cores/24 Hyper-Threaded, 126GB memory
  • 31. 31© Cloudera, Inc. All rights reserved. 0 50 100 150 200 250 300 350 TPCDS SparkSQL SparkSQL SparkSQL with RecordService Spark SQL ~15% improvement in query times; queries are not scan bound
  • 32. 32© Cloudera, Inc. All rights reserved. Spark SQL 29.5 31 14 23.5 0 5 10 15 20 25 30 35 2% Selective Scan Sum(col) SparkSQL SparkSQL SparkSQL with RecordService
  • 33. 33© Cloudera, Inc. All rights reserved. State of the project • Available for beta • Integration with Spark and MR, and Pig/HCatalog. • Working on complex type support • More InputFormat support • We’ll continually refresh beta, in particular client libraries. • Apache 2.0 Licensed • Intent to donate to Apache Software Foundation
  • 34. 34© Cloudera, Inc. All rights reserved. Conclusion • RecordService provides a schemed data access service for Hadoop • Logical data access instead of physical • Much more powerful abstraction • Demonstrated security enforcement, improved performance • Simpler: clients don’t need to worry about low level details: storage APIs, file formats • Opens the door for future improvements
  • 35. 35© Cloudera, Inc. All rights reserved. Contributing! • Mailing list: recordservice-user@googlegroups.com • Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd- p/Beta • Contributions: http://github.com/cloudera/RecordServiceClient/ • Documentation: http://recordservice.io • Bug Reporting: Open Github Issue • Beta Download: http://www.cloudera.com/downloads/beta/record- service/0-2-0.html • Quickstart Virtual Machine: http://recordservice.io/vm/

Notas do Editor

  1. My name is Sean Kane and I'm a solutions architect at cloudera. I'm going to be giving a talk today on the record service and the surrounding technologies. The talk is going to cover some of the motivations for creating the record service as well as how it fits into our platform. We will also dig into some of the technical details of the project and complimentary components of the architecture.
  2. As we start talking about Hadoop or, more specifically, an enterprise data hub, you’ll notice that many of the benefits of an EDH have an offset with some interesting security side effects. With an EDH, you can have a single platform for all of your data. But you’re also now combining data and audiences that used to be siloed into separate, secure systems. Hadoop offers a rich, flexible ecosystem of tools and utilities but you want to be sure that the ecosystem doesn’t come with an equally abundant ecosystem of authentication and access controls. For every tool, you don’t want to manage a unique permission to control access and privileges as it becomes unyielding very quickly. Hadoop allows you to ingest data of any type, very quickly. But this means you don’t always know when sensitive data is coming in and who is accessing it. Lastly, active archive is a key benefit of an EDH, providing much lower storage costs than legacy systems. But you also realize that existing systems, while extensive, have a lot of compliance controls built into them and it begs the question, how do you get those same sets of compliance and privacy controls inside the new environment. Hadoop provides a lot of flexibility, but it’s important to find a platform that maintains this flexibility while still providing the necessary security controls.
  3. Another key part about security is that there are multiple stakeholders all concerned about the security of the system and what you can and cannot do. The Business Manager is interested in using the EDH to run high value workloads inside the cluster, answer new questions, and gain new insights. They want to put sensitive data into the cluster to reap the benefits back to the business. They also want to be able to quickly adopt new innovations within the Hadoop ecosystem and take full advantage of all the capabilities. The InfoSec Team supports this but has established internal rules for how new technologies can be adopted and existing policies and procedures around how systems are authenticated and how people access sensitive data. While Hadoop may be a great advancement for the business, the InfoSec team will not change their policies just for one new system. Additionally, in some environments, the system and data must maintain external compliance to meet HIPAA, PCI, etc. Lastly, for IT/Ops, this isn’t the first system that needed to be secured and already has made existing investments in security tools – such as Active Directory, Kerberos, SIEMs, etc. They want to leverage this existing infrastructure as much as possible for any new systems being introduced. They also want a system that can be set up without too much end-user support and automate the security configurations. So, not only do we need to balance the security concerns introduced with Hadoop/Big Data, but also against all the viewpoints of the stakeholders. Developer Joke Story about how I’ve seen InfoSec accept security violations
  4. There are many aspects to security - and it's all too easy for to claim their a platform is "secure" because it covers one or more of these pillars. To achieve comprehensive security, all four pillars of security must be addressed: Perimeter, Access, Visibility, and Data. A quick plug for - Cloudera Enterprise - We these and a CDH installation is compliance-ready out-of-the-box to ensure you’re protected It offers a comprehensive set of security controls that balances the flexibility of Hadoop against the concerns of stakeholders – we’re proud that we’re the most secure Hadoop distribution in the market. And.. We were are the first and only distribution to achieve PCI compliance. We comprehensively address all the traditional security concerns around authentication, authorization, audit, and compliance – for a full compliance-ready stack. We will walk through each of these controls and discuss how these security constraints are addressed.
  5. So, the first pillar… perimeter security... we are addressing the concept of authentication I.e. what services can have access to the cluster itself (such as Impala, or Hive, or Spark For Business Users, we need to preserve the choice around what Hadoop service they’re using to get the job done to take full advantage of the Hadoop ecosystem of services. From an InfoSec perspective, all those services need to conform to a centrally managed set of authentication policies – meaning … one way to authenticate … regardless of what service you’re using. From the IT/Ops perspective, this isn’t the first time they’ve tackled a problem like this so it needs to implement with existing standard systems such as Active Directory (AD) and Kerberos – which is how they’ve solved this for other systems.
  6. Access security … once the user has authenticated against services, what is the data they can access? Can they query everything in the cluster? Inserts, transforms, only do reads, limited to certain set of data? That falls under access controls - defining what users and applications can do with data. For Access requirements, we want to provide users access to data needed to do their job. The very top level starts with a job, or function, or role based view of access. InfoSec’s position is they need a centrally managed way to define access policies and they’re not going to go through configuring access controls for each path. For IT, again this isn’t a new problem. They’ve solved it before through role-based access controls built on AD. They want to be able to leverage that again.
  7. Sentry is an open source Apache project and its emerging as an open standard for unified authorization. It has a broad set of contributions from Cloudera, Intel, IBM, and Oracle. It ships in multiple distributions. We want to provide unified authorization not only for Hadoop services… but also for the third-party tools … that users are choosing to access the cluster with.
  8. Sentry allows you to define policies that govern fine-grained access control policies. In the earlier version it wasn’t quite so pretty but now you have a GUI that simplifies the creation and management of Sentry policies. It’s all done via Hue. You can go in and select table or database, define roles and permissions, create group associations, all in this GUI inteface.
  9. So, before Record Service and Sentry came on the scene, this was the only way to meet these security concerns. But it worked.. heh.. You can achieve row and column-based security by duplicating the ever living heck out of the data. You can take your original master table and split it up into sub tables which are then governed by filesystem permissions. If you have a user who can only see US accounts... You can create a new table with only those rows… Splitting up data into individual files for each group that needs access works.. But there’s serious scalability issues. Imagine if you also needed to split these files again to regulate who gets to see the SSN column – doubles the number of files again. What if only some brokers in each group are allowed to see full SSN? Problems: - Batch processing only, not near real-time Difficult to maintain: keeping it fresh enough, splitting again each time a new column is added = complex etl workflow Applications using the data need to know which file to use – custom logic for that; the logic needs to change when a new column is added! - Extra processing required - Data overlap means more storage - Small file sizes affect performance
  10. At a high level record service is a highly scalable distributed data access service that provides unified authorization for hadoop. it sits between the computer layer and the storage layer of hadoop. it provides a unified data access path to uniformly applied data access policies for all compute frameworks. GOOD SW DEVEL POLICIES Before we go to two new details I'd like to discuss some of the motivations for how we got here.
  11. So, let’s go back to the previous example and look at how it would work with record service. Control on SSN column limits who can see full SSN and who can see only last 4 digits of SSN Control on Broker column means queries from each broker group only return records from their Group This is not unlike what’s possible in Oracle, Teradata or other mature traditional data warehouses.
  12. In this talk we will be introducing Record Service … In Short, RecordService is a highly scalable, distributed, data access service for Hadoop that provides unified authorization while also simplifying the platform.
  13. Before digging in to the details of RecordService, let’s take a step back and look at the current state of the Hadoop ecosystem. What we have seen is more components, continue added to the stack at an accelerated rate.
  14. * RS provides layer of abstraction over storage so compute frameworks don’t need to care as where data is stored Provides platform for uniform, fine grained security across all compute engines Helps to simplify Hadoop – Unified data a ccess path TODO: make the picture more clear
  15. Cap1 -> Use case TODO: better to explain with an example.
  16. TODO: think about making a new image.
  17. TODO: improve this one and the previous one! TODO: talk about the resource management
  18. TODO: have some diagram to illustrate this.
  19. TODO: zk is secure
  20. Mention that views are accessbile through MR/Spark as well
  21. TODO: Hcatalog ?
  22. Args[0] is a table name
  23. 5 node / 126GB memory 24 cores
  24. TODO: update this Nested types
  25. TODO: add website TODO: update download