My name is Sean Kane and I'm a solutions architect at cloudera.
I'm going to be giving a talk today on the record service and the surrounding technologies.
The talk is going to cover some of the motivations for creating the record service as well as how it fits into our platform.
We will also dig into some of the technical details of the project and complimentary components of the architecture.
As we start talking about Hadoop or, more specifically, an enterprise data hub, you’ll notice that many of the benefits of an EDH have an offset with some interesting security side effects.
With an EDH, you can have a single platform for all of your data. But you’re also now combining data and audiences that used to be siloed into separate, secure systems. Hadoop offers a rich, flexible ecosystem of tools and utilities but you want to be sure that the ecosystem doesn’t come with an equally abundant ecosystem of authentication and access controls. For every tool, you don’t want to manage a unique permission to control access and privileges as it becomes unyielding very quickly.
Hadoop allows you to ingest data of any type, very quickly. But this means you don’t always know when sensitive data is coming in and who is accessing it.
Lastly, active archive is a key benefit of an EDH, providing much lower storage costs than legacy systems. But you also realize that existing systems, while extensive, have a lot of compliance controls built into them and it begs the question, how do you get those same sets of compliance and privacy controls inside the new environment.
Hadoop provides a lot of flexibility, but it’s important to find a platform that maintains this flexibility while still providing the necessary security controls.
Another key part about security is that there are multiple stakeholders all concerned about the security of the system and what you can and cannot do.
The Business Manager is interested in using the EDH to run high value workloads inside the cluster, answer new questions, and gain new insights. They want to put sensitive data into the cluster to reap the benefits back to the business. They also want to be able to quickly adopt new innovations within the Hadoop ecosystem and take full advantage of all the capabilities.
The InfoSec Team supports this but has established internal rules for how new technologies can be adopted and existing policies and procedures around how systems are authenticated and how people access sensitive data. While Hadoop may be a great advancement for the business, the InfoSec team will not change their policies just for one new system. Additionally, in some environments, the system and data must maintain external compliance to meet HIPAA, PCI, etc.
Lastly, for IT/Ops, this isn’t the first system that needed to be secured and already has made existing investments in security tools – such as Active Directory, Kerberos, SIEMs, etc. They want to leverage this existing infrastructure as much as possible for any new systems being introduced. They also want a system that can be set up without too much end-user support and automate the security configurations.
So, not only do we need to balance the security concerns introduced with Hadoop/Big Data, but also against all the viewpoints of the stakeholders.
Developer Joke
Story about how I’ve seen InfoSec accept security violations
There are many aspects to security - and it's all too easy for to claim their a platform is "secure" because it covers one or more of these pillars. To achieve comprehensive security, all four pillars of security must be addressed: Perimeter, Access, Visibility, and Data.
A quick plug for - Cloudera Enterprise - We these and a CDH installation is compliance-ready out-of-the-box to ensure you’re protected
It offers a comprehensive set of security controls that balances the flexibility of Hadoop against the concerns of stakeholders – we’re proud that we’re the most secure Hadoop distribution in the market.
And.. We were are the first and only distribution to achieve PCI compliance.
We comprehensively address all the traditional security concerns around authentication, authorization, audit, and compliance – for a full compliance-ready stack. We will walk through each of these controls and discuss how these security constraints are addressed.
So, the first pillar… perimeter security... we are addressing the concept of authentication
I.e. what services can have access to the cluster itself (such as Impala, or Hive, or Spark
For Business Users, we need to preserve the choice around what Hadoop service they’re using to get the job done to take full advantage of the Hadoop ecosystem of services.
From an InfoSec perspective, all those services need to conform to a centrally managed set of authentication policies – meaning … one way to authenticate … regardless of what service you’re using.
From the IT/Ops perspective, this isn’t the first time they’ve tackled a problem like this so it needs to implement with existing standard systems such as Active Directory (AD) and Kerberos – which is how they’ve solved this for other systems.
Access security … once the user has authenticated against services, what is the data they can access?
Can they query everything in the cluster? Inserts, transforms, only do reads, limited to certain set of data? That falls under access controls - defining what users and applications can do with data.
For Access requirements, we want to provide users access to data needed to do their job. The very top level starts with a job, or function, or role based view of access.
InfoSec’s position is they need a centrally managed way to define access policies and they’re not going to go through configuring access controls for each path.
For IT, again this isn’t a new problem. They’ve solved it before through role-based access controls built on AD. They want to be able to leverage that again.
Sentry is an open source Apache project and its emerging as an open standard for unified authorization. It has a broad set of contributions from Cloudera, Intel, IBM, and Oracle. It ships in multiple distributions. We want to provide unified authorization not only for Hadoop services… but also for the third-party tools … that users are choosing to access the cluster with.
Sentry allows you to define policies that govern fine-grained access control policies. In the earlier version it wasn’t quite so pretty but now you have a GUI that simplifies the creation and management of Sentry policies.
It’s all done via Hue. You can go in and select table or database, define roles and permissions, create group associations, all in this GUI inteface.
So, before Record Service and Sentry came on the scene, this was the only way to meet these security concerns. But it worked.. heh.. You can achieve row and column-based security by duplicating the ever living heck out of the data. You can take your original master table and split it up into sub tables which are then governed by filesystem permissions. If you have a user who can only see US accounts... You can create a new table with only those rows…
Splitting up data into individual files for each group that needs access works.. But there’s serious scalability issues. Imagine if you also needed to split these files again to regulate who gets to see the SSN column – doubles the number of files again.
What if only some brokers in each group are allowed to see full SSN?
Problems:
- Batch processing only, not near real-time
Difficult to maintain:
keeping it fresh enough, splitting again each time a new column is added = complex etl workflow
Applications using the data need to know which file to use – custom logic for that; the logic needs to change when a new column is added!
- Extra processing required
- Data overlap means more storage
- Small file sizes affect performance
At a high level record service is a highly scalable distributed data access service that provides unified authorization for hadoop.
it sits between the computer layer and the storage layer of hadoop.
it provides a unified data access path to uniformly applied data access policies for all compute frameworks.
GOOD SW DEVEL POLICIES
Before we go to two new details I'd like to discuss some of the motivations for how we got here.
So, let’s go back to the previous example and look at how it would work with record service.
Control on SSN column limits who can see full SSN
and who can see only last 4 digits of SSN
Control on Broker column means
queries from each broker group
only return records from their
Group
This is not unlike what’s possible in Oracle, Teradata or other mature traditional data warehouses.
In this talk we will be introducing Record Service …
In Short, RecordService is a highly scalable, distributed, data access service for Hadoop that provides unified authorization while also simplifying the platform.
Before digging in to the details of RecordService, let’s take a step back and look at the current state of the Hadoop ecosystem.
What we have seen is more components, continue added to the stack at an accelerated rate.
* RS provides layer of abstraction over storage so compute frameworks don’t need to care as where data is stored
Provides platform for uniform, fine grained security across all compute engines
Helps to simplify Hadoop – Unified data a ccess path
TODO: make the picture more clear
Cap1 -> Use case
TODO: better to explain with an example.
TODO: think about making a new image.
TODO: improve this one and the previous one!
TODO: talk about the resource management
TODO: have some diagram to illustrate this.
TODO: zk is secure
Mention that views are accessbile through MR/Spark as well