3. Hadoop
File
Permissions
• Added
in
HADOOP-‐1298
• Hadoop
0.16
• Early
2008
• Authoriza8on
without
authen8ca8on
• POSIX-‐like
RWX
bits
3
4. MapReduce
ACLs
• Added
in
HADOOP-‐3698
• Hadoop
0.19
• Late
2008
• ACLs
per
job
queue
• Set
a
list
of
allowed
users
or
groups
per
opera8on
• Job
submission
• Job
administra8on
• No
authen8ca8on
4
5. Securing
a
Cluster
Through
a
Gateway
• Hadoop
cluster
runs
on
a
private
network
• Gateway
server
dual-‐homed
(Hadoop
network
and
public
network)
• Users
SSH
onto
gateway
• Op8onally
can
create
an
SSH
proxy
for
jobs
to
be
submi`ed
from
the
client
machine
• Provides
minimum
level
of
protec8on
5
7. Prevent
Accidental
Access
• Don’t
let
users
shoot
themselves
in
the
foot
• Main
driver
for
early
features
• Not
security
per-‐se,
but
a
cri8cal
first
step
• Doesn’t
require
strong
authen8ca8on
7
8. Stop
Malicious
Users
• Early
features
were
necessary,
but
not
sufficient
• Security
has
to
get
real
• Hadoop
runs
arbitrary
code
• Implicit
trust
doesn’t
prevent
the
insider
threat
8
9. Co-‐mingle
All
Your
Data
• Ofen
overlooked
• Big
data
means
gegng
rid
of
stovepipes
• Scalability
and
flexibility
are
only
50%
of
the
problem
• Trust
your
data
in
a
mul8-‐tenant
environment
• Most
cri8cal
driver
9
11. Authoriza8on
• Files
• MapReduce/YARN
job
queues
• Service-‐level
authoriza8on
• Whitelists
and
blacklists
of
hosts
and
users
11
12. Authen8ca8on
2.2 High Level Use Cases 2 USE CASES
• HADOOP-‐4487
• Hadoop
0.22
and
0.20.205
2.2 High Level Use Cases
1. Applications accessing files on HDFS clusters Non-MapReduce ap-
• Late
2010
including hadoop fs, access files stored on one or more HDFS
plications,
clusters. The application should only be able to access files and services
• Based
on
Kerberos
and
internal
delega8on
tokens
they are authorized to access. See figure 1. Variations:
(a) Access HDFS directly using HDFS protocol.
• Provides
strong
user
authen8ca8on
servers via the HFTP
(b) Access HDFS indirectly though HDFS proxy
FileSystem or HTTP get.
• Also
used
for
service-‐to-‐service
authen8ca8on
(joe)
Name
Node delg(jo
e)
kerb
MapReduce
Application
kerb(hdfs) Task
bloc n
k to oke
ken ck t
Data blo
Node
Figure 1: HDFS High-level Dataflow
12
2. Applications accessing third-party (non-Hadoop) services Non-
MapReduce applications and MapReduce tasks accessing files or opera-
13. Encryp8on
• Over
the
wire
encryp8on
for
some
socket
connec8ons
• RPC
encryp8on
added
soon
afer
Kerberos
• Shuffle
encryp8on
(HTTPS)
added
in
Hadoop
2.0.2-‐
alpha,
back
ported
to
CDH4
MR1
• HDFS
block
streamer
encryp8on
added
in
Hadoop
2.0.2-‐alpha
• Volume-‐level
encryp8on
for
data
at
rest
13
15. Apache
Accumulo
• Robust,
scalable,
high
performance
data
storage
and
retrieval
system
• Built
by
NSA,
now
an
Apache
project
• Based
on
Google’s
BigTable
• Built
on
top
of
HDFS,
ZooKeeper
and
Thrif
• Iterators
for
server-‐side
extensions
• Cell
labels
for
flexible
security
models
15
16. Data
Model
• Mul8-‐dimensional,
persistent,
sorted
map
• Key/Value
store
with
a
twist
• A
single
primary
key
(Row
ID)
• Secondary
key
(Column)
internal
to
a
row
• Family
• Qualifier
• Per-‐cell
8mestamp
16
17. Cell-‐Level
Security
• Labels
stored
per
cell
• Labels
consist
of
Boolean
expressions
(AND,
OR,
nes8ng)
• Labels
associated
with
each
user
• Cell
labels
checked
against
user’s
labels
with
a
built-‐
in
iterator
17
18. Pluggable
Authen8ca8on
• Currently
supports
username/password
authen8ca8on
backed
by
ZooKeeper
• ACCUMULO-‐259
• Targeted
for
Accumulo
1.5.0
• Authen8ca8on
info
replaced
with
generic
tokens
• Supports
mul8ple
implementa8ons
(e.g.
Kerberos)
18
19. Applica8on
Level
• Accumulo
ofen
paired
with
applica8on
level
authen8ca8on/authoriza8on
• Accumulo
users
created
per
applica8on
• Each
applica8on
granted
access
level
of
most
permi`ed
user
• Applica8on
authen8cates
users,
grabs
user
authoriza8ons,
passes
user
labels
with
requests
19
20. Apache
HBase
• Also
based
on
Google’s
BigTable
• Started
as
a
Hadoop
contrib
project
• Supports
column-‐level
ACLs
• Kerberos
for
authen8ca8on
• Discussion
and
early
prototypes
of
cell-‐level
security
ongoing
20
22. Encryp8on
for
Data
at
Rest
• Need
mul8ple
levels
of
granularity
• Encryp8on
keys
8ed
to
authoriza8on
labels
(like
Accumulo
labels
or
HBase
ACLs)
• APIs
for
file-‐level,
block-‐level,
or
record-‐level
encryp8on
22