Big Data Challenges at NASA

Big
Data
Challenges
at
NASA

Chris
A.
Ma4mann

Senior
Computer
Scien.st,
NASA

Adjunct
Assistant
Professor,
USC

Member,
Apache
So<ware
Founda.on

And
you
are?

•  Senior
Computer
ScienLst
at

NASA
JPL
in
Pasadena,
CA

USA

•  SoNware
Architecture/
Engineering
Prof
at
Univ.
of

Southern
California

•  Apache
Member
involved
in

–  OODT
(VP,
PMC),
Tika
(VP,PMC),
Nutch
(PMC),
Incubator
(PMC),

SIS
(Mentor),
Lucy
(Mentor)
and
Gora
(Champion),
MRUnit

(Mentor),
Airavata
(Mentor)

13-‐Jun-‐12
HADOOPSUMMIT12
2

Agenda

•  Big
Data
Challenges
and
where
we’re
headed

•  Example
systems
at
NASA
and
other
agencies

•  Apache
OODT:
a
primer

•  Apache
OODT
+
Hadoop

•  Where
we’re
headed
and
wrapup

13-‐Jun-‐12
HADOOPSUMMIT12
3

Some
“Big
Data”
Grand
Challenges
I’m

interested
in

•  How
do
we
handle
700
TB/sec
of
data
coming
oﬀ
the
wire
when
we

actually
have
to
keep
it
around?

–  Required
by
the
Square
Kilometre
Array

•  Joe
scien.st
says
I’ve
got
an
IDL
or
Matlab
algorithm
that
I
will
not

change
and
I
need
to
run
it
on
10
years
of
data
from
the
Colorado

River
Basin
and
store
and
disseminate
the
output
products

–  Required
by
the
Western
Snow
Hydrology
project

•  How
do
we
compare
petabytes
of
climate
model
output
data
in
a

variety
of
formats
(HDF,
NetCDF,
Grib,
etc.)
with
petabytes
of
remote

sensing
data
to
improve
climate
models
for
the
next
IPCC
assessment?

–  Required
by
the
5th
IPCC
assessment
and
the
Earth
System
Grid
and
NASA

•  How
do
we
catalog
all
of
NASA s
current
planetary
science
data?

–  Required
by
the
NASA
Planetary
Data
System

13-‐Jun-‐12
HADOOPSUMMIT12
2012.
Jet
Propulsion
Laboratory,
California
InsLtute
of
Technology.
US

Copyright
4

Image
Credit:
h4p://www.jpl.nasa.gov/news/news.cfm?release=2011-‐295
Government
Sponsorship
Acknowledged.

The
NASA
ESDS
Context

Where is open source
most useful?

Which area should produce
open source software?
13-‐Jun-‐12
HADOOPSUMMIT12
5

Lessons
from
90’s
era
missions

•  Increasing
data
volumes
(exponen>al
growth)

•  Increasing
complexity
of
instruments
and
algorithms

•  Increasing
availability
of
proxy/sim/ancillary
data

•  Increasing
rate
of
technology
refresh

…
all
of
this
while
NASA
Earth
Mission
funding
was
decreasing

A
data
system
framework
based
on
a
standard
architecture
and

reusable
soKware
components
for
suppor>ng
all
future
missions.

13-‐Jun-‐12
HADOOPSUMMIT12
6

Where
do
Big
Data
technologies

ﬁt
into
this?

U.S.
NaLonal
Climate
Assessment

(pic
credit:
Dr.
Tom
Painter)

SKA
South
Africa:
Square
Kilometre
Array

(pic
credit:
Dr.
Jasper
Horrell,
Simon
Ratcliﬀe

13-‐Jun-‐12
HADOOPSUMMIT12
7

13-‐Jun-‐12
HADOOPSUMMIT12
8

Credit:
Cameron
Goodale

day2_TDEM0003_10s_norx
EVLA
demonstraLon

architecture

EVLA

day2_TDEM0003_10s_norx
WWW

Staging
Area

products,

CAS Data
Services
metadata
Crawler Browser
Science

system

Services
status

PCS
Curator FM

proc Data System
Legend: rep cat status
Operator
data ﬂow
Apache
OODT control ﬂow W
Cub WM
Monitor
data
Disk Area /met
ska-dc.jpl.nasa.gov

13-‐Jun-‐12
HADOOPSUMMIT12
evlascube event 9

Apache OODT
•  Entered incubation at the Apache
Software Foundation in 2010
•  Selected as a top level Apache Software
Foundation project in January 2011
•  Developed by a community of participants
from many companies, universities, and
organizations
•  Used for a diverse set of science data
system activities in planetary science,
earth science, radio astronomy,
biomedicine, astrophysics, and more

http://oodt.apache.org
OODT Development & user community includes:

13-‐Jun-‐12
HADOOPSUMMIT12
10

Apache
OODT:
OSS
“big
data”
plaPorm

originally
pioneered
at
NASA

•  OODT is meant to be a set of tools to help build data systems
–  It s not meant to be turn key
–  It attempts to exploit the boundary between bringing in capability vs.
being overly rigid in science
Copyright
2012.
Jet
Propulsion
Laboratory,
California

–  Each discipline/project extends InsLtute
of
Technology.
US
Government
Sponsorship

Acknowledged.

•  Projects
that
are
deploying
it
operaLonally
at

–  Decadal-‐survey
recommended
NASA
Earth
science

missions,
NIH,
and
NCI,

CHLA,
USC,
South
African
SKA
project

•  Why
Apache?

–  Less than 100 projects have been promoted to top level (Apache Web
Server, Tomcat, Solr, Hadoop)
–  Differs from other open source communities; it provides a governance
and management structure

13-‐Jun-‐12
HADOOPSUMMIT12
11

Why Apache and OODT?
•  OODT is meant to be a set of tools to
help build data systems
–  It s not meant to be turn key
–  It attempts to exploit the boundary
between bringing in capability vs.
being overly rigid in science
–  Each discipline/project extends

•  Apache is the elite open source
community for software developers
–  Less than 100 projects have been
promoted to top level (Apache Web
Server, Tomcat, Solr, Hadoop)
–  Differs from other open source
communities; it provides a
governance and management
structure

13-‐Jun-‐12
HADOOPSUMMIT12
12

Governance
Model+NASA=&hearts;

•  NASA
and
other
government

agencies
have
tons
of
process

–  They
like
that

13-‐Jun-‐12
HADOOPSUMMIT12
13

OODT Framework and PCS

OODT/Science Archive
Web Tools Client
Navigation
Service

OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK

Catalog &
Archive
Archive Profile
Catalog
&CArchive

Process

ontrol

Product Query
Bridge to
External
Other
Service 1

Service
((CAS)

Service Service Service Service

System
PCS)

Service Services

Other
Service 2
Profile Data Data
XML Data System 1 System 2

CAS has recently become known as Process Control System
when applied to mission work.

13-‐Jun-‐12
HADOOPSUMMIT12
14

Current PCS deployments
Orbiting Carbon Observatory (OCO-2) - spectrometer instrument
NASA ESSP Mission, launch date: TBD 2013
PCS supporting Thermal Vacuum Tests, Ground-based instrument data processing, Space-
based instrument data processing and Science Computing Facility
EOM Data Volume: 61-81 TB in 3 yrs Processing Throughput: 200-300 jobs/day

NPP Sounder PEATE - infrared sounder
Joint NASA/NPOESS mission, launch date: October 2011
PCS supporting Science Computing Facility (PEATE)
EOM Data Volume: 600 TB in 5 yrs Processing Throughput: 600 jobs/day

QuikSCAT
-‐
sca4erometer

NASA
Quick-‐Recovery
Mission,
launch
date:
June
1999

PCS
supporLng
instrument
data
processing
and
science
analyst
sandbox

Originally
planned
as
a
2-‐year
mission

SMAP
-‐
high-‐res
radar
and
radiometer

NASA
decadal
study
mission,
launch
date:
2014

PCS
supporLng
radar
instrument
and
science
algorithm
development
testbed

13-‐Jun-‐12
HADOOPSUMMIT12
15

Other PCS applications
Astronomy
and
Radio

Prototype
work
on
MeerKAT
with
South
Africans
and
KAT-‐7
telescope

Discussions
ongoing
with
NRAO
Socorro
(EVLA
and
ALMA)

Bioinforma>cs

NaLonal
InsLtutes
of
Health
(NIH)
NaLonal
Cancer
InsLtute s
(NCI)
Early
DetecLon

Research
Network
(EDRN)

Children s
Hospital
LA
Virtual
Pediatric
Intensive
Care
Unit
(VPICU)

Earth
Science

NaLonal
Climate
Assessment
–
Snow
Hydrology
in
the
Western
US
and
Alaska

NaLonal
Climate
Assessment
–
Regional
Climate
Modeling
and
EvaluaLon

Technology
Demonstra>on

JPL s
AcLve
Mirror
Telescope
(AMT)

White
Sands
Missile
Range

13-‐Jun-‐12
HADOOPSUMMIT12
16

PCS Core Components

•  All
Core
components
implemented
as
web
services

–  XML-‐RPC
used
to
communicate
between
components

–  Servers
implemented
in
Java

–  Clients
implemented
in
Java,
scripts,
Python,

PHP
and
web-‐apps

–  Service
conﬁguraLon
implemented
in
ASCII
and
XML
ﬁles

13-‐Jun-‐12
HADOOPSUMMIT12
17

Core Capabilities
•  File
Manager
does
Data
Management

–  Tracks
all
of
the
stored
data,
ﬁles
&
metadata

–  Moves
data
to
appropriate
locaLons
before
and
aNer
iniLaLng
PGE
runs
and
from
staging
area
to

controlled
access
storage

• 
Workﬂow
Manager
does
Pipeline
Processing

–  Automates
processing
when
all
run
condiLons
are
ready

–  Monitors
and
logs
processing
status

•  Resource
Manager
does
Resource
Management

–  Allocates
processing
jobs
to
compuLng
resources

–  Monitors
and
logs
job
&
resource
status

–  Copies
output
data
to
storage
locaLons
where
space
is
available

–  Provides
the
means
to
monitor
resource
usage

13-‐Jun-‐12
HADOOPSUMMIT12
18

File/Metadata Capabilities

13-‐Jun-‐12
HADOOPSUMMIT12
19

Advanced Workflow Monitoring

13-‐Jun-‐12
HADOOPSUMMIT12
20

Resource Monitoring

13-‐Jun-‐12
HADOOPSUMMIT12
21

How do we deploy PCS for a mission?
•  We implement the following mission-specific customizations
–  Server Configuration
•  Implemented in ASCII properties files

–  Product metadata specification
•  Implemented in XML policy files

–  Processing Rules
•  Implemented as Java classes and/or XML policy files

–  PGE Configuration

–  Compute Node Usage Policies

•  Here s what we don t change
–  All PCS Servers (e.g. File Manager, Workflow Manager, Resource Manager)
•  Core data management, pipeline process management and job scheduling/submission
capabilities
–  File Catalog schema
–  Workflow Model Repository Schema

13-‐Jun-‐12
HADOOPSUMMIT12
22

Server and PGE Configuration

13-‐Jun-‐12
HADOOPSUMMIT12
23

Latest
Apache
OODT
release:
0.3

•  First
appearance
of
PCS

–  Core,
Services
(JAX-‐RS)

•  Web
ApplicaLons

–  Balance
(PHP),
and
Wicket
(Java)-‐based
apps
for

file
management
and
workflow
monitoring

•  First
release
deployed
to
Maven
Central

–  We
did
backport
0.2
there
aNer
this

–  Over
60
issues
fixed
in
JIRA

•  June
2011:
recommended
stable
release

13-‐Jun-‐12
HADOOPSUMMIT12
24

Working
on:
0.4

•  Operator
Interface
(OODT-‐157)

•  Workflow2
integraLon
(OODT-‐215)
and
all
of
its
sub-‐issues

–  Global
workflow
condiLons,
dynamic
workflows,
parallel/sequenLal

model,
new
workflow
engine,
etc.

•  OODT
RADIX
for
super
easy
deployment
(OODT-‐120)

•  Solr
sync
with
File
Manager
(OODT-‐326)

•  Improvements
to
XMLPS
(OODT-‐333)
and
new
crawler
acLons

(OODT-‐33,
OODT-‐34,
OODT-‐35,
OODT-‐36,
OODT-‐37)

•  CLI
rewrite
and
refactor

•  Over
130
issues
currently
resolved

•  Likely
to
come
before
end
of
Q2
2012

13-‐Jun-‐12
HADOOPSUMMIT12
25

How
do
these
fit
together?

•  Hadoop
HDFS

–  OODT
file
manager
leveraging
HDFS
for
virtual
disk
path,
replicaLon,

archiving,
scalability

•  Hadoop
M/R

–  Work
done
in
OODT
branch
to
connect
OODT
Workflow
+
Resource

Mgmt
to
Hadoop
(pre
YARN)

•  Hadoop
HIVE
used
in
Regional
Climate
Modeling
DB

13-‐Jun-‐12
HADOOPSUMMIT12
26

Where
are
we
headed
with

OODT
+
Hadoop?

•  InvesLgate
and
integrate
YARN

–  Workﬂow
and
Resource
Mgmt

•  Plug
in
HBase
as
File
Manager
Catalog

–  Already
plugged
in
HIVE

–  PotenLally
leverage
Gora?

•  OODT
+
Hadoop
Virtual
Machines
and
RPMs

–  Easy
InstallaLon
leveraging
OODT
RADIX

•  Remote
ﬁle
acquisiLon
(Push/Pull)
as
Hadoop

M/R

13-‐Jun-‐12
HADOOPSUMMIT12
27

Key
Takeaway

Apache
OODT,
Apache
Hadoop,
other
big
data

technologies
preparing
the
world
to
handle
all
of

these
diverse
use
cases!

Constantly
evolving
and
improving
frameworks
–
join
up
and
help.

Free
and
open
source
from
Apache
and
helping
government
demonstrate
the

public
good

13-‐Jun-‐12
HADOOPSUMMIT12
28

Apache OODT Project Contact Info
•  Learn more and track our progress at:
–  http://oodt.apache.org
–  WIKI: https://cwiki.apache.org/OODT/
–  JIRA: https://issues.apache.org/jira/browse/OODT
•  Join the mailing list:
–  dev@oodt.apache.org
•  Chat on IRC:
–  #oodt on irc.freenode.net
•  Acknowledgements
–  Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes, Andrew
Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn,
Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid
–  Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network,
Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA
OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid
Federation

13-‐Jun-‐12
HADOOPSUMMIT12
29

Alright,
I ll
shut
up
now

•  Any
quesLons?

•  THANK
YOU!

–  chris.a.ma4mann@nasa.gov

–  @chrisma4mann
on
Twi4er

13-‐Jun-‐12
HADOOPSUMMIT12
30

Big Data Challenges at NASA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Challenges at NASA

Similar to Big Data Challenges at NASA (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Big Data Challenges at NASA