Scanning the Internet for External Cloud Exposures via SSL Certs
Big Data Challenges at NASA
1. Big
Data
Challenges
at
NASA
Chris
A.
Ma4mann
Senior
Computer
Scien.st,
NASA
Adjunct
Assistant
Professor,
USC
Member,
Apache
So<ware
Founda.on
2. And
you
are?
• Senior
Computer
ScienLst
at
NASA
JPL
in
Pasadena,
CA
USA
• SoNware
Architecture/
Engineering
Prof
at
Univ.
of
Southern
California
• Apache
Member
involved
in
– OODT
(VP,
PMC),
Tika
(VP,PMC),
Nutch
(PMC),
Incubator
(PMC),
SIS
(Mentor),
Lucy
(Mentor)
and
Gora
(Champion),
MRUnit
(Mentor),
Airavata
(Mentor)
13-‐Jun-‐12
HADOOPSUMMIT12
2
3. Agenda
• Big
Data
Challenges
and
where
we’re
headed
• Example
systems
at
NASA
and
other
agencies
• Apache
OODT:
a
primer
• Apache
OODT
+
Hadoop
• Where
we’re
headed
and
wrapup
13-‐Jun-‐12
HADOOPSUMMIT12
3
4. Some
“Big
Data”
Grand
Challenges
I’m
interested
in
• How
do
we
handle
700
TB/sec
of
data
coming
off
the
wire
when
we
actually
have
to
keep
it
around?
– Required
by
the
Square
Kilometre
Array
• Joe
scien.st
says
I’ve
got
an
IDL
or
Matlab
algorithm
that
I
will
not
change
and
I
need
to
run
it
on
10
years
of
data
from
the
Colorado
River
Basin
and
store
and
disseminate
the
output
products
– Required
by
the
Western
Snow
Hydrology
project
• How
do
we
compare
petabytes
of
climate
model
output
data
in
a
variety
of
formats
(HDF,
NetCDF,
Grib,
etc.)
with
petabytes
of
remote
sensing
data
to
improve
climate
models
for
the
next
IPCC
assessment?
– Required
by
the
5th
IPCC
assessment
and
the
Earth
System
Grid
and
NASA
• How
do
we
catalog
all
of
NASA s
current
planetary
science
data?
– Required
by
the
NASA
Planetary
Data
System
13-‐Jun-‐12
HADOOPSUMMIT12
2012.
Jet
Propulsion
Laboratory,
California
InsLtute
of
Technology.
US
Copyright
4
Image
Credit:
h4p://www.jpl.nasa.gov/news/news.cfm?release=2011-‐295
Government
Sponsorship
Acknowledged.
5. The
NASA
ESDS
Context
Where is open source
most useful?
Which area should produce
open source software?
13-‐Jun-‐12
HADOOPSUMMIT12
5
6. Lessons
from
90’s
era
missions
• Increasing
data
volumes
(exponen>al
growth)
• Increasing
complexity
of
instruments
and
algorithms
• Increasing
availability
of
proxy/sim/ancillary
data
• Increasing
rate
of
technology
refresh
…
all
of
this
while
NASA
Earth
Mission
funding
was
decreasing
A
data
system
framework
based
on
a
standard
architecture
and
reusable
soKware
components
for
suppor>ng
all
future
missions.
13-‐Jun-‐12
HADOOPSUMMIT12
6
7. Where
do
Big
Data
technologies
fit
into
this?
U.S.
NaLonal
Climate
Assessment
(pic
credit:
Dr.
Tom
Painter)
SKA
South
Africa:
Square
Kilometre
Array
(pic
credit:
Dr.
Jasper
Horrell,
Simon
Ratcliffe
13-‐Jun-‐12
HADOOPSUMMIT12
7
9. day2_TDEM0003_10s_norx
EVLA
demonstraLon
architecture
EVLA
day2_TDEM0003_10s_norx
WWW
Staging
Area
products,
CAS Data
Services
metadata
Crawler Browser
Science
system
Services
status
PCS
Curator FM
proc Data System
Legend: rep cat status
Operator
data flow
Apache
OODT control flow W
Cub WM
Monitor
data
Disk Area /met
ska-dc.jpl.nasa.gov
13-‐Jun-‐12
HADOOPSUMMIT12
evlascube event 9
10. Apache OODT
• Entered incubation at the Apache
Software Foundation in 2010
• Selected as a top level Apache Software
Foundation project in January 2011
• Developed by a community of participants
from many companies, universities, and
organizations
• Used for a diverse set of science data
system activities in planetary science,
earth science, radio astronomy,
biomedicine, astrophysics, and more
http://oodt.apache.org
OODT Development & user community includes:
13-‐Jun-‐12
HADOOPSUMMIT12
10
11. Apache
OODT:
OSS
“big
data”
plaPorm
originally
pioneered
at
NASA
• OODT is meant to be a set of tools to help build data systems
– It s not meant to be turn key
– It attempts to exploit the boundary between bringing in capability vs.
being overly rigid in science
Copyright
2012.
Jet
Propulsion
Laboratory,
California
– Each discipline/project extends InsLtute
of
Technology.
US
Government
Sponsorship
Acknowledged.
• Projects
that
are
deploying
it
operaLonally
at
– Decadal-‐survey
recommended
NASA
Earth
science
missions,
NIH,
and
NCI,
CHLA,
USC,
South
African
SKA
project
• Why
Apache?
– Less than 100 projects have been promoted to top level (Apache Web
Server, Tomcat, Solr, Hadoop)
– Differs from other open source communities; it provides a governance
and management structure
13-‐Jun-‐12
HADOOPSUMMIT12
11
12. Why Apache and OODT?
• OODT is meant to be a set of tools to
help build data systems
– It s not meant to be turn key
– It attempts to exploit the boundary
between bringing in capability vs.
being overly rigid in science
– Each discipline/project extends
• Apache is the elite open source
community for software developers
– Less than 100 projects have been
promoted to top level (Apache Web
Server, Tomcat, Solr, Hadoop)
– Differs from other open source
communities; it provides a
governance and management
structure
13-‐Jun-‐12
HADOOPSUMMIT12
12
13. Governance
Model+NASA=♥
• NASA
and
other
government
agencies
have
tons
of
process
– They
like
that
13-‐Jun-‐12
HADOOPSUMMIT12
13
14. OODT Framework and PCS
OODT/Science Archive
Web Tools Client
Navigation
Service
OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK
Catalog &
Archive
Archive Profile
Catalog
&CArchive
Process
ontrol
Product Query
Bridge to
External
Other
Service 1
Service
((CAS)
Service Service Service Service
System
PCS)
Service Services
Other
Service 2
Profile Data Data
XML Data System 1 System 2
CAS has recently become known as Process Control System
when applied to mission work.
13-‐Jun-‐12
HADOOPSUMMIT12
14
15. Current PCS deployments
Orbiting Carbon Observatory (OCO-2) - spectrometer instrument
NASA ESSP Mission, launch date: TBD 2013
PCS supporting Thermal Vacuum Tests, Ground-based instrument data processing, Space-
based instrument data processing and Science Computing Facility
EOM Data Volume: 61-81 TB in 3 yrs Processing Throughput: 200-300 jobs/day
NPP Sounder PEATE - infrared sounder
Joint NASA/NPOESS mission, launch date: October 2011
PCS supporting Science Computing Facility (PEATE)
EOM Data Volume: 600 TB in 5 yrs Processing Throughput: 600 jobs/day
QuikSCAT
-‐
sca4erometer
NASA
Quick-‐Recovery
Mission,
launch
date:
June
1999
PCS
supporLng
instrument
data
processing
and
science
analyst
sandbox
Originally
planned
as
a
2-‐year
mission
SMAP
-‐
high-‐res
radar
and
radiometer
NASA
decadal
study
mission,
launch
date:
2014
PCS
supporLng
radar
instrument
and
science
algorithm
development
testbed
13-‐Jun-‐12
HADOOPSUMMIT12
15
16. Other PCS applications
Astronomy
and
Radio
Prototype
work
on
MeerKAT
with
South
Africans
and
KAT-‐7
telescope
Discussions
ongoing
with
NRAO
Socorro
(EVLA
and
ALMA)
Bioinforma>cs
NaLonal
InsLtutes
of
Health
(NIH)
NaLonal
Cancer
InsLtute s
(NCI)
Early
DetecLon
Research
Network
(EDRN)
Children s
Hospital
LA
Virtual
Pediatric
Intensive
Care
Unit
(VPICU)
Earth
Science
NaLonal
Climate
Assessment
–
Snow
Hydrology
in
the
Western
US
and
Alaska
NaLonal
Climate
Assessment
–
Regional
Climate
Modeling
and
EvaluaLon
Technology
Demonstra>on
JPL s
AcLve
Mirror
Telescope
(AMT)
White
Sands
Missile
Range
13-‐Jun-‐12
HADOOPSUMMIT12
16
17. PCS Core Components
• All
Core
components
implemented
as
web
services
– XML-‐RPC
used
to
communicate
between
components
– Servers
implemented
in
Java
– Clients
implemented
in
Java,
scripts,
Python,
PHP
and
web-‐apps
– Service
configuraLon
implemented
in
ASCII
and
XML
files
13-‐Jun-‐12
HADOOPSUMMIT12
17
18. Core Capabilities
• File
Manager
does
Data
Management
– Tracks
all
of
the
stored
data,
files
&
metadata
– Moves
data
to
appropriate
locaLons
before
and
aNer
iniLaLng
PGE
runs
and
from
staging
area
to
controlled
access
storage
•
Workflow
Manager
does
Pipeline
Processing
– Automates
processing
when
all
run
condiLons
are
ready
– Monitors
and
logs
processing
status
• Resource
Manager
does
Resource
Management
– Allocates
processing
jobs
to
compuLng
resources
– Monitors
and
logs
job
&
resource
status
– Copies
output
data
to
storage
locaLons
where
space
is
available
– Provides
the
means
to
monitor
resource
usage
13-‐Jun-‐12
HADOOPSUMMIT12
18
22. How do we deploy PCS for a mission?
• We implement the following mission-specific customizations
– Server Configuration
• Implemented in ASCII properties files
– Product metadata specification
• Implemented in XML policy files
– Processing Rules
• Implemented as Java classes and/or XML policy files
– PGE Configuration
• Implemented in XML policy files
– Compute Node Usage Policies
• Implemented in XML policy files
• Here s what we don t change
– All PCS Servers (e.g. File Manager, Workflow Manager, Resource Manager)
• Core data management, pipeline process management and job scheduling/submission
capabilities
– File Catalog schema
– Workflow Model Repository Schema
13-‐Jun-‐12
HADOOPSUMMIT12
22
23. Server and PGE Configuration
13-‐Jun-‐12
HADOOPSUMMIT12
23
24. Latest
Apache
OODT
release:
0.3
• First
appearance
of
PCS
– Core,
Services
(JAX-‐RS)
• Web
ApplicaLons
– Balance
(PHP),
and
Wicket
(Java)-‐based
apps
for
file
management
and
workflow
monitoring
• First
release
deployed
to
Maven
Central
– We
did
backport
0.2
there
aNer
this
– Over
60
issues
fixed
in
JIRA
• June
2011:
recommended
stable
release
13-‐Jun-‐12
HADOOPSUMMIT12
24
25. Working
on:
0.4
• Operator
Interface
(OODT-‐157)
• Workflow2
integraLon
(OODT-‐215)
and
all
of
its
sub-‐issues
– Global
workflow
condiLons,
dynamic
workflows,
parallel/sequenLal
model,
new
workflow
engine,
etc.
• OODT
RADIX
for
super
easy
deployment
(OODT-‐120)
• Solr
sync
with
File
Manager
(OODT-‐326)
• Improvements
to
XMLPS
(OODT-‐333)
and
new
crawler
acLons
(OODT-‐33,
OODT-‐34,
OODT-‐35,
OODT-‐36,
OODT-‐37)
• CLI
rewrite
and
refactor
• Over
130
issues
currently
resolved
• Likely
to
come
before
end
of
Q2
2012
13-‐Jun-‐12
HADOOPSUMMIT12
25
26. How
do
these
fit
together?
• Hadoop
HDFS
– OODT
file
manager
leveraging
HDFS
for
virtual
disk
path,
replicaLon,
archiving,
scalability
• Hadoop
M/R
– Work
done
in
OODT
branch
to
connect
OODT
Workflow
+
Resource
Mgmt
to
Hadoop
(pre
YARN)
• Hadoop
HIVE
used
in
Regional
Climate
Modeling
DB
13-‐Jun-‐12
HADOOPSUMMIT12
26
27. Where
are
we
headed
with
OODT
+
Hadoop?
• InvesLgate
and
integrate
YARN
– Workflow
and
Resource
Mgmt
• Plug
in
HBase
as
File
Manager
Catalog
– Already
plugged
in
HIVE
– PotenLally
leverage
Gora?
• OODT
+
Hadoop
Virtual
Machines
and
RPMs
– Easy
InstallaLon
leveraging
OODT
RADIX
• Remote
file
acquisiLon
(Push/Pull)
as
Hadoop
M/R
13-‐Jun-‐12
HADOOPSUMMIT12
27
28. Key
Takeaway
Apache
OODT,
Apache
Hadoop,
other
big
data
technologies
preparing
the
world
to
handle
all
of
these
diverse
use
cases!
Constantly
evolving
and
improving
frameworks
–
join
up
and
help.
Free
and
open
source
from
Apache
and
helping
government
demonstrate
the
public
good
13-‐Jun-‐12
HADOOPSUMMIT12
28
29. Apache OODT Project Contact Info
• Learn more and track our progress at:
– http://oodt.apache.org
– WIKI: https://cwiki.apache.org/OODT/
– JIRA: https://issues.apache.org/jira/browse/OODT
• Join the mailing list:
– dev@oodt.apache.org
• Chat on IRC:
– #oodt on irc.freenode.net
• Acknowledgements
– Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes, Andrew
Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn,
Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid
– Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network,
Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA
OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid
Federation
13-‐Jun-‐12
HADOOPSUMMIT12
29
30. Alright,
I ll
shut
up
now
• Any
quesLons?
• THANK
YOU!
– chris.a.ma4mann@nasa.gov
– @chrisma4mann
on
Twi4er
13-‐Jun-‐12
HADOOPSUMMIT12
30