WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Big Data Challenges at NASA
1. Big
Data
Challenges
at
NASA
Chris
A.
Ma4mann
Senior
Computer
Scien.st,
NASA
Adjunct
Assistant
Professor,
USC
Member,
Apache
So<ware
Founda.on
2. And
you
are?
• Senior
Computer
ScienLst
at
NASA
JPL
in
Pasadena,
CA
USA
• SoNware
Architecture/
Engineering
Prof
at
Univ.
of
Southern
California
• Apache
Member
involved
in
– OODT
(VP,
PMC),
Tika
(VP,PMC),
Nutch
(PMC),
Incubator
(PMC),
SIS
(Mentor),
Lucy
(Mentor)
and
Gora
(Champion),
MRUnit
(Mentor),
Airavata
(Mentor)
13-‐Jun-‐12
HADOOPSUMMIT12
2
3. Agenda
• Big
Data
Challenges
and
where
we’re
headed
• Example
systems
at
NASA
and
other
agencies
• Apache
OODT:
a
primer
• Apache
OODT
+
Hadoop
• Where
we’re
headed
and
wrapup
13-‐Jun-‐12
HADOOPSUMMIT12
3
4. Some
“Big
Data”
Grand
Challenges
I’m
interested
in
• How
do
we
handle
700
TB/sec
of
data
coming
off
the
wire
when
we
actually
have
to
keep
it
around?
– Required
by
the
Square
Kilometre
Array
• Joe
scien.st
says
I’ve
got
an
IDL
or
Matlab
algorithm
that
I
will
not
change
and
I
need
to
run
it
on
10
years
of
data
from
the
Colorado
River
Basin
and
store
and
disseminate
the
output
products
– Required
by
the
Western
Snow
Hydrology
project
• How
do
we
compare
petabytes
of
climate
model
output
data
in
a
variety
of
formats
(HDF,
NetCDF,
Grib,
etc.)
with
petabytes
of
remote
sensing
data
to
improve
climate
models
for
the
next
IPCC
assessment?
– Required
by
the
5th
IPCC
assessment
and
the
Earth
System
Grid
and
NASA
• How
do
we
catalog
all
of
NASA s
current
planetary
science
data?
– Required
by
the
NASA
Planetary
Data
System
13-‐Jun-‐12
HADOOPSUMMIT12
2012.
Jet
Propulsion
Laboratory,
California
InsLtute
of
Technology.
US
Copyright
4
Image
Credit:
h4p://www.jpl.nasa.gov/news/news.cfm?release=2011-‐295
Government
Sponsorship
Acknowledged.
5. The
NASA
ESDS
Context
Where is open source
most useful?
Which area should produce
open source software?
13-‐Jun-‐12
HADOOPSUMMIT12
5
6. Lessons
from
90’s
era
missions
• Increasing
data
volumes
(exponen>al
growth)
• Increasing
complexity
of
instruments
and
algorithms
• Increasing
availability
of
proxy/sim/ancillary
data
• Increasing
rate
of
technology
refresh
…
all
of
this
while
NASA
Earth
Mission
funding
was
decreasing
A
data
system
framework
based
on
a
standard
architecture
and
reusable
soKware
components
for
suppor>ng
all
future
missions.
13-‐Jun-‐12
HADOOPSUMMIT12
6
7. Where
do
Big
Data
technologies
fit
into
this?
U.S.
NaLonal
Climate
Assessment
(pic
credit:
Dr.
Tom
Painter)
SKA
South
Africa:
Square
Kilometre
Array
(pic
credit:
Dr.
Jasper
Horrell,
Simon
Ratcliffe
13-‐Jun-‐12
HADOOPSUMMIT12
7
9. day2_TDEM0003_10s_norx
EVLA
demonstraLon
architecture
EVLA
day2_TDEM0003_10s_norx
WWW
Staging
Area
products,
CAS Data
Services
metadata
Crawler Browser
Science
system
Services
status
PCS
Curator FM
proc Data System
Legend: rep cat status
Operator
data flow
Apache
OODT control flow W
Cub WM
Monitor
data
Disk Area /met
ska-dc.jpl.nasa.gov
13-‐Jun-‐12
HADOOPSUMMIT12
evlascube event 9
10. Apache OODT
• Entered incubation at the Apache
Software Foundation in 2010
• Selected as a top level Apache Software
Foundation project in January 2011
• Developed by a community of participants
from many companies, universities, and
organizations
• Used for a diverse set of science data
system activities in planetary science,
earth science, radio astronomy,
biomedicine, astrophysics, and more
http://oodt.apache.org
OODT Development & user community includes:
13-‐Jun-‐12
HADOOPSUMMIT12
10
11. Apache
OODT:
OSS
“big
data”
plaPorm
originally
pioneered
at
NASA
• OODT is meant to be a set of tools to help build data systems
– It s not meant to be turn key
– It attempts to exploit the boundary between bringing in capability vs.
being overly rigid in science
Copyright
2012.
Jet
Propulsion
Laboratory,
California
– Each discipline/project extends InsLtute
of
Technology.
US
Government
Sponsorship
Acknowledged.
• Projects
that
are
deploying
it
operaLonally
at
– Decadal-‐survey
recommended
NASA
Earth
science
missions,
NIH,
and
NCI,
CHLA,
USC,
South
African
SKA
project
• Why
Apache?
– Less than 100 projects have been promoted to top level (Apache Web
Server, Tomcat, Solr, Hadoop)
– Differs from other open source communities; it provides a governance
and management structure
13-‐Jun-‐12
HADOOPSUMMIT12
11
12. Why Apache and OODT?
• OODT is meant to be a set of tools to
help build data systems
– It s not meant to be turn key
– It attempts to exploit the boundary
between bringing in capability vs.
being overly rigid in science
– Each discipline/project extends
• Apache is the elite open source
community for software developers
– Less than 100 projects have been
promoted to top level (Apache Web
Server, Tomcat, Solr, Hadoop)
– Differs from other open source
communities; it provides a
governance and management
structure
13-‐Jun-‐12
HADOOPSUMMIT12
12
13. Governance
Model+NASA=♥
• NASA
and
other
government
agencies
have
tons
of
process
– They
like
that
13-‐Jun-‐12
HADOOPSUMMIT12
13
14. OODT Framework and PCS
OODT/Science Archive
Web Tools Client
Navigation
Service
OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK
Catalog &
Archive
Archive Profile
Catalog
&CArchive
Process
ontrol
Product Query
Bridge to
External
Other
Service 1
Service
((CAS)
Service Service Service Service
System
PCS)
Service Services
Other
Service 2
Profile Data Data
XML Data System 1 System 2
CAS has recently become known as Process Control System
when applied to mission work.
13-‐Jun-‐12
HADOOPSUMMIT12
14
15. Current PCS deployments
Orbiting Carbon Observatory (OCO-2) - spectrometer instrument
NASA ESSP Mission, launch date: TBD 2013
PCS supporting Thermal Vacuum Tests, Ground-based instrument data processing, Space-
based instrument data processing and Science Computing Facility
EOM Data Volume: 61-81 TB in 3 yrs Processing Throughput: 200-300 jobs/day
NPP Sounder PEATE - infrared sounder
Joint NASA/NPOESS mission, launch date: October 2011
PCS supporting Science Computing Facility (PEATE)
EOM Data Volume: 600 TB in 5 yrs Processing Throughput: 600 jobs/day
QuikSCAT
-‐
sca4erometer
NASA
Quick-‐Recovery
Mission,
launch
date:
June
1999
PCS
supporLng
instrument
data
processing
and
science
analyst
sandbox
Originally
planned
as
a
2-‐year
mission
SMAP
-‐
high-‐res
radar
and
radiometer
NASA
decadal
study
mission,
launch
date:
2014
PCS
supporLng
radar
instrument
and
science
algorithm
development
testbed
13-‐Jun-‐12
HADOOPSUMMIT12
15
16. Other PCS applications
Astronomy
and
Radio
Prototype
work
on
MeerKAT
with
South
Africans
and
KAT-‐7
telescope
Discussions
ongoing
with
NRAO
Socorro
(EVLA
and
ALMA)
Bioinforma>cs
NaLonal
InsLtutes
of
Health
(NIH)
NaLonal
Cancer
InsLtute s
(NCI)
Early
DetecLon
Research
Network
(EDRN)
Children s
Hospital
LA
Virtual
Pediatric
Intensive
Care
Unit
(VPICU)
Earth
Science
NaLonal
Climate
Assessment
–
Snow
Hydrology
in
the
Western
US
and
Alaska
NaLonal
Climate
Assessment
–
Regional
Climate
Modeling
and
EvaluaLon
Technology
Demonstra>on
JPL s
AcLve
Mirror
Telescope
(AMT)
White
Sands
Missile
Range
13-‐Jun-‐12
HADOOPSUMMIT12
16
17. PCS Core Components
• All
Core
components
implemented
as
web
services
– XML-‐RPC
used
to
communicate
between
components
– Servers
implemented
in
Java
– Clients
implemented
in
Java,
scripts,
Python,
PHP
and
web-‐apps
– Service
configuraLon
implemented
in
ASCII
and
XML
files
13-‐Jun-‐12
HADOOPSUMMIT12
17
18. Core Capabilities
• File
Manager
does
Data
Management
– Tracks
all
of
the
stored
data,
files
&
metadata
– Moves
data
to
appropriate
locaLons
before
and
aNer
iniLaLng
PGE
runs
and
from
staging
area
to
controlled
access
storage
•
Workflow
Manager
does
Pipeline
Processing
– Automates
processing
when
all
run
condiLons
are
ready
– Monitors
and
logs
processing
status
• Resource
Manager
does
Resource
Management
– Allocates
processing
jobs
to
compuLng
resources
– Monitors
and
logs
job
&
resource
status
– Copies
output
data
to
storage
locaLons
where
space
is
available
– Provides
the
means
to
monitor
resource
usage
13-‐Jun-‐12
HADOOPSUMMIT12
18
22. How do we deploy PCS for a mission?
• We implement the following mission-specific customizations
– Server Configuration
• Implemented in ASCII properties files
– Product metadata specification
• Implemented in XML policy files
– Processing Rules
• Implemented as Java classes and/or XML policy files
– PGE Configuration
• Implemented in XML policy files
– Compute Node Usage Policies
• Implemented in XML policy files
• Here s what we don t change
– All PCS Servers (e.g. File Manager, Workflow Manager, Resource Manager)
• Core data management, pipeline process management and job scheduling/submission
capabilities
– File Catalog schema
– Workflow Model Repository Schema
13-‐Jun-‐12
HADOOPSUMMIT12
22
23. Server and PGE Configuration
13-‐Jun-‐12
HADOOPSUMMIT12
23
24. Latest
Apache
OODT
release:
0.3
• First
appearance
of
PCS
– Core,
Services
(JAX-‐RS)
• Web
ApplicaLons
– Balance
(PHP),
and
Wicket
(Java)-‐based
apps
for
file
management
and
workflow
monitoring
• First
release
deployed
to
Maven
Central
– We
did
backport
0.2
there
aNer
this
– Over
60
issues
fixed
in
JIRA
• June
2011:
recommended
stable
release
13-‐Jun-‐12
HADOOPSUMMIT12
24
25. Working
on:
0.4
• Operator
Interface
(OODT-‐157)
• Workflow2
integraLon
(OODT-‐215)
and
all
of
its
sub-‐issues
– Global
workflow
condiLons,
dynamic
workflows,
parallel/sequenLal
model,
new
workflow
engine,
etc.
• OODT
RADIX
for
super
easy
deployment
(OODT-‐120)
• Solr
sync
with
File
Manager
(OODT-‐326)
• Improvements
to
XMLPS
(OODT-‐333)
and
new
crawler
acLons
(OODT-‐33,
OODT-‐34,
OODT-‐35,
OODT-‐36,
OODT-‐37)
• CLI
rewrite
and
refactor
• Over
130
issues
currently
resolved
• Likely
to
come
before
end
of
Q2
2012
13-‐Jun-‐12
HADOOPSUMMIT12
25
26. How
do
these
fit
together?
• Hadoop
HDFS
– OODT
file
manager
leveraging
HDFS
for
virtual
disk
path,
replicaLon,
archiving,
scalability
• Hadoop
M/R
– Work
done
in
OODT
branch
to
connect
OODT
Workflow
+
Resource
Mgmt
to
Hadoop
(pre
YARN)
• Hadoop
HIVE
used
in
Regional
Climate
Modeling
DB
13-‐Jun-‐12
HADOOPSUMMIT12
26
27. Where
are
we
headed
with
OODT
+
Hadoop?
• InvesLgate
and
integrate
YARN
– Workflow
and
Resource
Mgmt
• Plug
in
HBase
as
File
Manager
Catalog
– Already
plugged
in
HIVE
– PotenLally
leverage
Gora?
• OODT
+
Hadoop
Virtual
Machines
and
RPMs
– Easy
InstallaLon
leveraging
OODT
RADIX
• Remote
file
acquisiLon
(Push/Pull)
as
Hadoop
M/R
13-‐Jun-‐12
HADOOPSUMMIT12
27
28. Key
Takeaway
Apache
OODT,
Apache
Hadoop,
other
big
data
technologies
preparing
the
world
to
handle
all
of
these
diverse
use
cases!
Constantly
evolving
and
improving
frameworks
–
join
up
and
help.
Free
and
open
source
from
Apache
and
helping
government
demonstrate
the
public
good
13-‐Jun-‐12
HADOOPSUMMIT12
28
29. Apache OODT Project Contact Info
• Learn more and track our progress at:
– http://oodt.apache.org
– WIKI: https://cwiki.apache.org/OODT/
– JIRA: https://issues.apache.org/jira/browse/OODT
• Join the mailing list:
– dev@oodt.apache.org
• Chat on IRC:
– #oodt on irc.freenode.net
• Acknowledgements
– Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes, Andrew
Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn,
Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid
– Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network,
Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA
OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid
Federation
13-‐Jun-‐12
HADOOPSUMMIT12
29
30. Alright,
I ll
shut
up
now
• Any
quesLons?
• THANK
YOU!
– chris.a.ma4mann@nasa.gov
– @chrisma4mann
on
Twi4er
13-‐Jun-‐12
HADOOPSUMMIT12
30