SlideShare uma empresa Scribd logo
1 de 42
Addressing Big Data Challenges
through Innovative Architecture,
Databases and Software
UNCLASSIFIED
Dr. Vijay Gadepally
vijayg@ll.mit.edu
Accumulo Summit
College Park, MD
June 12, 2014
This work is sponsored by the Assistant Secretary of Defense for Research
and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions,
interpretations, recommendations and conclusions are those of the authors
and are not necessarily endorsed by the United States Government
Accumulo Summit
VNG - 2
Acknowledgements
• Bill Arcand
• Bill Bergeron
• David Bestor
• Chansup Byun
• Matt Hubbell
• Jeremy Kepner
• Pete Michaleas
• Julie Mullen
• Andy Prout
• Albert Reuther
• Tony Rosa
• Charles Yee
And many more …
Accumulo Summit
VNG - 3
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo Summit
VNG - 4
Introduction to MIT Lincoln Laboratory
Established 1951
Lincoln Laboratory is a Department of Defense FFRDC operated by MIT
Accumulo Summit
VNG - 5
Technology in Support of National Security
Sensors Information Extraction Communications
Integrated Sensing and Decision Support
(Secure – Countermeasure Resistant)
Purpose
Core Work Areas
Space Control
Intelligence,
Surveillance, and
Reconnaissance Systems
and Technology
Tactical Systems
Air and Missile
Defense Technology
Homeland ProtectionAir Traffic Control
Communication Systems Advanced Technology
Cyber Security and
Information Sciences
Engineering
Current Mission Areas
MIT Lincoln Laboratory
Accumulo Summit
VNG - 6
MIT Lincoln Laboratory
- Focus Areas -
Rapid Prototyping
Trusted Government
Advisor
University Affiliations
System
Analysis
• Highly instrumented
• Field / operational
testing
• Capabilities against
existing & future threats
• Rapid development
• Operationally
relevant
• Validated
by testing
Methodology Outputs
Testing
Technology
Prototyping
Assessments
to Senior
Leadership
Architects and
Requirements
Definition
Advanced
Technology
Prototypes
Broad Multi-Mission
Technology Strength
Architecture Analysis
and Test
Conferences, Workshops
Outreach
Accumulo Summit
VNG - 7
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo Summit
VNG - 8
WarfightersOperators Analysts
MaritimeGround SpaceC2 CyberText and
Social Media
<html>
Data
AirHUMINTWeather
Gap
Common Big Data Challenge
Users
Year
InformationStored
(MB)
1986 1989 1992 1995 1998 2001 2004 2007 2010 2013
3 X 1014 MB
7 X 1012 MIPS
World Total
Information
Stored
World Total
Computing
Capacity
MillionsofInstructionsperSecond
(MIPS)
1014
1015
1016
1013
1012
1011
1014
1015
1016
1013
1012
1011
1010
Source: M. Hilbert and P. López, Science, Vol. 332 (2011)
and associated online material
Rapidly increasing
- Data volume
- Data velocity
- Data variety
Accumulo Summit
VNG - 9
Common Big Data Architecture
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Accumulo Summit
VNG - 10
Common Big Data Architecture
- Data Volume: Cloud Computing -
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Operators
MIT
SuperCloud
Enterprise Cloud
Big Data Cloud Database Cloud
Compute Cloud
MIT SuperCloud merges four clouds
LLSuperCloud: Sharing HPC Systems for Diverse Rapid Prototyping,
Reuther et al, IEEE HPEC 2013
Accumulo Summit
VNG - 11
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Lincoln benchmarking
validated Accumulo performance
Common Big Data Architecture
- Data Velocity: Accumulo Database -
Accumulo Summit
VNG - 12
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
D4M demonstrated a
universal approach to diverse data
columnsrows
Σ
raw
Common Big Data Architecture
- Data Variety: D4M Schema -
intel reports, DNA, health records, publication
citations, web logs, social media, building alarms,
cyber, … all handled by a common 4 table schema
D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo
Database, Kepner et al, IEEE HPEC 2013
Accumulo Summit
VNG - 13
The Cloud within the Common Big Data
Architecture
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
The “Cloud”
Accumulo Summit
VNG - 14
• Each cloud ecosystem supports many multi-$B industries
• Each cloud ecosystem uses different software and hardware
Four Ecosystems Dominate
Large Scale Cloud Computing
Enterprise
Big Data Database
Supercomputing
- Interactive
- On-demand
- Virtualization
- High performance
- Scientific computing
- Batch jobs
- Java
- Distributed
- Easy admin
- Indexing
- Search
- Atomic
Accumulo Summit
VNG - 15
• MIT SuperCloud adds virtual machines and added security
• Combines all four ecosystems without sacrificing performance
Enterprise
Big Data
MIT SuperCloud
Supercomputing
- Interactive
- On-demand
- Virtualization
- High performance
- Scientific computing
- Batch jobs
- Java
- Distributed
- Easy admin
- Indexing
- Search
- Atomic
Database
MIT SuperCloud
Accumulo Summit
VNG - 16
• VMware is the main enterprise computing virtualization technology
• Message Passing Interface (MPI) is the primary supercomputing API
• System Query Lange (SQL) is the primary database API
• Hadoop & Accumulo & D4M are core to government big data clouds
MIT SuperCloud
Enterprise
Big Data
- Interactive
- On-demand
- Virtualization
- Java
- Distributed
- Easy admin
Core Technologies
VMware
Hadoop
MPI
SQL
Database
Supercomputing
- High performance
- Scientific computing
- Batch jobs
- Indexing
- Search
- Atomic
D4M = Dynamic Distributed Dimensional Data Model
Accumulo Summit
VNG - 17
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo Summit
VNG - 18
MIT SuperCloud
• Developed to address the challenges associated with big data
volume
• Cloud system allows all four ecosystems of the cloud to exist
within the same computational architecture
• Key Innovations:
– Shared HPC cloud capabilities
– High performance
– Reliable
• Brings the power of cloud computing to the HPC community
Accumulo Summit
VNG - 19
• Allows different architectures to be dynamically combined and tested
Cloud Ecosystems
Enterprise
Big Data
- Interactive
- On-demand
- Virtualization
- Java
- Distributed
- Easy admin
VMware
Hadoop
MPI
SQL
Database
Supercomputing
- High performance
- Scientific computing
- Batch jobs
- Indexing
- Search
- Atomic
MIT SuperCloud
Accumulo Summit
VNG - 20
MIT SuperCloud
Network Storage
Scheduler
Monitoring System
Compute NodesService Nodes
Cluster
Switch
LAN Switch
Interactive Compute Job
Interactive VM Job
Interactive Database Job
Project
Data
TX-E1
Accumulo Summit
VNG - 21
Cloud Computing @ MIT
• The cloud computing infrastructure at Lincoln Laboratory is
based on the MIT Supercloud infrastructure which allows the
different cloud eco systems to co exist
• MIT SuperCloud architecture addresses the issues of big data
volume
• Centerpiece of MIT SuperCloud: Accumulo database
Accumulo Summit
VNG - 22
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo Summit
VNG - 23
Apache Accumulo
• Highest performance open source database
• Contributed to Apache project by the NSA in 2011
• Used extensively for government applications
• Requires a schema for storing and organizing data to obtain full
benefits
Accumulo Summit
VNG - 24
Accumulo and the MIT SuperCloud
• Apache Accumulo is a high performance database used for a
variety of purposes
– Helps address the big data velocity challenge
• Accumulo is the centerpiece of the Common Big Data
Architecture developed by MIT Lincoln Laboratory
• Key features:
– Open Source
– High Performance
– Widely adopted
– Vibrant Developer Community
• MIT Lincoln Laboratory has developed a set of tools – D4M to
help researchers use Accumulo for novel research
Accumulo Summit
VNG - 25
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Common Big Data Architecture
- Data Velocity: Accumulo Database -
Databases
Accumulo Summit
VNG - 26
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo Summit
VNG - 27
High Level Language: D4M
http://www.mit.edu/~kepner/D4M
Accumulo
Distributed Database
Query:
Alice
Bob
Cathy
David
Earl
Associative Arrays
Numerical Computing Environment
D4M
Dynamic
Distributed
Dimensional
Data Model
A
C
D
E
B
A D4M query returns a sparse
matrix or a graph…
…for statistical signal processing
or graph analysis in MATLAB
D4M binds associative arrays to databases, enabling rapid
prototyping of data-intensive cloud analytics and visualization
Accumulo Summit
VNG - 28
D4M
• The Dynamic Distributed Data Model
– Supports database and computation systems that deal with
Big Data
– Developed at Lincoln Laboratory
• Key Features:
– Applies linear algebra and signal processing techniques to
databases through associative arrays
– D4M data schema offers a one-stop solution for most types of
data source for any type of database
– Low barrier to entry – API accessible to those even with
minimal database and/or big-data background
Accumulo Summit
VNG - 29
Associative Arrays
• Extends associative arrays to 2D and mixed data types
A(’#a1',’#b2') = ’same_tweet'
• Key innovation: 2D is 1-to-1 with triple store
(’#a1',’#b2',’same_tweet’)
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexing
A(’#al b2',:) A(’#al,',:) A(’#a* ',:)
A('#al: b2',:) A(1:2,:) A == #b2
#a1
#b2
same_tweet
#a1 #b2
Accumulo Summit
VNG - 30
Data Schema
• A structure described in a language supported by the database
management system
• Accumulo supports triples
– How can we represent heterogeneous data types in a common data
schema?
– Use the D4M schema
• D4M schema converts structured or unstructured raw data to
the 3-tuple representation supported by Accumulo:
– row is a unique identifier (often some variation of a time stamp)
– column is a unique representation of the data
– value is typically just ‘1’
• Usually use a 4 table representation
– The Edge Table, the Transpose Table, Degree Table, Raw Table
(row, column)  value
Accumulo Summit
VNG - 31
Exploded Table
row_num col1 col2 col3
001 row1col1 row1col2 word1 word2 word3
002 row2col1 row2col2 word2 word3
003 … … word1 word3
col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3
row_num|001 1 1 1 1 1
row_num|002 1 1 1 1
row_num|003 1 1
Use as row
indices
Create columns
for each unique
type/value pair
col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3
Degree 1 1 1 1 2 2 3
row_num|001 row_num|002 row_num|003
col1|row1col1 1
col1|row2col1
col2|row1col2 1 1
col2|row2col2 1
col3|word1 1 1
col3|word2 1 1
col3|word3 1 1
text
row_num|00
1
word1 word2 word3
row_num|00
2
word2 word3
row_num|00
3
word1 word3
Tedge
TedgeDeg
TedgeT TedgeTxt
Accumulo Summit
VNG - 32
• Key innovation: mathematical closure
– All associative array operations return associative arrays
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexing
A('alice bob ',:) A('alice ',:) A('al* ',:)
A('alice : bob ',:) A(1:2,:) A == 47.0
• Simple to implement in a library (~3500 lines) in programming
environments with: 1st class support of 2D arrays, operator
overloading, sparse linear algebra
Composable Associative Arrays
• Complex queries with ~50x less effort than Java/SQL
• Naturally leads to high performance parallel implementation
Accumulo Summit
VNG - 33
Using D4M for Advanced Analytics
• D4M allows researchers to harness the versatility of the MIT
SuperCloud architecture and speed of Apache Accumulo
through the familiarity of high level languages such at MATLAB
or GNU Octave.
• D4M schema provides an approach to mitigate challenges
associated with big data variety
• D4M is used for a variety of applications across the Department
of Defense and Intelligence Community
Accumulo Summit
VNG - 34
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Outline
Accumulo Summit
VNG - 35
Supporting National Security
-Rapid Solution Prototyping-
336592592584179712 2013-05-20 21:21:42 20798128 kiefpief
web 3b77caf94bfc81fe I am sending love to Oklahoma. And
actually -- to everyone who may need it. You are loved. And you are not
alone. Promise. #PrayforOklahoma
336600956710027264 2013-05-20 21:54:56 35.99894978 -
78.90660222 -8783842.7781526 4300476.86376416 22435220
RyanBLeslie Twitter for iPad348803787 bced47a0c99c71d0
@HaydenBigCntry RT @jiminhofe: The devastation in Oklahoma is
…
Step 1: Start an instance of Accumulo and Ingest Data
Step 2: Find all tweets with keyword:
>>A = Tedge(Row(Tedge(:, 'word|#prayforoklahoma,')),:);
Step 3: Filter tweets by location:
>>B = A(:, 'latlon|+-003934,:,latlon|+-003979,’);
Step 4: Visualize results:
>>Assoc2KML(B);
Accumulo Summit
VNG - 36
Promoting big data discovery
-Domain Agnostic Analytics-
NOISE
SIGNAL
N-D SPACE
Example background model:
Power Law Graph
Goal: Find subgraph of interest
using background model to
identify noise
Model Background Data to Extract Signal from Observations
10
0
10
1
10
2
10
3
10
0
10
1
10
2
10
3
10
4
Degree Distribution
Degree
Count
dmax
- =
Observed
Data
Background
Model of Data
Residual
Data
Signal
&
Noise
Noise Signal
Big Data Filtering and SamplingDetecting Subgraphs of Interest from Large
Graphs
Accumulo Summit
VNG - 37
Securing the Cloud
-The Lincoln Secure and Resilient Cloud-
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Secure and
Resilient
Communication
+ Provenance
Secure and
Resilient
Storage
Secure and
Resilient
Processing
• Big Data systems are vulnerable to a variety of attacks
• Improve security of cloud systems by researching:
• Security in Communication and Provenance
• Security in Data Storage
• Security in Processing
• Security in the underlying architecture
Accumulo Summit
VNG - 38
Ensuring Privacy
-Computing On Masked Data-
Big Data Veracity
<html>
Challenges
Analysts
Analytics
A
C
DE
B
Computing
Scheduler
Ingest &
EnrichmentIngest &
Enrichment
Remote Code
Injection
Hypervisor
Privilege Escalation
Cross VM Side
Channels
Data Loss /
Exfiltration
Data Integrity
Attack
Current Approaches
<html>
Analytics
A
C
DE
B
Computing
Files
Scheduler
Ingest &
EnrichmentIngest &
EnrichmentIngest
Encryptedlink
Encrypted
linkEncryptedstorage
Encrypted
storage
Vision
<html>
A
C
DE
B
Computing
Files
Scheduler
Ingest &
EnrichmentIngest &
Enrichment
Compute on
Encrypted
Data
Compute on
Encrypted
Data
Compute on
Encrypted Data
Step 1: Mask data and ingest into database
>>put(Tedge, Mask(Aedge, maskcode));
Step 2: Query DB for results with masked queries
>>Aedge_mt = Tedge(Row(Tedge(:,StrMask(‘word|bieber ‘, maskcode))),:);
>>Atxt_mt = TedgeTxt(Row(Tedge(:,StrMask(‘word|bieber ‘, maskcode))),:);
Step 3: Unmask Results
>>Aedge = Unmask(Aedge_mt, maskcode);
>>Atxt = Unmask(Atxt_mt, maskcode);
Use D4M and CMD to protect the 4th V of Big Data – Veracity
• Big Data systems are vulnerable to a variety of attacks
• Currently encrypt data at rest but data in flight is in the clear
• Compute on Encrypted Data: Data is always protected by
encryption through the system.
Accumulo Summit
VNG - 39
Outline
• Introduction
• Cloud Computing and Challenges
• Innovative Architecture: MIT SuperCloud
• Innovative Databases: Apache Accumulo
• Innovative Software: D4M
• R&D Examples
• Conclusions
Accumulo Summit
VNG - 40
Summary
Air and Missile
Defense
Homeland
Protection
Air Traffic
Control
Communication
Systems
Advanced
Technology
Space
Control
ISR Systems
and Technology
Tactical Systems
Mission Areas:
Cyber
Security
Engineering
• Lincoln Laboratory missions collect and process vast amounts of data
from many sources
• MIT Lincoln Laboratory makes use of innovations in system architecture
(MIT SuperCloud), database technologies (Apache Accumulo) and
software (D4M) to develop technology in support of national security
Data Sources:
MaritimeGround SpaceC2 CyberOSINT
<html>
AirHUMINTWeather
Lincoln Laboratory is always interested in technical exchange with big data
community!
Accumulo Summit
VNG - 41
Backup
Accumulo Summit
VNG - 42
Cyber Security and Information Sciences
Human Language
Technology
Cyber Security Metrics Anti-Tamper Hardware Cyber Situational Awareness
Correlation and visualization of cyber alert data
makes it possible to detect and understand
attacks on large, enterprise networks.
Lincoln Laboratory builds, supports, and uses
cyber ranges to evaluate the performance of
cyber security technology.
Metrics are defined and measured to estimate
the defensive posture of enterprise-class
networks.
Physically unclonable functions are used to embed
cryptographic key material in a coating around a
computing module permitting detection of tampering.
Net-Centric
Operations
Cyber Testing and
Range Development
Research and prototyping of Service-Oriented
Architectures that enable the dynamic composition
of systems involving complex sensors, processing
and decision-support elements.
Algorithms are developed and implemented
for speech and biometric applications,
including language/speaker identification,
machine translation, and face comparison.
S-13

Mais conteúdo relacionado

Mais procurados

Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizon
Mike Miller
 
OGF Standards Overview - ITU-T JCA Cloud
OGF Standards Overview - ITU-T JCA CloudOGF Standards Overview - ITU-T JCA Cloud
OGF Standards Overview - ITU-T JCA Cloud
Alan Sill
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
Kelvin Lam
 
PHIDIAS - Boosting the use of cloud services for marine data management, serv...
PHIDIAS - Boosting the use of cloud services for marine data management, serv...PHIDIAS - Boosting the use of cloud services for marine data management, serv...
PHIDIAS - Boosting the use of cloud services for marine data management, serv...
Phidias
 

Mais procurados (20)

CTE Phase III
CTE Phase IIICTE Phase III
CTE Phase III
 
4. the grid evolution
4. the grid evolution4. the grid evolution
4. the grid evolution
 
The Extreme Data Cloud (XDC) Project
The Extreme Data Cloud (XDC) ProjectThe Extreme Data Cloud (XDC) Project
The Extreme Data Cloud (XDC) Project
 
GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute
GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics InstituteGlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute
GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute
 
Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizon
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the Internet
 
OGF Standards Overview - ITU-T JCA Cloud
OGF Standards Overview - ITU-T JCA CloudOGF Standards Overview - ITU-T JCA Cloud
OGF Standards Overview - ITU-T JCA Cloud
 
Globus toolkit in grid
Globus toolkit in gridGlobus toolkit in grid
Globus toolkit in grid
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
 
Grid Computing Systems and Resource Management
Grid Computing Systems and Resource ManagementGrid Computing Systems and Resource Management
Grid Computing Systems and Resource Management
 
UC-Wide Cyberinfrastructure for Data-Intensive Research
UC-Wide Cyberinfrastructure for Data-Intensive ResearchUC-Wide Cyberinfrastructure for Data-Intensive Research
UC-Wide Cyberinfrastructure for Data-Intensive Research
 
Grid computing assiment
Grid computing assimentGrid computing assiment
Grid computing assiment
 
Phidias: Steps forward in detection and identification of anomalous atmospher...
Phidias: Steps forward in detection and identification of anomalous atmospher...Phidias: Steps forward in detection and identification of anomalous atmospher...
Phidias: Steps forward in detection and identification of anomalous atmospher...
 
Sc10 slide share
Sc10 slide shareSc10 slide share
Sc10 slide share
 
e-Infrastructure @ Science
e-Infrastructure @ Sciencee-Infrastructure @ Science
e-Infrastructure @ Science
 
Grid computing
Grid computingGrid computing
Grid computing
 
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningWebinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
PHIDIAS - Boosting the use of cloud services for marine data management, serv...
PHIDIAS - Boosting the use of cloud services for marine data management, serv...PHIDIAS - Boosting the use of cloud services for marine data management, serv...
PHIDIAS - Boosting the use of cloud services for marine data management, serv...
 

Semelhante a Accumulo Summit 2014: Addressing big data challenges through innovative architecture, databases and software

Overall System Architecture of Big Data of Wind Power Based on IoT_20161...
Overall System Architecture of Big Data of Wind Power Based on IoT_20161...Overall System Architecture of Big Data of Wind Power Based on IoT_20161...
Overall System Architecture of Big Data of Wind Power Based on IoT_20161...
元 黄
 
CAMP IT Slides - Skytap - Brian White
CAMP IT Slides - Skytap - Brian White CAMP IT Slides - Skytap - Brian White
CAMP IT Slides - Skytap - Brian White
Skytap Cloud
 

Semelhante a Accumulo Summit 2014: Addressing big data challenges through innovative architecture, databases and software (20)

Bigdata-Intro.pptx
Bigdata-Intro.pptxBigdata-Intro.pptx
Bigdata-Intro.pptx
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...
The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...
The IBM Research Compute Cloud (RC2): Innovation, Best Practices and Lessons ...
 
Cloud computing infrastructure
Cloud computing infrastructure Cloud computing infrastructure
Cloud computing infrastructure
 
The Evolution of Data Architecture
The Evolution of Data ArchitectureThe Evolution of Data Architecture
The Evolution of Data Architecture
 
Enterprise Trends for MongoDB as a Service
Enterprise Trends for MongoDB as a ServiceEnterprise Trends for MongoDB as a Service
Enterprise Trends for MongoDB as a Service
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Cloud Testbeds for Standards Development and Innovation
Cloud Testbeds for Standards Development and InnovationCloud Testbeds for Standards Development and Innovation
Cloud Testbeds for Standards Development and Innovation
 
Converged Everything, Converged Infrastructure delivering business value and ...
Converged Everything, Converged Infrastructure delivering business value and ...Converged Everything, Converged Infrastructure delivering business value and ...
Converged Everything, Converged Infrastructure delivering business value and ...
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Data
 
Overall System Architecture of Big Data of Wind Power Based on IoT_20161...
Overall System Architecture of Big Data of Wind Power Based on IoT_20161...Overall System Architecture of Big Data of Wind Power Based on IoT_20161...
Overall System Architecture of Big Data of Wind Power Based on IoT_20161...
 
Science DMZ as a Service: Creating Science Super- Facilities with GENI
Science DMZ as a Service: Creating Science Super- Facilities with GENIScience DMZ as a Service: Creating Science Super- Facilities with GENI
Science DMZ as a Service: Creating Science Super- Facilities with GENI
 
CAMP IT Slides - Skytap - Brian White
CAMP IT Slides - Skytap - Brian White CAMP IT Slides - Skytap - Brian White
CAMP IT Slides - Skytap - Brian White
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Re-Engineering Engineering
Re-Engineering EngineeringRe-Engineering Engineering
Re-Engineering Engineering
 
CHAPTER 2 cloud computing technology in cs
CHAPTER 2 cloud computing technology in csCHAPTER 2 cloud computing technology in cs
CHAPTER 2 cloud computing technology in cs
 
Shared services - the future of HPC and big data facilities for UK research
Shared services - the future of HPC and big data facilities for UK researchShared services - the future of HPC and big data facilities for UK research
Shared services - the future of HPC and big data facilities for UK research
 
AWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.pptAWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.ppt
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Louise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx SystemsLouise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx Systems
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Accumulo Summit 2014: Addressing big data challenges through innovative architecture, databases and software

  • 1. Addressing Big Data Challenges through Innovative Architecture, Databases and Software UNCLASSIFIED Dr. Vijay Gadepally vijayg@ll.mit.edu Accumulo Summit College Park, MD June 12, 2014 This work is sponsored by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government
  • 2. Accumulo Summit VNG - 2 Acknowledgements • Bill Arcand • Bill Bergeron • David Bestor • Chansup Byun • Matt Hubbell • Jeremy Kepner • Pete Michaleas • Julie Mullen • Andy Prout • Albert Reuther • Tony Rosa • Charles Yee And many more …
  • 3. Accumulo Summit VNG - 3 Outline • Introduction • Cloud Computing and Challenges • Innovative Architecture: MIT SuperCloud • Innovative Databases: Apache Accumulo • Innovative Software: D4M • R&D Examples • Conclusions
  • 4. Accumulo Summit VNG - 4 Introduction to MIT Lincoln Laboratory Established 1951 Lincoln Laboratory is a Department of Defense FFRDC operated by MIT
  • 5. Accumulo Summit VNG - 5 Technology in Support of National Security Sensors Information Extraction Communications Integrated Sensing and Decision Support (Secure – Countermeasure Resistant) Purpose Core Work Areas Space Control Intelligence, Surveillance, and Reconnaissance Systems and Technology Tactical Systems Air and Missile Defense Technology Homeland ProtectionAir Traffic Control Communication Systems Advanced Technology Cyber Security and Information Sciences Engineering Current Mission Areas MIT Lincoln Laboratory
  • 6. Accumulo Summit VNG - 6 MIT Lincoln Laboratory - Focus Areas - Rapid Prototyping Trusted Government Advisor University Affiliations System Analysis • Highly instrumented • Field / operational testing • Capabilities against existing & future threats • Rapid development • Operationally relevant • Validated by testing Methodology Outputs Testing Technology Prototyping Assessments to Senior Leadership Architects and Requirements Definition Advanced Technology Prototypes Broad Multi-Mission Technology Strength Architecture Analysis and Test Conferences, Workshops Outreach
  • 7. Accumulo Summit VNG - 7 Outline • Introduction • Cloud Computing and Challenges • Innovative Architecture: MIT SuperCloud • Innovative Databases: Apache Accumulo • Innovative Software: D4M • R&D Examples • Conclusions
  • 8. Accumulo Summit VNG - 8 WarfightersOperators Analysts MaritimeGround SpaceC2 CyberText and Social Media <html> Data AirHUMINTWeather Gap Common Big Data Challenge Users Year InformationStored (MB) 1986 1989 1992 1995 1998 2001 2004 2007 2010 2013 3 X 1014 MB 7 X 1012 MIPS World Total Information Stored World Total Computing Capacity MillionsofInstructionsperSecond (MIPS) 1014 1015 1016 1013 1012 1011 1014 1015 1016 1013 1012 1011 1010 Source: M. Hilbert and P. López, Science, Vol. 332 (2011) and associated online material Rapidly increasing - Data volume - Data velocity - Data variety
  • 9. Accumulo Summit VNG - 9 Common Big Data Architecture WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases
  • 10. Accumulo Summit VNG - 10 Common Big Data Architecture - Data Volume: Cloud Computing - WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases Operators MIT SuperCloud Enterprise Cloud Big Data Cloud Database Cloud Compute Cloud MIT SuperCloud merges four clouds LLSuperCloud: Sharing HPC Systems for Diverse Rapid Prototyping, Reuther et al, IEEE HPEC 2013
  • 11. Accumulo Summit VNG - 11 WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases Lincoln benchmarking validated Accumulo performance Common Big Data Architecture - Data Velocity: Accumulo Database -
  • 12. Accumulo Summit VNG - 12 WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases D4M demonstrated a universal approach to diverse data columnsrows Σ raw Common Big Data Architecture - Data Variety: D4M Schema - intel reports, DNA, health records, publication citations, web logs, social media, building alarms, cyber, … all handled by a common 4 table schema D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et al, IEEE HPEC 2013
  • 13. Accumulo Summit VNG - 13 The Cloud within the Common Big Data Architecture WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases The “Cloud”
  • 14. Accumulo Summit VNG - 14 • Each cloud ecosystem supports many multi-$B industries • Each cloud ecosystem uses different software and hardware Four Ecosystems Dominate Large Scale Cloud Computing Enterprise Big Data Database Supercomputing - Interactive - On-demand - Virtualization - High performance - Scientific computing - Batch jobs - Java - Distributed - Easy admin - Indexing - Search - Atomic
  • 15. Accumulo Summit VNG - 15 • MIT SuperCloud adds virtual machines and added security • Combines all four ecosystems without sacrificing performance Enterprise Big Data MIT SuperCloud Supercomputing - Interactive - On-demand - Virtualization - High performance - Scientific computing - Batch jobs - Java - Distributed - Easy admin - Indexing - Search - Atomic Database MIT SuperCloud
  • 16. Accumulo Summit VNG - 16 • VMware is the main enterprise computing virtualization technology • Message Passing Interface (MPI) is the primary supercomputing API • System Query Lange (SQL) is the primary database API • Hadoop & Accumulo & D4M are core to government big data clouds MIT SuperCloud Enterprise Big Data - Interactive - On-demand - Virtualization - Java - Distributed - Easy admin Core Technologies VMware Hadoop MPI SQL Database Supercomputing - High performance - Scientific computing - Batch jobs - Indexing - Search - Atomic D4M = Dynamic Distributed Dimensional Data Model
  • 17. Accumulo Summit VNG - 17 Outline • Introduction • Cloud Computing and Challenges • Innovative Architecture: MIT SuperCloud • Innovative Databases: Apache Accumulo • Innovative Software: D4M • R&D Examples • Conclusions
  • 18. Accumulo Summit VNG - 18 MIT SuperCloud • Developed to address the challenges associated with big data volume • Cloud system allows all four ecosystems of the cloud to exist within the same computational architecture • Key Innovations: – Shared HPC cloud capabilities – High performance – Reliable • Brings the power of cloud computing to the HPC community
  • 19. Accumulo Summit VNG - 19 • Allows different architectures to be dynamically combined and tested Cloud Ecosystems Enterprise Big Data - Interactive - On-demand - Virtualization - Java - Distributed - Easy admin VMware Hadoop MPI SQL Database Supercomputing - High performance - Scientific computing - Batch jobs - Indexing - Search - Atomic MIT SuperCloud
  • 20. Accumulo Summit VNG - 20 MIT SuperCloud Network Storage Scheduler Monitoring System Compute NodesService Nodes Cluster Switch LAN Switch Interactive Compute Job Interactive VM Job Interactive Database Job Project Data TX-E1
  • 21. Accumulo Summit VNG - 21 Cloud Computing @ MIT • The cloud computing infrastructure at Lincoln Laboratory is based on the MIT Supercloud infrastructure which allows the different cloud eco systems to co exist • MIT SuperCloud architecture addresses the issues of big data volume • Centerpiece of MIT SuperCloud: Accumulo database
  • 22. Accumulo Summit VNG - 22 Outline • Introduction • Cloud Computing and Challenges • Innovative Architecture: MIT SuperCloud • Innovative Databases: Apache Accumulo • Innovative Software: D4M • R&D Examples • Conclusions
  • 23. Accumulo Summit VNG - 23 Apache Accumulo • Highest performance open source database • Contributed to Apache project by the NSA in 2011 • Used extensively for government applications • Requires a schema for storing and organizing data to obtain full benefits
  • 24. Accumulo Summit VNG - 24 Accumulo and the MIT SuperCloud • Apache Accumulo is a high performance database used for a variety of purposes – Helps address the big data velocity challenge • Accumulo is the centerpiece of the Common Big Data Architecture developed by MIT Lincoln Laboratory • Key features: – Open Source – High Performance – Widely adopted – Vibrant Developer Community • MIT Lincoln Laboratory has developed a set of tools – D4M to help researchers use Accumulo for novel research
  • 25. Accumulo Summit VNG - 25 WarfightersOperators Analysts Users MaritimeGround SpaceC2 CyberOSINT <html> Data AirHUMINTWeather Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Common Big Data Architecture - Data Velocity: Accumulo Database - Databases
  • 26. Accumulo Summit VNG - 26 Outline • Introduction • Cloud Computing and Challenges • Innovative Architecture: MIT SuperCloud • Innovative Databases: Apache Accumulo • Innovative Software: D4M • R&D Examples • Conclusions
  • 27. Accumulo Summit VNG - 27 High Level Language: D4M http://www.mit.edu/~kepner/D4M Accumulo Distributed Database Query: Alice Bob Cathy David Earl Associative Arrays Numerical Computing Environment D4M Dynamic Distributed Dimensional Data Model A C D E B A D4M query returns a sparse matrix or a graph… …for statistical signal processing or graph analysis in MATLAB D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization
  • 28. Accumulo Summit VNG - 28 D4M • The Dynamic Distributed Data Model – Supports database and computation systems that deal with Big Data – Developed at Lincoln Laboratory • Key Features: – Applies linear algebra and signal processing techniques to databases through associative arrays – D4M data schema offers a one-stop solution for most types of data source for any type of database – Low barrier to entry – API accessible to those even with minimal database and/or big-data background
  • 29. Accumulo Summit VNG - 29 Associative Arrays • Extends associative arrays to 2D and mixed data types A(’#a1',’#b2') = ’same_tweet' • Key innovation: 2D is 1-to-1 with triple store (’#a1',’#b2',’same_tweet’) • Enables composable mathematical operations A + B A - B A & B A|B A*B • Enables composable query operations via array indexing A(’#al b2',:) A(’#al,',:) A(’#a* ',:) A('#al: b2',:) A(1:2,:) A == #b2 #a1 #b2 same_tweet #a1 #b2
  • 30. Accumulo Summit VNG - 30 Data Schema • A structure described in a language supported by the database management system • Accumulo supports triples – How can we represent heterogeneous data types in a common data schema? – Use the D4M schema • D4M schema converts structured or unstructured raw data to the 3-tuple representation supported by Accumulo: – row is a unique identifier (often some variation of a time stamp) – column is a unique representation of the data – value is typically just ‘1’ • Usually use a 4 table representation – The Edge Table, the Transpose Table, Degree Table, Raw Table (row, column)  value
  • 31. Accumulo Summit VNG - 31 Exploded Table row_num col1 col2 col3 001 row1col1 row1col2 word1 word2 word3 002 row2col1 row2col2 word2 word3 003 … … word1 word3 col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3 row_num|001 1 1 1 1 1 row_num|002 1 1 1 1 row_num|003 1 1 Use as row indices Create columns for each unique type/value pair col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3 Degree 1 1 1 1 2 2 3 row_num|001 row_num|002 row_num|003 col1|row1col1 1 col1|row2col1 col2|row1col2 1 1 col2|row2col2 1 col3|word1 1 1 col3|word2 1 1 col3|word3 1 1 text row_num|00 1 word1 word2 word3 row_num|00 2 word2 word3 row_num|00 3 word1 word3 Tedge TedgeDeg TedgeT TedgeTxt
  • 32. Accumulo Summit VNG - 32 • Key innovation: mathematical closure – All associative array operations return associative arrays • Enables composable mathematical operations A + B A - B A & B A|B A*B • Enables composable query operations via array indexing A('alice bob ',:) A('alice ',:) A('al* ',:) A('alice : bob ',:) A(1:2,:) A == 47.0 • Simple to implement in a library (~3500 lines) in programming environments with: 1st class support of 2D arrays, operator overloading, sparse linear algebra Composable Associative Arrays • Complex queries with ~50x less effort than Java/SQL • Naturally leads to high performance parallel implementation
  • 33. Accumulo Summit VNG - 33 Using D4M for Advanced Analytics • D4M allows researchers to harness the versatility of the MIT SuperCloud architecture and speed of Apache Accumulo through the familiarity of high level languages such at MATLAB or GNU Octave. • D4M schema provides an approach to mitigate challenges associated with big data variety • D4M is used for a variety of applications across the Department of Defense and Intelligence Community
  • 34. Accumulo Summit VNG - 34 • Introduction • Cloud Computing and Challenges • Innovative Architecture: MIT SuperCloud • Innovative Databases: Apache Accumulo • Innovative Software: D4M • R&D Examples • Conclusions Outline
  • 35. Accumulo Summit VNG - 35 Supporting National Security -Rapid Solution Prototyping- 336592592584179712 2013-05-20 21:21:42 20798128 kiefpief web 3b77caf94bfc81fe I am sending love to Oklahoma. And actually -- to everyone who may need it. You are loved. And you are not alone. Promise. #PrayforOklahoma 336600956710027264 2013-05-20 21:54:56 35.99894978 - 78.90660222 -8783842.7781526 4300476.86376416 22435220 RyanBLeslie Twitter for iPad348803787 bced47a0c99c71d0 @HaydenBigCntry RT @jiminhofe: The devastation in Oklahoma is … Step 1: Start an instance of Accumulo and Ingest Data Step 2: Find all tweets with keyword: >>A = Tedge(Row(Tedge(:, 'word|#prayforoklahoma,')),:); Step 3: Filter tweets by location: >>B = A(:, 'latlon|+-003934,:,latlon|+-003979,’); Step 4: Visualize results: >>Assoc2KML(B);
  • 36. Accumulo Summit VNG - 36 Promoting big data discovery -Domain Agnostic Analytics- NOISE SIGNAL N-D SPACE Example background model: Power Law Graph Goal: Find subgraph of interest using background model to identify noise Model Background Data to Extract Signal from Observations 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 10 4 Degree Distribution Degree Count dmax - = Observed Data Background Model of Data Residual Data Signal & Noise Noise Signal Big Data Filtering and SamplingDetecting Subgraphs of Interest from Large Graphs
  • 37. Accumulo Summit VNG - 37 Securing the Cloud -The Lincoln Secure and Resilient Cloud- Analytics A C DE B Computing Web Files Scheduler Ingest & Enrichment Ingest & EnrichmentIngest Databases Secure and Resilient Communication + Provenance Secure and Resilient Storage Secure and Resilient Processing • Big Data systems are vulnerable to a variety of attacks • Improve security of cloud systems by researching: • Security in Communication and Provenance • Security in Data Storage • Security in Processing • Security in the underlying architecture
  • 38. Accumulo Summit VNG - 38 Ensuring Privacy -Computing On Masked Data- Big Data Veracity <html> Challenges Analysts Analytics A C DE B Computing Scheduler Ingest & EnrichmentIngest & Enrichment Remote Code Injection Hypervisor Privilege Escalation Cross VM Side Channels Data Loss / Exfiltration Data Integrity Attack Current Approaches <html> Analytics A C DE B Computing Files Scheduler Ingest & EnrichmentIngest & EnrichmentIngest Encryptedlink Encrypted linkEncryptedstorage Encrypted storage Vision <html> A C DE B Computing Files Scheduler Ingest & EnrichmentIngest & Enrichment Compute on Encrypted Data Compute on Encrypted Data Compute on Encrypted Data Step 1: Mask data and ingest into database >>put(Tedge, Mask(Aedge, maskcode)); Step 2: Query DB for results with masked queries >>Aedge_mt = Tedge(Row(Tedge(:,StrMask(‘word|bieber ‘, maskcode))),:); >>Atxt_mt = TedgeTxt(Row(Tedge(:,StrMask(‘word|bieber ‘, maskcode))),:); Step 3: Unmask Results >>Aedge = Unmask(Aedge_mt, maskcode); >>Atxt = Unmask(Atxt_mt, maskcode); Use D4M and CMD to protect the 4th V of Big Data – Veracity • Big Data systems are vulnerable to a variety of attacks • Currently encrypt data at rest but data in flight is in the clear • Compute on Encrypted Data: Data is always protected by encryption through the system.
  • 39. Accumulo Summit VNG - 39 Outline • Introduction • Cloud Computing and Challenges • Innovative Architecture: MIT SuperCloud • Innovative Databases: Apache Accumulo • Innovative Software: D4M • R&D Examples • Conclusions
  • 40. Accumulo Summit VNG - 40 Summary Air and Missile Defense Homeland Protection Air Traffic Control Communication Systems Advanced Technology Space Control ISR Systems and Technology Tactical Systems Mission Areas: Cyber Security Engineering • Lincoln Laboratory missions collect and process vast amounts of data from many sources • MIT Lincoln Laboratory makes use of innovations in system architecture (MIT SuperCloud), database technologies (Apache Accumulo) and software (D4M) to develop technology in support of national security Data Sources: MaritimeGround SpaceC2 CyberOSINT <html> AirHUMINTWeather Lincoln Laboratory is always interested in technical exchange with big data community!
  • 42. Accumulo Summit VNG - 42 Cyber Security and Information Sciences Human Language Technology Cyber Security Metrics Anti-Tamper Hardware Cyber Situational Awareness Correlation and visualization of cyber alert data makes it possible to detect and understand attacks on large, enterprise networks. Lincoln Laboratory builds, supports, and uses cyber ranges to evaluate the performance of cyber security technology. Metrics are defined and measured to estimate the defensive posture of enterprise-class networks. Physically unclonable functions are used to embed cryptographic key material in a coating around a computing module permitting detection of tampering. Net-Centric Operations Cyber Testing and Range Development Research and prototyping of Service-Oriented Architectures that enable the dynamic composition of systems involving complex sensors, processing and decision-support elements. Algorithms are developed and implemented for speech and biometric applications, including language/speaker identification, machine translation, and face comparison. S-13