The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. The Massachusetts Institute of Technology, Lincoln Laboratory (MIT LL) is not immune to these challenges and has developed a set of tools that address many of these challenges.
Big data volume stresses the storage, memory, and compute capacity of a computing system and requires access to a computing cloud. Choosing the right cloud is problem specific. Currently, there are four multi-billion dollar ecosystems that dominate the cloud computing environment: enterprise clouds, big data clouds, SQL database clouds, and supercomputing clouds. Each cloud ecosystem has its own hardware, software, conferences, and business markets. The broad nature of business big data challenges make it unlikely that one cloud ecosystem can meet its needs and solutions are likely to require the tools and techniques from more than one cloud ecosystem. The MIT Supercloud was developed to address this challenge. To our knowledge, the MIT SuperCloud is the only deployed cloud system that allows all four ecosystems to co-exist without sacrificing performance or functionality.
The velocity of big data velocity stresses the rate at which data can be absorbed and meaningful answers produced. Led by the NSA, a Common Big Data Architecture (CBDA) was developed for the U.S. government based on the Google Big Table NoSQL approach and is now in wide use. MIT/LL played a leading role in developing the CBDA and is a leader in adapting the CBDA to a variety of big data challenges.
Big data variety may present the largest challenge and greatest opportunities. The promise of big data is the ability to correlate diverse and heterogeneous data to form new insights. The centerpiece of the CBDA is the NSA developed Apache Accumulo database (capable of millions of entries/second) and the MIT/LL developed D4M schema. These technologies allow vast quantities of highly diverse data (text, computer logs, and social media data, etc.) to be automatically ingested into a common schema that enables rapid query and correlation of every element.
The talk will concentrate on how we utilize the aforementioned technologies in our mission to apply advanced technology to problems of national security.
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Accumulo Summit 2014: Addressing big data challenges through innovative architecture, databases and software
1. Addressing Big Data Challenges
through Innovative Architecture,
Databases and Software
UNCLASSIFIED
Dr. Vijay Gadepally
vijayg@ll.mit.edu
Accumulo Summit
College Park, MD
June 12, 2014
This work is sponsored by the Assistant Secretary of Defense for Research
and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions,
interpretations, recommendations and conclusions are those of the authors
and are not necessarily endorsed by the United States Government
2. Accumulo Summit
VNG - 2
Acknowledgements
• Bill Arcand
• Bill Bergeron
• David Bestor
• Chansup Byun
• Matt Hubbell
• Jeremy Kepner
• Pete Michaleas
• Julie Mullen
• Andy Prout
• Albert Reuther
• Tony Rosa
• Charles Yee
And many more …
4. Accumulo Summit
VNG - 4
Introduction to MIT Lincoln Laboratory
Established 1951
Lincoln Laboratory is a Department of Defense FFRDC operated by MIT
5. Accumulo Summit
VNG - 5
Technology in Support of National Security
Sensors Information Extraction Communications
Integrated Sensing and Decision Support
(Secure – Countermeasure Resistant)
Purpose
Core Work Areas
Space Control
Intelligence,
Surveillance, and
Reconnaissance Systems
and Technology
Tactical Systems
Air and Missile
Defense Technology
Homeland ProtectionAir Traffic Control
Communication Systems Advanced Technology
Cyber Security and
Information Sciences
Engineering
Current Mission Areas
MIT Lincoln Laboratory
6. Accumulo Summit
VNG - 6
MIT Lincoln Laboratory
- Focus Areas -
Rapid Prototyping
Trusted Government
Advisor
University Affiliations
System
Analysis
• Highly instrumented
• Field / operational
testing
• Capabilities against
existing & future threats
• Rapid development
• Operationally
relevant
• Validated
by testing
Methodology Outputs
Testing
Technology
Prototyping
Assessments
to Senior
Leadership
Architects and
Requirements
Definition
Advanced
Technology
Prototypes
Broad Multi-Mission
Technology Strength
Architecture Analysis
and Test
Conferences, Workshops
Outreach
8. Accumulo Summit
VNG - 8
WarfightersOperators Analysts
MaritimeGround SpaceC2 CyberText and
Social Media
<html>
Data
AirHUMINTWeather
Gap
Common Big Data Challenge
Users
Year
InformationStored
(MB)
1986 1989 1992 1995 1998 2001 2004 2007 2010 2013
3 X 1014 MB
7 X 1012 MIPS
World Total
Information
Stored
World Total
Computing
Capacity
MillionsofInstructionsperSecond
(MIPS)
1014
1015
1016
1013
1012
1011
1014
1015
1016
1013
1012
1011
1010
Source: M. Hilbert and P. López, Science, Vol. 332 (2011)
and associated online material
Rapidly increasing
- Data volume
- Data velocity
- Data variety
9. Accumulo Summit
VNG - 9
Common Big Data Architecture
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
10. Accumulo Summit
VNG - 10
Common Big Data Architecture
- Data Volume: Cloud Computing -
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Operators
MIT
SuperCloud
Enterprise Cloud
Big Data Cloud Database Cloud
Compute Cloud
MIT SuperCloud merges four clouds
LLSuperCloud: Sharing HPC Systems for Diverse Rapid Prototyping,
Reuther et al, IEEE HPEC 2013
11. Accumulo Summit
VNG - 11
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Lincoln benchmarking
validated Accumulo performance
Common Big Data Architecture
- Data Velocity: Accumulo Database -
12. Accumulo Summit
VNG - 12
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
D4M demonstrated a
universal approach to diverse data
columnsrows
Σ
raw
Common Big Data Architecture
- Data Variety: D4M Schema -
intel reports, DNA, health records, publication
citations, web logs, social media, building alarms,
cyber, … all handled by a common 4 table schema
D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo
Database, Kepner et al, IEEE HPEC 2013
13. Accumulo Summit
VNG - 13
The Cloud within the Common Big Data
Architecture
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
The “Cloud”
14. Accumulo Summit
VNG - 14
• Each cloud ecosystem supports many multi-$B industries
• Each cloud ecosystem uses different software and hardware
Four Ecosystems Dominate
Large Scale Cloud Computing
Enterprise
Big Data Database
Supercomputing
- Interactive
- On-demand
- Virtualization
- High performance
- Scientific computing
- Batch jobs
- Java
- Distributed
- Easy admin
- Indexing
- Search
- Atomic
15. Accumulo Summit
VNG - 15
• MIT SuperCloud adds virtual machines and added security
• Combines all four ecosystems without sacrificing performance
Enterprise
Big Data
MIT SuperCloud
Supercomputing
- Interactive
- On-demand
- Virtualization
- High performance
- Scientific computing
- Batch jobs
- Java
- Distributed
- Easy admin
- Indexing
- Search
- Atomic
Database
MIT SuperCloud
16. Accumulo Summit
VNG - 16
• VMware is the main enterprise computing virtualization technology
• Message Passing Interface (MPI) is the primary supercomputing API
• System Query Lange (SQL) is the primary database API
• Hadoop & Accumulo & D4M are core to government big data clouds
MIT SuperCloud
Enterprise
Big Data
- Interactive
- On-demand
- Virtualization
- Java
- Distributed
- Easy admin
Core Technologies
VMware
Hadoop
MPI
SQL
Database
Supercomputing
- High performance
- Scientific computing
- Batch jobs
- Indexing
- Search
- Atomic
D4M = Dynamic Distributed Dimensional Data Model
18. Accumulo Summit
VNG - 18
MIT SuperCloud
• Developed to address the challenges associated with big data
volume
• Cloud system allows all four ecosystems of the cloud to exist
within the same computational architecture
• Key Innovations:
– Shared HPC cloud capabilities
– High performance
– Reliable
• Brings the power of cloud computing to the HPC community
19. Accumulo Summit
VNG - 19
• Allows different architectures to be dynamically combined and tested
Cloud Ecosystems
Enterprise
Big Data
- Interactive
- On-demand
- Virtualization
- Java
- Distributed
- Easy admin
VMware
Hadoop
MPI
SQL
Database
Supercomputing
- High performance
- Scientific computing
- Batch jobs
- Indexing
- Search
- Atomic
MIT SuperCloud
20. Accumulo Summit
VNG - 20
MIT SuperCloud
Network Storage
Scheduler
Monitoring System
Compute NodesService Nodes
Cluster
Switch
LAN Switch
Interactive Compute Job
Interactive VM Job
Interactive Database Job
Project
Data
TX-E1
21. Accumulo Summit
VNG - 21
Cloud Computing @ MIT
• The cloud computing infrastructure at Lincoln Laboratory is
based on the MIT Supercloud infrastructure which allows the
different cloud eco systems to co exist
• MIT SuperCloud architecture addresses the issues of big data
volume
• Centerpiece of MIT SuperCloud: Accumulo database
23. Accumulo Summit
VNG - 23
Apache Accumulo
• Highest performance open source database
• Contributed to Apache project by the NSA in 2011
• Used extensively for government applications
• Requires a schema for storing and organizing data to obtain full
benefits
24. Accumulo Summit
VNG - 24
Accumulo and the MIT SuperCloud
• Apache Accumulo is a high performance database used for a
variety of purposes
– Helps address the big data velocity challenge
• Accumulo is the centerpiece of the Common Big Data
Architecture developed by MIT Lincoln Laboratory
• Key features:
– Open Source
– High Performance
– Widely adopted
– Vibrant Developer Community
• MIT Lincoln Laboratory has developed a set of tools – D4M to
help researchers use Accumulo for novel research
25. Accumulo Summit
VNG - 25
WarfightersOperators Analysts
Users
MaritimeGround SpaceC2 CyberOSINT
<html>
Data
AirHUMINTWeather
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Common Big Data Architecture
- Data Velocity: Accumulo Database -
Databases
27. Accumulo Summit
VNG - 27
High Level Language: D4M
http://www.mit.edu/~kepner/D4M
Accumulo
Distributed Database
Query:
Alice
Bob
Cathy
David
Earl
Associative Arrays
Numerical Computing Environment
D4M
Dynamic
Distributed
Dimensional
Data Model
A
C
D
E
B
A D4M query returns a sparse
matrix or a graph…
…for statistical signal processing
or graph analysis in MATLAB
D4M binds associative arrays to databases, enabling rapid
prototyping of data-intensive cloud analytics and visualization
28. Accumulo Summit
VNG - 28
D4M
• The Dynamic Distributed Data Model
– Supports database and computation systems that deal with
Big Data
– Developed at Lincoln Laboratory
• Key Features:
– Applies linear algebra and signal processing techniques to
databases through associative arrays
– D4M data schema offers a one-stop solution for most types of
data source for any type of database
– Low barrier to entry – API accessible to those even with
minimal database and/or big-data background
29. Accumulo Summit
VNG - 29
Associative Arrays
• Extends associative arrays to 2D and mixed data types
A(’#a1',’#b2') = ’same_tweet'
• Key innovation: 2D is 1-to-1 with triple store
(’#a1',’#b2',’same_tweet’)
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexing
A(’#al b2',:) A(’#al,',:) A(’#a* ',:)
A('#al: b2',:) A(1:2,:) A == #b2
#a1
#b2
same_tweet
#a1 #b2
30. Accumulo Summit
VNG - 30
Data Schema
• A structure described in a language supported by the database
management system
• Accumulo supports triples
– How can we represent heterogeneous data types in a common data
schema?
– Use the D4M schema
• D4M schema converts structured or unstructured raw data to
the 3-tuple representation supported by Accumulo:
– row is a unique identifier (often some variation of a time stamp)
– column is a unique representation of the data
– value is typically just ‘1’
• Usually use a 4 table representation
– The Edge Table, the Transpose Table, Degree Table, Raw Table
(row, column) value
32. Accumulo Summit
VNG - 32
• Key innovation: mathematical closure
– All associative array operations return associative arrays
• Enables composable mathematical operations
A + B A - B A & B A|B A*B
• Enables composable query operations via array indexing
A('alice bob ',:) A('alice ',:) A('al* ',:)
A('alice : bob ',:) A(1:2,:) A == 47.0
• Simple to implement in a library (~3500 lines) in programming
environments with: 1st class support of 2D arrays, operator
overloading, sparse linear algebra
Composable Associative Arrays
• Complex queries with ~50x less effort than Java/SQL
• Naturally leads to high performance parallel implementation
33. Accumulo Summit
VNG - 33
Using D4M for Advanced Analytics
• D4M allows researchers to harness the versatility of the MIT
SuperCloud architecture and speed of Apache Accumulo
through the familiarity of high level languages such at MATLAB
or GNU Octave.
• D4M schema provides an approach to mitigate challenges
associated with big data variety
• D4M is used for a variety of applications across the Department
of Defense and Intelligence Community
35. Accumulo Summit
VNG - 35
Supporting National Security
-Rapid Solution Prototyping-
336592592584179712 2013-05-20 21:21:42 20798128 kiefpief
web 3b77caf94bfc81fe I am sending love to Oklahoma. And
actually -- to everyone who may need it. You are loved. And you are not
alone. Promise. #PrayforOklahoma
336600956710027264 2013-05-20 21:54:56 35.99894978 -
78.90660222 -8783842.7781526 4300476.86376416 22435220
RyanBLeslie Twitter for iPad348803787 bced47a0c99c71d0
@HaydenBigCntry RT @jiminhofe: The devastation in Oklahoma is
…
Step 1: Start an instance of Accumulo and Ingest Data
Step 2: Find all tweets with keyword:
>>A = Tedge(Row(Tedge(:, 'word|#prayforoklahoma,')),:);
Step 3: Filter tweets by location:
>>B = A(:, 'latlon|+-003934,:,latlon|+-003979,’);
Step 4: Visualize results:
>>Assoc2KML(B);
36. Accumulo Summit
VNG - 36
Promoting big data discovery
-Domain Agnostic Analytics-
NOISE
SIGNAL
N-D SPACE
Example background model:
Power Law Graph
Goal: Find subgraph of interest
using background model to
identify noise
Model Background Data to Extract Signal from Observations
10
0
10
1
10
2
10
3
10
0
10
1
10
2
10
3
10
4
Degree Distribution
Degree
Count
dmax
- =
Observed
Data
Background
Model of Data
Residual
Data
Signal
&
Noise
Noise Signal
Big Data Filtering and SamplingDetecting Subgraphs of Interest from Large
Graphs
37. Accumulo Summit
VNG - 37
Securing the Cloud
-The Lincoln Secure and Resilient Cloud-
Analytics
A
C
DE
B
Computing
Web
Files
Scheduler
Ingest &
Enrichment
Ingest &
EnrichmentIngest
Databases
Secure and
Resilient
Communication
+ Provenance
Secure and
Resilient
Storage
Secure and
Resilient
Processing
• Big Data systems are vulnerable to a variety of attacks
• Improve security of cloud systems by researching:
• Security in Communication and Provenance
• Security in Data Storage
• Security in Processing
• Security in the underlying architecture
38. Accumulo Summit
VNG - 38
Ensuring Privacy
-Computing On Masked Data-
Big Data Veracity
<html>
Challenges
Analysts
Analytics
A
C
DE
B
Computing
Scheduler
Ingest &
EnrichmentIngest &
Enrichment
Remote Code
Injection
Hypervisor
Privilege Escalation
Cross VM Side
Channels
Data Loss /
Exfiltration
Data Integrity
Attack
Current Approaches
<html>
Analytics
A
C
DE
B
Computing
Files
Scheduler
Ingest &
EnrichmentIngest &
EnrichmentIngest
Encryptedlink
Encrypted
linkEncryptedstorage
Encrypted
storage
Vision
<html>
A
C
DE
B
Computing
Files
Scheduler
Ingest &
EnrichmentIngest &
Enrichment
Compute on
Encrypted
Data
Compute on
Encrypted
Data
Compute on
Encrypted Data
Step 1: Mask data and ingest into database
>>put(Tedge, Mask(Aedge, maskcode));
Step 2: Query DB for results with masked queries
>>Aedge_mt = Tedge(Row(Tedge(:,StrMask(‘word|bieber ‘, maskcode))),:);
>>Atxt_mt = TedgeTxt(Row(Tedge(:,StrMask(‘word|bieber ‘, maskcode))),:);
Step 3: Unmask Results
>>Aedge = Unmask(Aedge_mt, maskcode);
>>Atxt = Unmask(Atxt_mt, maskcode);
Use D4M and CMD to protect the 4th V of Big Data – Veracity
• Big Data systems are vulnerable to a variety of attacks
• Currently encrypt data at rest but data in flight is in the clear
• Compute on Encrypted Data: Data is always protected by
encryption through the system.
40. Accumulo Summit
VNG - 40
Summary
Air and Missile
Defense
Homeland
Protection
Air Traffic
Control
Communication
Systems
Advanced
Technology
Space
Control
ISR Systems
and Technology
Tactical Systems
Mission Areas:
Cyber
Security
Engineering
• Lincoln Laboratory missions collect and process vast amounts of data
from many sources
• MIT Lincoln Laboratory makes use of innovations in system architecture
(MIT SuperCloud), database technologies (Apache Accumulo) and
software (D4M) to develop technology in support of national security
Data Sources:
MaritimeGround SpaceC2 CyberOSINT
<html>
AirHUMINTWeather
Lincoln Laboratory is always interested in technical exchange with big data
community!
42. Accumulo Summit
VNG - 42
Cyber Security and Information Sciences
Human Language
Technology
Cyber Security Metrics Anti-Tamper Hardware Cyber Situational Awareness
Correlation and visualization of cyber alert data
makes it possible to detect and understand
attacks on large, enterprise networks.
Lincoln Laboratory builds, supports, and uses
cyber ranges to evaluate the performance of
cyber security technology.
Metrics are defined and measured to estimate
the defensive posture of enterprise-class
networks.
Physically unclonable functions are used to embed
cryptographic key material in a coating around a
computing module permitting detection of tampering.
Net-Centric
Operations
Cyber Testing and
Range Development
Research and prototyping of Service-Oriented
Architectures that enable the dynamic composition
of systems involving complex sensors, processing
and decision-support elements.
Algorithms are developed and implemented
for speech and biometric applications,
including language/speaker identification,
machine translation, and face comparison.
S-13