Mais conteúdo relacionado Semelhante a Adam Fuchs' Accumulo Talk at NoSQL Now! 2013 (20) Adam Fuchs' Accumulo Talk at NoSQL Now! 20131. Securely explore your data
SQRRL ENTERPRISE +
APACHE ACCUMULO:
A secure, scalable, real-time
analysis framework
Adam Fuchs, CTO
Sqrrl Data, Inc.
August 21, 2013
2. OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
3. TWO HALVES OF REAL-TIME
Data-Driven
Real-Time reduce event to reaction time
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
Query-Driven
Real-Time reduce ingest to query latency
4. Data-Driven + Query-Driven Real-Time Ecosystem
Actions
3
SPE
4
Data
1
Dashboards
2
5
NoSQL+
6
1.
2.
3.
4.
5.
6.
Interactive
Analysis Tools
(Discovery + Forensics)
SPE queries NoSQL to enrich streaming data
SPE persists results in NoSQL for future query
SPE takes action automatically
SPE issues data-driven alerts
Sqrrl provides context for dashboards
Analysis tools query use Sqrrl to search and manipulate historical data
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
5. This talk focuses on the database.
Actions
3
SPE
4
Data
1
Dashboards
2
5
NoSQL+
6
Interactive
Analysis Tools
(Discovery + Forensics)
1.
2.
3.
4.
5.
6.
SPE queries NoSQL to enrich streaming data
SPE persists results in NoSQL for future query
SPE takes action automatically
SPE issues data-driven alerts
Sqrrl provides context for dashboards
Analysis tools query use Sqrrl to search and manipulate historical data
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
5
6. OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
7. ACCUMULO DATA FORMAT
An Accumulo key is a 5-tuple, consisting of:
- Row: Controls Atomicity
- Column Family: Controls Locality
- Column Qualifier: Controls Uniqueness
- Visibility Label: Controls Access
- Timestamp: Controls Versioning
Accumulo Key/Value Example
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
7
8. ACCUMULO TABLETS
Well-Known
Location
(zookeeper)
Collections of KV pairs form Tables
Tables are partitioned into Tablets
Metadata tablets hold info about other
tablets, forming a 3-level hierarchy
A Tablet is a unit of work for a Tablet
Server
Root Tablet
-∞ to ∞
Metadata Tablet 1
Metadata Tablet 2
-∞ to “Encyclopedia:Ocelot”
“Encyclopedia:Ocelot” to ∞
Table: Adam’s Table
Data Tablet
-∞ : thing
Data Tablet
thing : ∞
Table: Encyclopedia
Data Tablet
-∞ : Ocelot
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
Data Tablet
Ocelot : Yak
Data Tablet
Yak : ∞
Table: Foo
Data Tablet
-∞ to ∞
8
12. ITERATOR FRAMEWORK
Iterator Operations:
- File Reads
- Block Caching
- Merging
- Deletion
- Isolation
- Locality Groups
- Range Selection
- Column Selection
- Cell-level Security
- Versioning
- Filtering
- Aggregation
- Partitioned Joins
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
12
14. ACCUMULO THROUGHPUT
Scan:
up to 1M entries/s
per node
Ingest:
up to 500K entries/s
per node
~ms
~ms
Ingesters
Tablet Servers
InInInMemory
Memory
Memory
Map
Map
Map
Batch
Writer
ms - min
Input
~ms
Scan
Scan
Scan
Iterators
Iterators
Iterators
Queriers
Scanner
/Batch
Scanner
Output
Compacti
Compacti
on
Compaction
on
Iterators
Iterators
Iterators
RFile
RFile
RFiles
Read-Modify-Write Latency: ~ms
>1K entries/s challenging with R-M-W
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
14
15. SQRRL ENTERPRISE
Bulk Processing
Integration
Exploratory /
Operational Apps
Built on Apache Accumulo
Graph +
Document I/O
Sqrrl API over Apache Thrift RPC
(JSON, Graph, Aggregation, Search, etc.)
•
•
•
•
•
Sqrrl proprietary
Automated indexing
Custom iterators
Lucene integration
Security extensions
Sqrrl Server
Accumulo RPC
(Sorted Key/Value I/O)
• Open source
(including Sqrrl
contributions)
Hadoop RPC
(File I/O)
• Open source or
commercial distributions
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
15
16. OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
16
17. DATA-CENTRIC SECURITY
Definition: Data carries with it information that is required
to make policy decisions on its releasability.
User 1
Sqrrl/
Accumul
o
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
User 2
17
18. SECURITY
Example Accumulo Key/Value Pairs
Accumulo is the only
NoSQL database with
cell-level access
controls
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
18
20. OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
20
23. FORWARD AND INVERTED INDEX
Table:
Forward Index
Inverted Index
Row:
<UUID>
<Term>
Column Family:
<Type>
<UUID>
Column Qualifier:
<Field>
<Type+Field>
<Term>
<Digest of Event>
Value:
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
23
26. D4M 2.0 SCHEMA FOR TWITTER DATA
Table:
Tedge
TedgeT
Row:
<UUID>
<value>
Column Family:
“stat”
“time”
“user”
“word
”
“stat”
“time”
“user”
“word
”
Column Qualifier:
<stat>
<time>
<user
>
<word
>
<UUID
>
<UUID
>
<UUID
>
<UUID
>
“1”
“1”
“1”
“1”
“1”
“1”
“1”
“1”
Value:
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
26
27. D4M 2.0 SCHEMA FOR TWITTER DATA
Table:
TedgeDegT
Ttext
Row:
<value>
<UUID>
Column Family:
“stat”
“time”
“user”
“word
”
“text”
Column Qualifier:
“degre
e”
“degre
e”
“degre
e”
“degre
e”
-
Value:
<count>
<count> <count>
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
<count>
<text>
27
28. D4M 2.0 SCHEMA FOR TWITTER DATA
Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et. al., HPEC 2013
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
28
29. OUTLINE
Two Halves of “Real-Time”
Accumulo and Sqrrl Technology
Data-Centric Security
Table Designs
Performance Benchmarks
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
29
30. ACCUMULO WITH D4M 2.0 SCHEMA PERFORMANCE
Maximizing throughput on an 8-node, 192-core cluster:
Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et. al., HPEC 2013
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
30
31. ACCUMULO SCALABILITY: GRAPH500 BENCHMARK
source: http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
31
32. ATOMIC INCREMENT PERFORMANCE COMPARISON
Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)
© 2013 Sqrrl | All Rights Reserved | Proprietary and Confidential
32