2. About Me
• James Salter
• Former: PhD, University of Surrey
▫ Resource discovery in peer-to-peer networks
▫ Recommender systems
• Current: Applied Researcher
▫ Data mining algorithms, information fusion
▫ Hadoop
▫ Large graphs
▫ “other interesting things”
3. Outline
• What is Accumulo?
• Comparison with Relational Databases
• Architecture
• Potential Applications
4. Apache Hadoop
• Framework for distributed computing
• Clusters of commodity machines
• MapReduce
▫ Best-known sub-project
▫ Batch processing of bulk data
▫ (Potentially) large files of output
5. What is Accumulo?
• A distributed key/value store
▫ Runs in parallel across a Hadoop cluster
• Very scalable
▫ trillions of records, 10s of Petabytes of data
• Cell level security
▫ Every data item has a security label
• Open source version of Google’s BigTable
▫ Original development by NSA
▫ Now a top-level Apache project
6. Relational schema to Accumulo
CustName Birthday Phone
Alice 12/03/45 794838
Bob 09/09/67
Mary 23/04/83 975838
CustName ItemID Quantity
Alice 17 1
Alice 89 5
Bob 92 1
Mary 12 1
ItemID ItemName
12 DVD
17 Magazine
89 Ticket
92 Shirt
CustName Birthday Phone DVD Magazine Ticket Shirt
Alice 12/03/45 794838 1 5
Bob 09/09/67 1
Mary 23/04/83 975838 1
7. Relational schema to Accumulo
Row,Column Value
{Alice,Birthday} 12/03/45
{Alice,Phone} 794838
{Alice,Magazine} 1
{Alice,Ticket} 5
{Bob,Birthday} 09/09/67
{Bob,Shirt} 1
... ...
nulls are
not stored
easy to add
new columns
e.g. {Bob,Book}
CustName Birthday Phone DVD Magazine Ticket Shirt
Alice 12/03/45 794838 1 5
Bob 09/09/67 1
Mary 23/04/83 975838 1
8. Table Structure
• Tables contain key/value pairs sorted by key
• Split into tablets, distributed across a cluster
▫ Tablets reflect a portion of the table’s keyspace
Key Value
{Alice,Birthday} 12/03/45
{Alice,Magazine} 1
{Alice,Phone} 794838
{Alice,Ticket} 5 Key Value
{Bob,Birthday} 09/09/67
{Bob,Shirt} 1
... ...
9. Tablet Server
• Hosts one or more tablets
▫ Not necessarily for the same table
• Tablets store references to ISAM (Indexed
Sequential Access Method) files in HDFS
▫ Key/values stored in ISAM files
Tablet Server
Tablet
Table A
RowIDs g-n
Tablet
Table F
RowIDs a-c
Tablet
Table J
RowIDs x-zz
HDFS
ISAM
File
ISAM
File
ISAM
File
10. Master
• Detects Tablet Server failures
▫ Migrates tablets to other Tablet Servers
• Responsible for load balancing
▫ Assigns tablets to Tablet Servers
▫ Instructs Tablet Servers to migrate tablets
11. Potential Applications
• Massive datastore
▫ Interactive retrieval of MapReduce results
• Graph database/graph mining
▫ Data input to Google Pregel clones (e.g. Giraph)
• Machine learning/classification
▫ Good for storing sparse feature vectors
• Not good for applications involving JOIN
▫ Limited joins possible – Intersecting Iterator
▫ Combine with Hive, Impala, etc.
12. Conclusion
• Accumulo is a key-value datastore
• Data layout very different from Relational DBs
• Distributed architecture on top of Hadoop
• Many uses aside from “just” a simple store