This talk introduces our recent research on the column-store databases. Using the Timestamped Binary Association (TBAT), writing speed on column-store databases can be significantly improved. We also talk about how to further improve reading speed on TBAT after plenty of updates. Finally, we demonstrate how to apply TBAT on big data file systems.
5. Row-Based to Column-Store
YOUNGSTOWN STATE UNIVERSITY
Fig. 1 customer Data in Row-Based and Column-Store (BAT) Format
id name balance
1 Alissa 100.00
2 Bob 200.00
3 Charles 300.00
(a) Row-Based Table customer
oid int
101 1
102 2
103 3
(b) BAT customer id
o
1
1
1
(c)
Figure 1: customer Data in Row-Based and
much faster in a column-store database.
Another featured benefit of the column-store
database is data compression, which can reach a higher
compression rate and higher speed than traditional
row-based database. One of the major reasons is that
the information entropy in the data of one column is
lower compared to that of row-based data.
Optimizing write operations in a column-store
sec
wo
2
e
omer
oid int
101 1
102 2
103 3
(b) BAT customer id
oid varchar
101 Alissa
102 Bob
103 Charles
(c) BAT customer name
oid float
101 100.00
102 200.00
103 300.00
(d) BAT customer balance
customer Data in Row-Based and Column-Store (BAT) Format
A BUN consists of
(oid, value)
Mapping Rules
Relational Data
Column-Store
8. Traditional Update on BAT
In traditional BAT, an update by a given OID
involves in 2 phases:
1. Search the location in BAT by OID (Time-
consuming)
2.Update the value at the target location.
YOUNGSTOWN STATE UNIVERSITY
12. AOC Update Example
YOUNGSTOWN STATE UNIVERSITY
Example:
Uupdate query on customer table:
update customer set balance=201.00
where id=2
Current timestamp is time2 (>time1).
The newest TBUN for 201.00 is appended to the end of TBAT customer_balance
New update ->
inal value to 201.00. Instead of seeking the position
to the record with oid=102, AOC update directly ap-
pends at the end of the TBAT a new tuple as (time2,
102, 201.00). The timestamp when AOC update is
performed is assumed to be time2, and 201.00 is the
newly updated value. The TBAT customer balance
after the AOC update is illustrated in Table 3.
Table 3: TBAT customer balance after AOC Update
optime oid float
time1 101 100.00
time1 102 200.00
time1 103 300.00
time2 102 201.00
3.2.2 Cost Analysis of the AOC Update
Body
Appendix
27. TBAT (Timestamped BAT)
TBAT in HDFS:
struct TBUN{
TIMESTAMP optime,
ROWID oid,
USER_DEFINED_TYPE attrv
}
struct TBAT_slip{
TBUN[max_size_per_HDFS_slip] tbuns
}
• No need for any global pre-sorting or indexing
• ‘attrv’ is can be any user defined type that flexibly define arbitrary kinds of
schema
YOUNGSTOWN STATE UNIVERSITY
28. AMO Update (logical)
YOUNGSTOWN STATE UNIVERSITY
Example:
Update query on customer table:
update customer set balance=201.00 where id=2
Current timestamp is time2 (>time1).
The newest TBUN for 201.00 is appended to the end of TBAT customer_balance
inal value to 201.00. Instead of seeking the position
to the record with oid=102, AOC update directly ap-
pends at the end of the TBAT a new tuple as (time2,
102, 201.00). The timestamp when AOC update is
performed is assumed to be time2, and 201.00 is the
newly updated value. The TBAT customer balance
after the AOC update is illustrated in Table 3.
Table 3: TBAT customer balance after AOC Update
optime oid float
time1 101 100.00
time1 102 200.00
time1 103 300.00
time2 102 201.00
3.2.2 Cost Analysis of the AOC Update
t
t
3
A
t
s
q
w
e
fi
p
New Data
Old Data
36. New Challenges
•New Index on C-S DBs
• Local and global
• Searching
• Data Cleaning
• Parallel Processing
•Big Data
• Searching
• Data Cleaning
• Auto Mapping
• To Index or not to index?
•Broader Applications
• Scientifics Data Management
• Big Data Analytics
• Machine Learning
• OLAP
• OLTP
• HPC
• HTC
YOUNGSTOWN STATE UNIVERSITY, OH, USA