SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
Faster Practical Block Compression
for Rank/Select Dictionaries
Yusaku Kaneta | yusaku.kaneta@rakuten.com
Rakuten Institute of Technology, Rakuten, Inc.
2
Background
§ Compressed data structures in Web companies.
• Web companies generate massive amount of logs in text formats.
• Analyzing such huge logs is vital for our decision making.
• Practical improvements of compressed data structures are important.
§ RRR compression [Raman, Raman, Rao, SODA’02]
• Basic building block in many compressed data structures.
• Rank/Select queries on compressed bit strings in constant time:
‣ Rankb(B, i): Number of b’s in B’s prefix of length i.
‣ Selectb(B, i): Position of B’s i-th b.
B is an input bit string
b: a bit in {0, 1}
3
RRR = Block compression + succinct index
§ Represents a block B of w bits into a pair (class(B), offset(B)).
• class(B): Number of ones in B.
• offset(B): Number of preceding blocks of class same as B for some
order (e.g., lexicographical order of bit strings).
§ log w bits for class(B) and log2
w
class(B)
bits for offset(B).
§ Two practical approaches to block compression:
• Blockwise approach [Claude and Navarro, SPIRE’09]
• Bitwise approach [Navarro and Providel, SEA’12]
4
Block compression in practice
Good: O(1) time.
Bad: Low compression ratio.
§The tables limit use of larger w.
§log w bits for class(B) become non-
negligible.
§Ex) 25% overhead for w = 15.
1. Blockwise approach
[Claude and Navarro, SPIRE’09]
2. Bitwise approach
[Navarro and Providel, SEA’12]
Idea: O(2ww)-bit universal tables. Idea: O(w3)-bit binomial coefficients.
Good: High compression ratio.
Bad: O(w) time.
§Count bit strings lexicographically
smaller than block B bit by bit.
§In practice, heuristics of encoding
and decoding blocks with a few
ones in O(1) time can be used.
Less flexible in practice
5
Main result
§ Practical encoder/decoder for block compression
• Generalization of existing blockwise and bitwise approaches.
• Idea: chunkwise processing with multiple universal tables.
• Faster and more stable on artificial data.
Method Encode Decode Space (in bits)
Blockwise [Claude and Navarro, SPIRE’09] O(1) O(1) O(2ww)
Bitwise [Navarro and Provital, SEA’12] O(w) O(w) O(w3)
Chunkwise (This work) O(w/t) O((w/t) log t) O(w3 + 2t t)
This talk uses w and t for block and chunk lengths, respectively.
6
Our algorithm
7
Overview of our algorithm
§ Main idea: Process a block B in a chunkwise fashion.
• Bi: The i-th chunk of length t. (Suppose t divides w.)
‣ Encoded/Decoded in O(1) time using O(2tt)-bit universal tables.
• Efficiently count up blocks X satisfying X < B by using a combination
formula and chunkwise order:
A lexicographical order with:
1. class(Xi) < class(Bi) or
2. class(Xi) = class(Bi) and offset(Xi) < offset(Bi)
t
c
×
n − t
m − c
c m − c
n − tt
Number of ones:
Number of bits:
Block X
Combination formula: Chunkwise order: X < B
8
Block encoding in O(w/t) time
Lemma: Block encoding can be implemented in
O(w/t) time with O(w3+2tt)-bit universal tables.
` 1
oi+1
B0···Bi-1
oi
2
X[0, i) X[i] •••
Blocks X of class same as B
in descending order of offset(X)
from top to bottom.
oi = X 	X0···Xi-1 < B0···Bi-1
ci
ni
class(B)
− ci − class(Bi)
w − ni − t#bits
#ones
Bi Bi+1···Bw/t-1
• w−	ni − t is in {0, t, 2t, …, (w/t)t=w}.
• class(B) − ci − c ranges in [0, w).
• class(Bi) ranges in [0, t).
• Each value can be represented in w bits.
2. class(Xi) = class(Bi) and offset(Xi) < offset(Bi)
Idea:
Multiplication
offset(Bi)×
w − ni − t
class(B) − ci − class(Bi)
1. class(Xi) < class(Bi):
Idea:
Summation
%
t
c
×
w − ni − t
class(B) − ci − c
class(Bi)&1
c = 0
9
Block decoding in O((w/t)log t) time
§ Reverse operation of block encoding.
• class(Bi): O(log t) time by a successor query on a universal table.
• offset(Bi): O(1) time by integer division.
min k	 ∑ t
c
×
w − ni − t
class(B) − ci − c
	≥ offset(B) − oi
k
c = 0
Lemma: Block decoding can be implemented in
O((w/t)log t) time with O(w3+2tt)-bit universal tables.
Idea:
Successor
query
10
Experimental results
11
Experiment 1: Encoding/Decoding
§ Method: Measured average time for block encoding and decoding.
§ Input: 1M random blocks of length w = 64 for each class.
Our chunkwise encoding and decoding:
§ Time: Significantly faster and less sensitive to densities.
§ Space: Comparable (t = 8) and 10 times more (t = 16).
Average time (in microseconds) for encoding and decodiing
Bitwise
Bitwise
Our chunkwise (t = 8)
Our chunkwise (t = 8)
Our chunkwise (t = 16)
Our chunkwise (t = 16)
Class of blocks Class of blocks Class of blocks
Decoding
time
Enoding
time
12
Experiment 2: Rank/Select queries
§ Method: Measured average time for 1M rank/select on RRR.
§ Input: Random bit strings of length 228 with densities 5, 10, and 20 %.
Density 5% 10% 20%
Operation Rank1 Select1 Rank1 Select1 Rank1 Select1
bitwise 0.226 0.276 0.288 0.310 0.375 0.417
chunkwise (t=8) 0.212 0.288 0.279 0.312 0.297 0.321
chunkwise (t=16) 0.187 0.250 0.219 0.254 0.235 0.265
Average time (in microseconds) for rank and select
Our chunkwise approach improved rank/select queries on RRR
although our improvement is smaller than that in Experiment 1.
13
Conclusion
§ Practical block encoding and decoding for RRR
• New time-space tradeoff based on chunkwise processing:
‣ O(w/t) encoding
‣ O((w/t)log t) decoding
‣ O(w3 + 2tt) bits of space.
• Generalize previous blockwise and bitwise approaches.
• Fast and stable on artificial data with various densities.
§ Future work:
• More experimental evaluation on real data.
THANK YOU

Mais conteúdo relacionado

Mais procurados

Tpr star tree
Tpr star treeTpr star tree
Tpr star treeWin Yu
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02Muhammad Aslam
 
A Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationA Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationTomonari Masada
 
ZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsAlex Pruden
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic EncryptionVictor Pereira
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...Alex Pruden
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDing Li
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition동호 이
 
Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsLuc Brun
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
Introduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious AlgorithmsIntroduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious AlgorithmsChristopher Gilbert
 
Graph kernels
Graph kernelsGraph kernels
Graph kernelsLuc Brun
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
 

Mais procurados (20)

Tpr star tree
Tpr star treeTpr star tree
Tpr star tree
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
A Note on TopicRNN
A Note on TopicRNNA Note on TopicRNN
A Note on TopicRNN
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02
 
A Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationA Note on Latent LSTM Allocation
A Note on Latent LSTM Allocation
 
ZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their Applications
 
Recursive algorithms
Recursive algorithmsRecursive algorithms
Recursive algorithms
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic Encryption
 
Ch8
Ch8Ch8
Ch8
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition
 
Graph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & TrendsGraph Edit Distance: Basics & Trends
Graph Edit Distance: Basics & Trends
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
Introduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious AlgorithmsIntroduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious Algorithms
 
Graph kernels
Graph kernelsGraph kernels
Graph kernels
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
 

Semelhante a Faster Practical Block Compression for Rank/Select Dictionaries

Computer Networking : Principles, Protocols and Practice - lesson 1
Computer Networking : Principles, Protocols and Practice - lesson 1Computer Networking : Principles, Protocols and Practice - lesson 1
Computer Networking : Principles, Protocols and Practice - lesson 1Olivier Bonaventure
 
Advance computer architecture
Advance computer architectureAdvance computer architecture
Advance computer architecturesuma1991
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Hemant Jha
 
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...Arun Kumar
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5 Salah Amean
 

Semelhante a Faster Practical Block Compression for Rank/Select Dictionaries (20)

Lecture 25
Lecture 25Lecture 25
Lecture 25
 
SISAP17
SISAP17SISAP17
SISAP17
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
Computer Networking : Principles, Protocols and Practice - lesson 1
Computer Networking : Principles, Protocols and Practice - lesson 1Computer Networking : Principles, Protocols and Practice - lesson 1
Computer Networking : Principles, Protocols and Practice - lesson 1
 
Advance computer architecture
Advance computer architectureAdvance computer architecture
Advance computer architecture
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
Dynamic memory allocation Dynamic Memory Allocation I. Topics. Basic represen...
 
Algorithms Exam Help
Algorithms Exam HelpAlgorithms Exam Help
Algorithms Exam Help
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Part 1 : reliable transmission
Part 1 : reliable transmissionPart 1 : reliable transmission
Part 1 : reliable transmission
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
 
Editors l21 l24
Editors l21 l24Editors l21 l24
Editors l21 l24
 
Memory caching
Memory cachingMemory caching
Memory caching
 

Mais de Rakuten Group, Inc.

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話Rakuten Group, Inc.
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のりRakuten Group, Inc.
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Rakuten Group, Inc.
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みRakuten Group, Inc.
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開Rakuten Group, Inc.
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用Rakuten Group, Inc.
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャーRakuten Group, Inc.
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割Rakuten Group, Inc.
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Group, Inc.
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfRakuten Group, Inc.
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfRakuten Group, Inc.
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfRakuten Group, Inc.
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technologyRakuten Group, Inc.
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情Rakuten Group, Inc.
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャーRakuten Group, Inc.
 

Mais de Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Último

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Faster Practical Block Compression for Rank/Select Dictionaries

  • 1. Faster Practical Block Compression for Rank/Select Dictionaries Yusaku Kaneta | yusaku.kaneta@rakuten.com Rakuten Institute of Technology, Rakuten, Inc.
  • 2. 2 Background § Compressed data structures in Web companies. • Web companies generate massive amount of logs in text formats. • Analyzing such huge logs is vital for our decision making. • Practical improvements of compressed data structures are important. § RRR compression [Raman, Raman, Rao, SODA’02] • Basic building block in many compressed data structures. • Rank/Select queries on compressed bit strings in constant time: ‣ Rankb(B, i): Number of b’s in B’s prefix of length i. ‣ Selectb(B, i): Position of B’s i-th b. B is an input bit string b: a bit in {0, 1}
  • 3. 3 RRR = Block compression + succinct index § Represents a block B of w bits into a pair (class(B), offset(B)). • class(B): Number of ones in B. • offset(B): Number of preceding blocks of class same as B for some order (e.g., lexicographical order of bit strings). § log w bits for class(B) and log2 w class(B) bits for offset(B). § Two practical approaches to block compression: • Blockwise approach [Claude and Navarro, SPIRE’09] • Bitwise approach [Navarro and Providel, SEA’12]
  • 4. 4 Block compression in practice Good: O(1) time. Bad: Low compression ratio. §The tables limit use of larger w. §log w bits for class(B) become non- negligible. §Ex) 25% overhead for w = 15. 1. Blockwise approach [Claude and Navarro, SPIRE’09] 2. Bitwise approach [Navarro and Providel, SEA’12] Idea: O(2ww)-bit universal tables. Idea: O(w3)-bit binomial coefficients. Good: High compression ratio. Bad: O(w) time. §Count bit strings lexicographically smaller than block B bit by bit. §In practice, heuristics of encoding and decoding blocks with a few ones in O(1) time can be used. Less flexible in practice
  • 5. 5 Main result § Practical encoder/decoder for block compression • Generalization of existing blockwise and bitwise approaches. • Idea: chunkwise processing with multiple universal tables. • Faster and more stable on artificial data. Method Encode Decode Space (in bits) Blockwise [Claude and Navarro, SPIRE’09] O(1) O(1) O(2ww) Bitwise [Navarro and Provital, SEA’12] O(w) O(w) O(w3) Chunkwise (This work) O(w/t) O((w/t) log t) O(w3 + 2t t) This talk uses w and t for block and chunk lengths, respectively.
  • 7. 7 Overview of our algorithm § Main idea: Process a block B in a chunkwise fashion. • Bi: The i-th chunk of length t. (Suppose t divides w.) ‣ Encoded/Decoded in O(1) time using O(2tt)-bit universal tables. • Efficiently count up blocks X satisfying X < B by using a combination formula and chunkwise order: A lexicographical order with: 1. class(Xi) < class(Bi) or 2. class(Xi) = class(Bi) and offset(Xi) < offset(Bi) t c × n − t m − c c m − c n − tt Number of ones: Number of bits: Block X Combination formula: Chunkwise order: X < B
  • 8. 8 Block encoding in O(w/t) time Lemma: Block encoding can be implemented in O(w/t) time with O(w3+2tt)-bit universal tables. ` 1 oi+1 B0···Bi-1 oi 2 X[0, i) X[i] ••• Blocks X of class same as B in descending order of offset(X) from top to bottom. oi = X X0···Xi-1 < B0···Bi-1 ci ni class(B) − ci − class(Bi) w − ni − t#bits #ones Bi Bi+1···Bw/t-1 • w− ni − t is in {0, t, 2t, …, (w/t)t=w}. • class(B) − ci − c ranges in [0, w). • class(Bi) ranges in [0, t). • Each value can be represented in w bits. 2. class(Xi) = class(Bi) and offset(Xi) < offset(Bi) Idea: Multiplication offset(Bi)× w − ni − t class(B) − ci − class(Bi) 1. class(Xi) < class(Bi): Idea: Summation % t c × w − ni − t class(B) − ci − c class(Bi)&1 c = 0
  • 9. 9 Block decoding in O((w/t)log t) time § Reverse operation of block encoding. • class(Bi): O(log t) time by a successor query on a universal table. • offset(Bi): O(1) time by integer division. min k ∑ t c × w − ni − t class(B) − ci − c ≥ offset(B) − oi k c = 0 Lemma: Block decoding can be implemented in O((w/t)log t) time with O(w3+2tt)-bit universal tables. Idea: Successor query
  • 11. 11 Experiment 1: Encoding/Decoding § Method: Measured average time for block encoding and decoding. § Input: 1M random blocks of length w = 64 for each class. Our chunkwise encoding and decoding: § Time: Significantly faster and less sensitive to densities. § Space: Comparable (t = 8) and 10 times more (t = 16). Average time (in microseconds) for encoding and decodiing Bitwise Bitwise Our chunkwise (t = 8) Our chunkwise (t = 8) Our chunkwise (t = 16) Our chunkwise (t = 16) Class of blocks Class of blocks Class of blocks Decoding time Enoding time
  • 12. 12 Experiment 2: Rank/Select queries § Method: Measured average time for 1M rank/select on RRR. § Input: Random bit strings of length 228 with densities 5, 10, and 20 %. Density 5% 10% 20% Operation Rank1 Select1 Rank1 Select1 Rank1 Select1 bitwise 0.226 0.276 0.288 0.310 0.375 0.417 chunkwise (t=8) 0.212 0.288 0.279 0.312 0.297 0.321 chunkwise (t=16) 0.187 0.250 0.219 0.254 0.235 0.265 Average time (in microseconds) for rank and select Our chunkwise approach improved rank/select queries on RRR although our improvement is smaller than that in Experiment 1.
  • 13. 13 Conclusion § Practical block encoding and decoding for RRR • New time-space tradeoff based on chunkwise processing: ‣ O(w/t) encoding ‣ O((w/t)log t) decoding ‣ O(w3 + 2tt) bits of space. • Generalize previous blockwise and bitwise approaches. • Fast and stable on artificial data with various densities. § Future work: • More experimental evaluation on real data.