SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 1
Efficient Top-k Retrieval on Massive Data
Xixian Han, Jianzhong Li, Member, IEEE, Hong Gao
Abstract—In many applications, top-k query is an important operation to return a set of interesting points in a potentially huge
data space. It is analyzed in this paper that the existing algorithms cannot process top-k query on massive data efficiently. This
paper proposes a novel table-scan-based T2S algorithm to efficiently compute top-k results on massive data. T2S first constructs
the presorted table, whose tuples are arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains only
fixed number of tuples to compute results. The early termination checking for T2S is presented in this paper, along with the
analysis of scan depth. The selective retrieval is devised to skip the tuples in the presorted table which are not top-k results. The
theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and
incremental-update/batch-processing methods for the used structures are proposed in this paper. The extensive experimental
results, conducted on synthetic and real-life data sets, show that T2S has a significant advantage over the existing algorithms.
Index Terms—Massive data, T2S algorithm, Table scan, Early termination, Selective retrieval
!
1 INTRODUCTION
IN many applications, top-k query is an important
operation to return a set of interesting points from a
potentially huge data space. In top-k query, a ranking
function F is provided to determine the score of each
tuple and k tuples with the largest scores are returned.
Due to its practical importance, top-k query has
attracted extensive attention [15]. The existing top-k
algorithms can be classified into three types: index-
based methods [3], [22], view-based methods [4],
[14], [21] and sorted-list-based methods [7], [17], [18].
Index-based methods (or view-based methods) make
use of the pre-constructed indexes (or views) to pro-
cess top-k query. However, a concrete index (or view)
is constructed on a specific subset of attributes, the
indexes (or views) of exponential order with respect
to attribute number have to be built to cover the actual
queries, which is prohibitively expensive. The used
indexes (or views) can only be built on a small and
selective set of attribute combinations.
Sorted-list-based methods retrieve the sorted lists in
a round-robin fashion, maintain the retrieved tuples,
update their lower-bound and upper-bound scores.
When the kth
largest lower-bound score is not less
than the upper-bound scores of other candidates, the
k candidates with the largest lower-bound scores are
top-k results. Sorted-list-based methods compute top-
k results by retrieving the involved sorted lists and
naturally can support the actual queries. However, it
is analyzed in this paper that the numbers of tuples
retrieved and maintained in these methods increase
exponentially with attribute number, increase polyno-
mially with tuple number and result size. They will
• The authors are with the School of Computer Science and Technology,
Harbin Institute of Technology, China.
E-mail: hanxixian@yeah.net, lijzh@hit.edu.cn, honggao@hit.edu.cn
incur high I/O cost and memory cost on massive data.
This paper proposes a novel table-scan-based T2S
algorithm (Top-k by Table Scan) to compute top-k
results on massive data efficiently. Given table T, T2S
first presorts T to obtain table PT(Presorted Table),
whose tuples are arranged in the order of the round-
robin retrieval on the sorted lists. During its execution,
T2S only maintains fixed and small number of tuples
to compute results. It is proved that T2S has the
characteristic of early termination. It does not need to
examine all tuples in PT to return results. The analysis
of scan depth in T2S is developed also. The result size
k is usually small and the vast majority of the tuples
retrieved in PT are not top-k results, this paper devises
selective retrieval to skip the tuples in PT which
are not query results. The theoretical analysis proves
that selective retrieval can reduce the number of the
retrieved tuples significantly. The construction and
incremental-update/batch-processing methods for the
data structures are proposed in this paper. The exten-
sive experiments are conducted on synthetic and real-
life data sets. The experimental results show that, T2S
outperforms the existing algorithms significantly.
The contributions of this paper are listed as follows:
• This paper proposes a novel table-scan-based T2S
algorithm to process top-k query on massive data.
• The early termination checking is presented in
this paper, along with the analysis of scan depth.
• This paper devises selective retrieval and its per-
formance analysis.
• The experimental results show that T2S has a sig-
nificant advantage over the existing algorithms.
The rest of the paper is organized as follows. Section
2 reviews related work. Preliminaries are described in
Section 3. The execution behavior of sorted-list-based
methods is analyzed in Section 4. T2S algorithm is
introduced in Section 5 (the presorting of the table),
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 2
Section 6 (early termination checking), Section 7 (se-
lective retrieval) and Section 8 (columnwise algorith-
m). The discussion about data structures is developed
in Section 9. The performance evaluation is provided
in Section 10. Section 11 concludes the paper.
2 RELATED WORK
2.1 Index-based and view-based methods
Xin et al. [22] and Chang et al. [3] propose layered
indexing to organize the tuples into multiple consec-
utive layers. The top-k results can be computed by
at most k layers of tuples. Zou et al. [23] propose
layer-based Pareto-Based Dominant Graph to express
the dominant relationship between records and top-k
query is implemented as a graph traversal problem.
Lee et al. [16] propose a dual-resolution layer structure
(coarse-level layers and fine-level sub-layers). Top-
k query can be processed efficiently by traversing
the dual-resolution layer through the relationships
between tuples. Heo et al. [13] propose the Hybrid-
Layer Index, which integrates layer-level filtering and
list-level filtering to significantly reduce the number
of tuples retrieved in query processing
Hristidis et al. [14] and Das et al. [4] propose
view-based algorithms to pre-construct the specified
materialized views according to some ranking func-
tions. Given a top-k query, one or more optimal
materialized views are selected to return the top-k
results efficiently. Xie et al. [21] propose LPTA+ to
significantly improve efficiency of the state-of-the-art
LPTA algorithm. The materialized views are cached
in memory, LPTA+ can reduce the iterative calling of
the linear programming sub-procedure, thus greatly
improving the efficiency over the LPTA algorithm.
Summary. In practical applications, a concrete in-
dex (or view) is built on a specific subset of attributes.
Due to prohibitively expensive overhead to cover all
attribute combinations, the indexes (or views) can
only be built on a small and selective set of attribute
combinations. If the attribute combinations of top-k
query are fixed, index-based (or view-based) methods
can provide a superior performance. However, on
massive data, users often issue ad-hoc queries, it is
very likely that the indexes (or views) involved in
the ad-hoc queries are not built and the practicability
of these methods is limited greatly. Correspondingly,
T2S only builds presorted table, on which top-k query
on any attribute combination can be dealt with. This
reduces the space overhead significantly compared
with index-based (or view-based) methods, and en-
ables actual practicability for T2S.
2.2 Sorted-list-based methods
Fagin et al. [6] propose TA algorithm to answer top-k
query on sorted lists. During execution, TA reads each
list in a round-robin fashion. For each tuple newly
seen in some list, TA retrieves scores of the tuple
in other lists to determine its complete score. The
threshold of TA is updated in each pass of retrieval
by aggregating the scores of the last seen tuples in
lists. TA terminates when the kth
largest score seen
so far is not less than the threshold, and returns k
tuples with the largest scores. Instead of round-robin
retrieval, G¨unter et al. [9] propose Quick-Combine to
preferentially access lists that make threshold decline
most. Akbarinia et al. [1] develop best position al-
gorithm to speed up top-k processing by means of
information obtained in the process of random access.
In some cases, random access is limited or impossi-
ble, NRA [7] is proposed with sequential access only
and performs round-robin retrieval on sorted lists. For
each retrieved tuple, NRA maintains its lower-bound
and upper-bound scores. NRA terminates when the
kth
largest lower-bound score seen so far is not less
than the upper-bound scores of the other tuples, and
returns k tuples with the largest lower-bound scores.
G¨unter et al. [10] propose Stream-Combine to improve
NRA by preferentially accessing sorted lists which
make threshold decline most. Mamoulis et al. [17]
divide execution of NRA into growing phase and
shrinking phase, propose LARA to reduce computa-
tion cost. Pang et al. [18] propose TBB algorithm to
improve disk access efficiency of NRA by sequential
scan. Han et al. [11] develop TKEP algorithm to
perform early pruning on each candidate encountered
in growing phase to reduce the processing cost.
Summary. Sorted-list-based methods can compute
top-k results by retrieving the involved sorted lists
and support actual top-k queries. They usually have
the characteristic of early termination. However, on
massive data, TA-style algorithms will generate many
random seek operations, while NRA-style algorithms
will retrieve and maintain a large number of tuples,
they cannot process top-k query efficiently. TKEP
can reduce the number of the tuples maintained in
memory by early pruning. Compared with TKEP, the
novelties of T2S are that, (1) TKEP is an approximate
method while T2S always returns exact results, (2)
the number of tuples maintained in TKEP is affect-
ed by concrete parameters (such as tuple number
or attribute number) and underlying distribution of
data, while T2S always maintains fixed and small
number of candidates, (3) during the execution, TKEP
has to retrieve every tuple, while T2S skips most
of the candidates. Of course, compared with sorted-
list-based methods, T2S has the overhead to build
presorted table. The description below shows that, (1)
the overhead is worthwhile because T2S outperforms
the existing methods significantly, (2) for the required
structures, the incremental-update/batch-processing
can be performed to reflect the latest changes of the
sorted lists with a relatively small cost.
Of course, in some cases, T2S may be not suitable
because it needs to presort the table. For example,
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 3
in distributed systems, each node maintains some
attribute and the attribute values can be sorted in
different order according to the user requirement. In
the case, it is difficult (if not impossible) to build
the required presorted table. In many other cases,
however, data set is stored in stand-alone computer
or horizontal partitioned in distributed system, the
values in the data set has the default order, T2S is
very useful here because of its superior performance.
3 PRELIMINARIES
3.1 Top-k query
Given a table T(A1, . . . , AM ) with N tuples, ∀t ∈ T, let
t.Ai be the value of Ai in t. Without loss of generality,
let F =
∑m
i=1(wi × t.Ai) be a ranking function on
A1, . . . , Am, here wi is the weight of F on Ai. Usually, F
is a monotonic function, i.e. ∀t1, t2 ∈ T, if t1.Ai ≤ t2.Ai
for all 1 ≤ i ≤ m, F(t1) ≤ F(t2). In this paper, we
assume that the larger scores are preferred.
Top-k query: Given a table T and a ranking function
F, top-k query returns k-subset R of T, in which ∀t1 ∈
R, ∀t2 ∈ (T − R), F(t1) ≥ F(t2).
∀t ∈ T, its positional index (PI) is a if t is the ath
tuple, denoted by T(a) [12]. We denote by T(S) the
tuples in T whose PIs are in set S, by T(a, . . . , b)(a ≤ b)
the tuples in T whose PIs are between a and b, by
T(a, . . . , b).Ai be the set of attribute Ai in T(a, . . . , b).
3.2 Sorted-list-based methods
In this paper, we take LARA [17] to discuss sorted-list-
based methods due to its superior performance. We
do not take TKEP here because its execution is similar
to LARA except for early pruning and it is an approx-
imate method. In LARA, T is kept as a set of sorted
lists SL1, . . . , SLM . The schema of SLi(1 ≤ i ≤ M)
is SLi(PI, Ai), here PI is the positional index of the
tuple in T and Ai is the corresponding attribute value.
SLi is sorted in descending order of Ai. Given ranking
function F, LARA performs round-robin retrieval on
SL1, . . . , SLm. Let the tuples retrieved in current pass
be (pi1, a1), . . ., (pim, am), the threshold of LARA is
τ = F(a1, . . . , am). The execution of LARA consists
of growing phase and shrinking phase. In growing
phase, LARA maintains each retrieved candidate t in
set C and updates its lower-bound score Flb
t . Here,
Flb
t =
∑m
i=1(wi × t.Ail) where t.Ail = t.Ai if t.Ai
has been obtained in SLi, otherwise t.Ail = MINi
(the minimum value of Ai). A priority queue PQ is
used to maintain k candidates with the largest lower-
bound scores seen so far. Let PQ.min be the minimum
lower-bound score in PQ. When PQ.min ≥ τ, growing
phase is over. In shrinking phase, LARA continues
its round-robin retrieval, maintains the lower-bound
and upper-bound scores of the candidates in set C,
updates PQ. For candidate t, its upper-bound score is
Fub
t =
∑m
i=1(wi × t.Aiu), here t.Aiu = t.Ai if t.Ai has
been seen in SLi, otherwise t.Aiu = ai (the maximum
value of t.Ai). When PQ.min ≥ Fub
t (∀t, t ∈ (C − PQ)),
LARA terminates and PQ keeps the top-k results.
4 BEHAVIOR ANALYSIS OF LARA
This section analyzes execution behavior of LARA. To
facilitate discussion, Assumption 4.1 is made here.
Assumption 4.1: The attributes A1, . . . , AM are dis-
tributed uniformly and independently, the range of
Ai(1 ≤ i ≤ M) is [0, 1], and ∀t ∈ T, the ranking
function F(t) =
∑m
i=1 t.Ai.
Definition 4.1: (Terminating tuple) In round-robin
retrieval on SL1, . . . , SLm, a tuple t is a terminating
tuple, if its attributes A1, . . . , Am have been seen.
Let gdep be scan depth1
when k terminating tu-
ples are obtained in the round-robin retrieval on
SL1, . . . , SLm, here growing phase ends definitely
since PQ.min ≥
∑m
i=1 SLi(gdep).Ai is satisfied. We
take gdep as the scan depth when growing phase ends.
∀t ∈ T, the probability that (t.PI, t.Ai) is in the first
gdep tuples in SLi is gdep
N , and the probability that t is
a terminating tuple given scan depth gdep is (gdep
N )m
.
∀t1, t2 ∈ T, given scan depth gdep, the event that t1 is
a terminating tuple is independent of the event that t2
is a terminating tuple [12]. The number of terminating
tuples follows binomial distribution BD(N, (gdep
N )m
).
Its expectation is E(BD) = N ×(gdep
N )m
. Let E(BD) =
k, we have gdep = N × ( k
N )
1
m .
LARA does not prune candidates in growing phase,
here we analyze the number of candidates maintained
by LARA. ∀t ∈ T, let numt,gdep be the number of at-
tributes A1, . . . , Am of t lying in the first gdep tuples in
the corresponding sorted lists. Under Assumption 4.1,
numt,gdep follows binomial distribution BD2(m, gdep
N ),
and P(numt,gdep = i) =
(m
i
)
× (gdep
N )i
× (1 − gdep
N )m−i
.
The number NUMgdep of the candidates maintained
in growing phase is
∑m
i=1(N × P(numt,gdep = i)). It
is found that, NUMgdep increases exponentially with
m, and increases polynomially with N, k. On massive
data, LARA maintains a large number of candidates
and will incur a high memory cost.
Next, we analyze the maximum scan depth sdep
of shrinking phase. In shrinking phase, PQ.min ≥∑m
i=1 SLi(gdep).Ai = m × (1 − gdep
N ). Thus, ∀t ∈
(C − PQ), shrinking phase ends definitely when m ×
(1 − gdep
N ) ≥ Fub
t . The scan depth sdep, to which m
sorted lists are retrieved in order to guarantee that t
is not top-k result, corresponds to the case that m − 1
attributes of t are value 1 while the left attribute is
rather small2
, i.e. m×(1− gdep
N ) ≥ (m−1)+(1− sdep
N ).
We take sdep = m × gdep as scan depth when shrink-
ing phase ends. In terms of its formula, sdep increas-
es exponentially with m, and increases polynomially
with N, k. On massive data, LARA has to retrieve a
large number of tuples and incurs a high I/O cost.
1. The positional index of tuples in the sorted lists
2. We only consider A1, . . . , Am when computing score of tuple
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 4
Fig. 1. The illustration of sorted lists
When the number of maintained candidates exceed-
s the memory limit, LARA has to utilize the disk as
exchange space and incurs an even higher I/O cost.
It is noted that Assumption 4.1 is taken only for
facilitating discussion, the execution of LARA and
T2S does not depend on the assumption. Of course, it
will be interesting to discuss the case that A1, . . . , Am
follows correlated distribution with correlation coeffi-
cient c. If c < 0, they follows negatively correlated dis-
tribution and the values of the attributes tend to move
in the opposite direction. Let gdepc be the current scan
depth in growing phase. Obviously gdepc > gdep,
LARA has to maintain more tuples in negatively
correlated distribution. As |c| increases, gdepc will
become larger and LARA has to maintain much more
candidates. If c > 0, they follows positively correlated
distribution and the converse is true.
Aiming at the performance issues of LARA in high
memory cost and I/O cost, this paper proposes a
table-scan-based T2S algorithm to compute top-k re-
sults on massive data efficiently. In its execution, T2S
selectively retrieves the tuples in presorted table and
maintains the k tuples with the largest exact scores
so far. For each tuple retrieved, T2S checks whether
the top-k results are obtained already. In the following
sections, we will introduce T2S algorithm gradually.
5 THE PRESORTING OF THE TABLE
This section introduces how to generate table PT.
As shown in Figure 1, because sorted list SLi(1 ≤
i ≤ m) only maintains the information of Ai, LARA
has to maintain the retrieved tuple in memory to
compute the lower-bound and upper-bound scores of
the tuple. For any retrieved tuple, if we can obtain the
required attribute values, we can compute its exact
score directly, and only maintain k candidates with
the largest scores seen so far to compute top-k results.
In order to compute exact score, one method is
based on sorted lists, whose process is similar to TA
algorithm. The method has the characteristic of early
termination, but will incur a large number of disk
seek operations. Another method performs sequential
scan on the table and return k tuples with the largest
scores when the scan is over. This method has high
disk efficiency, but has to read all tuples in the table.
Fig. 2. The illustration of presorted row table
In this paper, T2S presorts the table in the order
of the round-robin retrieval on sorted lists, which
combines the advantages of two methods above.
The detail of presorting operation is described be-
low. Given sorted list SLi(PI, Ai)(1 ≤ i ≤ M),
SLi is first sorted to generate PSLi(PIL, PIT , Ai)
in ascending order of PIT , here PIL represents the
positional index of the tuple in sorted list 3
, PIT
is the attribute PI in the schema of SLi. ∀j(1 ≤
j ≤ N), PSL1(j), . . . , PSLM (j) make up T(j) and
we denote by MPIL(j) the minimum value of PIL
among PSL1(j), . . . , PSLM (j). As shown in Figure
2, we merge PSL1, . . . , PSLM and sort the merged
table in the ascending order of MPIL. The schema
of presorted table PT is PT(MPIL, PIT , A1, . . . , AM ).
It should be noted that, PT does not contain dupli-
cate tuples, since PT in essence is built by sorting
T with MPIL acting as sort key. Of course, given
row table T, we only need to keep attribute PIL in
PSLi in the ascending order of PIT and the merging
on PSL1, . . . , PSLM generates a list of value MPIL,
which is used to sort T to generate PT.
When presorted table PT is generated, T2S performs
sequential scan on PT and maintains k tuples with
the largest scores seen so far until early termination
condition is satisfied. The k tuples maintained cur-
rently are top-k results. Let pt ∈ PT be the tuple
retrieved currently, Theorem 5.1 proves that the tu-
ples already obtained from PT include all tuples of
SL1(1, . . . , pt.MPIL−1), . . ., SLm(1, . . . , pt.MPIL−1).
Theorem 5.1: During the sequential scan on PT,
let pt be the tuple retrieved currently, the tuples
already obtained from PT include all tuples of
SL1(1, . . . , pt.MPIL−1), . . ., SLm(1, . . . , pt.MPIL−1).
Proof: Let pt ∈ PT be the tuple retrieved currently.
∀sl ∈ SLi(1, . . . , pt.MPIL − 1)(1 ≤ i ≤ m), sl.PIL ≤
pt.MPIL − 1 and MPIL(sl.PI) ≤ pt.MPIL − 1 obvi-
ously. Given that PT is sorted in the ascending order
of MPIL, all tuples, whose values of MPIL are less
than pt.MPIL, have been already obtained. Q.E.D.
6 THE EARLY TERMINATION CHECKING
Since the tuples in PT are arranged in the order of
the round-robin retrieval on sorted lists, T2S naturally
3. PIL can be seen as an implicit attribute in SLi
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 5
holds the characteristic of early termination. Let PQ be
the priority queue to maintain k tuples with the largest
scores seen so far, PQ.min be the minimum score in
PQ. This paper proposes a compact data structure
GTS(Gap Threshold Score) to maintain the threshold
scores with the specified gap of positional indexes of
tuples in sorted lists. Given gap parameter OG, the
jth
(1 ≤ j ≤ N
OG ) record in structure GTS maintains
the attribute values SL1(j × OG).A1, . . . , SLM (j ×
OG).AM . Let pt be the current tuple retrieved in PT,
Theorem 6.1 proves that, if early termination condition
PQ.min ≥
∑m
i=1(wi × GTS(⌊pt.MP IL−1
OG ⌋).Ai) is satis-
fied, the tuples maintained in PQ currently is the top-k
results. Here, we denote by GTS(0).Ai(1 ≤ i ≤ m) the
value of MAXi (maximum value in Ai).
Theorem 6.1: Given ranking function F and priority
queue PQ, let pt be the current tuple retrieved in
PT, if PQ.min ≥
∑m
i=1(wi × GTS(⌊pt.MP IL−1
OG ⌋).Ai) is
satisfied, the tuples in PQ currently is the top-k results.
Proof: Let pt be the current tuple retrieved in PT.
According to Theorem 5.1, the tuples already obtained
from PT include all tuples of SL1(1, . . . , pt.MPIL −
1), . . . , SLm(1, . . . , pt.MPIL − 1). If it is in the exe-
cution of LARA, the scores of tuples currently not
seen are not larger than
∑m
i=1(wi × SLi(pt.MPIL −
1).Ai). Given ⌊pt.MP IL−1
OG ⌋ × OG ≤ pt.MPIL − 1, if
PQ.min ≥
∑m
i=1(wi × GTS(⌊pt.MP IL−1
OG ⌋).Ai), PQ.min
≥
∑m
i=1(wi × SLi(pt.MPIL − 1).Ai) because of the
sortedness of the sorted lists. Q.E.D.
Next, we will analyze scan depth of T2S. During
the analysis, we take Assumption 4.1 in Section 4.
First, we assume that the tuples in PT only con-
tain attributes A1, . . . , Am. Given scan depth gdep, k
terminating tuples are generated in LARA and the
minimum score of the k terminating tuples is not less
than the current threshold. According to construction
method of PT,
∪
1≤i≤m T(SLi(1, . . . , gdep).PI) are the
first NUMgdep tuples in PT. When the first NUMgdep
tuples in PT are retrieved, early termination condi-
tion is satisfied and T2S obtains the top-k results.
Actually, the tuples in PT contain attributes A1, . . .,
AM . In order to compute the upper-bound of scan
depth, we assume
∪
m+1≤i≤M T(SLi(1, . . . , gdep).PI)
do not contain top-k results. Thus, T2S needs to
retrieve NUMgdep × M
m tuples in PT to return top-
k results. Because score threshold is computed by
structure GTS, scan depth of T2S can be estimated
to be ⌈
NUMgdep× M
m
OG ⌉ × OG. Of course, in general case
that the attributes follow correlation distribution with
correlation coefficient c, the scan depth of T2S can
be estimated to be ⌈
NUMgdepc × M
m
OG ⌉ × OG and the
discussion is similar as that in Section 4.
For gap parameter OG, the greater value generates
the fewer records in structure GTS, but a larger scan
depth of T2S, and vice versa. In this paper, we set
the value of OG to be 10000, the space overhead of
structure GTS is not significant ( 1
10000 of the original
table), the difference between actual scan depth and
prolonged scan depth is not greater than 10000.
7 THE SELECTIVE RETRIEVAL STRATEGY
7.1 Basic idea
It is known that the result size k is usually small
(for example, in text search engine, k ≤ 40) [18], the
vast majority of tuples retrieved in PT are not query
results. The monotonicity of ranking function can be
utilized here to determine whether a tuple in PT is not
query result. Of course, we do not need to retrieve the
tuples in PT which are not query results.
Let tscorethres be the lower-bound score of top-k
results, and MAXi be the maximum value of Ai, we
propose basic idea of selective retrieval in Table 1.
TABLE 1
Basic idea of selective retrieval
∀pt ∈ PT, if pt.Ai(∃i, 1 ≤ i ≤ m) satisfies the condition:
pt.Ai < 1
wi
× (tscorethres −
∑
1≤j̸=i≤m(wj × MAXj))
pt is not top-k result.
This can be explained that, even though other m−1
attributes of pt take the maximum values, pt.Ai is so
small that the value of F(pt) is less than tscorethres.
The computation of tscorethres and the value check-
ing of certain attribute are implemented by the data
structures preconstructed below.
7.2 The preconstruction of data structure
7.2.1 Attribute pair terminating-tuple set (APTS)
This paper proposes APTS to determine the lower-
bound score of top-k results. Of course, any k tuples
can compute the lower-bound. But, the key point is
how to determine the bound which is very close to
the exact threshold, i.e. minimum top-k score.
Given that result size k is usually small, an upper
limit Kmax can be specified in the context of actual
applications. With dimensionality M, APTS maintains(M
2
)
files, each of which keeps Kmax tuples of T.
∀1 ≤ a < b ≤ M, APTSa,b represents a file in
APTS, which keeps the first Kmax terminating tuples
in round-robin retrieval on SLa and SLb. The total
number of tuples maintained in APTS is
(M
2
)
×Kmax.
For example, given N = 109
, M = 100, Kmax = 1000,
the total number of tuples in APTS is 4950000, less
than 0.5% of the original table.
7.2.2 Exponential gap information
In this part, we first give the definition of exponential
gap bloom filter table.
Definition 7.1: (Exponential gap bloom filter table)
Given sorted list SLi with N tuples, EGBFTi is
exponential gap bloom filter table for SLi, if EGBFTi
satisfies: (1) |EGBFTi| = log2N, (2) EGBFTi(j) is a
bloom filter constructed on SLi(1, . . . , 2j
).PI.
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 6
The size of EGBFTi depends on tuple number and
false positive rate of bloom filter. We set the used
false positive rate to be 0.001 in this paper, the size
of EGBFTi is less than 23% of the size of SLi [11].
This paper presents structure EGBIi to maintain
the maximum Ai value and minimum Ai value of
SLi within the range covered by tuples in EGBFTi.
The schema of EGBIi is (bval, eval). EGBIi(j).bval =
SLi(1).Ai and EGBIi(j).eval = SLi(2j
).Ai.
T2S builds EGBFT and EGBI on each sorted list.
7.2.3 Membership Checking Result (MCR)
This part introduces how to construct structure MCR.
Given presorted table PT, the file MCRi(1 ≤ i ≤ M)
consists of log2 N tuples, here MCRi(a) is a N-bit bit-
vector representing the checking results of PT tuples
by EGBFTi(a). ∀PT(b)(1 ≤ b ≤ N), if the checking
result of PT(b).PIT by EGBFTi(a) is true, the bth
bit
in MCRi(a) is set to 1, otherwise, the bit is set to 0.
The size of MCRi(a) is N
8 bytes, and the total
size of structure MCR is
∑
1≤i≤M
∑
1≤a≤log2 N
N
8 . The
ratio of disk space taken by structure MCR to that
taken by PT is M×log2 N
64×(M+2) , space overhead of structure
MCR is not high. Besides, structure MCR in practical
applications can be stored in compressed bitmap [20]
rather than literal bit-vector. Under Assumption 4.1,
the proportion of bit 1 in MCRi(a)(1 ≤ a ≤ log2 N)
is 2a
N and most of tuples in MCRi can be compressed
with a big compression ratio. Of course, the emphasis
of this part is to introduce selective retrieval, and
for the sake of convenience, we still take the form
of literal bit-vector to store structure MCR in this
paper. Besides, we do not need to maintain all log2 N
tuples in MCRi. For example, given N = 109
, we
can generate MCRi from the 16th
tuple (29 tuples in
original MCRi), neglecting its first 15 tuples, which
cover at most 32768 tuples and are rarely used in
actual applications. And any query involving the first
15 tuples in MCRi can utilize MCRi(16) instead.
7.3 Retrieval process
Given ranking function F on A1, . . . , Am, T2S utilizes(m
2
)
relevant APTS files APTSa,b(1 ≤ a < b ≤ m) to
compute lower-bound score of top-k results. Although
dimensionality M may be high, top-k query usually
involves a small number of attributes [5] and the total
number of tuples in relevant APTS files is not large.
For example, given m = 4, Kmax = 1000, total tuple
number in relevant APTS files is 6000.
During retrieving relevant APTS files, T2S main-
tains in priority queue TSETthres k distinct tuples
with the largest scores. When the relevant APTS files
are retrieved over, tscorethres, i.e. the minimum score
in TSETthres, is lower-bound score of top-k results.
Next, we analyze the effect of the determined
lower-bound. ∀t ∈ T, if t.PI is not obtained in
growing phase, t is called a low-score tuple. Let
Fig. 3. The illustration of MCR positional index
APTSa,b(1, . . . , k) be the first k tuples in APTSa,b,
and TSETk be the k tuples with the largest scores
in
∪
1≤a<b≤m APTSa,b(1, . . . , k).
Theorem 7.1: TSETk does not contain low-score tu-
ples.
Proof: When scan depth is gdep, k terminating
tuples are obtained, thus ∀1 ≤ a < b ≤ m, the number
of the candidates whose attributes Aa and Ab are seen
is not less than k. ∀t ∈ APTSa,b(1, . . . , k), t.PI ∈
SLa(1, . . . , gdep).PI ∧ t.PI ∈ SLb(1, . . . , gdep).PI, t is
not a low-score tuple. Q.E.D.
It is proved by Theorem 7.1 that TSETk does not
contain any low-score tuple. Considering the fact that:
(1) ranking function F involves a small number of
attributes, (2) the minimum score in TSETthres is not
less than the minimum score in TSETk, tscorethres
computed above is a satisfactory lower-bound score of
top-k results, which is also verified in the experiments.
In the following part, T2S utilizes the value of
tscorethres as score lower-bound of top-k results.
According to basic idea of selective retrieval, we
denote by PV mx
i the maximum value of attribute Ai
satisfying Ai < 1
wi
× (tscorethres −
∑
1≤j̸=i≤m(wj ×
MAXj)). Given PV mx
i , T2S exploits structure EGBIi
to determine the positional indexes of MCR tuples
employed in selective retrieval. The tuples of EGBIi
is retrieved from the beginning, and as shown in
Figure 3, T2S returns the first positional index bmx
i (1 ≤
bmx
i ≤ log2 N) of EGBIi satisfying the condition:
EGBIi(bmx
i ).eval < PV mx
i ≤ EGBIi(bmx
i − 1).eval
Here, we denote by EGBIi(0).eval the value of
MAXi and the bmx
i is the required positional index
of MCR tuple. It is known that MCRi(bmx
i ) is a
bit-vector of checking results of the values in PIT
attribute of PT by EGBFTi(bmx
i ). Before performing
sequential scan on PT, T2S first retrieves data from
MCRi(bmx
i )(1 ≤ i ≤ m). In order to improve the
efficiency of disk retrieval, T2S retrieves data of MSZ
bytes from MCRi(bmx
i ) once (In this paper, the value
of MSZ is set to 1M). Let BUFi be the buffer to
store MCRi(bmx
i ) data retrieved currently. T2S com-
putes the bit-vector SRB =
∩
1≤i≤m BUFi. After the
first MSZ bytes are retrieved from MCRi(bmx
i )(1 ≤
i ≤ m), the computed SRB represents the bit-vector
corresponding to PT(1, . . . , 8 × MSZ). If the ath
bit
in current SRB is 0, PT(a) can be skipped directly,
otherwise PT(a) should be retrieved. When the cur-
rent SRB is exhausted, T2S retrieves the following
data of MSZ bytes from MCRi(bmx
i )(1 ≤ i ≤ m),
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 7
re-computes SRB, which represents the bit-vector
corresponding to PT(8×MSZ+1, . . . , 16×MSZ), and
continues the selective retrieval on PT. The process
proceeds until the top-k results are acquired.
7.4 Theoretical analysis
In this part, we analyze the performance of selective
retrieval under Assumption 4.1.
It is known that TSETk does not contain low-score
tuples and the minimum score in TSETthres is not less
than the minimum score in TSETk. And it is verified
in the experiments that the estimated lower-bound
score of top-k results is very close to minimum top-
k score. In this part, we take
∑m
i=1 SLi(gdep).Ai, the
threshold when growing phase ends, as tscorethres.
Under Assumption 4.1, we have bmx
1 = . . . =
bmx
m = bmx. According to basic idea of selective
retrieval, ∀pt ∈ PT, when pt.Ai satisfies the condition
pt.Ai < tscorethres − (m − 1), pt is not top-k result.
Let the position of maximum possible Ai value in SLi
satisfying the condition be depcon, we have 1−depcon
N =
tscorethres−(m−1), i.e. depcon = N ×(m−tscorethres).
The MCR positional index is bmx = ⌈log2 depcon⌉.
∀pt = PT(a)(1 ≤ a ≤ N), the probability
pruprob that pt can be skipped is 1 − (2bmx
N )m
.
This formula can be explained that, (2bmx
N )m
is
probability that A1, . . . , Am of PT(a) are located
in SL1(1, . . . , 2bmx
), . . . , SLm(1, . . . , 2bmx
) respectively,
and 1 − (2bmx
N )m
is the probability that at least one
attribute PT(a).Aj is not located in SLj(1, . . . , 2bmx
).
In terms of the construction method of MCR and the
execution process of T2S, pruprob can be treated as the
proportion of bit 0 in bit-vector SRB, T2S will skip
vast majority of the candidates by selective retrieval.
Of course, in general case that the attributes follow
correlation distribution with correlation coefficient c,
pruprob is affected by the value of c. For exam-
ple, if c < 0, the current threshold tscorethres,c =∑m
i=1 SLi(gdepc).Ai < tscorethres, the current M-
CR positional index bmx,c = ⌈log2(N × (m −
tscorethres,c))⌉ ≥ bmx, then the current probability to
skip a tuple should become smaller.
8 THE COLUMNWISE T2S ALGORITHM
In the above discussion, the tuples in PT are stored
in row-store model and all of the attributes of a tuple
can be retrieved sequentially. However, its problem
is also obvious, that is, even though the ranking
function only involves attributes A1, . . . , Am, T2S has
to retrieve all of the attributes A1, . . . , AM (m < M)
and incurs extra I/O cost. Intuitively, we can de-
compose table PT, and retrieve the required column
files when computing top-k results. Given the schema
PT(MPIL, PIT , A1, . . . , AM ), we keep PT as a set of
column files {CFL, CFT , CF1, . . . , CFM }, here CFL,
CFT and CFi(1 ≤ i ≤ M) are column files corre-
sponding to MPIL, PIT and Ai respectively. For easy
identification, we denote by T2SR the T2S algorithm
performed on row-store model and by T2SC the T2S
algorithm performed on column-store model.
The naive method, i.e. T2SC obtains single tuple of
PT by retrieving CFL, CFT , CF1, . . . , CFm in round-
robin fashion, has the same execution process as
T2SR. The problem of the naive method is that, T2SC
has to retrieve the values from m + 2 column files to
compute the score of a tuple, every retrieval from one
column file often incurs a disk seek operation.
Here, besides the k tuples with the largest scores
seen so far, T2SC also utilizes a relatively larger buffer
TBUF to reduce the number of disk seek operations.
Let TBUF be large enough to keep BS tuples of
T. During its round-robin retrieval on the involved
column files, T2SC reads BS tuples once in a column
file rather than reads a single tuple once, which is
called to be batch round-robin retrieval in this part, and
maintains the tuples in TBUF. In this way, in every
pass of batch round-robin retrieval, T2SC obtains BS
tuples, which amortizes the cost to obtain every tuple.
Then T2SC reads the tuples in TBUF to compute top-
k results. The similar process in T2SC continues until
the early termination condition is satisfied.
The size BS of TBUF is important. T2SC with
BS = 1 is the naive method. The larger value of
BS makes the lower cost of disk seek operation for
each tuple, makes the higher cost of maintaining
tuples in memory and makes T2SC retrieve more
extra tuples (BS tuples are retrieved in each pass).
Under Assumption 4.1, the number NUMret of tuples
retrieved by T2SR is ⌈
NUMgdep× M
m
OG ⌉ × OG × (2bmx
N )m
.
Obviously, given BS = NUMret, T2SC has the lowest
cost of disk seek operations but maybe maintain too
many tuples. In terms of application context and the
user requirement, this paper sets the upper-bound
value BSmx and lower-bound value BSmn of the
size of TBUF, takes median value of NUMret, BSmn
and BSmx as the value of BS. In this paper, we set
BSmn = 100 and BSmx = 100000. Furthermore, when
bit-vector SRB is exhausted over, T2SC finishes the
current batch round-robin retrieval even though the
number of tuples retrieved is fewer than BS. Given
the value of BS, T2SC may retrieve extra more BS
tuples than T2SR at most, the scan depth of T2SC
can be estimated to be ⌈
NUMgdep× M
m + BS
( 2bmx
N
)m
OG ⌉ × OG.
Of course, batch round-robin retrieval above is s-
traightforward, and a new retrieval scheduling meth-
ods as in [2] may provide a better performance for
T2SC. But, the emphasis of T2S is to perform top-k
query on presorted table with selective retrieval, T2SC
is just a columnwise version of T2S. Therefore, we
adopt the straightforward method in this paper.
9 DISCUSSION ABOUT DATA STRUCTURES
This section discusses the cost of pre-computing the
data structures and the update processing.
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 8
The construction of presorted table PT comprises
the major cost of pre-construction operation. Given
sorted list SLi(PI, Ai) (the tuple length is 16 bytes),
the cost of generating PSLi(PIL) is 56N bytes. Then,
PSL1, . . . , PSLM are merged to generate MPIL, its
cost is 8MN + 8N bytes. MPIL is used as sort key to
sort T, the cost of sorting is (M + 1) × 8N + (M + 2) ×
24N bytes, consisting of the cost to read MPIL and
T, the cost to generate sorted partitions and perform
merging operation. Adding up the costs above, the
cost of generating PT is 32N × (3M + 2) bytes.
The structure GTS is built for early termination
checking. Its construction cost is 32M × N
OG bytes,
consisting of the cost to retrieve data and output data.
Note that the costs of generating PT and GTS are
not affected by the data distribution.
For selective retrieval, we construct the structures
APTS, EGBFT and MCR, within which only the cost of
APTS construction depends on the data distribution.
Given two sorted lists, under Assumption 4.1, the
scan depth to find Kmax terminating tuples is N ×
(Kmax
N )
1
2 . Of course, if the attributes follow positive
correlation or negative correlation, the actual scan
depth should be less or larger than N×(Kmax
N )
1
2 . Here,
consider that the constructing PT is the major cost in
pre-construction, we take N × (Kmax
N )
1
2 as the scan
depth to find Kmax terminating tuples, and the cost
to build APTS is
(M
2
)
×
[
32N × (Kmax
N )
1
2 + 16MKmax
]
bytes, the first part in bracket is the cost to find the ter-
minating tuples, the second part is the cost to retrieve
and output the specified tuples. The cost to gener-
ate EGBFT is M ×
[
16 × (2N − 2) +
(2N−2)×log2( 1
fpr )
8×ln2
]
bytes, the first part in bracket is the cost to retrieve
the sorted lists, the second part is the cost to output
bloom filters. The cost to build EGBI is rather small
and we do not discuss it here. The cost to build MCR
is M ×
[
8N +
(2N−2)×log2( 1
fpr )
8×ln2 + N×log2N
8
]
bytes, the
first part in bracket is the cost to retrieve attribute
PIT in PT, the second part is the cost to load EGBFT
and the third part is the cost to output MCR data.
Obviously, the total cost of pre-construction in T2S
is high on massive data. Next we discuss how to
deal with update operation in T2S. We denote the
new data periodically obtained by Dnew, and the old
data by Told. With the increasing volume and the disk
IO bottleneck, data usually is stored in read/append-
only mode in typical massive data applications [8].
Using the characteristic, we do not need to re-compute
the data structures for every update, but support ef-
ficient incremental-update/batch-processing for T2S.
Note that Told is much larger than Dnew. Take a similar
method in C-Store [19], we do not update Told and
its related data structures if the size of accumulated
new data (stored in Tnew) does not reach a certain
proportion of the old data (for example 5%). Due to
read/append-only characteristic, the old data struc-
tures are not affected by the new data. T2S performs
top-k computation both on old data and new data, and
returns final results. Depending on its volume and the
performance requirement, Tnew can be stored in row-
store form, or in the similar data structures as T2S
(which is used in this paper). We re-compute the data
structures on Tnew when new data comes periodically,
its update cost is relatively low. If the size of Tnew is
large enough, we merge Tnew and Told and re-compute
the data structures in proper time (when the system
is idle), which amortizes its relatively high cost and
keeps a good performance meanwhile.
10 PERFORMANCE EVALUATION
10.1 Experimental settings
To evaluate the performance of T2S, we implement it
in Java with jdk-8u20-windows-x64. The experiments
are executed on LENOVO ThinkCentre M8400 (Intel
(R) Core(TM) i7-3770 CPU @ 3.40GHz (8 CPUs) + 32G
memory + 64 bit windows 7 + Seagate ST1000DM003
(1TB)). T2S evaluated here includes T2SR and T2SC,
executing on PT in row-store model and column-store
model respectively. In the experiments, the perfor-
mance of T2S is evaluated against naive algorithm,
LARA, TA and TKEP. The naive method performs
sequential scan on the table which is kept in column-
store model and only retrieves the required column
files. LARA [17] is a latest sorted-list-based algorithm
and has superior performance. The sorted lists in
LARA are retrieved by Java’s BufferedInputStream
class, which has an internal buffer array and enables
LARA the ability of batched I/Os as in [18]. In order to
compare T2S with TA-style algorithms with random
access, we implement TA as [1], [7]. It is found that,
due to poor performance of random access on massive
data, TA is orders of magnitude slower than other
algorithms (7315.819s when tuple number is 108
and
attribute number is 5), we only provide the experi-
mental results of TA in the experiment of real data.
Although TKEP [11] is an approximate method, we
still evaluate it here for more extensive comparison.
TABLE 2
Parameter Settings
Parameter Used values
Tuple number(108
) (syn) 1 ∼ 20
Result size (syn) 10 ∼ 50
Used attribute number (syn) 2 ∼ 6
Total attribute number (syn) 5 ∼ 20
Negative correlation coefficient (syn) -0.8 ∼ 0
Positive correlation coefficient (syn) 0 ∼ 0.8
Result size (real) 5 ∼ 25
In the experiments, we evaluate the performance
of T2S in terms of several aspects: tuple number
(N), result size (k), used attribute number (m), total
attribute number (M), correlation coefficient (c).
The experiments are executed on three data sets:
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 9
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
1 5 10 15 20
time(s)
tuple number (108
)
PT
EGBFT
MCR
GTS
APTS
EGBI
(a)Construction time
0
5e+010
1e+011
1.5e+011
2e+011
2.5e+011
1 5 10 15 20
datavolume(byte)
tuple number (108
)
PT
EGBFT
MCR
GTS
APTS
EGBI
(b)storage cost
Fig. 4. The result of pre-construction operation
60
65
70
75
80
85
90
95
100
10 20 30 40 50
time(s)
result size
SR
NoSR
(a)Selective retrieval
60
65
70
75
80
85
90
95
100
105
110
10 20 30 40 50
time(s)
result size
apts
random
single
(b)Different APTS methods
Fig. 5. The result of component comparison
two synthetic data sets (independent distribution
and correlated distribution) and a real data set.
The used parameter settings are listed in Table 2.
For correlated distribution, the first two attributes
have the specified correlation coefficient, while the
left attributes follow the independent distribution.
In order to generate two sequences of random
numbers with correlation coefficient c, we first
generate two sequences of uncorrelated distributed
random number X1 and X2, then a new sequence
Y1 = c × X1 +
√
1 − c2 × X2 is generated, and
we get two sequences X1 and Y1 with the given
correlation coefficient c. The real data used is HIGGS
Data Set from UCI Machine Learning Repository
(https://archive.ics.uci.edu/ml/datasets/HIGGS#).
The default ranking function used is F =
∑m
i=1 Ai
(the effect of different weights is evaluated in
Section 10.10). The maximum limit of candidates in
in-memory hash-table is set to be 1 × 107
.
10.2 Construction and component comparison
The result of pre-construction operation is shown in
Figure 4. It is illustrated in Figure 4(a) that the pre-
construction time increases linearly with data volume
and the operation to construct the presorted table
consumes most of the construction time. The storage
cost of the data structures is depicted in Figure 4(b).
The total storage cost of the data structures is around
220GB at N = 20 × 108
. Figure 5(a) compares T2SR
with and without selective retrieval. Although T2S
with selective retrieval skips most of the tuples, its
performance advantage over T2S without selective
retrieval is not much significant due to the relatively
poor skip performance in Java. Figure 5(b) compares
T2SR with different APTS selection strategy: random
selection and the tuples with the largest values in one
1
10
100
1000
1 5 10 15 20
time(s)
tuple number (108
)
T2SC
T2SR
LARA
naive
TKEP
(a)Execution time
10
100
1000
10000
100000
1e+006
1e+007
1e+008
1e+009
1 5 10 15 20
candidatetuplenumber
tuple number (108
)
T2SC
T2SR
LARA
naive
NUMgdep
TKEP
(b)Maintained tuple number
0.97
0.975
0.98
0.985
0.99
0.995
1
1 5 10 15 20
top-kscorelowerbound
tuple number (108
)
T2S
actual
(c)Lower-bound score
1e+006
1e+007
1e+008
1e+009
1 5 10 15 20
scandepth
tuple number (108
)
T2SC
T2SR
LARA
TKEP
ESTR
ESTC
ESTL
(d)Scan depth
1e+006
1e+007
1e+008
1e+009
1e+010
1e+011
1e+012
1 5 10 15 20
read/writebytenumber
tuple number (108
)
T2SC
T2SR
LARA
naive
TKEP
(e)The I/O cost
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
1 5 10 15 20
pruningratio
tuple number (108
)
actual
estimated
(f)The pruning ratio
Fig. 6. The effect of tuple number
attribute. T2S with APTS obtained in this paper shows
the better performance.
10.3 Exp 1: The effect of tuple number
Given m = 5, M = 10, k = 10 and c = 0, experiment
1 evaluates the performance of T2S on varying tuple
numbers. As shown in Figure 6(a), T2SC runs 10.97
times faster than LARA, 4.61 times faster than TKEP
and 12.09 times faster than naive algorithm, T2SR
runs 5.22 times faster than LARA, 2.14 times faster
than TKEP and 5.59 times faster than naive algorithm.
Due to fewer data retrieved, T2SC runs faster than
T2SR. As tuple number increases, the speedup ratio
of T2S over LARA becomes larger. For example, T2SC
runs 6.54 times faster than LARA at N = 1 × 108
,
while runs 15.27 times faster at N = 20 × 108
. As
shown in Figure 6(b), T2SC maintains 1203.35 times
fewer tuples than LARA and 11 times fewer tuples
than TKEP, T2SR maintains 12033509.56 times fewer
tuples than LARA and 110238.96 times fewer tuples
than TKEP. Only k tuples are maintained in T2SR
and naive algorithm. And the size of TBUF in T2SC
is set to be 100000 in experiment 1. The estimated
tuple number NUMgdep maintained by LARA is also
shown in Figure 6(b), and the actual value follows
our analysis in Section 4. The effect of lower-bound
score of top-k results, which is computed by structure
APTS, is illustrated in Figure 6(c). Here, we divide
the computed lower-bound score by the minimum
score of top-k results to report the normalized result.
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 10
It is seen that the lower-bound score computed in
T2S is very close to exact top-k threshold. Figure 6(d)
reports scan depths of those algorithms. The scan
depth of LARA (TKEP) in growing phase is around
4.7 times fewer than those of T2SC and T2SR, because
LARA (TKEP) retrieves one tuple from each sorted list
respectively (5 sorted lists total) in one pass of round-
robin retrieval. Due to early termination, T2SC and
T2SR only involve around 14% of the table, in which
vast majority of tuples are skipped. The estimated
scan depths are also provided in Figure 6(d), and the
actual values also conform to our analysis in Section
4, 5 and 8. Figure 6(e) depicts the I/O costs of those
algorithms. It is shown that, T2SC incurs 772.81 times
less I/O cost than LARA and 114.81 times less I/O
cost than TKEP, T2SR incurs 347.84 times less I/O cost
than LARA and 53.31 times less I/O cost than TKEP.
The I/O cost involved in LARA gradually exceeds
that involved in naive algorithm. As tuple number
increases, the number of tuples maintained in LARA
exceeds the limit of allocated memory, the memory-
disk exchange operation is performed to complete the
growing phase, the exchange number is 2 at N = 1 ×
108
and rises to 24 at N = 20 × 108
, incurring a much
higher I/O cost, while the I/O cost in naive algorithm
grows linearly with the tuple number. As illustrated in
Figure 6(e), the I/O cost incurred in T2SC and T2SR
first becomes higher when tuple number increases
from 1×108
to 15×108
, then decreases at N = 20×108
.
This can be explained in Figure 6(f). When tuple
number rises fivefold from 1 × 108
to 5 × 108
, the
positional index of MCR used for selective retrieval
increases from 24 to 26 and the covered exponential
gap expands fourfold, which makes the pruning ratio
rise. When tuple number rises threefold from 5×108
to
15×108
, the positional index of MCR increases from 26
to 28 and the exponential gap expands fourfold, which
makes the pruning ratio decline. At N = 20 × 108
,
the used positional index of MCR remains 28 and the
pruning ratio increases. The theoretical pruning ratio
also declines at N = 15 × 108
, and the actual value
has a similar trend as the theoretical value.
10.4 Exp 2: The effect of result size
Given m = 5, M = 10, N = 10 × 108
and c = 0, exper-
iment 2 evaluates the performance of T2S on varying
result sizes. As shown in Figure 7(a), T2SC runs 11.28
times faster than LARA, 4.19 times faster than TKEP
and 11.5 times faster than naive algorithm, T2SR runs
5.94 times faster than LARA, 2.2 times faster than
TKEP and 5.95 times faster than naive algorithm. For
T2SC, its execution time has a sudden rise at k = 30
because much more data is retrieved here. As depicted
in Figure 7(b), T2SC maintains 1381.13 fewer tuples
than LARA and 13.12 fewer tuples than TKEP, T2SR
maintains 5871375.63 fewer tuples than LARA and
56081.46 fewer tuples than TKEP. In experiment 2, the
10
100
1000
10 20 30 40 50
time(s)
result size
T2SC
T2SR
LARA
naive
TKEP
(a)Execution time
10
100
1000
10000
100000
1e+006
1e+007
1e+008
1e+009
10 20 30 40 50
candidatetuplenumber
result size
T2SC
T2SR
LARA
naive
NUMgdep
TKEP
(b)Maintained tuple number
1e+007
1e+008
1e+009
1e+010
1e+011
10 20 30 40 50
read/writebytenumber
result size
T2SC
T2SR
LARA
naive
TKEP
(c)The I/O cost
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
10 20 30 40 50
pruningratio
result size
actual
estimated
(d)The pruning ratio
Fig. 7. The effect of result size
0.1
1
10
100
1000
2 3 4 5 6
time(s)
used attribute number
T2SC
T2SR
LARA
naive
TKEP
(a)Execution time
10
100
1000
10000
100000
1e+006
1e+007
1e+008
1e+009
2 3 4 5 6
candidatetuplenumber
used attribute number
T2SC
T2SR
LARA
naive
NUMgdep
TKEP
(b)Maintained tuple number
1000
10000
100000
1e+006
1e+007
1e+008
1e+009
1e+010
1e+011
1e+012
2 3 4 5 6
read/writebytenumber
used attribute number
T2SC
T2SR
LARA
naive
TKEP
(c)The I/O cost
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
2 3 4 5 6
pruningratio
used attribute number
actual
estimated
(d)The pruning ratio
Fig. 8. The effect of used attribute number
size of TBUF in T2SC is set to be 100000, while the
number of tuples maintained in LARA grows linearly
with result size. The result of I/O cost is reported in
Figure 7(c), T2SC incurs 467.12 times less I/O cost
than LARA and 41.36 times less I/O cost than TKEP,
T2SR incurs 201.19 times less I/O cost than LARA
and 17.41 times less I/O cost than TKEP. There is a
sudden rise for T2SC and T2SR in Figure 7(c), it can
be explained in Figure 7(d). When result size is not
greater than 20, the positional index of MCR used for
selective retrieval is 27, and later the positional index
increases by 1 and remains 28. This leads to a drop of
pruning ratio at k = 30, as illustrated in Figure 7(d).
The trend of actual pruning ratio basically conforms
to the theoretical results, the early decline at k = 20
is the result of the data set actually generated.
10.5 Exp 3: The effect of used attribute number
Given k = 10, M = 10, N = 10 × 108
and c = 0,
experiment 3 evaluates the performance of T2S on
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 11
10
100
1000
5 10 15 20
time(s)
table width
T2SC
T2SR
LARA
naive
TKEP
(a)Execution time
10
100
1000
10000
100000
1e+006
1e+007
1e+008
1e+009
5 10 15 20
candidatetuplenumber
table width
T2SC
T2SR
LARA
naive
NUMgdep
TKEP
(b)Maintained tuple number
1e+007
1e+008
1e+009
1e+010
1e+011
5 10 15 20
read/writebytenumber
table width
T2SC
T2SR
LARA
naive
TKEP
(c)The I/O cost
0.99
0.991
0.992
0.993
0.994
0.995
0.996
0.997
0.998
0.999
5 10 15 20
pruningratio
table width
actual
estimated
(d)The pruning ratio
Fig. 9. The effect of total attribute number
varying used attribute numbers. As shown in Figure
8(a), T2SC runs 9.14 times faster than LARA, 4.23
times faster than TKEP and 79.1 times faster than
naive algorithm, T2SR runs 4.77 times faster than
LARA, 2.38 times faster than TKEP and 66.54 times
faster than naive algorithm. When the used attribute
number increases from 2 to 6, the execution time
of LARA grows exponentially, while the execution
time of naive algorithm grows linearly. At m = 5,
the execution time of LARA exceeds that of naive
algorithm. The execution times of T2SC and T2SR
grow rapidly also with a larger value of m due to a
greater scan depth and a smaller pruning ratio. As
shown in Figure 8(b), T2SC maintains 18109.16 fewer
tuples than LARA and 47.86 times fewer tuples than
TKEP, T2SR maintains 8091748.76 times fewer tuples
than LARA and 385764.32 times fewer tuples than
TKEP. The size of TBUF in T2SC is limited to be
between 100 and 100000, while the number of tuples
maintained in LARA grows exponentially as more
attributes are used in ranking function. As illustrated
in Figure 8(c), T2SC incurs 6429.31 times less I/O cost
than LARA and 4464.96 times less I/O cost than TKEP,
T2SR incurs 2078.97 times less I/O cost than LARA
and 1315.83 times less I/O cost than TKEP. For LARA,
its I/O cost is much less than naive algorithm with
a small value of m, then since m = 5, its I/O cost
exceeds that of naive algorithm. The memory-disk
exchange operation in LARA incurs a high I/O cost.
At m = 2 or m = 3, LARA does not need to involve
memory-disk exchange operation. As more attributes
are used, more tuples have to be maintained and the
number of memory-disk exchange operations rises to
26 at m = 6. In experiment 3, the positional index of
MCR used in selective retrieval increases with a larger
m, and this makes the pruning ratio decline, as shown
in Figure 8(d). But even at m = 6, the pruning effect
is still good enough and 94% of tuples are skipped.
1
10
100
1000
10000
0 -0.2 -0.4 -0.6 -0.8
time(s)
negative correlation coefficient
T2SC
T2SR
LARA
naive
TKEP
(a)Execution time
10
100
1000
10000
100000
1e+006
1e+007
1e+008
1e+009
0 -0.2 -0.4 -0.6 -0.8
candidatetuplenumber
negative correlation coefficient
T2SC
T2SR
LARA
naive
NUMgdep
TKEP
(b)Maintained tuple number
10000
100000
1e+006
1e+007
1e+008
1e+009
1e+010
1e+011
1e+012
1e+013
0 -0.2 -0.4 -0.6 -0.8
read/writebytenumber
negative correlation coefficient
T2SC
T2SR
LARA
naive
TKEP
(c)The I/O cost
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 -0.2 -0.4 -0.6 -0.8
pruningratio
negative correlation coefficient
actual
estimated
(d)The pruning ratio
Fig. 10. The effect of negative correlation
10.6 Exp 4: The effect of total attribute number
Given k = 10, m = 5, N = 10 × 108
and c = 0,
experiment 4 evaluates the performance of T2S on
varying total attribute numbers. As shown in Fig-
ure 9(a), T2SC runs 11.51 times faster than LARA,
4.25 times faster than TKEP and 13.7 times faster
than naive algorithm, T2SR runs 6.74 times faster
than LARA, 2.49 times faster than TKEP and 8.03
times faster than naive algorithm. In experiment 4,
the execution times of LARA and naive algorithm
remain unchanged, while the execution times of T2SC
and T2SR visibly increase. With a larger value of M,
LARA and naive algorithm only need to retrieve the
involved column files, while retrieval on the presorted
table (row-oriented or column-oriented) in T2SC and
T2SR amounts to the round-robin retrieval on the all
sorted lists. As illustrated in Figure 9(b), T2SC main-
tains 1148.14 fewer tuples than LARA and 10.74 times
fewer tuples than TKEP, T2SR maintains 11481433.2
times fewer tuples than LARA and 107469 times fewer
tuples than TKEP. As depicted in Figure 9(c), T2SC
incurs 906.47 times less I/O cost than LARA and
140.88 times less I/O cost than TKEP, T2SR incurs
445.55 times less I/O cost than LARA and 69.24 times
less I/O cost than TKEP. Due to fixed value of m
and independent distribution, the positional index of
MCR utilized in selective retrieval is constant (27). The
pruning ratio is depicted in Figure 9(d). The pruning
ratio increases slightly from 0.99 at M = 5 to 0.9945 at
M = 20. With a larger value of M, more tuples from
the irrelevant sorted lists will be combined into the
presorted table, which provides a higher chance to be
skipped for tuples.
10.7 Exp 5: The effect of negative correlation
Given k = 10, M = 10, m = 3, N = 10 × 108
,
experiment 5 evaluates the performance of T2S on
varying negative correlation coefficients. As shown in
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 12
0.1
1
10
100
1000
0 0.2 0.4 0.6 0.8
time(s)
positive correlation coefficient
T2SC
T2SR
LARA
naive
TKEP
(a)Execution time
10
100
1000
10000
100000
1e+006
1e+007
0 0.2 0.4 0.6 0.8
candidatetuplenumber
positively correlated coefficient
T2SC
T2SR
LARA
naive
NUMgdep
TKEP
(b)Maintained tuple number
1000
10000
100000
1e+006
1e+007
1e+008
1e+009
1e+010
1e+011
0 0.2 0.4 0.6 0.8
read/writebytenumber
positive correlation coefficient
T2SC
T2SR
LARA
naive
TKEP
(c)The I/O cost
0.9988
0.999
0.9992
0.9994
0.9996
0.9998
1
0 0.2 0.4 0.6 0.8
pruningratio
positive correlation coefficient
actual
estimated
(d)The pruning ratio
Fig. 11. The effect of positive correlation
Figure 10(a), T2SC runs 18.71 times faster than LARA
and 13.07 times faster than TKEP averagely, T2SR
runs 6.25 times faster than LARA and 4.49 times faster
than TKEP. The significantly longer execution time of
LARA is due to a greater scan depth of exponential
growth. For T2S and TKEP, their execution times
increase significantly with a larger value of c because
of a greater scan depth and a worse pruning effect.
For naive algorithm, the data distribution does not
affect its execution and its execution time remains
unchanged. As |c| becomes larger, the execution times
of other algorithms exceed that of naive algorithm
gradually. At c = −0.6, the scan depth of LARA
in growing phase is 226200848, the early pruning of
TKEP does not work, thus from c = −0.6, TKEP
has the same execution as LARA. As illustrated in
Figure 10(b), T2SC maintains 15816.32 times fewer
candidates than LARA and 2617.06 times fewer can-
didates than TKEP, T2SR maintains 37527318 times
fewer candidates than LARA and 26109850 times
fewer candidates than TKEP. As shown in Figure
10(c), T2SC incurs 492.19 times less I/O cost than
LARA and 119.63 times less I/O cost than TKEP, T2SR
incurs 124.18 times less I/O cost than LARA and 29.99
times less I/O cost than TKEP. As |c| becomes larger,
the I/O costs of other algorithms gradually exceed
that of naive algorithm. Due to the effect of negative
correlation, as depicted in Figure 10(d), the pruning
ratio in T2S declines significantly as |c| becomes larger.
10.8 Exp 6: The effect of positive correlation
Given k = 10, M = 10, m = 3, N = 10 × 108
,
experiment 6 evaluates the performance of T2S on
varying positive correlation coefficients. As shown in
Figure 11(a), T2SC runs 8 times faster than LARA,
5.74 times faster than TKEP and 413.93 times faster
than naive algorithm, T2SR runs 8.33 times faster than
LARA, 6.23 times faster than TKEP and 454.53 times
0.1
1
10
100
1000
5 10 15 20 25
time(s)
result size
T2SC
T2SR
LARA
naive
TKEP
TA
(a)Execution time
1
10
100
1000
10000
100000
1e+006
5 10 15 20 25
candidatetuplenumber
result size
T2SC
T2SR
LARA
naive
NUMgdep
TKEP
TA
(b)Maintained tuple number
100
1000
10000
100000
1e+006
1e+007
1e+008
1e+009
5 10 15 20 25
read/writebytenumber
result size
T2SC
T2SR
LARA
naive
TKEP
TA
(c)The I/O cost
0.9997
0.99975
0.9998
0.99985
0.9999
0.99995
1
5 10 15 20 25
pruningratio
result size
T2S
estimated
(d)The pruning ratio
Fig. 12. The effect of real data
faster than naive algorithm. With a larger value of
|c|, the execution times of these algorithms except for
naive algorithm decline. As illustrated in Figure 11(b),
when |c| increases, the numbers of tuples maintained
in T2S and naive algorithm remain unchanged, the
number of tuples maintained in LARA is reduced due
to a smaller scan depth, while the number of tuples
maintained in TKEP increases because of a worse
pruning effect. And T2SC maintains 14878.88 fewer
candidates than LARA and 87.08 times fewer can-
didates than TKEP, T2SR maintains 148788.58 times
fewer candidates than LARA and 870.84 times fewer
candidates than TKEP. As depicted in Figure 11(c),
T2SC incurs 22709.58 times less I/O cost than LARA
and 24632.18 times less I/O cost than TKEP, T2SR
incurs 7374.15 times less I/O cost than LARA and
8011.62 times less I/O cost than TKEP. Although the
scan depth of growing phase declines from 2016625 to
668231, the scan depth of shrinking phase is bewteen
[3431319, 4589297] in experiment 6 and the number of
bytes retrieved by LARA does not vary much with a
larger value of |c|. The I/O costs of T2SC and T2SR
show a similar trend, which is affected by pruning
operation and the scan depth. The pruning ratio of
selective retrieval is illustrated in Figure 11(d), it is
found that as positive correlation coefficient increases,
the pruning ratio shows a decline trend.
10.9 Exp 7: The effect of real data
The real data, HIGGS Data Set, is obtained from UCI
Machine Learning Repository. It contains 11,000,000
tuples with 28 attributes. We select three features (2,
7, 12) with relatively large variances to compute top-
k results. In experiment 7, the performance of T2S is
evaluated on varying result sizes. As shown in Figure
12(a), T2SC runs 2.77 times faster than LARA, 2.71
times faster than TKEP and 8.01 times faster than
naive algorithm, T2SR runs 1.42 times faster than
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 13
10
100
1000
1 2 3 4 5
time(s)
first weight value
T2SC
T2SR
LARA
naive
TKEP
(a)Query time
0.993
0.9935
0.994
0.9945
0.995
0.9955
0.996
0.9965
0.997
0.9975
1 2 3 4 5
pruningratio
first weight value
actual
(b)The pruning ratio
Fig. 13. The effect of different weights
LARA, 1.39 times faster than TKEP and 4.46 times
faster than naive algorithm. Because the volume of
real data is relatively small, we evaluate TA algorithm
here. We find that TA algorithm runs orders of mag-
nitude slower than other algorithms. As illustrated
in Figure 12(b), T2SC maintains 3092.91 times fewer
tuples than LARA and 27.07 times fewer tuples than
TKEP, T2SR maintains 25629.18 times fewer tuples
than LARA and 142.97 times fewer tuples than TKEP.
During its execution, TA keeps positional indexes of
tuples seen before to avoid the duplicate computation
of score. The size of TBUF in T2SC is fixed to be
100, while the number of tuples maintained in LARA
increases from 217992 to 375220. As depicted in Figure
12(c), T2SC incurs 12227.52 times less I/O cost than
LARA and 7300.48 times less I/O cost than TKEP,
T2SR incurs 1546.11 times less I/O cost than LARA
and 831.02 times less I/O cost than TKEP. When the
result size increases from 5 to 25, the positional index
of MCR used in selective retrieval first increases from
15 to 16, then remains 16, and rises to 17. The pruning
ratio, depicted in Figure 12(d), reflects the variation
trend of positional index of MCR.
10.10 Exp 8: The effect of different weights
Given k = 10, M = 10, m = 5, N = 10 × 108
and
c = 0, experiment 8 evaluates the performance of T2S
on varying weights of ranking function. The weight
of the first attribute is set to be 1, 2, 3, 4 and 5,
and let the other weights be 1. As shown in Figure
13(a), the execution time of naive algorithm remains
unchanged, while the execution time of LARA in-
creases gradually, from 344.094s to 416.611s due to
an increase of scan depth in shrinking phase. And
the execution times of T2S and TKEP decline slight,
because the greater weight value on the first attribute
makes it more important than other attributes, the
MCR positional index for selective retrieval or early
pruning decreases, which is depicted in Figure 13(b).
10.11 Exp 9: The result of update operation
Given N = 10 × 108
(Told), experiment 9 evaluates
the performance of update operation in T2S. The
update operation is processed as in Section 9. Initially
the accumulated new data Tnew is empty, and tuple
100
200
300
400
500
600
700
800
900
1000
1100
1 2 3 4 5
time(s)
The tuple number in Tnew (107
)
update time
(a)Update time
50
55
60
65
70
1 2 3 4 5
time(s)
The tuple number in Tnew (107
)
incremental
merged
(b)Query time
Fig. 14. The result of update processing
number of new data Dnew periodically obtained is 106
.
When the volume of Tnew reaches 5% of Told (every
fifty days), Tnew is merged with Told. As shown in
Figure 14(a), the time to finish update operation on
Tnew alone is 1059.477s at most. The execution time of
T2SR on Told and Tnew respectively to compute top-k
results is at most 1.7 seconds longer than the time of
T2SR on the merged data set.
10.12 Summary
As illustrated in the experiments, T2S outperforms
the existing sorted-list-based methods significantly. It
also is noted that, given N = 10 × 108
, M = 10, it
takes 50186.264 seconds for T2S to build the required
data structure and the consumed disk space is 107GB.
The pre-construction cost of T2S is relatively high. But
the proposed T2S algorithm still is worth the price.
The reason is that, (1) T2S obtains one order of mag-
nitude speedup over the existing algorithms, which
provides a significant improvement, (2) for the sorted-
list-based algorithm, it needs to consume 31592.302
seconds to build the sorted list from row table and
the required disk space is 149GB, comparatively, the
pre-construction cost of T2S is not unacceptable, (3)
with the rapid growth of disk capacity, it is worth
spending more space to speed up query evaluation.
On data set of negative correlation, the performance
of T2S algorithm degrades greatly and not even as
good as naive approach when |c| is large. However,
the naive approach only performs better at a large
negative correlation coefficient. In the other experi-
ments, it runs orders of magnitude slower than T2S. In
practical applications, T2S will show a better overall
performance than naive algorithm. Besides, the nega-
tive correlation affects the existing algorithms more
greatly, the speedup ratio of T2S over the existing
algorithms becomes larger when |c| increases.
For the update operation, when the size of Tnew is
less than 5% of Told, the cost of update operation on
Tnew is relatively small. When the enough new data
is received, Tnew is merged with Told. Although the
cost of re-construction of the relevant data structures
is relatively high, as illustrated in Section 10.11, the
re-construction will be performed every fifty days,
which amortizes its relatively high cost. Of course,
in some applications, the re-construction may seem
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 14
unacceptable. In future work, we intend to explore
the idea of more efficient update operation.
11 CONCLUSION
This paper proposes a novel T2S algorithm to effi-
ciently return top-k results on massive data by sequen-
tially scanning the presorted table, in which the tuples
are arranged in the order of round-robin retrieval on
sorted lists. Only fixed number of candidates need to
be maintained in T2S. This paper proposes early ter-
mination checking and the analysis of the scan depth.
Selective retrieval is devised in T2S and it is analyzed
that most of the candidates in the presorted table can
be skipped. The experimental results show that T2S
significantly outperforms the existing algorithm.
ACKNOWLEDGEMENTS
This work was supported in part by the National
Basic Research (973) Program of China under grant
no. 2012CB316200, the National Natural Science Foun-
dation of China under grant nos. 61402130, 61272046,
61190115, 61173022, 61033015.
REFERENCES
[1] R. Akbarinia, E. Pacitti, and P. Valduriez. Best position
algorithms for top-k queries. In Proc. 33rd Int’l Conf. on Very
Large Data Bases, VLDB ’07, pages 495–506, 2007.
[2] H. Bast, D. Majumdar, R. Schenkel, and et al. Io-top-k: Index-
access optimized top-k query processing. In Proc. 32Nd Int’l
Conf. on Very Large Data Bases, VLDB ’06, pages 475–486, 2006.
[3] Y. Chang, L. Bergman, V. Castelli, and et al. The onion
technique: Indexing for linear optimization queries. In Proc.
ACM SIGMOD Int’l Conf. on Management of Data, SIGMOD ’00,
pages 391–402, 2000.
[4] G. Das, D. Gunopulos, N. Koudas, and D. Tsirogiannis. An-
swering top-k queries using views. In Proc. 32Nd Int’l Conf.
on Very Large Data Bases, VLDB ’06, pages 451–462, 2006.
[5] R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity
search and classification via rank aggregation. In Proc. ACM
SIGMOD Int’l Conf. on Management of Data, SIGMOD ’03, pages
301–312, 2003.
[6] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation
algorithms for middleware. In Proc. 20th ACM SIGMOD-
SIGACT-SIGART Symp. on Principles of Database Systems, PODS
’01, pages 102–113, 2001.
[7] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation
algorithms for middleware. J. Comput. Syst. Sci., 66(4):614–656,
2003.
[8] R. Fernandez, P. Pietzuch, J. Koshy, and et al. Liquid: Unifying
nearline and offline big data integration. In Biennial Conf. on
Innovative Data Systems Research, CIDR ’15, 2015.
[9] U. G¨untzer, W. Balke, and W. Kießling. Optimizing multi-
feature queries for image databases. In Proc. 26th Int’l Conf.
on Very Large Data Bases, VLDB ’00, pages 419–428, 2000.
[10] U. G¨untzer, W. Balke, and W. Kießling. Towards efficient
multi-feature queries in heterogeneous environments. In Proc.
Int’l Conf. on Information Technology: Coding and Computing,
ITCC ’01, pages 622–628, 2001.
[11] X. Han, J. Li, and D. Yang. Supporting early pruning in
top-k query processing on massive data. Inf. Process. Lett.,
111(11):524–532, 2011.
[12] X. Han, J. Li, D. Yang, and J. Wang. Efficient skyline com-
putation on big data. IEEE Trans. on Knowl. and Data Eng.,
25(11):2521–2535, 2013.
[13] J. Heo, J. Cho, and K. Whang. Subspace top-k query processing
using the hybrid-layer index with a tight bound. Data Knowl.
Eng., 83:1–19, 2013.
[14] V. Hristidis and Y. Papakonstantinou. Algorithms and appli-
cations for answering ranked queries using ranked views. The
VLDB J., 13(1):49–70, 2004.
[15] I. Ilyas, G. Beskales, and M. Soliman. A survey of top-k query
processing techniques in relational database systems. ACM
Comput. Surv., 40(4):11:1–11:58, 2008.
[16] J. Lee, H.k Cho, and S. Hwang. Efficient dual-resolution layer
indexing for top-k queries. In Proc. 28th Int’l Conf. on Data
Engineering, ICDE ’12, pages 1084–1095, 2012.
[17] N. Mamoulis, M. Yiu, K. Cheng, and D. Cheung. Efficient top-
k aggregation of ranked inputs. ACM Trans. Database Syst.,
32(3), 2007.
[18] H. Pang, X. Ding, and B. Zheng. Efficient processing of exact
top-k queries over disk-resident sorted lists. The VLDB J.,
19(3):437–456, 2010.
[19] M. Stonebraker, D. Abadi, A. Batkin, and et al. C-store: A
column-oriented dbms. In Proc. 31st Int’l Conf. on Very Large
Data Bases, VLDB ’05, pages 553–564, 2005.
[20] K. Wu, A. Shoshani, and K. Stockinger. Analyses of multi-
level and multi-component compressed bitmap indexes. ACM
Trans. Database Syst., 35(1):2:1–2:52, 2008.
[21] M. Xie, L. Lakshmanan, and P. Wood. Efficient top-k query
answering using cached views. In Proc. 16th Int’l Conf. on
Extending Database Technology, EDBT ’13, pages 489–500, 2013.
[22] D. Xin, C. Chen, and J. Han. Towards robust indexing for
ranked queries. In Proc. 32Nd Int’l Conf. on Very Large Data
Bases, VLDB ’06, pages 235–246, 2006.
[23] L. Zou and L. Chen. Pareto-based dominant graph: An
efficient indexing structure to answer top-k queries. IEEE
Trans. on Knowl. and Data Eng., 23(5):727–741, 2011.
Xixian Han is a lecturer in the School of
Computer Science and Technology, Harbin
Institute of Technology, China. He received
his Master’s degree and PhD degree from the
School of Computer Science and Technol-
ogy, Harbin Institute of Technology in 2006
and 2012 respectively. His main research
interests include massive data management
and data-intensive computing.
Jianzhong Li is Chair of the Departmen-
t of Computer Science and Engineering at
Harbin Institute of Technology, China. He is
also a professor in the School of Computer
Science and Technology, the Dean of School
of Computer Science and Technology, and
the Dean of School of Software at Hei-
longjiang University, China. His current re-
search interests include data-intensive com-
puting, wireless sensor networks and CPS.
Hong Gao is professor in the School of
Computer Science and Technology at Harbin
Institute of Technology, China. Prof. Gao
is the principal investigator for several Na-
tional Natural Science Foundation Projects
and participated two the National Basic Re-
search (973) Program. Her research inter-
ests include wireless sensor network, cyber-
physical systems, massive data manage-
ment and data mining.
For More Details Contact G.Venkat Rao
PVR TECHNOLOGIES 8143271457

Mais conteúdo relacionado

Mais procurados

SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYSCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYijcseit
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...Stuart Chalk
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structurecseij
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
 
Ginix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword SearchGinix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword SearchIRJET Journal
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingVinoth Chandar
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clusteringguest0edcaf
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimizationSaranya Natarajan
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification Mahmoud Alfarra
 
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...ertekg
 

Mais procurados (18)

SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYSCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
D0352630
D0352630D0352630
D0352630
 
Cg4201552556
Cg4201552556Cg4201552556
Cg4201552556
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structure
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
 
Ginix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword SearchGinix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword Search
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based Grouping
 
Non-Uniform Gap Distribution Library Sort
Non-Uniform Gap Distribution Library SortNon-Uniform Gap Distribution Library Sort
Non-Uniform Gap Distribution Library Sort
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
 
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
 

Destaque

#MFSummit2016 Operate: Solving desktop challenges with application virtualisa...
#MFSummit2016 Operate: Solving desktop challenges with application virtualisa...#MFSummit2016 Operate: Solving desktop challenges with application virtualisa...
#MFSummit2016 Operate: Solving desktop challenges with application virtualisa...Micro Focus
 
Pray For Spiritual Empowering
Pray For Spiritual EmpoweringPray For Spiritual Empowering
Pray For Spiritual EmpoweringAdrian Kerr
 
Cтатистика работы push-уведомлений
Cтатистика работы push-уведомленийCтатистика работы push-уведомлений
Cтатистика работы push-уведомленийJeapie
 
Factores de crecimiento_hematopoyetico
Factores de crecimiento_hematopoyeticoFactores de crecimiento_hematopoyetico
Factores de crecimiento_hematopoyeticoRoberto Alvarado
 
О специальной оценке условий труда
О специальной оценке условий трудаО специальной оценке условий труда
О специальной оценке условий трудаAndrew Larchenko
 
Mobile Day - Ionic 2
Mobile Day - Ionic 2Mobile Day - Ionic 2
Mobile Day - Ionic 2Software Guru
 
Design and development of an anaerobic bio-digester for application in sewage...
Design and development of an anaerobic bio-digester for application in sewage...Design and development of an anaerobic bio-digester for application in sewage...
Design and development of an anaerobic bio-digester for application in sewage...Dr. Eng. Mercy Manyuchi
 
RESUM ACTA COMITÈ VERD - 25 gener 2026
RESUM ACTA COMITÈ VERD - 25 gener 2026RESUM ACTA COMITÈ VERD - 25 gener 2026
RESUM ACTA COMITÈ VERD - 25 gener 2026nfont
 
Els gegants de llagostera
Els gegants de llagosteraEls gegants de llagostera
Els gegants de llagosteranfont
 
Gestor de proyectos alexa
Gestor de proyectos alexaGestor de proyectos alexa
Gestor de proyectos alexaCarlos Toro
 

Destaque (17)

Acetaia muratori
Acetaia muratoriAcetaia muratori
Acetaia muratori
 
#MFSummit2016 Operate: Solving desktop challenges with application virtualisa...
#MFSummit2016 Operate: Solving desktop challenges with application virtualisa...#MFSummit2016 Operate: Solving desktop challenges with application virtualisa...
#MFSummit2016 Operate: Solving desktop challenges with application virtualisa...
 
Pray For Spiritual Empowering
Pray For Spiritual EmpoweringPray For Spiritual Empowering
Pray For Spiritual Empowering
 
Cтатистика работы push-уведомлений
Cтатистика работы push-уведомленийCтатистика работы push-уведомлений
Cтатистика работы push-уведомлений
 
Json
JsonJson
Json
 
Gmp+b2
Gmp+b2Gmp+b2
Gmp+b2
 
Factores de crecimiento_hematopoyetico
Factores de crecimiento_hematopoyeticoFactores de crecimiento_hematopoyetico
Factores de crecimiento_hematopoyetico
 
Hojas de evaluacion
Hojas de evaluacion Hojas de evaluacion
Hojas de evaluacion
 
О специальной оценке условий труда
О специальной оценке условий трудаО специальной оценке условий труда
О специальной оценке условий труда
 
Lab 02-diurèticos
Lab 02-diurèticosLab 02-diurèticos
Lab 02-diurèticos
 
Modul84
Modul84Modul84
Modul84
 
Mobile Day - Ionic 2
Mobile Day - Ionic 2Mobile Day - Ionic 2
Mobile Day - Ionic 2
 
Design and development of an anaerobic bio-digester for application in sewage...
Design and development of an anaerobic bio-digester for application in sewage...Design and development of an anaerobic bio-digester for application in sewage...
Design and development of an anaerobic bio-digester for application in sewage...
 
RESUM ACTA COMITÈ VERD - 25 gener 2026
RESUM ACTA COMITÈ VERD - 25 gener 2026RESUM ACTA COMITÈ VERD - 25 gener 2026
RESUM ACTA COMITÈ VERD - 25 gener 2026
 
Els gegants de llagostera
Els gegants de llagosteraEls gegants de llagostera
Els gegants de llagostera
 
Anestesicos
AnestesicosAnestesicos
Anestesicos
 
Gestor de proyectos alexa
Gestor de proyectos alexaGestor de proyectos alexa
Gestor de proyectos alexa
 

Semelhante a Efficient top k retrieval on massive data

A time efficient and accurate retrieval of range aggregate queries using fuzz...
A time efficient and accurate retrieval of range aggregate queries using fuzz...A time efficient and accurate retrieval of range aggregate queries using fuzz...
A time efficient and accurate retrieval of range aggregate queries using fuzz...IJECEIAES
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
E132833
E132833E132833
E132833irjes
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES cscpconf
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIEScsandit
 
Enhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiesEnhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiescsandit
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive Zara Tariq
 
Network Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangNetwork Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangEugine Kang
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
 
Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...
Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...
Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...ijceronline
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluationavniS
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor Systemijtsrd
 
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYSCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYijcseit
 

Semelhante a Efficient top k retrieval on massive data (20)

A time efficient and accurate retrieval of range aggregate queries using fuzz...
A time efficient and accurate retrieval of range aggregate queries using fuzz...A time efficient and accurate retrieval of range aggregate queries using fuzz...
A time efficient and accurate retrieval of range aggregate queries using fuzz...
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Bo4301369372
Bo4301369372Bo4301369372
Bo4301369372
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
E132833
E132833E132833
E132833
 
I0343047049
I0343047049I0343047049
I0343047049
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
Enhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiesEnhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologies
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive
 
Network Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangNetwork Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine Kang
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
 
Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...
Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...
Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor System
 
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERYSCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
SCIENTIFIC WORKFLOW CLUSTERING BASED ON MOTIF DISCOVERY
 

Mais de Pvrtechnologies Nellore

A High Throughput List Decoder Architecture for Polar Codes
A High Throughput List Decoder Architecture for Polar CodesA High Throughput List Decoder Architecture for Polar Codes
A High Throughput List Decoder Architecture for Polar CodesPvrtechnologies Nellore
 
Performance/Power Space Exploration for Binary64 Division Units
Performance/Power Space Exploration for Binary64 Division UnitsPerformance/Power Space Exploration for Binary64 Division Units
Performance/Power Space Exploration for Binary64 Division UnitsPvrtechnologies Nellore
 
Hybrid LUT/Multiplexer FPGA Logic Architectures
Hybrid LUT/Multiplexer FPGA Logic ArchitecturesHybrid LUT/Multiplexer FPGA Logic Architectures
Hybrid LUT/Multiplexer FPGA Logic ArchitecturesPvrtechnologies Nellore
 
Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video...
Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video...Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video...
Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video...Pvrtechnologies Nellore
 
2016 2017 ieee ece embedded- project titles
2016   2017 ieee ece  embedded- project titles2016   2017 ieee ece  embedded- project titles
2016 2017 ieee ece embedded- project titlesPvrtechnologies Nellore
 
A High-Speed FPGA Implementation of an RSD-Based ECC Processor
A High-Speed FPGA Implementation of an RSD-Based ECC ProcessorA High-Speed FPGA Implementation of an RSD-Based ECC Processor
A High-Speed FPGA Implementation of an RSD-Based ECC ProcessorPvrtechnologies Nellore
 
6On Efficient Retiming of Fixed-Point Circuits
6On Efficient Retiming of Fixed-Point Circuits6On Efficient Retiming of Fixed-Point Circuits
6On Efficient Retiming of Fixed-Point CircuitsPvrtechnologies Nellore
 
Pre encoded multipliers based on non-redundant radix-4 signed-digit encoding
Pre encoded multipliers based on non-redundant radix-4 signed-digit encodingPre encoded multipliers based on non-redundant radix-4 signed-digit encoding
Pre encoded multipliers based on non-redundant radix-4 signed-digit encodingPvrtechnologies Nellore
 
Quality of-protection-driven data forwarding for intermittently connected wir...
Quality of-protection-driven data forwarding for intermittently connected wir...Quality of-protection-driven data forwarding for intermittently connected wir...
Quality of-protection-driven data forwarding for intermittently connected wir...Pvrtechnologies Nellore
 
Control cloud-data-access-privilege-and-anonymity-with-fully-anonymous-attrib...
Control cloud-data-access-privilege-and-anonymity-with-fully-anonymous-attrib...Control cloud-data-access-privilege-and-anonymity-with-fully-anonymous-attrib...
Control cloud-data-access-privilege-and-anonymity-with-fully-anonymous-attrib...Pvrtechnologies Nellore
 
Control cloud data access privilege and anonymity with fully anonymous attrib...
Control cloud data access privilege and anonymity with fully anonymous attrib...Control cloud data access privilege and anonymity with fully anonymous attrib...
Control cloud data access privilege and anonymity with fully anonymous attrib...Pvrtechnologies Nellore
 
Cloud keybank privacy and owner authorization
Cloud keybank  privacy and owner authorizationCloud keybank  privacy and owner authorization
Cloud keybank privacy and owner authorizationPvrtechnologies Nellore
 
Circuit ciphertext policy attribute-based hybrid encryption with verifiable
Circuit ciphertext policy attribute-based hybrid encryption with verifiableCircuit ciphertext policy attribute-based hybrid encryption with verifiable
Circuit ciphertext policy attribute-based hybrid encryption with verifiablePvrtechnologies Nellore
 

Mais de Pvrtechnologies Nellore (20)

A High Throughput List Decoder Architecture for Polar Codes
A High Throughput List Decoder Architecture for Polar CodesA High Throughput List Decoder Architecture for Polar Codes
A High Throughput List Decoder Architecture for Polar Codes
 
Performance/Power Space Exploration for Binary64 Division Units
Performance/Power Space Exploration for Binary64 Division UnitsPerformance/Power Space Exploration for Binary64 Division Units
Performance/Power Space Exploration for Binary64 Division Units
 
Hybrid LUT/Multiplexer FPGA Logic Architectures
Hybrid LUT/Multiplexer FPGA Logic ArchitecturesHybrid LUT/Multiplexer FPGA Logic Architectures
Hybrid LUT/Multiplexer FPGA Logic Architectures
 
Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video...
Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video...Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video...
Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video...
 
2016 2017 ieee matlab project titles
2016 2017 ieee matlab project titles2016 2017 ieee matlab project titles
2016 2017 ieee matlab project titles
 
2016 2017 ieee vlsi project titles
2016   2017 ieee vlsi project titles2016   2017 ieee vlsi project titles
2016 2017 ieee vlsi project titles
 
2016 2017 ieee ece embedded- project titles
2016   2017 ieee ece  embedded- project titles2016   2017 ieee ece  embedded- project titles
2016 2017 ieee ece embedded- project titles
 
A High-Speed FPGA Implementation of an RSD-Based ECC Processor
A High-Speed FPGA Implementation of an RSD-Based ECC ProcessorA High-Speed FPGA Implementation of an RSD-Based ECC Processor
A High-Speed FPGA Implementation of an RSD-Based ECC Processor
 
6On Efficient Retiming of Fixed-Point Circuits
6On Efficient Retiming of Fixed-Point Circuits6On Efficient Retiming of Fixed-Point Circuits
6On Efficient Retiming of Fixed-Point Circuits
 
Pre encoded multipliers based on non-redundant radix-4 signed-digit encoding
Pre encoded multipliers based on non-redundant radix-4 signed-digit encodingPre encoded multipliers based on non-redundant radix-4 signed-digit encoding
Pre encoded multipliers based on non-redundant radix-4 signed-digit encoding
 
Quality of-protection-driven data forwarding for intermittently connected wir...
Quality of-protection-driven data forwarding for intermittently connected wir...Quality of-protection-driven data forwarding for intermittently connected wir...
Quality of-protection-driven data forwarding for intermittently connected wir...
 
11.online library management system
11.online library management system11.online library management system
11.online library management system
 
06.e voting system
06.e voting system06.e voting system
06.e voting system
 
New web based projects list
New web based projects listNew web based projects list
New web based projects list
 
Power controlled medium access control
Power controlled medium access controlPower controlled medium access control
Power controlled medium access control
 
IEEE PROJECTS LIST
IEEE PROJECTS LIST IEEE PROJECTS LIST
IEEE PROJECTS LIST
 
Control cloud-data-access-privilege-and-anonymity-with-fully-anonymous-attrib...
Control cloud-data-access-privilege-and-anonymity-with-fully-anonymous-attrib...Control cloud-data-access-privilege-and-anonymity-with-fully-anonymous-attrib...
Control cloud-data-access-privilege-and-anonymity-with-fully-anonymous-attrib...
 
Control cloud data access privilege and anonymity with fully anonymous attrib...
Control cloud data access privilege and anonymity with fully anonymous attrib...Control cloud data access privilege and anonymity with fully anonymous attrib...
Control cloud data access privilege and anonymity with fully anonymous attrib...
 
Cloud keybank privacy and owner authorization
Cloud keybank  privacy and owner authorizationCloud keybank  privacy and owner authorization
Cloud keybank privacy and owner authorization
 
Circuit ciphertext policy attribute-based hybrid encryption with verifiable
Circuit ciphertext policy attribute-based hybrid encryption with verifiableCircuit ciphertext policy attribute-based hybrid encryption with verifiable
Circuit ciphertext policy attribute-based hybrid encryption with verifiable
 

Último

A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfsmsksolar
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 

Último (20)

A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 

Efficient top k retrieval on massive data

  • 1. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 1 Efficient Top-k Retrieval on Massive Data Xixian Han, Jianzhong Li, Member, IEEE, Hong Gao Abstract—In many applications, top-k query is an important operation to return a set of interesting points in a potentially huge data space. It is analyzed in this paper that the existing algorithms cannot process top-k query on massive data efficiently. This paper proposes a novel table-scan-based T2S algorithm to efficiently compute top-k results on massive data. T2S first constructs the presorted table, whose tuples are arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains only fixed number of tuples to compute results. The early termination checking for T2S is presented in this paper, along with the analysis of scan depth. The selective retrieval is devised to skip the tuples in the presorted table which are not top-k results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and incremental-update/batch-processing methods for the used structures are proposed in this paper. The extensive experimental results, conducted on synthetic and real-life data sets, show that T2S has a significant advantage over the existing algorithms. Index Terms—Massive data, T2S algorithm, Table scan, Early termination, Selective retrieval ! 1 INTRODUCTION IN many applications, top-k query is an important operation to return a set of interesting points from a potentially huge data space. In top-k query, a ranking function F is provided to determine the score of each tuple and k tuples with the largest scores are returned. Due to its practical importance, top-k query has attracted extensive attention [15]. The existing top-k algorithms can be classified into three types: index- based methods [3], [22], view-based methods [4], [14], [21] and sorted-list-based methods [7], [17], [18]. Index-based methods (or view-based methods) make use of the pre-constructed indexes (or views) to pro- cess top-k query. However, a concrete index (or view) is constructed on a specific subset of attributes, the indexes (or views) of exponential order with respect to attribute number have to be built to cover the actual queries, which is prohibitively expensive. The used indexes (or views) can only be built on a small and selective set of attribute combinations. Sorted-list-based methods retrieve the sorted lists in a round-robin fashion, maintain the retrieved tuples, update their lower-bound and upper-bound scores. When the kth largest lower-bound score is not less than the upper-bound scores of other candidates, the k candidates with the largest lower-bound scores are top-k results. Sorted-list-based methods compute top- k results by retrieving the involved sorted lists and naturally can support the actual queries. However, it is analyzed in this paper that the numbers of tuples retrieved and maintained in these methods increase exponentially with attribute number, increase polyno- mially with tuple number and result size. They will • The authors are with the School of Computer Science and Technology, Harbin Institute of Technology, China. E-mail: hanxixian@yeah.net, lijzh@hit.edu.cn, honggao@hit.edu.cn incur high I/O cost and memory cost on massive data. This paper proposes a novel table-scan-based T2S algorithm (Top-k by Table Scan) to compute top-k results on massive data efficiently. Given table T, T2S first presorts T to obtain table PT(Presorted Table), whose tuples are arranged in the order of the round- robin retrieval on the sorted lists. During its execution, T2S only maintains fixed and small number of tuples to compute results. It is proved that T2S has the characteristic of early termination. It does not need to examine all tuples in PT to return results. The analysis of scan depth in T2S is developed also. The result size k is usually small and the vast majority of the tuples retrieved in PT are not top-k results, this paper devises selective retrieval to skip the tuples in PT which are not query results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and incremental-update/batch-processing methods for the data structures are proposed in this paper. The exten- sive experiments are conducted on synthetic and real- life data sets. The experimental results show that, T2S outperforms the existing algorithms significantly. The contributions of this paper are listed as follows: • This paper proposes a novel table-scan-based T2S algorithm to process top-k query on massive data. • The early termination checking is presented in this paper, along with the analysis of scan depth. • This paper devises selective retrieval and its per- formance analysis. • The experimental results show that T2S has a sig- nificant advantage over the existing algorithms. The rest of the paper is organized as follows. Section 2 reviews related work. Preliminaries are described in Section 3. The execution behavior of sorted-list-based methods is analyzed in Section 4. T2S algorithm is introduced in Section 5 (the presorting of the table), For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 2. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 2 Section 6 (early termination checking), Section 7 (se- lective retrieval) and Section 8 (columnwise algorith- m). The discussion about data structures is developed in Section 9. The performance evaluation is provided in Section 10. Section 11 concludes the paper. 2 RELATED WORK 2.1 Index-based and view-based methods Xin et al. [22] and Chang et al. [3] propose layered indexing to organize the tuples into multiple consec- utive layers. The top-k results can be computed by at most k layers of tuples. Zou et al. [23] propose layer-based Pareto-Based Dominant Graph to express the dominant relationship between records and top-k query is implemented as a graph traversal problem. Lee et al. [16] propose a dual-resolution layer structure (coarse-level layers and fine-level sub-layers). Top- k query can be processed efficiently by traversing the dual-resolution layer through the relationships between tuples. Heo et al. [13] propose the Hybrid- Layer Index, which integrates layer-level filtering and list-level filtering to significantly reduce the number of tuples retrieved in query processing Hristidis et al. [14] and Das et al. [4] propose view-based algorithms to pre-construct the specified materialized views according to some ranking func- tions. Given a top-k query, one or more optimal materialized views are selected to return the top-k results efficiently. Xie et al. [21] propose LPTA+ to significantly improve efficiency of the state-of-the-art LPTA algorithm. The materialized views are cached in memory, LPTA+ can reduce the iterative calling of the linear programming sub-procedure, thus greatly improving the efficiency over the LPTA algorithm. Summary. In practical applications, a concrete in- dex (or view) is built on a specific subset of attributes. Due to prohibitively expensive overhead to cover all attribute combinations, the indexes (or views) can only be built on a small and selective set of attribute combinations. If the attribute combinations of top-k query are fixed, index-based (or view-based) methods can provide a superior performance. However, on massive data, users often issue ad-hoc queries, it is very likely that the indexes (or views) involved in the ad-hoc queries are not built and the practicability of these methods is limited greatly. Correspondingly, T2S only builds presorted table, on which top-k query on any attribute combination can be dealt with. This reduces the space overhead significantly compared with index-based (or view-based) methods, and en- ables actual practicability for T2S. 2.2 Sorted-list-based methods Fagin et al. [6] propose TA algorithm to answer top-k query on sorted lists. During execution, TA reads each list in a round-robin fashion. For each tuple newly seen in some list, TA retrieves scores of the tuple in other lists to determine its complete score. The threshold of TA is updated in each pass of retrieval by aggregating the scores of the last seen tuples in lists. TA terminates when the kth largest score seen so far is not less than the threshold, and returns k tuples with the largest scores. Instead of round-robin retrieval, G¨unter et al. [9] propose Quick-Combine to preferentially access lists that make threshold decline most. Akbarinia et al. [1] develop best position al- gorithm to speed up top-k processing by means of information obtained in the process of random access. In some cases, random access is limited or impossi- ble, NRA [7] is proposed with sequential access only and performs round-robin retrieval on sorted lists. For each retrieved tuple, NRA maintains its lower-bound and upper-bound scores. NRA terminates when the kth largest lower-bound score seen so far is not less than the upper-bound scores of the other tuples, and returns k tuples with the largest lower-bound scores. G¨unter et al. [10] propose Stream-Combine to improve NRA by preferentially accessing sorted lists which make threshold decline most. Mamoulis et al. [17] divide execution of NRA into growing phase and shrinking phase, propose LARA to reduce computa- tion cost. Pang et al. [18] propose TBB algorithm to improve disk access efficiency of NRA by sequential scan. Han et al. [11] develop TKEP algorithm to perform early pruning on each candidate encountered in growing phase to reduce the processing cost. Summary. Sorted-list-based methods can compute top-k results by retrieving the involved sorted lists and support actual top-k queries. They usually have the characteristic of early termination. However, on massive data, TA-style algorithms will generate many random seek operations, while NRA-style algorithms will retrieve and maintain a large number of tuples, they cannot process top-k query efficiently. TKEP can reduce the number of the tuples maintained in memory by early pruning. Compared with TKEP, the novelties of T2S are that, (1) TKEP is an approximate method while T2S always returns exact results, (2) the number of tuples maintained in TKEP is affect- ed by concrete parameters (such as tuple number or attribute number) and underlying distribution of data, while T2S always maintains fixed and small number of candidates, (3) during the execution, TKEP has to retrieve every tuple, while T2S skips most of the candidates. Of course, compared with sorted- list-based methods, T2S has the overhead to build presorted table. The description below shows that, (1) the overhead is worthwhile because T2S outperforms the existing methods significantly, (2) for the required structures, the incremental-update/batch-processing can be performed to reflect the latest changes of the sorted lists with a relatively small cost. Of course, in some cases, T2S may be not suitable because it needs to presort the table. For example, For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 3. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 3 in distributed systems, each node maintains some attribute and the attribute values can be sorted in different order according to the user requirement. In the case, it is difficult (if not impossible) to build the required presorted table. In many other cases, however, data set is stored in stand-alone computer or horizontal partitioned in distributed system, the values in the data set has the default order, T2S is very useful here because of its superior performance. 3 PRELIMINARIES 3.1 Top-k query Given a table T(A1, . . . , AM ) with N tuples, ∀t ∈ T, let t.Ai be the value of Ai in t. Without loss of generality, let F = ∑m i=1(wi × t.Ai) be a ranking function on A1, . . . , Am, here wi is the weight of F on Ai. Usually, F is a monotonic function, i.e. ∀t1, t2 ∈ T, if t1.Ai ≤ t2.Ai for all 1 ≤ i ≤ m, F(t1) ≤ F(t2). In this paper, we assume that the larger scores are preferred. Top-k query: Given a table T and a ranking function F, top-k query returns k-subset R of T, in which ∀t1 ∈ R, ∀t2 ∈ (T − R), F(t1) ≥ F(t2). ∀t ∈ T, its positional index (PI) is a if t is the ath tuple, denoted by T(a) [12]. We denote by T(S) the tuples in T whose PIs are in set S, by T(a, . . . , b)(a ≤ b) the tuples in T whose PIs are between a and b, by T(a, . . . , b).Ai be the set of attribute Ai in T(a, . . . , b). 3.2 Sorted-list-based methods In this paper, we take LARA [17] to discuss sorted-list- based methods due to its superior performance. We do not take TKEP here because its execution is similar to LARA except for early pruning and it is an approx- imate method. In LARA, T is kept as a set of sorted lists SL1, . . . , SLM . The schema of SLi(1 ≤ i ≤ M) is SLi(PI, Ai), here PI is the positional index of the tuple in T and Ai is the corresponding attribute value. SLi is sorted in descending order of Ai. Given ranking function F, LARA performs round-robin retrieval on SL1, . . . , SLm. Let the tuples retrieved in current pass be (pi1, a1), . . ., (pim, am), the threshold of LARA is τ = F(a1, . . . , am). The execution of LARA consists of growing phase and shrinking phase. In growing phase, LARA maintains each retrieved candidate t in set C and updates its lower-bound score Flb t . Here, Flb t = ∑m i=1(wi × t.Ail) where t.Ail = t.Ai if t.Ai has been obtained in SLi, otherwise t.Ail = MINi (the minimum value of Ai). A priority queue PQ is used to maintain k candidates with the largest lower- bound scores seen so far. Let PQ.min be the minimum lower-bound score in PQ. When PQ.min ≥ τ, growing phase is over. In shrinking phase, LARA continues its round-robin retrieval, maintains the lower-bound and upper-bound scores of the candidates in set C, updates PQ. For candidate t, its upper-bound score is Fub t = ∑m i=1(wi × t.Aiu), here t.Aiu = t.Ai if t.Ai has been seen in SLi, otherwise t.Aiu = ai (the maximum value of t.Ai). When PQ.min ≥ Fub t (∀t, t ∈ (C − PQ)), LARA terminates and PQ keeps the top-k results. 4 BEHAVIOR ANALYSIS OF LARA This section analyzes execution behavior of LARA. To facilitate discussion, Assumption 4.1 is made here. Assumption 4.1: The attributes A1, . . . , AM are dis- tributed uniformly and independently, the range of Ai(1 ≤ i ≤ M) is [0, 1], and ∀t ∈ T, the ranking function F(t) = ∑m i=1 t.Ai. Definition 4.1: (Terminating tuple) In round-robin retrieval on SL1, . . . , SLm, a tuple t is a terminating tuple, if its attributes A1, . . . , Am have been seen. Let gdep be scan depth1 when k terminating tu- ples are obtained in the round-robin retrieval on SL1, . . . , SLm, here growing phase ends definitely since PQ.min ≥ ∑m i=1 SLi(gdep).Ai is satisfied. We take gdep as the scan depth when growing phase ends. ∀t ∈ T, the probability that (t.PI, t.Ai) is in the first gdep tuples in SLi is gdep N , and the probability that t is a terminating tuple given scan depth gdep is (gdep N )m . ∀t1, t2 ∈ T, given scan depth gdep, the event that t1 is a terminating tuple is independent of the event that t2 is a terminating tuple [12]. The number of terminating tuples follows binomial distribution BD(N, (gdep N )m ). Its expectation is E(BD) = N ×(gdep N )m . Let E(BD) = k, we have gdep = N × ( k N ) 1 m . LARA does not prune candidates in growing phase, here we analyze the number of candidates maintained by LARA. ∀t ∈ T, let numt,gdep be the number of at- tributes A1, . . . , Am of t lying in the first gdep tuples in the corresponding sorted lists. Under Assumption 4.1, numt,gdep follows binomial distribution BD2(m, gdep N ), and P(numt,gdep = i) = (m i ) × (gdep N )i × (1 − gdep N )m−i . The number NUMgdep of the candidates maintained in growing phase is ∑m i=1(N × P(numt,gdep = i)). It is found that, NUMgdep increases exponentially with m, and increases polynomially with N, k. On massive data, LARA maintains a large number of candidates and will incur a high memory cost. Next, we analyze the maximum scan depth sdep of shrinking phase. In shrinking phase, PQ.min ≥∑m i=1 SLi(gdep).Ai = m × (1 − gdep N ). Thus, ∀t ∈ (C − PQ), shrinking phase ends definitely when m × (1 − gdep N ) ≥ Fub t . The scan depth sdep, to which m sorted lists are retrieved in order to guarantee that t is not top-k result, corresponds to the case that m − 1 attributes of t are value 1 while the left attribute is rather small2 , i.e. m×(1− gdep N ) ≥ (m−1)+(1− sdep N ). We take sdep = m × gdep as scan depth when shrink- ing phase ends. In terms of its formula, sdep increas- es exponentially with m, and increases polynomially with N, k. On massive data, LARA has to retrieve a large number of tuples and incurs a high I/O cost. 1. The positional index of tuples in the sorted lists 2. We only consider A1, . . . , Am when computing score of tuple For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 4. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 4 Fig. 1. The illustration of sorted lists When the number of maintained candidates exceed- s the memory limit, LARA has to utilize the disk as exchange space and incurs an even higher I/O cost. It is noted that Assumption 4.1 is taken only for facilitating discussion, the execution of LARA and T2S does not depend on the assumption. Of course, it will be interesting to discuss the case that A1, . . . , Am follows correlated distribution with correlation coeffi- cient c. If c < 0, they follows negatively correlated dis- tribution and the values of the attributes tend to move in the opposite direction. Let gdepc be the current scan depth in growing phase. Obviously gdepc > gdep, LARA has to maintain more tuples in negatively correlated distribution. As |c| increases, gdepc will become larger and LARA has to maintain much more candidates. If c > 0, they follows positively correlated distribution and the converse is true. Aiming at the performance issues of LARA in high memory cost and I/O cost, this paper proposes a table-scan-based T2S algorithm to compute top-k re- sults on massive data efficiently. In its execution, T2S selectively retrieves the tuples in presorted table and maintains the k tuples with the largest exact scores so far. For each tuple retrieved, T2S checks whether the top-k results are obtained already. In the following sections, we will introduce T2S algorithm gradually. 5 THE PRESORTING OF THE TABLE This section introduces how to generate table PT. As shown in Figure 1, because sorted list SLi(1 ≤ i ≤ m) only maintains the information of Ai, LARA has to maintain the retrieved tuple in memory to compute the lower-bound and upper-bound scores of the tuple. For any retrieved tuple, if we can obtain the required attribute values, we can compute its exact score directly, and only maintain k candidates with the largest scores seen so far to compute top-k results. In order to compute exact score, one method is based on sorted lists, whose process is similar to TA algorithm. The method has the characteristic of early termination, but will incur a large number of disk seek operations. Another method performs sequential scan on the table and return k tuples with the largest scores when the scan is over. This method has high disk efficiency, but has to read all tuples in the table. Fig. 2. The illustration of presorted row table In this paper, T2S presorts the table in the order of the round-robin retrieval on sorted lists, which combines the advantages of two methods above. The detail of presorting operation is described be- low. Given sorted list SLi(PI, Ai)(1 ≤ i ≤ M), SLi is first sorted to generate PSLi(PIL, PIT , Ai) in ascending order of PIT , here PIL represents the positional index of the tuple in sorted list 3 , PIT is the attribute PI in the schema of SLi. ∀j(1 ≤ j ≤ N), PSL1(j), . . . , PSLM (j) make up T(j) and we denote by MPIL(j) the minimum value of PIL among PSL1(j), . . . , PSLM (j). As shown in Figure 2, we merge PSL1, . . . , PSLM and sort the merged table in the ascending order of MPIL. The schema of presorted table PT is PT(MPIL, PIT , A1, . . . , AM ). It should be noted that, PT does not contain dupli- cate tuples, since PT in essence is built by sorting T with MPIL acting as sort key. Of course, given row table T, we only need to keep attribute PIL in PSLi in the ascending order of PIT and the merging on PSL1, . . . , PSLM generates a list of value MPIL, which is used to sort T to generate PT. When presorted table PT is generated, T2S performs sequential scan on PT and maintains k tuples with the largest scores seen so far until early termination condition is satisfied. The k tuples maintained cur- rently are top-k results. Let pt ∈ PT be the tuple retrieved currently, Theorem 5.1 proves that the tu- ples already obtained from PT include all tuples of SL1(1, . . . , pt.MPIL−1), . . ., SLm(1, . . . , pt.MPIL−1). Theorem 5.1: During the sequential scan on PT, let pt be the tuple retrieved currently, the tuples already obtained from PT include all tuples of SL1(1, . . . , pt.MPIL−1), . . ., SLm(1, . . . , pt.MPIL−1). Proof: Let pt ∈ PT be the tuple retrieved currently. ∀sl ∈ SLi(1, . . . , pt.MPIL − 1)(1 ≤ i ≤ m), sl.PIL ≤ pt.MPIL − 1 and MPIL(sl.PI) ≤ pt.MPIL − 1 obvi- ously. Given that PT is sorted in the ascending order of MPIL, all tuples, whose values of MPIL are less than pt.MPIL, have been already obtained. Q.E.D. 6 THE EARLY TERMINATION CHECKING Since the tuples in PT are arranged in the order of the round-robin retrieval on sorted lists, T2S naturally 3. PIL can be seen as an implicit attribute in SLi For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 5. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 5 holds the characteristic of early termination. Let PQ be the priority queue to maintain k tuples with the largest scores seen so far, PQ.min be the minimum score in PQ. This paper proposes a compact data structure GTS(Gap Threshold Score) to maintain the threshold scores with the specified gap of positional indexes of tuples in sorted lists. Given gap parameter OG, the jth (1 ≤ j ≤ N OG ) record in structure GTS maintains the attribute values SL1(j × OG).A1, . . . , SLM (j × OG).AM . Let pt be the current tuple retrieved in PT, Theorem 6.1 proves that, if early termination condition PQ.min ≥ ∑m i=1(wi × GTS(⌊pt.MP IL−1 OG ⌋).Ai) is satis- fied, the tuples maintained in PQ currently is the top-k results. Here, we denote by GTS(0).Ai(1 ≤ i ≤ m) the value of MAXi (maximum value in Ai). Theorem 6.1: Given ranking function F and priority queue PQ, let pt be the current tuple retrieved in PT, if PQ.min ≥ ∑m i=1(wi × GTS(⌊pt.MP IL−1 OG ⌋).Ai) is satisfied, the tuples in PQ currently is the top-k results. Proof: Let pt be the current tuple retrieved in PT. According to Theorem 5.1, the tuples already obtained from PT include all tuples of SL1(1, . . . , pt.MPIL − 1), . . . , SLm(1, . . . , pt.MPIL − 1). If it is in the exe- cution of LARA, the scores of tuples currently not seen are not larger than ∑m i=1(wi × SLi(pt.MPIL − 1).Ai). Given ⌊pt.MP IL−1 OG ⌋ × OG ≤ pt.MPIL − 1, if PQ.min ≥ ∑m i=1(wi × GTS(⌊pt.MP IL−1 OG ⌋).Ai), PQ.min ≥ ∑m i=1(wi × SLi(pt.MPIL − 1).Ai) because of the sortedness of the sorted lists. Q.E.D. Next, we will analyze scan depth of T2S. During the analysis, we take Assumption 4.1 in Section 4. First, we assume that the tuples in PT only con- tain attributes A1, . . . , Am. Given scan depth gdep, k terminating tuples are generated in LARA and the minimum score of the k terminating tuples is not less than the current threshold. According to construction method of PT, ∪ 1≤i≤m T(SLi(1, . . . , gdep).PI) are the first NUMgdep tuples in PT. When the first NUMgdep tuples in PT are retrieved, early termination condi- tion is satisfied and T2S obtains the top-k results. Actually, the tuples in PT contain attributes A1, . . ., AM . In order to compute the upper-bound of scan depth, we assume ∪ m+1≤i≤M T(SLi(1, . . . , gdep).PI) do not contain top-k results. Thus, T2S needs to retrieve NUMgdep × M m tuples in PT to return top- k results. Because score threshold is computed by structure GTS, scan depth of T2S can be estimated to be ⌈ NUMgdep× M m OG ⌉ × OG. Of course, in general case that the attributes follow correlation distribution with correlation coefficient c, the scan depth of T2S can be estimated to be ⌈ NUMgdepc × M m OG ⌉ × OG and the discussion is similar as that in Section 4. For gap parameter OG, the greater value generates the fewer records in structure GTS, but a larger scan depth of T2S, and vice versa. In this paper, we set the value of OG to be 10000, the space overhead of structure GTS is not significant ( 1 10000 of the original table), the difference between actual scan depth and prolonged scan depth is not greater than 10000. 7 THE SELECTIVE RETRIEVAL STRATEGY 7.1 Basic idea It is known that the result size k is usually small (for example, in text search engine, k ≤ 40) [18], the vast majority of tuples retrieved in PT are not query results. The monotonicity of ranking function can be utilized here to determine whether a tuple in PT is not query result. Of course, we do not need to retrieve the tuples in PT which are not query results. Let tscorethres be the lower-bound score of top-k results, and MAXi be the maximum value of Ai, we propose basic idea of selective retrieval in Table 1. TABLE 1 Basic idea of selective retrieval ∀pt ∈ PT, if pt.Ai(∃i, 1 ≤ i ≤ m) satisfies the condition: pt.Ai < 1 wi × (tscorethres − ∑ 1≤j̸=i≤m(wj × MAXj)) pt is not top-k result. This can be explained that, even though other m−1 attributes of pt take the maximum values, pt.Ai is so small that the value of F(pt) is less than tscorethres. The computation of tscorethres and the value check- ing of certain attribute are implemented by the data structures preconstructed below. 7.2 The preconstruction of data structure 7.2.1 Attribute pair terminating-tuple set (APTS) This paper proposes APTS to determine the lower- bound score of top-k results. Of course, any k tuples can compute the lower-bound. But, the key point is how to determine the bound which is very close to the exact threshold, i.e. minimum top-k score. Given that result size k is usually small, an upper limit Kmax can be specified in the context of actual applications. With dimensionality M, APTS maintains(M 2 ) files, each of which keeps Kmax tuples of T. ∀1 ≤ a < b ≤ M, APTSa,b represents a file in APTS, which keeps the first Kmax terminating tuples in round-robin retrieval on SLa and SLb. The total number of tuples maintained in APTS is (M 2 ) ×Kmax. For example, given N = 109 , M = 100, Kmax = 1000, the total number of tuples in APTS is 4950000, less than 0.5% of the original table. 7.2.2 Exponential gap information In this part, we first give the definition of exponential gap bloom filter table. Definition 7.1: (Exponential gap bloom filter table) Given sorted list SLi with N tuples, EGBFTi is exponential gap bloom filter table for SLi, if EGBFTi satisfies: (1) |EGBFTi| = log2N, (2) EGBFTi(j) is a bloom filter constructed on SLi(1, . . . , 2j ).PI. For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 6. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 6 The size of EGBFTi depends on tuple number and false positive rate of bloom filter. We set the used false positive rate to be 0.001 in this paper, the size of EGBFTi is less than 23% of the size of SLi [11]. This paper presents structure EGBIi to maintain the maximum Ai value and minimum Ai value of SLi within the range covered by tuples in EGBFTi. The schema of EGBIi is (bval, eval). EGBIi(j).bval = SLi(1).Ai and EGBIi(j).eval = SLi(2j ).Ai. T2S builds EGBFT and EGBI on each sorted list. 7.2.3 Membership Checking Result (MCR) This part introduces how to construct structure MCR. Given presorted table PT, the file MCRi(1 ≤ i ≤ M) consists of log2 N tuples, here MCRi(a) is a N-bit bit- vector representing the checking results of PT tuples by EGBFTi(a). ∀PT(b)(1 ≤ b ≤ N), if the checking result of PT(b).PIT by EGBFTi(a) is true, the bth bit in MCRi(a) is set to 1, otherwise, the bit is set to 0. The size of MCRi(a) is N 8 bytes, and the total size of structure MCR is ∑ 1≤i≤M ∑ 1≤a≤log2 N N 8 . The ratio of disk space taken by structure MCR to that taken by PT is M×log2 N 64×(M+2) , space overhead of structure MCR is not high. Besides, structure MCR in practical applications can be stored in compressed bitmap [20] rather than literal bit-vector. Under Assumption 4.1, the proportion of bit 1 in MCRi(a)(1 ≤ a ≤ log2 N) is 2a N and most of tuples in MCRi can be compressed with a big compression ratio. Of course, the emphasis of this part is to introduce selective retrieval, and for the sake of convenience, we still take the form of literal bit-vector to store structure MCR in this paper. Besides, we do not need to maintain all log2 N tuples in MCRi. For example, given N = 109 , we can generate MCRi from the 16th tuple (29 tuples in original MCRi), neglecting its first 15 tuples, which cover at most 32768 tuples and are rarely used in actual applications. And any query involving the first 15 tuples in MCRi can utilize MCRi(16) instead. 7.3 Retrieval process Given ranking function F on A1, . . . , Am, T2S utilizes(m 2 ) relevant APTS files APTSa,b(1 ≤ a < b ≤ m) to compute lower-bound score of top-k results. Although dimensionality M may be high, top-k query usually involves a small number of attributes [5] and the total number of tuples in relevant APTS files is not large. For example, given m = 4, Kmax = 1000, total tuple number in relevant APTS files is 6000. During retrieving relevant APTS files, T2S main- tains in priority queue TSETthres k distinct tuples with the largest scores. When the relevant APTS files are retrieved over, tscorethres, i.e. the minimum score in TSETthres, is lower-bound score of top-k results. Next, we analyze the effect of the determined lower-bound. ∀t ∈ T, if t.PI is not obtained in growing phase, t is called a low-score tuple. Let Fig. 3. The illustration of MCR positional index APTSa,b(1, . . . , k) be the first k tuples in APTSa,b, and TSETk be the k tuples with the largest scores in ∪ 1≤a<b≤m APTSa,b(1, . . . , k). Theorem 7.1: TSETk does not contain low-score tu- ples. Proof: When scan depth is gdep, k terminating tuples are obtained, thus ∀1 ≤ a < b ≤ m, the number of the candidates whose attributes Aa and Ab are seen is not less than k. ∀t ∈ APTSa,b(1, . . . , k), t.PI ∈ SLa(1, . . . , gdep).PI ∧ t.PI ∈ SLb(1, . . . , gdep).PI, t is not a low-score tuple. Q.E.D. It is proved by Theorem 7.1 that TSETk does not contain any low-score tuple. Considering the fact that: (1) ranking function F involves a small number of attributes, (2) the minimum score in TSETthres is not less than the minimum score in TSETk, tscorethres computed above is a satisfactory lower-bound score of top-k results, which is also verified in the experiments. In the following part, T2S utilizes the value of tscorethres as score lower-bound of top-k results. According to basic idea of selective retrieval, we denote by PV mx i the maximum value of attribute Ai satisfying Ai < 1 wi × (tscorethres − ∑ 1≤j̸=i≤m(wj × MAXj)). Given PV mx i , T2S exploits structure EGBIi to determine the positional indexes of MCR tuples employed in selective retrieval. The tuples of EGBIi is retrieved from the beginning, and as shown in Figure 3, T2S returns the first positional index bmx i (1 ≤ bmx i ≤ log2 N) of EGBIi satisfying the condition: EGBIi(bmx i ).eval < PV mx i ≤ EGBIi(bmx i − 1).eval Here, we denote by EGBIi(0).eval the value of MAXi and the bmx i is the required positional index of MCR tuple. It is known that MCRi(bmx i ) is a bit-vector of checking results of the values in PIT attribute of PT by EGBFTi(bmx i ). Before performing sequential scan on PT, T2S first retrieves data from MCRi(bmx i )(1 ≤ i ≤ m). In order to improve the efficiency of disk retrieval, T2S retrieves data of MSZ bytes from MCRi(bmx i ) once (In this paper, the value of MSZ is set to 1M). Let BUFi be the buffer to store MCRi(bmx i ) data retrieved currently. T2S com- putes the bit-vector SRB = ∩ 1≤i≤m BUFi. After the first MSZ bytes are retrieved from MCRi(bmx i )(1 ≤ i ≤ m), the computed SRB represents the bit-vector corresponding to PT(1, . . . , 8 × MSZ). If the ath bit in current SRB is 0, PT(a) can be skipped directly, otherwise PT(a) should be retrieved. When the cur- rent SRB is exhausted, T2S retrieves the following data of MSZ bytes from MCRi(bmx i )(1 ≤ i ≤ m), For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 7. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 7 re-computes SRB, which represents the bit-vector corresponding to PT(8×MSZ+1, . . . , 16×MSZ), and continues the selective retrieval on PT. The process proceeds until the top-k results are acquired. 7.4 Theoretical analysis In this part, we analyze the performance of selective retrieval under Assumption 4.1. It is known that TSETk does not contain low-score tuples and the minimum score in TSETthres is not less than the minimum score in TSETk. And it is verified in the experiments that the estimated lower-bound score of top-k results is very close to minimum top- k score. In this part, we take ∑m i=1 SLi(gdep).Ai, the threshold when growing phase ends, as tscorethres. Under Assumption 4.1, we have bmx 1 = . . . = bmx m = bmx. According to basic idea of selective retrieval, ∀pt ∈ PT, when pt.Ai satisfies the condition pt.Ai < tscorethres − (m − 1), pt is not top-k result. Let the position of maximum possible Ai value in SLi satisfying the condition be depcon, we have 1−depcon N = tscorethres−(m−1), i.e. depcon = N ×(m−tscorethres). The MCR positional index is bmx = ⌈log2 depcon⌉. ∀pt = PT(a)(1 ≤ a ≤ N), the probability pruprob that pt can be skipped is 1 − (2bmx N )m . This formula can be explained that, (2bmx N )m is probability that A1, . . . , Am of PT(a) are located in SL1(1, . . . , 2bmx ), . . . , SLm(1, . . . , 2bmx ) respectively, and 1 − (2bmx N )m is the probability that at least one attribute PT(a).Aj is not located in SLj(1, . . . , 2bmx ). In terms of the construction method of MCR and the execution process of T2S, pruprob can be treated as the proportion of bit 0 in bit-vector SRB, T2S will skip vast majority of the candidates by selective retrieval. Of course, in general case that the attributes follow correlation distribution with correlation coefficient c, pruprob is affected by the value of c. For exam- ple, if c < 0, the current threshold tscorethres,c =∑m i=1 SLi(gdepc).Ai < tscorethres, the current M- CR positional index bmx,c = ⌈log2(N × (m − tscorethres,c))⌉ ≥ bmx, then the current probability to skip a tuple should become smaller. 8 THE COLUMNWISE T2S ALGORITHM In the above discussion, the tuples in PT are stored in row-store model and all of the attributes of a tuple can be retrieved sequentially. However, its problem is also obvious, that is, even though the ranking function only involves attributes A1, . . . , Am, T2S has to retrieve all of the attributes A1, . . . , AM (m < M) and incurs extra I/O cost. Intuitively, we can de- compose table PT, and retrieve the required column files when computing top-k results. Given the schema PT(MPIL, PIT , A1, . . . , AM ), we keep PT as a set of column files {CFL, CFT , CF1, . . . , CFM }, here CFL, CFT and CFi(1 ≤ i ≤ M) are column files corre- sponding to MPIL, PIT and Ai respectively. For easy identification, we denote by T2SR the T2S algorithm performed on row-store model and by T2SC the T2S algorithm performed on column-store model. The naive method, i.e. T2SC obtains single tuple of PT by retrieving CFL, CFT , CF1, . . . , CFm in round- robin fashion, has the same execution process as T2SR. The problem of the naive method is that, T2SC has to retrieve the values from m + 2 column files to compute the score of a tuple, every retrieval from one column file often incurs a disk seek operation. Here, besides the k tuples with the largest scores seen so far, T2SC also utilizes a relatively larger buffer TBUF to reduce the number of disk seek operations. Let TBUF be large enough to keep BS tuples of T. During its round-robin retrieval on the involved column files, T2SC reads BS tuples once in a column file rather than reads a single tuple once, which is called to be batch round-robin retrieval in this part, and maintains the tuples in TBUF. In this way, in every pass of batch round-robin retrieval, T2SC obtains BS tuples, which amortizes the cost to obtain every tuple. Then T2SC reads the tuples in TBUF to compute top- k results. The similar process in T2SC continues until the early termination condition is satisfied. The size BS of TBUF is important. T2SC with BS = 1 is the naive method. The larger value of BS makes the lower cost of disk seek operation for each tuple, makes the higher cost of maintaining tuples in memory and makes T2SC retrieve more extra tuples (BS tuples are retrieved in each pass). Under Assumption 4.1, the number NUMret of tuples retrieved by T2SR is ⌈ NUMgdep× M m OG ⌉ × OG × (2bmx N )m . Obviously, given BS = NUMret, T2SC has the lowest cost of disk seek operations but maybe maintain too many tuples. In terms of application context and the user requirement, this paper sets the upper-bound value BSmx and lower-bound value BSmn of the size of TBUF, takes median value of NUMret, BSmn and BSmx as the value of BS. In this paper, we set BSmn = 100 and BSmx = 100000. Furthermore, when bit-vector SRB is exhausted over, T2SC finishes the current batch round-robin retrieval even though the number of tuples retrieved is fewer than BS. Given the value of BS, T2SC may retrieve extra more BS tuples than T2SR at most, the scan depth of T2SC can be estimated to be ⌈ NUMgdep× M m + BS ( 2bmx N )m OG ⌉ × OG. Of course, batch round-robin retrieval above is s- traightforward, and a new retrieval scheduling meth- ods as in [2] may provide a better performance for T2SC. But, the emphasis of T2S is to perform top-k query on presorted table with selective retrieval, T2SC is just a columnwise version of T2S. Therefore, we adopt the straightforward method in this paper. 9 DISCUSSION ABOUT DATA STRUCTURES This section discusses the cost of pre-computing the data structures and the update processing. For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 8. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 8 The construction of presorted table PT comprises the major cost of pre-construction operation. Given sorted list SLi(PI, Ai) (the tuple length is 16 bytes), the cost of generating PSLi(PIL) is 56N bytes. Then, PSL1, . . . , PSLM are merged to generate MPIL, its cost is 8MN + 8N bytes. MPIL is used as sort key to sort T, the cost of sorting is (M + 1) × 8N + (M + 2) × 24N bytes, consisting of the cost to read MPIL and T, the cost to generate sorted partitions and perform merging operation. Adding up the costs above, the cost of generating PT is 32N × (3M + 2) bytes. The structure GTS is built for early termination checking. Its construction cost is 32M × N OG bytes, consisting of the cost to retrieve data and output data. Note that the costs of generating PT and GTS are not affected by the data distribution. For selective retrieval, we construct the structures APTS, EGBFT and MCR, within which only the cost of APTS construction depends on the data distribution. Given two sorted lists, under Assumption 4.1, the scan depth to find Kmax terminating tuples is N × (Kmax N ) 1 2 . Of course, if the attributes follow positive correlation or negative correlation, the actual scan depth should be less or larger than N×(Kmax N ) 1 2 . Here, consider that the constructing PT is the major cost in pre-construction, we take N × (Kmax N ) 1 2 as the scan depth to find Kmax terminating tuples, and the cost to build APTS is (M 2 ) × [ 32N × (Kmax N ) 1 2 + 16MKmax ] bytes, the first part in bracket is the cost to find the ter- minating tuples, the second part is the cost to retrieve and output the specified tuples. The cost to gener- ate EGBFT is M × [ 16 × (2N − 2) + (2N−2)×log2( 1 fpr ) 8×ln2 ] bytes, the first part in bracket is the cost to retrieve the sorted lists, the second part is the cost to output bloom filters. The cost to build EGBI is rather small and we do not discuss it here. The cost to build MCR is M × [ 8N + (2N−2)×log2( 1 fpr ) 8×ln2 + N×log2N 8 ] bytes, the first part in bracket is the cost to retrieve attribute PIT in PT, the second part is the cost to load EGBFT and the third part is the cost to output MCR data. Obviously, the total cost of pre-construction in T2S is high on massive data. Next we discuss how to deal with update operation in T2S. We denote the new data periodically obtained by Dnew, and the old data by Told. With the increasing volume and the disk IO bottleneck, data usually is stored in read/append- only mode in typical massive data applications [8]. Using the characteristic, we do not need to re-compute the data structures for every update, but support ef- ficient incremental-update/batch-processing for T2S. Note that Told is much larger than Dnew. Take a similar method in C-Store [19], we do not update Told and its related data structures if the size of accumulated new data (stored in Tnew) does not reach a certain proportion of the old data (for example 5%). Due to read/append-only characteristic, the old data struc- tures are not affected by the new data. T2S performs top-k computation both on old data and new data, and returns final results. Depending on its volume and the performance requirement, Tnew can be stored in row- store form, or in the similar data structures as T2S (which is used in this paper). We re-compute the data structures on Tnew when new data comes periodically, its update cost is relatively low. If the size of Tnew is large enough, we merge Tnew and Told and re-compute the data structures in proper time (when the system is idle), which amortizes its relatively high cost and keeps a good performance meanwhile. 10 PERFORMANCE EVALUATION 10.1 Experimental settings To evaluate the performance of T2S, we implement it in Java with jdk-8u20-windows-x64. The experiments are executed on LENOVO ThinkCentre M8400 (Intel (R) Core(TM) i7-3770 CPU @ 3.40GHz (8 CPUs) + 32G memory + 64 bit windows 7 + Seagate ST1000DM003 (1TB)). T2S evaluated here includes T2SR and T2SC, executing on PT in row-store model and column-store model respectively. In the experiments, the perfor- mance of T2S is evaluated against naive algorithm, LARA, TA and TKEP. The naive method performs sequential scan on the table which is kept in column- store model and only retrieves the required column files. LARA [17] is a latest sorted-list-based algorithm and has superior performance. The sorted lists in LARA are retrieved by Java’s BufferedInputStream class, which has an internal buffer array and enables LARA the ability of batched I/Os as in [18]. In order to compare T2S with TA-style algorithms with random access, we implement TA as [1], [7]. It is found that, due to poor performance of random access on massive data, TA is orders of magnitude slower than other algorithms (7315.819s when tuple number is 108 and attribute number is 5), we only provide the experi- mental results of TA in the experiment of real data. Although TKEP [11] is an approximate method, we still evaluate it here for more extensive comparison. TABLE 2 Parameter Settings Parameter Used values Tuple number(108 ) (syn) 1 ∼ 20 Result size (syn) 10 ∼ 50 Used attribute number (syn) 2 ∼ 6 Total attribute number (syn) 5 ∼ 20 Negative correlation coefficient (syn) -0.8 ∼ 0 Positive correlation coefficient (syn) 0 ∼ 0.8 Result size (real) 5 ∼ 25 In the experiments, we evaluate the performance of T2S in terms of several aspects: tuple number (N), result size (k), used attribute number (m), total attribute number (M), correlation coefficient (c). The experiments are executed on three data sets: For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 9. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 9 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 1 5 10 15 20 time(s) tuple number (108 ) PT EGBFT MCR GTS APTS EGBI (a)Construction time 0 5e+010 1e+011 1.5e+011 2e+011 2.5e+011 1 5 10 15 20 datavolume(byte) tuple number (108 ) PT EGBFT MCR GTS APTS EGBI (b)storage cost Fig. 4. The result of pre-construction operation 60 65 70 75 80 85 90 95 100 10 20 30 40 50 time(s) result size SR NoSR (a)Selective retrieval 60 65 70 75 80 85 90 95 100 105 110 10 20 30 40 50 time(s) result size apts random single (b)Different APTS methods Fig. 5. The result of component comparison two synthetic data sets (independent distribution and correlated distribution) and a real data set. The used parameter settings are listed in Table 2. For correlated distribution, the first two attributes have the specified correlation coefficient, while the left attributes follow the independent distribution. In order to generate two sequences of random numbers with correlation coefficient c, we first generate two sequences of uncorrelated distributed random number X1 and X2, then a new sequence Y1 = c × X1 + √ 1 − c2 × X2 is generated, and we get two sequences X1 and Y1 with the given correlation coefficient c. The real data used is HIGGS Data Set from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/HIGGS#). The default ranking function used is F = ∑m i=1 Ai (the effect of different weights is evaluated in Section 10.10). The maximum limit of candidates in in-memory hash-table is set to be 1 × 107 . 10.2 Construction and component comparison The result of pre-construction operation is shown in Figure 4. It is illustrated in Figure 4(a) that the pre- construction time increases linearly with data volume and the operation to construct the presorted table consumes most of the construction time. The storage cost of the data structures is depicted in Figure 4(b). The total storage cost of the data structures is around 220GB at N = 20 × 108 . Figure 5(a) compares T2SR with and without selective retrieval. Although T2S with selective retrieval skips most of the tuples, its performance advantage over T2S without selective retrieval is not much significant due to the relatively poor skip performance in Java. Figure 5(b) compares T2SR with different APTS selection strategy: random selection and the tuples with the largest values in one 1 10 100 1000 1 5 10 15 20 time(s) tuple number (108 ) T2SC T2SR LARA naive TKEP (a)Execution time 10 100 1000 10000 100000 1e+006 1e+007 1e+008 1e+009 1 5 10 15 20 candidatetuplenumber tuple number (108 ) T2SC T2SR LARA naive NUMgdep TKEP (b)Maintained tuple number 0.97 0.975 0.98 0.985 0.99 0.995 1 1 5 10 15 20 top-kscorelowerbound tuple number (108 ) T2S actual (c)Lower-bound score 1e+006 1e+007 1e+008 1e+009 1 5 10 15 20 scandepth tuple number (108 ) T2SC T2SR LARA TKEP ESTR ESTC ESTL (d)Scan depth 1e+006 1e+007 1e+008 1e+009 1e+010 1e+011 1e+012 1 5 10 15 20 read/writebytenumber tuple number (108 ) T2SC T2SR LARA naive TKEP (e)The I/O cost 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 1 5 10 15 20 pruningratio tuple number (108 ) actual estimated (f)The pruning ratio Fig. 6. The effect of tuple number attribute. T2S with APTS obtained in this paper shows the better performance. 10.3 Exp 1: The effect of tuple number Given m = 5, M = 10, k = 10 and c = 0, experiment 1 evaluates the performance of T2S on varying tuple numbers. As shown in Figure 6(a), T2SC runs 10.97 times faster than LARA, 4.61 times faster than TKEP and 12.09 times faster than naive algorithm, T2SR runs 5.22 times faster than LARA, 2.14 times faster than TKEP and 5.59 times faster than naive algorithm. Due to fewer data retrieved, T2SC runs faster than T2SR. As tuple number increases, the speedup ratio of T2S over LARA becomes larger. For example, T2SC runs 6.54 times faster than LARA at N = 1 × 108 , while runs 15.27 times faster at N = 20 × 108 . As shown in Figure 6(b), T2SC maintains 1203.35 times fewer tuples than LARA and 11 times fewer tuples than TKEP, T2SR maintains 12033509.56 times fewer tuples than LARA and 110238.96 times fewer tuples than TKEP. Only k tuples are maintained in T2SR and naive algorithm. And the size of TBUF in T2SC is set to be 100000 in experiment 1. The estimated tuple number NUMgdep maintained by LARA is also shown in Figure 6(b), and the actual value follows our analysis in Section 4. The effect of lower-bound score of top-k results, which is computed by structure APTS, is illustrated in Figure 6(c). Here, we divide the computed lower-bound score by the minimum score of top-k results to report the normalized result. For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 10. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 10 It is seen that the lower-bound score computed in T2S is very close to exact top-k threshold. Figure 6(d) reports scan depths of those algorithms. The scan depth of LARA (TKEP) in growing phase is around 4.7 times fewer than those of T2SC and T2SR, because LARA (TKEP) retrieves one tuple from each sorted list respectively (5 sorted lists total) in one pass of round- robin retrieval. Due to early termination, T2SC and T2SR only involve around 14% of the table, in which vast majority of tuples are skipped. The estimated scan depths are also provided in Figure 6(d), and the actual values also conform to our analysis in Section 4, 5 and 8. Figure 6(e) depicts the I/O costs of those algorithms. It is shown that, T2SC incurs 772.81 times less I/O cost than LARA and 114.81 times less I/O cost than TKEP, T2SR incurs 347.84 times less I/O cost than LARA and 53.31 times less I/O cost than TKEP. The I/O cost involved in LARA gradually exceeds that involved in naive algorithm. As tuple number increases, the number of tuples maintained in LARA exceeds the limit of allocated memory, the memory- disk exchange operation is performed to complete the growing phase, the exchange number is 2 at N = 1 × 108 and rises to 24 at N = 20 × 108 , incurring a much higher I/O cost, while the I/O cost in naive algorithm grows linearly with the tuple number. As illustrated in Figure 6(e), the I/O cost incurred in T2SC and T2SR first becomes higher when tuple number increases from 1×108 to 15×108 , then decreases at N = 20×108 . This can be explained in Figure 6(f). When tuple number rises fivefold from 1 × 108 to 5 × 108 , the positional index of MCR used for selective retrieval increases from 24 to 26 and the covered exponential gap expands fourfold, which makes the pruning ratio rise. When tuple number rises threefold from 5×108 to 15×108 , the positional index of MCR increases from 26 to 28 and the exponential gap expands fourfold, which makes the pruning ratio decline. At N = 20 × 108 , the used positional index of MCR remains 28 and the pruning ratio increases. The theoretical pruning ratio also declines at N = 15 × 108 , and the actual value has a similar trend as the theoretical value. 10.4 Exp 2: The effect of result size Given m = 5, M = 10, N = 10 × 108 and c = 0, exper- iment 2 evaluates the performance of T2S on varying result sizes. As shown in Figure 7(a), T2SC runs 11.28 times faster than LARA, 4.19 times faster than TKEP and 11.5 times faster than naive algorithm, T2SR runs 5.94 times faster than LARA, 2.2 times faster than TKEP and 5.95 times faster than naive algorithm. For T2SC, its execution time has a sudden rise at k = 30 because much more data is retrieved here. As depicted in Figure 7(b), T2SC maintains 1381.13 fewer tuples than LARA and 13.12 fewer tuples than TKEP, T2SR maintains 5871375.63 fewer tuples than LARA and 56081.46 fewer tuples than TKEP. In experiment 2, the 10 100 1000 10 20 30 40 50 time(s) result size T2SC T2SR LARA naive TKEP (a)Execution time 10 100 1000 10000 100000 1e+006 1e+007 1e+008 1e+009 10 20 30 40 50 candidatetuplenumber result size T2SC T2SR LARA naive NUMgdep TKEP (b)Maintained tuple number 1e+007 1e+008 1e+009 1e+010 1e+011 10 20 30 40 50 read/writebytenumber result size T2SC T2SR LARA naive TKEP (c)The I/O cost 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 10 20 30 40 50 pruningratio result size actual estimated (d)The pruning ratio Fig. 7. The effect of result size 0.1 1 10 100 1000 2 3 4 5 6 time(s) used attribute number T2SC T2SR LARA naive TKEP (a)Execution time 10 100 1000 10000 100000 1e+006 1e+007 1e+008 1e+009 2 3 4 5 6 candidatetuplenumber used attribute number T2SC T2SR LARA naive NUMgdep TKEP (b)Maintained tuple number 1000 10000 100000 1e+006 1e+007 1e+008 1e+009 1e+010 1e+011 1e+012 2 3 4 5 6 read/writebytenumber used attribute number T2SC T2SR LARA naive TKEP (c)The I/O cost 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 2 3 4 5 6 pruningratio used attribute number actual estimated (d)The pruning ratio Fig. 8. The effect of used attribute number size of TBUF in T2SC is set to be 100000, while the number of tuples maintained in LARA grows linearly with result size. The result of I/O cost is reported in Figure 7(c), T2SC incurs 467.12 times less I/O cost than LARA and 41.36 times less I/O cost than TKEP, T2SR incurs 201.19 times less I/O cost than LARA and 17.41 times less I/O cost than TKEP. There is a sudden rise for T2SC and T2SR in Figure 7(c), it can be explained in Figure 7(d). When result size is not greater than 20, the positional index of MCR used for selective retrieval is 27, and later the positional index increases by 1 and remains 28. This leads to a drop of pruning ratio at k = 30, as illustrated in Figure 7(d). The trend of actual pruning ratio basically conforms to the theoretical results, the early decline at k = 20 is the result of the data set actually generated. 10.5 Exp 3: The effect of used attribute number Given k = 10, M = 10, N = 10 × 108 and c = 0, experiment 3 evaluates the performance of T2S on For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 11. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 11 10 100 1000 5 10 15 20 time(s) table width T2SC T2SR LARA naive TKEP (a)Execution time 10 100 1000 10000 100000 1e+006 1e+007 1e+008 1e+009 5 10 15 20 candidatetuplenumber table width T2SC T2SR LARA naive NUMgdep TKEP (b)Maintained tuple number 1e+007 1e+008 1e+009 1e+010 1e+011 5 10 15 20 read/writebytenumber table width T2SC T2SR LARA naive TKEP (c)The I/O cost 0.99 0.991 0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.999 5 10 15 20 pruningratio table width actual estimated (d)The pruning ratio Fig. 9. The effect of total attribute number varying used attribute numbers. As shown in Figure 8(a), T2SC runs 9.14 times faster than LARA, 4.23 times faster than TKEP and 79.1 times faster than naive algorithm, T2SR runs 4.77 times faster than LARA, 2.38 times faster than TKEP and 66.54 times faster than naive algorithm. When the used attribute number increases from 2 to 6, the execution time of LARA grows exponentially, while the execution time of naive algorithm grows linearly. At m = 5, the execution time of LARA exceeds that of naive algorithm. The execution times of T2SC and T2SR grow rapidly also with a larger value of m due to a greater scan depth and a smaller pruning ratio. As shown in Figure 8(b), T2SC maintains 18109.16 fewer tuples than LARA and 47.86 times fewer tuples than TKEP, T2SR maintains 8091748.76 times fewer tuples than LARA and 385764.32 times fewer tuples than TKEP. The size of TBUF in T2SC is limited to be between 100 and 100000, while the number of tuples maintained in LARA grows exponentially as more attributes are used in ranking function. As illustrated in Figure 8(c), T2SC incurs 6429.31 times less I/O cost than LARA and 4464.96 times less I/O cost than TKEP, T2SR incurs 2078.97 times less I/O cost than LARA and 1315.83 times less I/O cost than TKEP. For LARA, its I/O cost is much less than naive algorithm with a small value of m, then since m = 5, its I/O cost exceeds that of naive algorithm. The memory-disk exchange operation in LARA incurs a high I/O cost. At m = 2 or m = 3, LARA does not need to involve memory-disk exchange operation. As more attributes are used, more tuples have to be maintained and the number of memory-disk exchange operations rises to 26 at m = 6. In experiment 3, the positional index of MCR used in selective retrieval increases with a larger m, and this makes the pruning ratio decline, as shown in Figure 8(d). But even at m = 6, the pruning effect is still good enough and 94% of tuples are skipped. 1 10 100 1000 10000 0 -0.2 -0.4 -0.6 -0.8 time(s) negative correlation coefficient T2SC T2SR LARA naive TKEP (a)Execution time 10 100 1000 10000 100000 1e+006 1e+007 1e+008 1e+009 0 -0.2 -0.4 -0.6 -0.8 candidatetuplenumber negative correlation coefficient T2SC T2SR LARA naive NUMgdep TKEP (b)Maintained tuple number 10000 100000 1e+006 1e+007 1e+008 1e+009 1e+010 1e+011 1e+012 1e+013 0 -0.2 -0.4 -0.6 -0.8 read/writebytenumber negative correlation coefficient T2SC T2SR LARA naive TKEP (c)The I/O cost 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 -0.2 -0.4 -0.6 -0.8 pruningratio negative correlation coefficient actual estimated (d)The pruning ratio Fig. 10. The effect of negative correlation 10.6 Exp 4: The effect of total attribute number Given k = 10, m = 5, N = 10 × 108 and c = 0, experiment 4 evaluates the performance of T2S on varying total attribute numbers. As shown in Fig- ure 9(a), T2SC runs 11.51 times faster than LARA, 4.25 times faster than TKEP and 13.7 times faster than naive algorithm, T2SR runs 6.74 times faster than LARA, 2.49 times faster than TKEP and 8.03 times faster than naive algorithm. In experiment 4, the execution times of LARA and naive algorithm remain unchanged, while the execution times of T2SC and T2SR visibly increase. With a larger value of M, LARA and naive algorithm only need to retrieve the involved column files, while retrieval on the presorted table (row-oriented or column-oriented) in T2SC and T2SR amounts to the round-robin retrieval on the all sorted lists. As illustrated in Figure 9(b), T2SC main- tains 1148.14 fewer tuples than LARA and 10.74 times fewer tuples than TKEP, T2SR maintains 11481433.2 times fewer tuples than LARA and 107469 times fewer tuples than TKEP. As depicted in Figure 9(c), T2SC incurs 906.47 times less I/O cost than LARA and 140.88 times less I/O cost than TKEP, T2SR incurs 445.55 times less I/O cost than LARA and 69.24 times less I/O cost than TKEP. Due to fixed value of m and independent distribution, the positional index of MCR utilized in selective retrieval is constant (27). The pruning ratio is depicted in Figure 9(d). The pruning ratio increases slightly from 0.99 at M = 5 to 0.9945 at M = 20. With a larger value of M, more tuples from the irrelevant sorted lists will be combined into the presorted table, which provides a higher chance to be skipped for tuples. 10.7 Exp 5: The effect of negative correlation Given k = 10, M = 10, m = 3, N = 10 × 108 , experiment 5 evaluates the performance of T2S on varying negative correlation coefficients. As shown in For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 12. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 12 0.1 1 10 100 1000 0 0.2 0.4 0.6 0.8 time(s) positive correlation coefficient T2SC T2SR LARA naive TKEP (a)Execution time 10 100 1000 10000 100000 1e+006 1e+007 0 0.2 0.4 0.6 0.8 candidatetuplenumber positively correlated coefficient T2SC T2SR LARA naive NUMgdep TKEP (b)Maintained tuple number 1000 10000 100000 1e+006 1e+007 1e+008 1e+009 1e+010 1e+011 0 0.2 0.4 0.6 0.8 read/writebytenumber positive correlation coefficient T2SC T2SR LARA naive TKEP (c)The I/O cost 0.9988 0.999 0.9992 0.9994 0.9996 0.9998 1 0 0.2 0.4 0.6 0.8 pruningratio positive correlation coefficient actual estimated (d)The pruning ratio Fig. 11. The effect of positive correlation Figure 10(a), T2SC runs 18.71 times faster than LARA and 13.07 times faster than TKEP averagely, T2SR runs 6.25 times faster than LARA and 4.49 times faster than TKEP. The significantly longer execution time of LARA is due to a greater scan depth of exponential growth. For T2S and TKEP, their execution times increase significantly with a larger value of c because of a greater scan depth and a worse pruning effect. For naive algorithm, the data distribution does not affect its execution and its execution time remains unchanged. As |c| becomes larger, the execution times of other algorithms exceed that of naive algorithm gradually. At c = −0.6, the scan depth of LARA in growing phase is 226200848, the early pruning of TKEP does not work, thus from c = −0.6, TKEP has the same execution as LARA. As illustrated in Figure 10(b), T2SC maintains 15816.32 times fewer candidates than LARA and 2617.06 times fewer can- didates than TKEP, T2SR maintains 37527318 times fewer candidates than LARA and 26109850 times fewer candidates than TKEP. As shown in Figure 10(c), T2SC incurs 492.19 times less I/O cost than LARA and 119.63 times less I/O cost than TKEP, T2SR incurs 124.18 times less I/O cost than LARA and 29.99 times less I/O cost than TKEP. As |c| becomes larger, the I/O costs of other algorithms gradually exceed that of naive algorithm. Due to the effect of negative correlation, as depicted in Figure 10(d), the pruning ratio in T2S declines significantly as |c| becomes larger. 10.8 Exp 6: The effect of positive correlation Given k = 10, M = 10, m = 3, N = 10 × 108 , experiment 6 evaluates the performance of T2S on varying positive correlation coefficients. As shown in Figure 11(a), T2SC runs 8 times faster than LARA, 5.74 times faster than TKEP and 413.93 times faster than naive algorithm, T2SR runs 8.33 times faster than LARA, 6.23 times faster than TKEP and 454.53 times 0.1 1 10 100 1000 5 10 15 20 25 time(s) result size T2SC T2SR LARA naive TKEP TA (a)Execution time 1 10 100 1000 10000 100000 1e+006 5 10 15 20 25 candidatetuplenumber result size T2SC T2SR LARA naive NUMgdep TKEP TA (b)Maintained tuple number 100 1000 10000 100000 1e+006 1e+007 1e+008 1e+009 5 10 15 20 25 read/writebytenumber result size T2SC T2SR LARA naive TKEP TA (c)The I/O cost 0.9997 0.99975 0.9998 0.99985 0.9999 0.99995 1 5 10 15 20 25 pruningratio result size T2S estimated (d)The pruning ratio Fig. 12. The effect of real data faster than naive algorithm. With a larger value of |c|, the execution times of these algorithms except for naive algorithm decline. As illustrated in Figure 11(b), when |c| increases, the numbers of tuples maintained in T2S and naive algorithm remain unchanged, the number of tuples maintained in LARA is reduced due to a smaller scan depth, while the number of tuples maintained in TKEP increases because of a worse pruning effect. And T2SC maintains 14878.88 fewer candidates than LARA and 87.08 times fewer can- didates than TKEP, T2SR maintains 148788.58 times fewer candidates than LARA and 870.84 times fewer candidates than TKEP. As depicted in Figure 11(c), T2SC incurs 22709.58 times less I/O cost than LARA and 24632.18 times less I/O cost than TKEP, T2SR incurs 7374.15 times less I/O cost than LARA and 8011.62 times less I/O cost than TKEP. Although the scan depth of growing phase declines from 2016625 to 668231, the scan depth of shrinking phase is bewteen [3431319, 4589297] in experiment 6 and the number of bytes retrieved by LARA does not vary much with a larger value of |c|. The I/O costs of T2SC and T2SR show a similar trend, which is affected by pruning operation and the scan depth. The pruning ratio of selective retrieval is illustrated in Figure 11(d), it is found that as positive correlation coefficient increases, the pruning ratio shows a decline trend. 10.9 Exp 7: The effect of real data The real data, HIGGS Data Set, is obtained from UCI Machine Learning Repository. It contains 11,000,000 tuples with 28 attributes. We select three features (2, 7, 12) with relatively large variances to compute top- k results. In experiment 7, the performance of T2S is evaluated on varying result sizes. As shown in Figure 12(a), T2SC runs 2.77 times faster than LARA, 2.71 times faster than TKEP and 8.01 times faster than naive algorithm, T2SR runs 1.42 times faster than For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 13. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 13 10 100 1000 1 2 3 4 5 time(s) first weight value T2SC T2SR LARA naive TKEP (a)Query time 0.993 0.9935 0.994 0.9945 0.995 0.9955 0.996 0.9965 0.997 0.9975 1 2 3 4 5 pruningratio first weight value actual (b)The pruning ratio Fig. 13. The effect of different weights LARA, 1.39 times faster than TKEP and 4.46 times faster than naive algorithm. Because the volume of real data is relatively small, we evaluate TA algorithm here. We find that TA algorithm runs orders of mag- nitude slower than other algorithms. As illustrated in Figure 12(b), T2SC maintains 3092.91 times fewer tuples than LARA and 27.07 times fewer tuples than TKEP, T2SR maintains 25629.18 times fewer tuples than LARA and 142.97 times fewer tuples than TKEP. During its execution, TA keeps positional indexes of tuples seen before to avoid the duplicate computation of score. The size of TBUF in T2SC is fixed to be 100, while the number of tuples maintained in LARA increases from 217992 to 375220. As depicted in Figure 12(c), T2SC incurs 12227.52 times less I/O cost than LARA and 7300.48 times less I/O cost than TKEP, T2SR incurs 1546.11 times less I/O cost than LARA and 831.02 times less I/O cost than TKEP. When the result size increases from 5 to 25, the positional index of MCR used in selective retrieval first increases from 15 to 16, then remains 16, and rises to 17. The pruning ratio, depicted in Figure 12(d), reflects the variation trend of positional index of MCR. 10.10 Exp 8: The effect of different weights Given k = 10, M = 10, m = 5, N = 10 × 108 and c = 0, experiment 8 evaluates the performance of T2S on varying weights of ranking function. The weight of the first attribute is set to be 1, 2, 3, 4 and 5, and let the other weights be 1. As shown in Figure 13(a), the execution time of naive algorithm remains unchanged, while the execution time of LARA in- creases gradually, from 344.094s to 416.611s due to an increase of scan depth in shrinking phase. And the execution times of T2S and TKEP decline slight, because the greater weight value on the first attribute makes it more important than other attributes, the MCR positional index for selective retrieval or early pruning decreases, which is depicted in Figure 13(b). 10.11 Exp 9: The result of update operation Given N = 10 × 108 (Told), experiment 9 evaluates the performance of update operation in T2S. The update operation is processed as in Section 9. Initially the accumulated new data Tnew is empty, and tuple 100 200 300 400 500 600 700 800 900 1000 1100 1 2 3 4 5 time(s) The tuple number in Tnew (107 ) update time (a)Update time 50 55 60 65 70 1 2 3 4 5 time(s) The tuple number in Tnew (107 ) incremental merged (b)Query time Fig. 14. The result of update processing number of new data Dnew periodically obtained is 106 . When the volume of Tnew reaches 5% of Told (every fifty days), Tnew is merged with Told. As shown in Figure 14(a), the time to finish update operation on Tnew alone is 1059.477s at most. The execution time of T2SR on Told and Tnew respectively to compute top-k results is at most 1.7 seconds longer than the time of T2SR on the merged data set. 10.12 Summary As illustrated in the experiments, T2S outperforms the existing sorted-list-based methods significantly. It also is noted that, given N = 10 × 108 , M = 10, it takes 50186.264 seconds for T2S to build the required data structure and the consumed disk space is 107GB. The pre-construction cost of T2S is relatively high. But the proposed T2S algorithm still is worth the price. The reason is that, (1) T2S obtains one order of mag- nitude speedup over the existing algorithms, which provides a significant improvement, (2) for the sorted- list-based algorithm, it needs to consume 31592.302 seconds to build the sorted list from row table and the required disk space is 149GB, comparatively, the pre-construction cost of T2S is not unacceptable, (3) with the rapid growth of disk capacity, it is worth spending more space to speed up query evaluation. On data set of negative correlation, the performance of T2S algorithm degrades greatly and not even as good as naive approach when |c| is large. However, the naive approach only performs better at a large negative correlation coefficient. In the other experi- ments, it runs orders of magnitude slower than T2S. In practical applications, T2S will show a better overall performance than naive algorithm. Besides, the nega- tive correlation affects the existing algorithms more greatly, the speedup ratio of T2S over the existing algorithms becomes larger when |c| increases. For the update operation, when the size of Tnew is less than 5% of Told, the cost of update operation on Tnew is relatively small. When the enough new data is received, Tnew is merged with Told. Although the cost of re-construction of the relevant data structures is relatively high, as illustrated in Section 10.11, the re-construction will be performed every fifty days, which amortizes its relatively high cost. Of course, in some applications, the re-construction may seem For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457
  • 14. 1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2015.2426691, IEEE Transactions on Knowledge and Data Engineering IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 20XX 14 unacceptable. In future work, we intend to explore the idea of more efficient update operation. 11 CONCLUSION This paper proposes a novel T2S algorithm to effi- ciently return top-k results on massive data by sequen- tially scanning the presorted table, in which the tuples are arranged in the order of round-robin retrieval on sorted lists. Only fixed number of candidates need to be maintained in T2S. This paper proposes early ter- mination checking and the analysis of the scan depth. Selective retrieval is devised in T2S and it is analyzed that most of the candidates in the presorted table can be skipped. The experimental results show that T2S significantly outperforms the existing algorithm. ACKNOWLEDGEMENTS This work was supported in part by the National Basic Research (973) Program of China under grant no. 2012CB316200, the National Natural Science Foun- dation of China under grant nos. 61402130, 61272046, 61190115, 61173022, 61033015. REFERENCES [1] R. Akbarinia, E. Pacitti, and P. Valduriez. Best position algorithms for top-k queries. In Proc. 33rd Int’l Conf. on Very Large Data Bases, VLDB ’07, pages 495–506, 2007. [2] H. Bast, D. Majumdar, R. Schenkel, and et al. Io-top-k: Index- access optimized top-k query processing. In Proc. 32Nd Int’l Conf. on Very Large Data Bases, VLDB ’06, pages 475–486, 2006. [3] Y. Chang, L. Bergman, V. Castelli, and et al. The onion technique: Indexing for linear optimization queries. In Proc. ACM SIGMOD Int’l Conf. on Management of Data, SIGMOD ’00, pages 391–402, 2000. [4] G. Das, D. Gunopulos, N. Koudas, and D. Tsirogiannis. An- swering top-k queries using views. In Proc. 32Nd Int’l Conf. on Very Large Data Bases, VLDB ’06, pages 451–462, 2006. [5] R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In Proc. ACM SIGMOD Int’l Conf. on Management of Data, SIGMOD ’03, pages 301–312, 2003. [6] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proc. 20th ACM SIGMOD- SIGACT-SIGART Symp. on Principles of Database Systems, PODS ’01, pages 102–113, 2001. [7] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614–656, 2003. [8] R. Fernandez, P. Pietzuch, J. Koshy, and et al. Liquid: Unifying nearline and offline big data integration. In Biennial Conf. on Innovative Data Systems Research, CIDR ’15, 2015. [9] U. G¨untzer, W. Balke, and W. Kießling. Optimizing multi- feature queries for image databases. In Proc. 26th Int’l Conf. on Very Large Data Bases, VLDB ’00, pages 419–428, 2000. [10] U. G¨untzer, W. Balke, and W. Kießling. Towards efficient multi-feature queries in heterogeneous environments. In Proc. Int’l Conf. on Information Technology: Coding and Computing, ITCC ’01, pages 622–628, 2001. [11] X. Han, J. Li, and D. Yang. Supporting early pruning in top-k query processing on massive data. Inf. Process. Lett., 111(11):524–532, 2011. [12] X. Han, J. Li, D. Yang, and J. Wang. Efficient skyline com- putation on big data. IEEE Trans. on Knowl. and Data Eng., 25(11):2521–2535, 2013. [13] J. Heo, J. Cho, and K. Whang. Subspace top-k query processing using the hybrid-layer index with a tight bound. Data Knowl. Eng., 83:1–19, 2013. [14] V. Hristidis and Y. Papakonstantinou. Algorithms and appli- cations for answering ranked queries using ranked views. The VLDB J., 13(1):49–70, 2004. [15] I. Ilyas, G. Beskales, and M. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv., 40(4):11:1–11:58, 2008. [16] J. Lee, H.k Cho, and S. Hwang. Efficient dual-resolution layer indexing for top-k queries. In Proc. 28th Int’l Conf. on Data Engineering, ICDE ’12, pages 1084–1095, 2012. [17] N. Mamoulis, M. Yiu, K. Cheng, and D. Cheung. Efficient top- k aggregation of ranked inputs. ACM Trans. Database Syst., 32(3), 2007. [18] H. Pang, X. Ding, and B. Zheng. Efficient processing of exact top-k queries over disk-resident sorted lists. The VLDB J., 19(3):437–456, 2010. [19] M. Stonebraker, D. Abadi, A. Batkin, and et al. C-store: A column-oriented dbms. In Proc. 31st Int’l Conf. on Very Large Data Bases, VLDB ’05, pages 553–564, 2005. [20] K. Wu, A. Shoshani, and K. Stockinger. Analyses of multi- level and multi-component compressed bitmap indexes. ACM Trans. Database Syst., 35(1):2:1–2:52, 2008. [21] M. Xie, L. Lakshmanan, and P. Wood. Efficient top-k query answering using cached views. In Proc. 16th Int’l Conf. on Extending Database Technology, EDBT ’13, pages 489–500, 2013. [22] D. Xin, C. Chen, and J. Han. Towards robust indexing for ranked queries. In Proc. 32Nd Int’l Conf. on Very Large Data Bases, VLDB ’06, pages 235–246, 2006. [23] L. Zou and L. Chen. Pareto-based dominant graph: An efficient indexing structure to answer top-k queries. IEEE Trans. on Knowl. and Data Eng., 23(5):727–741, 2011. Xixian Han is a lecturer in the School of Computer Science and Technology, Harbin Institute of Technology, China. He received his Master’s degree and PhD degree from the School of Computer Science and Technol- ogy, Harbin Institute of Technology in 2006 and 2012 respectively. His main research interests include massive data management and data-intensive computing. Jianzhong Li is Chair of the Departmen- t of Computer Science and Engineering at Harbin Institute of Technology, China. He is also a professor in the School of Computer Science and Technology, the Dean of School of Computer Science and Technology, and the Dean of School of Software at Hei- longjiang University, China. His current re- search interests include data-intensive com- puting, wireless sensor networks and CPS. Hong Gao is professor in the School of Computer Science and Technology at Harbin Institute of Technology, China. Prof. Gao is the principal investigator for several Na- tional Natural Science Foundation Projects and participated two the National Basic Re- search (973) Program. Her research inter- ests include wireless sensor network, cyber- physical systems, massive data manage- ment and data mining. For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457