IEEE Datamining 2016 Title and Abstract

For Details, Contact TSYS Academic Projects.
Ph: 9841103123, 044-42607879, Website: http://www.tsys.co.in/
Mail Id: tsysglobalsolutions2014@gmail.com.
IEEE TRANSACTION ON KNOWLEDGE AND DATA ENGINEERING
A Simple Message-Optimal Algorithm for Random Sampling from a Distributed Stream
Abstract - We present a simple, message-optimal algorithm for maintaining a random sample
from a large data stream whose input elements are distributed across multiple sites that
communicate via a central coordinator. At any point in time, the set of elements held by the
coordinator represent a uniform random sample from the set of all the elements observed so far.
When compared with prior work, our algorithms asymptotically improve the total number of
messages sent in the system. We present a matching lower bound, showing that our protocol
sends the optimal number of messages up to a constant factor with large probability. We also
consider the important case when the distribution of elements across different sites is non-
uniform, and show that for such inputs, our algorithm significantly outperforms prior solutions.
IEEE Transactions on Knowledge and Data Engineering (June 1 2016)
Online Learning from Trapezoidal Data Streams
Abstract - In this paper, we study a new problem of continuous learning from doubly-streaming
data where both data volume and feature space increase over time. We refer to the doubly-
streaming data as trapezoidal data streams and the corresponding learning problem as online
learning from trapezoidal data streams. The problem is challenging because both data volume
and data dimension increase over time, and existing online learning [1] [2], online feature
selection [3], and streaming feature selection algorithms [4] [5] are inapplicable. We propose a
new Online Learning with Streaming Features algorithm (OLSF for short) and its two variants
that combine online learning [1] [2] and streaming feature selection [4] [5] to enable learning
from trapezoidal data streams with infinite training instances and features. Specifically, when a
new training instance carrying new features arrives, a classifier updates the existing features by
following the passive-aggressive update rule [2] and updates the new features by following the
structural risk minimization principle. Then, feature sparsity is introduced by using the projected
truncation technique. We derive performance bounds of the OLSF algorithm and its variants. We
also conduct experiments on real-world data sets to show the performance of the proposed
algorithms.

IEEE Transactions on Knowledge and Data Engineering (May 2016)
Quality-Aware Subgraph Matching Over Inconsistent Probabilistic Graph Databases
Abstract - Resource Description Framework (RDF) has been widely used in the Semantic Web to
describe resources and their relationships. The RDF graph is one of the most commonly used
representations for RDF data. However, in many real applications such as the data
extraction/integration, RDF graphs integrated from different data sources may often contain
uncertain and inconsistent information (e.g., uncertain labels or that violate facts/rules), due to
the unreliability of data sources. In this paper, we formalize the RDF data by inconsistent
probabilistic RDF graphs, which contain both inconsistencies and uncertainty. With such a
probabilistic graph model, we focus on an important problem, quality-aware subgraph matching
over inconsistent probabilistic RDF graphs (QA-gMatch), which retrieves subgraphs from
inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and with high
quality scores (considering both consistency and uncertainty). In order to efficiently answer QA-
gMatch queries, we provide two effective pruning methods, namely adaptive label pruning and
quality score pruning, which can greatly filter out false alarms of subgraphs. We also design an
effective index to facilitate our proposed pruning methods, and propose an efficient approach for
processing QA-gMatch queries. Finally, we demonstrate the efficiency and effectiveness of our
proposed approaches through extensive experiments.
CavSimBase: A Database for Large Scale Comparison of Protein Binding Sites
Abstract - CavBase is a database containing information about the three-dimensional geometry
and the physicochemical properties of putative protein binding sites. Analyzing CavBase data
typically involves computing the similarity of pairs of binding sites. In contrast to sequence
alignment, however, a structural comparison of protein binding sites is a computationally
challenging problem, making large scale studies difficult or even infeasible. One possibility to
overcome this obstacle is to precompute pairwise similarities in an all-against-all comparison,
and to make these similarities subsequently accessible to data analysis methods. Pairwise

similarities, once being computed, can also be used to equip CavBase with a neighborhood
structure. Taking advantage of this structure, methods for problems such as similarity retrieval
can be implemented efficiently. In this paper, we tackle the problem of performing an all-
against-all comparison using CavBase, consisting of more than 200,000 protein cavities, by
means of parallel computation and cloud computing techniques. We present the conceptual
design and technical realization of a large-scale study to create a similarity database called
CavSimBase. We illustrate how CavSimBase is constructed, is accessed, and is used to answer
biological questions by data analysis and similarity retrieval.
Online Subgraph Skyline Analysis over Knowledge Graphs
Abstract - Subgraph search is very useful in many real-world applications. However, users may
be overwhelmed by the masses of matches. In this paper, we propose a subgraph skyline analysis
problem, denoted as $S^2A$, to support more complicated analysis over graph data.
Specifically, given a large graph $G$ and a query graph $q$, we want to find all the subgraphs
$g$ in $G$, such that $g$ is graph isomorphic to $q$ and not dominated by any other subgraphs.
In order to improve the efficiency, we devise a hybrid feature encoding incorporating both
structural and numeric features based on a partitioning strategy, and discuss how to optimize the
space partitioning. We also present a skylayer index to facilitate the dynamic subgraph skyline
computation. Moreover, an attribute cluster-based - ethod is proposed to deal with the curse of
dimensionality. Extensive experiments over real datasets confirm the effectiveness and
efficiency of our algorithm.
IEEE Transactions on Knowledge and Data Engineering (July 1 2016)
K Nearest Neighbour Joins for Big Data on MapReduce: a Theoretical and Experimental
Analysis
Abstract - Given a point p and a set of points S, the kNN operation finds the k closest points to p
in S. It is a computational intensive task with a large range of applications such as knowledge
discovery or data mining. However, as the volume and the dimension of data increase, only

distributed approaches can perform such costly operation in a reasonable time. Recent works
have focused on implementing efficient solutions using the MapReduce programming model
because it is suitable for distributed large scale data processing. Although these works provide
different solutions to the same problem, each one has particular constraints and properties. In this
paper, we compare the different existing approaches for computing kNN on MapReduce, first
theoretically, and then by performing an extensive experimental evaluation. To be able to
compare solutions, we identify three generic steps for kNN computation on MapReduce: data
pre-processing, data partitioning and computation. We then analyze each step from load
balancing, accuracy and complexity aspects. Experiments in this paper use a variety of datasets,
and analyze the impact of data volume, data dimension and the value of k from many
perspectives like time and space complexity, and accuracy. The experimental part brings new
advantages and shortcomings that are discussed for each algorithm. To the best of our
knowledge, this is the first paper that compares kNN computing methods on MapReduce both
theoretically and experimentally with the same setting. Overall, this paper can be used as a guide
to tackle kNN-based practical problems in the context of big data.
ATD: Anomalous Topic Discovery in High Dimensional Discrete Data
Abstract - We propose an algorithm for detecting patterns exhibited by anomalous clusters in
high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect
individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of
points which collectively exhibit abnormal patterns. In many applications this can lead to better
understanding of the nature of the atypical behavior and to identifying the sources of the
anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small
(salient) subset of the very high dimensional feature space. Individual AD techniques and
techniques that detect anomalies using all the features typically fail to detect such anomalies, but
our method can detect such instances collectively, discover the shared anomalous patterns
exhibited by them, and identify the subsets of salient features. In this paper, we focus on
detecting anomalous topics in a batch of text documents, developing our algorithm based on

topic models. Results of our experiments show that our method can accurately detect anomalous
topics and salient features (words) under each such topic in a synthetic data set and two real-
world text corpora and achieves better performance compared to both standard group AD and
individual AD techniques. All required code to reproduce our experiments is available from
https://github.com/hsoleimani/ATD.
Multilabel Classification via Co-evolutionary Multilabel Hypernetwork
Abstract - Multilabel classification is prevalent in many real-world applications where data
instances may be associated with multiple labels simultaneously. In multilabel classification,
exploiting label correlations is an essential but nontrivial task. Most of the existing multilabel
learning algorithms are either ineffective or computational demanding and less scalable in
exploiting label correlations. In this paper, we propose a co-evolutionary multilabel
hypernetwork (Co-MLHN) as an attempt to exploit label correlations in an effective and efficient
way. To this end, we firstly convert the traditional hypernetwork into a multilabel hypernetwork
(MLHN) where label correlations are explicitly represented. We then propose a co-evolutionary
learning algorithm to learn an integrated classification model for all labels. The proposed Co-
MLHN exploits arbitrary order label correlations and has linear computational complexity with
respect to the number of labels. Empirical studies on a broad range of multilabel data sets
demonstrate that Co-MLHN achieves competitive results against state-of-the-art multilabel
learning algorithms, in terms of both classification performance and scalability with respect to
the number of labels.
Learning to Find Topic Experts in Twitter via Different Relations
Abstract - Expert finding has become a hot topic along with the flourishing of social networks,
such as micro-blogging services like Twitter. Finding experts in Twitter is an important problem
because tweets from experts are valuable sources that carry rich information (e.g., trends) in
various domains. However, previous methods cannot be directly applied to Twitter expert

finding problem. Recently, several attempts use the relations among users and Twitter Lists for
expert finding. Nevertheless, these approaches only partially utilize such relations. To this end,
we develop a probabilistic method to jointly exploit three types of relations (i.e., follower
relation, user-list relation, and list-list relation) for finding experts. Specifically, we propose a
Semi-Supervised Graph-based Ranking approach ($sf{SSGR}$) to offline calculate the global
authority of users. In $sf{SSGR}$, we employ a normalized Laplacian regularization term to
jointly explore the three relations, which is subject to the supervised information derived from
Twitter crowds. We then online compute the local relevance between users and the given query.
By leveraging the global authority and local relevance of users, we rank all of users and find top-
N users with highest ranking scores. Experiments on real-world data demonstrate the
effectiveness of our proposed approach for topic-specific expert finding in Twitt- r.
IEEE Transactions on Knowledge and Data Engineering (July 2016)
Analytic Queries over Geospatial Time-Series Data Using Distributed Hash Tables
Abstract - As remote sensing equipment and networked observational devices continue to
proliferate, their corresponding data volumes have surpassed the storage and processing
capabilities of commodity computing hardware. This trend has led to the development of
distributed storage frameworks that incrementally scale out by assimilating resources as
necessary. While challenging in its own right, storing and managing voluminous datasets is only
the precursor to a broader field of research: extracting insights, relationships, and models from
the underlying datasets. The focus of this study is twofold: exploratory and predictive analytics
over voluminous, multidimensional datasets in a distributed environment. Both of these types of
analysis represent a higher-level abstraction over standard query semantics; rather than indexing
every discrete value for subsequent retrieval, our framework autonomously learns the
relationships and interactions between dimensions in the dataset and makes the information
readily available to users. This functionality includes statistical synopses, correlation analysis,
hypothesis testing, probabilistic structures, and predictive models that not only enable the
discovery of nuanced relationships between dimensions, but also allow future events and trends
to be predicted. The algorithms presented in this work were evaluated empirically on a real-

world geospatial time-series dataset in a production environment, and are broadly applicable
across other storage frameworks.
IEEE Transactions on Knowledge and Data Engineering (June 2016)
RSkNN: kNN Search on Road Networks by Incorporating Social Influence
Abstract - Although NN search on a road network, i.e., finding nearest objects to a query user on,
has been extensively studied, existing works neglected the fact that the 's social information can
play an important role in this NN query. Many real-world applications, such as location-based
social networking services, require such a query. In this paper, we study a new problem: NN
search on road networks by incorporating social influence (RSkNN). Specifically, the state-of-
the-art Independent Cascade (IC) model in social network is applied to de- ine social influence.
One critical challenge of the problem is to speed up the computation of the social influence over
large road and social networks. To address this challenge, we propose three efficient index-based
search algorithms, i.e., road network-based (RN-based), social network-based (SN-based), and
hybrid indexing algorithms. In the RN-based algorithm, we employ a filtering-and-verification
framework for tackling the hard problem of computing social influence. In the SN-based
algorithm, we embed social cuts into the index, so that we speed up the query. In the hybrid
algorithm, we propose an index, summarizing the road and social networks, based on which we
can obtain query answers efficiently. Finally, we use real road and social network data to
empirically verify the efficiency and efficacy of our solutions.
Unsupervised Visual Hashing with Semantic Assistant for Content-based Image Retrieval
Abstract - As an emerging technology to support scalable content-based image retrieval (CBIR),
hashing has been recently received great attention and became a very active research domain. In
this study, we propose a novel unsupervised visual hashing approach called semantic-assisted
visual hashing (SAVH). Distinguished from semi-supervised and supervised visual hashing, its
core idea is to effectively extract the rich semantics latently embedded in auxiliary texts of
images to boost the effectiveness of visual hashing without any explicit semantic labels. To

achieve the target, a unified unsupervised framework is developed to learn hash codes by
simultaneously preserving visual similarities of images, integrating the semantic assistance from
auxiliary texts on modeling high-order relationships of inter-images and characterizing the
correlations between images and shared topics. Our performance study on three publicly
available image collections: Wiki, MIR Flickr, and NUS-WIDE indicates that SAVH can
achieve superior performance over several state-of-the-art techniques.
A Scalable Data Chunk Similarity based Compression Approach for Efficient Big Sensing
Data Processing on Cloud
Abstract - Big sensing data is prevalent in both industry and scientific research applications
where the data is generated with high volume and velocity. Cloud computing provides a
promising platform for big sensing data processing and storage as it provides a flexible stack of
massive computing, storage, and software services in a scalable manner. Current big sensing data
processing on Cloud have adopted some data compression techniques. However, due to the high
volume and velocity of big sensing data, traditional data compression techniques lack sufficient
efficiency and scalability for data processing. Based on specific on-Cloud data compression
requirements, we propose a novel scalable data compression approach based on calculating
similarity among the partitioned data chunks. Instead of compressing basic data units, the
compression will be conducted over partitioned data chunks. To restore original data sets, some
restoration functions and predictions will be designed. MapReduce is used for algorithm
implementation to achieve extra scalability on Cloud. With real world meteorological big
sensing data experiments on U-Cloud platform, we demonstrate that the proposed scalable
compression approach based on data chunk similarity can significantly improve data
compression efficiency with affordable data accuracy loss.
IEEE Transactions on Knowledge and Data Engineering (February 2016)

Network Motif Discovery: A GPU Approach
Abstract - The identification of network motifs has important applications in numerous domains,
such as pattern detection in biological networks and graph analysis in digital circuits. However,
mining network motifs is computationally challenging, as it requires enumerating subgraphs
from a real-life graph, and computing the frequency of each subgraph in a large number of
random graphs. In particular, existing solutions often require days to derive network motifs from
biological networks with only a few thousand vertices. To address this problem, this paper
presents a novel study on network motif discovery using Graphical Processing Units (GPUs).
The basic idea is to employ GPUs to parallelize a large number of subgraph matching tasks in
computing subgraph frequencies from random graphs, so as to reduce the overall computation
time of network motif discovery. We explore the design space of GPU-based subgraph matching
algorithms, with careful analysis of several crucial factors (such as branch divergences and
memory coalescing) that affect the performance of GPU programs. Based on our analysis, we
develop a GPU-based solution that (i) considerably differs from existing CPU-based methods in
how it enumerates subgraphs, and (ii) exploits the strengths of GPUs in terms of parallelism
while mitigating their limitations in terms of the computation power per GPU core. With
extensive experiments on a variety of biological networks, we show that our solution is up to two
orders of magnitude faster than the best CPU-based approach, and is around 20 times more cost-
effective than the latter, when taking into account the monetary costs of the CPU and GPUs used.
Crowdsourced Data Management: A Survey
Abstract - Some important data management and analytics tasks cannot be completely addressed
by automated processes. These “computer-hard” tasks such as entity resolution, sentiment
analysis, and image recognition, can be enhanced through the use of human cognitive ability.
Human Computation is an effective way to address such tasks by harnessing the capabilities of
crowd workers (i.e., the crowd). Thus, crowdsourced data management has become an area of
increasing interest in research and industry. There are three important problems in crowdsourced
data management. (1) Quality Control: Workers may return noisy results and effective

techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost
control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow,
particularly in contrast to computing time scales, so latency-control techniques are required.
There has been significant work addressing these three factors for designing crowdsourced tasks,
developing crowdsourced data manipulation operators, and optimizing plans of multiple
operators. In this paper, we survey and synthesize a wide spectrum of existing studies on
crowdsourced data management. Based on this analysis we then outline key factors that need to
be considered to improve crowdsourced data management.
IEEE Transactions on Knowledge and Data Engineering (February 2016)
Resolving Multi-Party Privacy Conflicts in Social Media
Abstract - Items shared through Social Media may affect more than one user's privacy—e.g.,
photos that depict multiple users, comments that mention multiple users, events in which
multiple users are invited, etc. The lack of multi-party privacy management support in current
mainstream Social Media infrastructures makes users unable to appropriately control to whom
these items are actually shared or not. Computational mechanisms that are able to merge the
privacy preferences of multiple users into a single policy for an item can help solve this problem.
However, merging multiple users’ privacy preferences is not an easy task, because privacy
preferences may conflict, so methods to resolve conflicts are needed. Moreover, these methods
need to consider how users’ would actually reach an agreement about a solution to the conflict in
order to propose solutions that can be acceptable by all of the users affected by the item to be
shared. Current approaches are either too demanding or only consider fixed ways of aggregating
privacy preferences. In this paper, we propose the first computational mechanism to resolve
conflicts for multi-party privacy management in Social Media that is able to adapt to different
situations by modelling the concessions that users make to reach a solution to the conflicts. We
also present results of a user study in which our proposed mechanism outperformed other
existing approaches in terms of how many times each approach matched users’ behaviour.

Interactive Visualization of Large Data Sets
Abstract - Visualization provides a powerful means for data analysis. But to be practical, visual
analytics tools must support smooth and flexible use of visualizations at a fast rate. This becomes
increasingly onerous with the ever-increasing size of real-world datasets. First, large databases
make interaction more difficult once query response time exceeds several seconds. Second, any
attempt to show all data points will overload the visualization, resulting in chaos that will only
confuse the user. Over the last few years substantial effort has been put into addressing both of
these issues and many innovative solutions have been proposed. Indeed, data visualization is a
topic that is too large to be addressed in a single survey paper. Thus, we restrict our attention
here to interactive visualization of large data sets. Our focus then is skewed in a natural way
towards query processing problem - provided by an underlying database system - rather than to
the actual data visualization problem.
IEEE Transactions on Knowledge and Data Engineering (April 2016)
Improving Construction of Conditional Probability Tables for Ranked Nodes in Bayesian
Networks
Abstract - This paper elaborates on the ranked nodes method (RNM) that is used for constructing
conditional probability tables (CPTs) for Bayesian networks consisting of a class of nodes called
ranked nodes. Such nodes typically represent continuous quantities that lack well-established
interval scales and are hence expressed by ordinal scales. Based on expert elicitation, the CPT of
a child node is generated in RNM by aggregating weighted states of parent nodes with a weight
expression. RNM is also applied to nodes that are expressed by interval scales. However, the use
of the method in this way may be ineffective due to challenges which are not addressed in the
existing literature but are demonstrated through an illustrative example in this paper. To
overcome the challenges, the paper introduces a novel approach that facilitates the use of RNM.
It consists of guidelines concerning the discretization of the interval scales into ordinal ones and
the determination of a weight expression and weights based on assessments of the expert about
the mode of the child node. The determination is premised on interpretations and feasibility

conditions of the weights derived in the paper. The utilization of the approach is demonstrated
with the illustrative example throughout the paper.
Clearing Contamination in Large Networks
Abstract - In this work, we study the problem of clearing contamination spreading through a
large network where we model the problem as a graph searching game. The problem can be
summarized as constructing a search strategy that will leave the graph clear of any contamination
at the end of the searching process in as few steps as possible. We show that this problem is NP-
hard even on directed acyclic graphs and provide an efficient approximation algorithm. We
experimentally observe the performance of our approximation algorithm in relation to the lower
bound on several large online networks including Slashdot, Epinions, and Twitter.
Private Over-threshold Aggregation Protocols over Distributed Databases
Abstract - In this paper, we revisit the private over-threshold data aggregation problem.We
formally define the problem’s security requirements as both data and user privacy goals. To
achieve both goals, and to strike a balance between efficiency and functionality, we devise an
efficient cryptographic construction and its proxy-based variant. Both schemes are provably
secure in the semi-honest model. Our key idea for the constructions and their malicious variants
is to compose two encryption functions tightly coupled in a way that the two functions are
commutative and one public-key encryption has an additive homomorphism. We call that double
encryption.We analyze the computational and communication complexities of our construction,
and show that it is much more efficient than the existing protocols in the literature. Specifically,
our protocol has linear complexity in computation and communication with respect to the
number of users. Its round complexity is also linear in the number of users. Finally, we show that
our basic protocol is efficiently transformed into a stronger protocol secure in the presence of
malicious adversaries, and provide the resulting protocol’s performance and security analysis.

Challenges in Data Crowdsourcing
Abstract - Crowdsourcing refers to solving large problems by involving human workers that
solve component sub-problems or tasks. In data crowdsourcing, the problem involves data
acquisition, management, and analysis. In this paper, we provide an overview of data
crowdsourcing, giving examples of problems that the authors have tackled, and presenting the
key design steps involved in implementing a crowdsourced solution. We also discuss some of the
open challenges that remain to be solved.
IEEE Transactions on Knowledge and Data Engineering (April 2016)
Efficient R-Tree Based Indexing Scheme for Server-Centric Cloud Storage System
Abstract - Cloud storage system poses new challenges to the community to support efficient
concurrent querying tasks for various data-intensive applications, where indices always hold
important positions. In this paper, we explore a practical method to construct a two-layer
indexing scheme for multi-dimensional data in diverse server-centric cloud storage system. We
first propose RT-HCN, an indexing scheme integrating R-tree based indexing structure and
HCN-based routing protocol. RT-HCN organizes storage and compute nodes into an HCN
overlay, one of the newly proposed sever-centric data center topologies. Based on the properties
of HCN, we design a specific index mapping technique to maintain layered global indices and
corresponding query processing algorithms to support efficient query tasks. Then, we expand the
idea of RT-HCN onto another server-centric data center topology DCell, discovering a potential
generalized and feasible way of deploying two-layer indexing schemes on other server-centric
networks. Furthermore, we prove theoretically that RT-HCN is both space-efficient and query-
efficient, by which each node actually maintains a tolerable number of global indices while high
concurrent queries can be processed within accepted overhead. We finally conduct targeted
experiments on Amazon's EC2 platforms, comparing our design with RT-CAN, a similar
indexing scheme for traditional P2P network. The results validate the query efficiency, especially
the speedup of point query of RT-HCN, depicting its potential applicability in future data
centers.

SUPPORT OFFERED TO REGISTERED STUDENTS:
1. IEEE Base paper.
2. Review material as per individuals’ university guidelines
3. Future Enhancement
4. assist in answering all critical questions
5. Training on programming language
6. Complete Source Code.
7. Final Report / Document
8. International Conference / International Journal Publication on your Project.
FOLLOW US ON FACEBOOK @ TSYS Academic Projects

IEEE Datamining 2016 Title and Abstract

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a IEEE Datamining 2016 Title and Abstract

Semelhante a IEEE Datamining 2016 Title and Abstract (20)

Último

Último (20)

IEEE Datamining 2016 Title and Abstract