SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
Relevance Feature Discovery for Text Mining
Yuefeng Li, Abdulmohsen Algarni, Mubarak Albathan, Yan Shen, and Moch Arif Bijaksana
Abstract—It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user
preferences because of large scale terms and data patterns. Most existing popular text mining and classification methods have
adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years,
there has been often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user
preferences; yet, how to effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this
challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both positive and negative
patterns in text documents as higher level features and deploys them over low-level features (terms). It also classifies terms into
categories and updates term weights based on their specificity and their distributions in patterns. Substantial experiments using this
model on RCV1, TREC topics and Reuters-21578 show that the proposed model significantly outperforms both the state-of-the-art
term-based methods and the pattern based methods.
Index Terms—Text mining, text feature extraction, text classification
Ç
1 INTRODUCTION
THE objective of relevance feature discovery (RFD) is to
find the useful features available in text documents,
including both relevant and irrelevant ones, for describing
text mining results. This is a particularly challenging task in
modern information analysis, from both an empirical and
theoretical perspective [33], [36]. This problem is also of cen-
tral interest in many Web personalized applications, and
has received attention from researchers in Data Mining,
Machine Learning, Information Retrieval and Web Intelli-
gence communities [32].
There are two challenging issues in using pattern mining
techniques for finding relevance features in both relevant
and irrelevant documents [32]. The first is the low-sup-
port problem. Given a topic, long patterns are usually
more specific for the topic, but they usually appear in
documents with low support or frequency. If the mini-
mum support is decreased, a lot of noisy patterns can be
discovered. The second issue is the misinterpretation
problem, which means the measures (e.g., “support” and
“confidence”) used in pattern mining turn out to be not
suitable in using patterns for solving problems. For exam-
ple, a highly frequent pattern (normally a short pattern)
may be a general pattern since it can be frequently used
in both relevant and irrelevant documents. Hence, the
difficult problem is how to use discovered patterns to
accurately weight useful features.
There are several existing methods for solving the two
challenging issues in text mining. Pattern taxonomy mining
(PTM) models have been proposed [59], [60], [70], in which,
mining closed sequential patterns in text paragraphs and
deploying them over a term space to weight useful features.
Concept-based model (CBM) [50], [51] has also been pro-
posed to discover concepts by using natural language proc-
essing (NLP) techniques. It proposed verb-argument
structures to find concepts in sentences. These pattern (or
concepts) based approaches have shown an important
improvement in the effectiveness [70]. However, fewer sig-
nificant improvements are made compared with the best
term-based method because how to effectively integrate
patterns in both relevant and irrelevant documents is still
an open problem.
Over the years, people have developed many mature
term-based techniques for ranking documents, information
filtering and text classification [37], [39], [44]. Recently, sev-
eral hybrid approaches were proposed for text classifica-
tion. To learn term features within only relevant documents
and unlabelled documents, paper [27] used two term-based
models. In the first stage, it utilized a Rocchio classifier to
extract a set of reliable irrelevant documents from the unla-
beled set. In the second stage, it built a SVM classifier to
classify text documents. A two-stage model was also pro-
posed in [34], [35], which proved that the integration of the
rough analysis (a term-based model) and pattern taxonomy
mining is the best way to design a two-stage model for
information filtering systems.
For many years, we have observed that many terms with
larger weights are more general because they are likely to
be frequently used in both relevant and irrelevant docu-
ments [32]. For example, word “LIB” may be more fre-
quently used than word “JDK”; but “JDK” is more specific
than “LIB” for describing “Java Programming Languages”;
and “LIB” is more general than “JDK” because “LIB” is also
frequently used in other programming languages like C or
 Y. Li, A. Algarni, Y. Shen, and M. Bijaksana are with the School of
Electrical Engineering and Computer Science, Queensland University of
Technology, Australia, Brisbane, QLD 4001. E-mail: y2.li@qut.edu.au,
{algarni.abdulmohsen, arifbijaksana}@gmail.com, y1.shen@student.qut.
edu.au.
 M. Albathan is with the School of Electrical Engineering and Computer
Science, Queensland University of Technology, Australia, Brisbane, QLD
4001, and the Al Imam Mohammad Ibn Saud Islamic University, Saudi
Arabia, P.O.Box 5701, Riyadh 11432.
E-mail: mubarak.albathan@student.qut.edu.au.
Manuscript received 2 May 2013; revised 1 Nov. 2014; accepted 4 Nov. 2014.
Date of publication 23 Nov. 2014; date of current version 27 Apr. 2015.
Recommended for acceptance by P. G. Ipeirotis.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TKDE.2014.2373357
1656 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
1041-4347 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
www.redpel.com+917620593389
www.redpel.com+917620593389
C++. Therefore, we recommend the consideration of both
terms’ distributions and specificities for relevance feature
discovery.
Given a topic, a term’s specificity describes the extent to
which the term focuses on the topic that users want [33].
However, it is very difficult to measure the specificity of
terms because a term’s specificity depends on users’ per-
spectives of their information needs [55]. We proposed the
first definition of the specificity in [30], [31], which calcu-
lated the specificity score of a term based on its appearance
in discovered positive and negative patterns. However, this
definition required an iterative algorithm (three loops) in
order to weight terms accurately.
In order to make a breakthrough in relation to the two
challenging issues, we proposed the first version of the RFD
model in [32]. In accordance with the distributions of terms
in a training set, it provided a new definition for the speci-
ficity function and used two empirical parameters to group
terms into three categories: “positive specific terms”,
“general terms”, and “negative specific terms”. Based on
these definitions, the RFD model can accurately evaluate
term weights according to both their specificity and their
distributions in the higher level features, where the higher
level features include both positive and negative patterns.
The term classification method proposed in [32] requires
manually setting two empirical parameters according to
testing sets. In this paper, we continue to develop the RFD
model, and experimentally prove that the proposed speci-
ficity function is reasonable and the term classification can
be effectively approximated by a feature clustering method.
We also design a comprehensive approach for evaluating
the proposed models. In addition, we conducted some new
experiments by using six new sliding windows to adap-
tively update the training sets and also applying the RFD
model for binary text classification to test the robustness of
the proposed model.
This paper proposes an innovative technique for finding
and classifying low-level terms based on both their appear-
ances in the higher-level features (patterns) and their speci-
ficity in a training set. It also introduces a method to select
irrelevant documents (so-called offenders) that are closed to
the extracted features in the relevant documents in order to
effectively revise term weights. Compared with other meth-
ods, the advantages of the proposed model include:
 Effective use of both relevant and irrelevant feed-
back to find useful features; and
 Integration of both term and pattern features
together rather than using them in two separated
stages.
To justify these claims for the proposed approach, we
conducted substantial experiments on standard data collec-
tions, namely, the Reuters Corpus Volume 1 (RCV1), TREC
filtering assessor topics, the Library of Congress Subject
Headings (LCSH) ontology and Reuters-21578. We also
used five measures and the t-test to evaluate these experi-
ments. The results show that the proposed specificity func-
tion is adequate, the clustering method is effective and the
proposed model is robust. The results also show that the
proposed model significantly outperforms both the state-of-
the-art term-based methods underpinned by Okapi BM25,
Rocchio and language models, SVM and the pattern-based
methods on most measures.
The remainder of this paper is organized as follows.
Section 2 introduces a detailed overview of the related
works. Section 3 reviews the concept of features in text
documents. Section 4 discusses the RFD model. Section 5
proposes a new feature clustering method based on the
specificity function. To evaluate the performance of the pro-
posed model, we conduct substantial experiments on
LCSH, RCV1, TREC filtering topics and Reuters-21578. The
empirical results and discussion are reported in Section 6,
followed by concluding remarks in the last section.
2 RELATED WORK
Feature selection is a technique that selects a subset of fea-
tures from data for modeling systems (see http://en.
wikipedia.org/wiki/Feature_selection). Over the years, a
variety of feature selection methods (e.g., Filter, Wrapper,
Embedded and Hybrid approaches, and unsupervised or
semi-supervised methods) have been proposed in various
fields [6], [9], [17], [54], [69]. Feature selection is also one of
important steps for text classification and information filter-
ing [1], [5], [47] which is the task of assigning documents to
predefined classes. To date, many classifiers, such as Naive
Bayes, Rocchio, kNN, SVM and Lasso regression [16], [26],
[27], [28], [37], [62], [66] have been developed, in addition
many believe that SVM is also a promising classifier [13].
The classification problems include the single class and
multi-class problem. The most common solution [71] to the
multi-class problem is to decompose it into some indepen-
dence binary classifiers, where a binary one is assigned to
one of two predefined classes (e.g., relevant category or
irrelevant category). Most traditional text feature selection
methods used the bag of words to select a set of features for
the multi-class problem [13]. There are some feature selec-
tion criteria for text categorization, including document fre-
quency (DF), the global IDF, information gain, mutual
information (MI), Chi-Square (x2
) and term strength [1],
[29], [37], [45], [67].
In this paper we focus on relevant feature selection in
text documents. Relevance is a big research issue [25], [32],
[65] for Web search, which discusses a documents relevance
to a user or a query. However, the traditional feature selec-
tion methods are not effective for selecting text features for
solving relevance issue because relevance is a single class
problem [13]. The efficient way of feature selection for rele-
vance is based on a feature weighting function. A feature
weighting function indicates the degree of information rep-
resented by the feature occurrences in a document and
reflects the relevance of the feature. The popular term-based
ranking models include tf*idf based techniques, Rocchio
algorithm, Probabilistic models and Okapi BM25 [4], [24],
[37], [44].
Recently, one of the important issues for multimedia data
is the identification of the optimal feature set without any
redundancy [69]; however, the challenging issue for text fea-
ture selection in text documents is the identification of which
format or where the relevant features are in a text document
because of the large amount of noisy information in the docu-
ment [2]. Text features can be simple structures (words),
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1657
www.redpel.com+917620593389
www.redpel.com+917620593389
complex linguistic structures or statistical structures. We
mainly discuss three complex structures below for selecting
relevant features: n-grams, concepts and patterns.
n-grams (or phrases) are more discriminative and carry
more “semantics” than words. They were useful for build-
ing good ranking functions [20], [47], [53]. In [49], a phrase-
based text representation for Web document management
was also proposed that used rule-based Natural Language
Processing and Context Free Grammar techniques. Lan-
guage models were proposed to calculate weights for n-
grams, which are often approximated by Unigram, Bigram
or Trigram models for considering word dependencies [8],
[39], [53], [58]. A concept-based model [50], [51] was also
presented to find concepts in text documents by using NLP
techniques, which analyzed terms’ associations based on
the semantic structure of sentences. This model included
three components. The first one analyzed the semantic
structure of sentences; the second one then constructed a
conceptual ontological graph (COG) to represent the seman-
tic structures; and the last one found top concepts according
to the first two components to generate feature vectors by
using the standard vector space model.
Pattern mining has been extensively studied in data min-
ing communities for many years. A variety of efficient algo-
rithms such as Apriori-like algorithms, PrefixSpan, FP-tree,
SPADE, SLPMiner and GST have been proposed [18], [19],
[40], [48], [68]. Patterns post-processing were also proposed
to compress or group patterns into some clusters [64]. How-
ever, interpreting useful patterns for text mining remains an
open problem [32]. Typically, text mining discusses terms’
associations at a broad spectrum level, paying little atten-
tion to labeled information and duplications of terms [33],
[34]. Usually, the existing text mining techniques return
numerous patterns (sets of terms) in text documents. Not
surprisingly, many patterns are redundant or noisy. There-
fore, the challenging issue is how to effectively deal with
the very large set of patterns and terms with a lot of redun-
dant or noisy information [32].
To reduce the quantity of redundant information, closed
patterns have turned out to be a good alternative to phrases
[21], [60]. To effectively use closed patterns for weighting
terms, a pattern deploying method in [59] has been pro-
posed to map closed patterns into a term vector that includes
a set of terms and a term-weight distribution. This method
has also shown encouraging improvements on the effective-
ness in comparing with traditional IR models [3], [32], [34].
The big obstacle of pattern mining based approaches for
text mining is how to effectively use both relevant and irrel-
evant feedback. In [70], a pattern deploying method was
proposed to updated positive patterns; however, the
improved effectiveness was not significant. In regard to the
aforementioned problem of redundancy and noises, another
challenging issue for pattern-based methods is how to deal
with low frequency patterns [32]. By way of illustration,
a short pattern (normally with large support, or called
a highly frequent pattern) is usually a general pattern, or a
large pattern (a low frequent pattern with small support)
could be a specific one. Recently, a clustering-based feature
subset selection method has presented to view features into
clusters to reduce dimensionality [54]. Another interesting
idea is to identify interesting features in LDA topics [14].
In summary, the existing methods for finding rele-
vance features can be grouped into three approaches [32].
The first approach tries to diminish weights of terms that
appear in both relevant documents and irrelevant docu-
ments (e.g., Rocchio-based models [41]). This heuristic is
obvious if we assume that terms are isolated atoms. The
second one is based on how often features appear or do
not appear in relevant and irrelevant documents (e.g.,
probabilistic based models [61] or BM25 [43], [44]). The
third one is based on finding features through positive
patterns [32], [59], [60]. The proposed model further
develops the third approach by grouping features into
three categories: “positive specific features”, “general
features”, and “negative specific features”.
3 DEFINITIONS
For a given topic, the goal of relevance feature discovery in
text documents is to find a set of useful features, including
patterns, terms and their weights, in a training set D, which
consists of a set of relevant documents, Dþ
, and a set of
irrelevant documents, DÀ
. In this paper, we assume that all
text documents, d, are split into paragraphs, PSðdÞ. In this
section, we introduce the basic definitions about patterns
and the deploying method. These definitions can also be
found in [32], [34], [59].
3.1 Frequent and Closed Patterns
Let T1 ¼ ft1; t2; . . . ; tmg be a set of terms (or words) which
are extracted from Dþ
, and termset X be a set of terms. For a
given document d, coversetðXÞ is called the covering set of
X in d, which includes all paragraphs dp 2 PSðdÞ such that
X  dp, i.e., coversetðXÞ ¼ fdpjdp 2 PSðdÞ; X  dpg. Its abso-
lute support is the number of occurrences of X in PSðdÞ, that
is supaðXÞ ¼ jcoversetðXÞj. Its relative support is the fraction
of the paragraphs that contain the pattern, that is, supr
ðXÞ ¼ jcoversetðXÞj
jPSðdÞj . A termset X is called a frequent pattern if its
supa (or supr) ! min sup, a given minimum support.
It is obvious that a termset X can be mapped to a set of
paragraphs coversetðXÞ. We can also map a set of para-
graphs Y  PSðdÞ to a termset, which satisfies
termsetðY Þ ¼ ftj8dp 2 Y ) t 2 dpg:
A pattern X (also a termset) is called closed if and only if
X ¼ termsetðcoversetðXÞÞ.
Let X be a closed pattern. We have
supaðX1Þ  supaðXÞ (1)
for all patterns X1 ' X.
All closed patterns can be structured into a pattern taxon-
omy by using the subset (or called is-a) relation [59].
3.2 Closed Sequential Patterns
A sequential pattern s ¼ t1; . . . ; tr  (ti 2 T1) is an ordered
list of terms. A sequence s1 ¼ x1; . . . ; xi  is called a sub-
sequence of another sequence s2 ¼ y1; . . . ; yj , denoted
by s1 v s2, iff 9j1; . . . ; ji such that 1 j1  j2 . . .  ji j
and x1 ¼ yj1
; x2 ¼ yj2
; . . . ; xi ¼ yji
. Given s1 v s2, we call s1
a sub-pattern of s2, and s2 a super-pattern of s1. In the fol-
lowing, we refer to sequential patterns as patterns.
1658 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
Given a sequential pattern X in document d, coversetðXÞ
is still used to describe the covering set of X, which includes
all paragraphs ps 2 PSðdÞ such that X v ps, i.e., coverset
ðXÞ ¼ fpsjps 2 PSðdÞ; X v psg. Its absolute support and rela-
tive support are defined the same as for the normal patterns.
A sequential pattern X is called a frequent pattern if its rel-
ative support ! min sup. The property of closed patterns
(see Eq. (1)) is used to define closed sequential patterns. A
frequent sequential pattern X is closed if supaðX1Þ 6¼
supaðXÞ for any super-pattern X1 of X.
3.3 Deploying Higher Level Patterns
on Low-Level Terms
For term-based approaches, weighting the usefulness of a
given term is based on its appearance in documents.
However, for pattern-based approaches, weighting the
usefulness of a given term is based on its appearance in
discovered patterns.
To improve the efficiency of the pattern taxonomy
mining, an algorithm, SPMiningðDþ
; min supÞ [60], was
proposed (also used in [34], [59]) to find closed sequential
patterns for all documents 2 Dþ
, which used the well-
known Apriori property to reduce the searching space.
For all relevant documents di 2 Dþ
, the SPMining algo-
rithm discovers all closed sequential patterns, SPi, based
on a given min sup. We do not want to repeat this
algorithm here because it is not the particular focus of
this study.
Let SP1, SP2; :::; SPjDþj be the sets of discovered closed
sequential patterns for all documents di 2 Dþ
ði ¼ 1; . . . ; nÞ,
where n ¼ jDþ
j. For a given term t, its d_support (deploying
support, called weight in this paper) in discovered patterns
can be described as follows:
d supðt; Dþ
Þ ¼
Xn
i¼1
supiðtÞ ¼
Xn
i¼1
jfpjp 2 SPi; t 2 pgj
P
p2SPi
jpj
; (2)
where jpj is the number of terms in p.
After the deploying supports of terms have been com-
puted from the training set, let wðtÞ ¼ d supðt; Dþ
Þ, the fol-
lowing rank function is used to decide the relevance of
document d:
rankðdÞ ¼
X
t2T
wðtÞtðt; dÞ; (3)
where tðt; dÞ ¼ 1 if t 2 d; otherwise tðt; dÞ ¼ 0.
4 RFD MODEL
In this section, we introduce the RFD model for relevance
feature discovery, which describes the relevant features in
relation to three groups: positive specific terms, general
terms and negative specific terms based on their appearan-
ces in a training set. We first discuss the concept of
“specificity” in terms of the relative “specificity” in training
datasets and the absolute “specificity” in domain ontology.
We also present a way to understand whether the proposed
relative“ specificity” is reasonable in term of the absolute
“specificity”. Finally, we introduce the term weighting
method in the RFD model.
4.1 Specificity Function
In the RDF model, a term’s specificity (referred to as relative
specificity in this paper) is defined [32] according to its
appearance in a given training set. Let T2 be a set of terms
which are extracted from DÀ
and T ¼ T1 [ T2. Given a term
t 2 T, its coverageþ
is the set of relevant documents that con-
tain t, and its coverageÀ
is the set of irrelevant documents
that contain t. We assume that the terms frequently used in
both relevant documents and irrelevant documents are gen-
eral terms. Therefore, we want to classify the terms that are
more frequently used in the relevant documents into the
positive specific category; the terms that are more fre-
quently used in the irrelevant documents are classified into
the negative specific category.
Based on the above analysis, we defined the specificity of
a given term t in the training set D ¼ Dþ
[ DÀ
as follows:
speðtÞ ¼
jcoverageþ
ðtÞj À jcoverageÀ
ðtÞj
n
; (4)
where coverageþ
ðtÞ ¼ fd 2 Dþ
jt 2 dg, coverageÀ
ðtÞ ¼ fd 2
DÀ
jt 2 dg, and n ¼ jDþ
j. speðtÞ  0 means that term t is
used more frequently in relevant documents than in irrele-
vant documents.
Based on the spe function, we have the following classi-
fication rules for determining general terms G, positive
specific terms Tþ
and negative specific terms TÀ
: G ¼ ft 2
Tju1 speðtÞ u2g, Tþ
¼ ft 2 TjspeðtÞ  u2g, and TÀ
¼ ft
2 TjspeðtÞ  u1g, where u2 is an experimental coefficient,
the maximum boundary of the specificity for the general
terms, and u1 is also an experimental coefficient, the mini-
mum boundary of the specificity for the general terms.
We assume that u2  0 and u2 ! u1. It is easy to verify
that G  Tþ
 TÀ
¼ ;. Therefore, fG; Tþ
; TÀ
g is a partition
of all terms.
A term’s relative specificity describes the extent to which
the term focuses on the topic that users want. It is very diffi-
cult to measure the relative specificity of terms because a
term’s specificity depends on users’ perspectives of their
information needs [55]. For example, “knowledge discov-
ery” will be a general term in the data mining community;
however, it may be a specific term when we talk about infor-
mation technology.
In this paper, we propose a way to understand whether
the proposed relative “specificity” is reasonable in term of
the absolute “specificity” in domain ontology, where
“absolute” means the specificity is independent to any train-
ing dataset. Normally, people consider terms to be more
general if they are frequently used in a very large domain
ontology; otherwise, they are more specific. Therefore, we
define the absolute specificity of a term in the ontology as
follows: speontoðtÞ ¼ 1
jcoverageðtÞj, where coverageðtÞ denotes the
set of concepts of subjects that use term t for describing their
meaning.
To clearly illustrate the spe values between 0 and 1, we
normalize the above equation as follows:
speontoðtÞ ¼ log10
N
jcoverageðtÞj
 
=log10
N
M
 
; (5)
where N is the total number of subjects and M is the maxi-
mum of coverageðtÞ for all t 2 T.
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1659
www.redpel.com+917620593389
www.redpel.com+917620593389
We call a relative spe function reasonable if the average
absolute specificity of its positive specific terms (Tþ
) is
greater than the average absolute specificity of its general
terms (G).
4.2 Weighting Features
To describe relevance features for a given topic, normally
we believe that specific terms are very useful in order to dis-
tinguish the topic from other topics. However, our experi-
ments (see Section 6.6.2) show that using only specific terms
is not good enough to improve the performance of relevance
feature discovery because user information needs cannot
simply be covered by documents that contain only the spe-
cific terms. Therefore, the best way is to use the specific
terms mixed with some of the general terms. We discuss
this issue in the evaluation section.
To improve the effectiveness, the RFD used irrelevant
documents in the training set in order to remove the noises.
The first issue in using irrelevant documents is how to select a
suitable set of irrelevant documents since a very large set of
negative samples is typically obtained. For example, a Google
search can return millions of documents; however, only a few
of those documents may be of interest to a Web user. Obvi-
ously, it is not efficient to use all of the irrelevant documents.
Most models can rank documents (see the ranking func-
tion in Equation (3)) using a set of extracted features. If an
irrelevant document gets a high rank, the document is
called an offender [33] because it is a false discovery. The
offenders are normally defined as the top-K ranked irrele-
vant documents. The basic hypothesis in this paper is that
relevance features are used to describe relevant documents,
and irrelevant documents are used to assure the discrimina-
tion of extracted features. Therefore, RFD only selects some
offenders (i.e., top-K ranked irrelevant documents) rather
than use all irrelevant documents. In Section 6.6.1 we dis-
cuss the performance of using different K values, where
K ¼ n
2 obtained the best performance.
Once we select the top-K irrelevant documents, the set of
irrelevant documents DÀ
will be reduced to include only K
offenders (irrelevant documents); therefore, we have
jDþ
j ! 2jDÀ
j if K ¼ n
2. The spe function can get its maximum
value, 1, if there is a term t, such that coverageÀ
ðtÞ ¼ ;; and
its minimum value, À 1
2, if there is a term t, such that
coverageþ
ðtÞ ¼ ;. Let 0 u2 1, then we can easily verify
À 1
2 u1 u2 1 if K ¼ n
2.
The calculation of original RFD term weighting function
[32] includes two steps: initial weight calculation and
weight revision. Based on Equation (2), in this paper we
integrate the two steps into the following equation:
wðtÞ ¼
d supðt; Dþ
Þð1 þ speðtÞÞ t 2 Tþ
d supðt; Dþ
Þ t 2 G
d supðt; Dþ
Þð1 À jspeðtÞjÞ t 2 T1
Àd supðt; DÀ
Þð1 þ jspeðtÞjÞ otherwise;
8

:
where the d_sup function is defined in Equation (2).
5 TERM CLASSIFICATION
RFD uses both specific features (e.g., Tþ
and TÀ
) and gen-
eral features (e.g., G). Therefore, the key research question
is how to find the best partition (Tþ
, G, TÀ
) to effectively
classify relevant documents and irrelevant documents. For
a given set of features, however, this question is an N-P
hard problem because of the large number of possible com-
binations of groups of features. In this section we propose
an approximation approach, and efficient algorithms to
refine the RFD model.
5.1 An Approximation Approach
The best partition (Tþ
, G, TÀ
) is used to clearly distinguish
irrelevant documents from relevant ones. Assume that we
have two characteristic functions f1, and f2, on all terms,
such that f1ðtÞ is the approximate average weight of t for all
relevant documents, and f2ðtÞ is the approximate average
weight of t for all irrelevant documents. Therefore, the best
partition (Tþ
, G, TÀ
) can maximize the following integra-
tion:
Rtn
t1
ðf1ðtÞ À f2ðtÞÞdt.
The above discussion motivates us to find adequate u1
and u2 to make positive specific features move far away
from negative specific features. If we view the terms that
have the same specificity score as a cluster and use the spe
function as the distance function, the new solution is to find
three groups that can clearly divide the terms into three
categories.
Based on the above analysis, we can develop a clustering
method to group terms into three categories automatically
for each topic by using the specificity function. In the begin-
ning, we assign terms that appear only in irrelevant docu-
ments into the negative specific category TÀ
. For the
remaining terms, we initially view each term ti as a single
cluster ci. We also represent each cluster ci using an interval
½minspeðciÞ; maxspeðciÞŠ, where minspeðciÞ is the smallest spe
value of elements in ci, and maxspeðciÞ is the largest spe
value of the elements in ci.
Let ci and cj be two clusters. The difference between the
two clusters is defined as follows: difðci; cjÞ ¼
minfjmaxspeðciÞ À minspeðcjÞj; jmaxspeðcjÞ À minspeðciÞjg:
A bottom-up approach is used to merge two clusters if they
have the minimum difference. Let ck be the merged cluster
of ci and cj, then we have ck ¼ ci [ cj, minspeðckÞ ¼ min
fminspeðciÞ; minspeðcjÞg and maxspeðckÞ ¼ maxfmaxspeðciÞ;
maxspeðcjÞg.
The merging operation continues until three clusters are
left if the number of initial clusters is greater than three. The
distances between two adjacent clusters in the retained
three clusters should be greater than or equal to any other
distances between two adjacent clusters. The cluster that
has the biggest minspe is determined as Tþ
, the cluster that
has the second biggest minspe would form category G and
the remainder will be part of TÀ
.
5.2 Efficient Algorithms
Algorithm FClustering describes the process of feature clus-
tering, where DPþ
is the set of discovered patterns of Dþ
and DPÀ
is the set of discovered patterns of DÀ
. Step 1 to
Step 4 initialize the three categories. All terms that are not
the elements of positive patterns are assigned to category
TÀ
. For the remaining m terms, each is viewed as a single
1660 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
cluster in the beginning (Step 5 to Step 7). It also sorts these
clusters in C based on their minspe values in Step 9. Step 10
to Step 21 describe the iterative process of merging clusters
until there are three clusters left. The merging process first
decides the closest of two adjacent clusters (Step 11 to
Step 14), ck and ckþ1. It also merges the two clusters into
one, denoted as ck (Step 15 to Step 19), and deletes ckþ1 from
C (Step 20 and Step 21). In the last step, it chooses the first
cluster as Tþ
, the second cluster as G (if it exists) and the
last cluster as a part of TÀ
(if it exists).
In the initialization, the algorithm uses the most time
(OðjTj2
Þ) finding the initial value of TÀ
. The initialization
can also be implemented in OðjTjÞ if a hash function is used
for the containment test. Before the merging process, it takes
OðmlogmÞ to sort C, where m ¼ jCj and m jTj. In the
while loop, it uses OðmÞ times to merge two clusters and
takes Oðm2
Þ to move the clusters in C. Therefore, the time
complexity is OðjTj þ mlogm þ m2
Þ ¼ OðjTj þ m2
Þ ¼ OðjTj2
Þ.
FClustering ( )
Input: Discovered features  T; DPþ
; DPÀ
 and function spe.
Output: Three categories of terms Tþ
, G and TÀ
.
Method:
1: G ¼ ;; Tþ
¼ ;; TÀ
¼ ;;
2: foreach ti 2 T do
3: if ti =2 ftjt 2 P; P 2 DPþ
}
4: then TÀ
¼ TÀ
[ ftig;
5: foreach ti 2 T À TÀ
do {
6: ci ¼ ftig;
7: maxspeðciÞ ¼ minspeðciÞ ¼ speðtiÞ; }
8: let m ¼ jT À TÀ
j;
9: let C ¼ fc1; c2; Á Á Á ; cmg and minspeðc1Þ ! Á Á Á ! minspeðcmÞ;
10: while (jCj  3){ //start merging process
11: let k ¼ 1 and mind ¼ difðc1; c2Þ;
12: for i ¼ 2 to m À 1 do
13: if difðci; ciþ1Þ  mind
14: then {k ¼ i; mind ¼ difðci; ciþ1Þ;}
15: let ck ¼ ck [ ckþ1;
16: if minspeðckþ1Þ  minspeðckÞ
17: then minspeðckÞ ¼ minspeðckþ1Þ;
18: if maxspeðckþ1Þ  maxspeðckÞ
19: then maxspeðckÞ ¼ maxspeðckþ1Þ;
20: for i ¼ k þ 1 to m À 1 do // delete ckþ1 from C
21: let ci ¼ ciþ1;}
22: if jCj ¼ 1 then Tþ
¼ c1
23: else if jCj ¼ 2 then fTþ
¼ c1; G ¼ c2g
24: else fTþ
¼ c1; G ¼ c2; TÀ
¼ TÀ
[ c3g;
Algorithm WFeature is applied to calculate term weights
after terms are classified using Algorithm FClustering. It first
calculates the sup function and spe function (Step 1 and
Step 8). For each term t, it takes Oðn  jpjÞ to calculate d sup
function if an inverted index is utilized, where jpj is the
average size of a pattern, and jpj jdj. For each term t, it
also takes Oðn  jdjÞ to calculate spe function. Therefore,
the time complexity is OððjTj  n  jpjÞ þ ðjTj  n  jdjÞÞ ¼
OðjTj  jdj  nÞ. It also uses Algorithm FClustering (Step 9)
to classify the terms into the three categories of Tþ
, G and
TÀ
. Finally, it calculates the weights of terms using the w
function defined in Section 4.2.
WFeature ðÞ
Input: A updated training set, fDþ
; DÀ
g;
extracted features  T; DPþ
; DPÀ
; and
the initial term weight function w.
Output: A term weight function.
Method:
1: let n ¼ jDþ
j;
2: T1 ¼ ftjt 2 p; p 2 DPþ
g;
3: foreach t 2 T do
4: if t 2 T1
5: then supðtÞ ¼ d supðt; Dþ
Þ;
6: else supðtÞ ¼ Àd supðt; DÀ
Þ;
7: foreach t 2 T do
8: speðtÞ ¼ jfdjd2Dþ;t2dgjÀjfdjd2DÀ;t2dgj
n ;
9: let ðTþ
; G; TÀ
Þ ¼ FClusteringðT; DPþ
; DPÀ
; speðÞÞ;
10: foreach t 2 Tþ
do
11: wðtÞ ¼ supðtÞ Ã ð1 þ speðtÞÞ;
12: foreach t 2 TÀ
do
13: wðtÞ ¼ supðtÞ À jsupðtÞ Ã speðtÞj;
Based on the above analysis, the time complexity of
Algorithm WFeature is OðjTj  jdj  n þ jTj2
Þ, where jdj is
the average size of the documents and n is the number of
relevant documents in the training set. In our experiments,
the size of the set of selected terms is less than 300, i.e.,
jTj  jdj; so, Algorithm WFeature is efficient.
6 EVALUATION
This section discusses the testing environment, and reports
the experimental results and the discussions. It also
provides recommendations for offender selection and the
use of specific terms and general terms for describing user
information needs. The proposed model is a supervised
approach that needs a training set including both relevant
documents and irrelevant documents.
6.1 Data
We used two popular data sets to test the proposed model:
Reuters Corpus Volume 1, a very large data collection; and
Reuters-21578, a small one. RCV1 includes 806,791 docu-
ments that cover a broad spectrum of issues or topics. TREC
(2002) has developed and provided 50 reliable assessor
topics [44] for RCV1, aiming at testing robust information
filtering systems. These topics were evaluated by human
assessors at the National Institute of Standards and Tech-
nology (NIST) [52]. For each topic, a subset of RCV1 docu-
ments is divided into a training set and a testing set. RCV1
is a standard data collection and the TREC 50 topics are sta-
ble and sufficient enough for high quality experiments [55].
Reuters-21578 corpus is a widely used collection for text
mining. The data was originally collected and labelled by
Carnegie Group, Inc. and Reuters, Ltd. in the course of
developing the CONSTRUE text categorization system.1
In
this experiment, we picked up the set of 10 classes. Accord-
ing to Sebastiani’s convention [11], it was also called “R8”
because two classes corn and wheat are intimately related to
the class grain, and they were appended to class grain.
1. Reuters-21578, http://www.daviddlewis.com/resources/
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1661
www.redpel.com+917620593389
www.redpel.com+917620593389
Documents in both RCV1 and Reuters-21578 are
described in XML. To avoid bias in experiments, all of the
information about the meta-data was ignored. All docu-
ments were treated as plain text documents by a preprocess-
ing, including removing stop-words according to a given
stop-words list and stemming terms by applying the Porter
Stemming algorithm.
We also used the Library of Congress Subject Headings2
to understand the definition of the spe function in a domain
ontology. LCSH is a very large taxonomic knowledge classi-
fication, which was developed by librarians for organizing
the large volume of library’s collections, and for retrieving
information from these collections [55]. LCSH covers
394,070 concepts or subjects.
6.2 Baseline Models and Setting
We grouped baseline models into two categories [32]. The
first category included the up-to-date pattern based meth-
ods (frequent patterns, frequent closed patterns, sequential
patterns, and sequential closed patterns), language models
(n-grams) and a concept-based model. The second category
included well-known term feature selection models (or
called term-based models): Rocchio, BM25, SVM, mutual
information, chi-square and Lasso regression.
We divided our approach into two stages. In the first
stage, we used only positive patterns in the training sets.
The model, called PTM, discovers sequential closed patterns
from relevant documents, deploys discovered patterns on
their terms using Equation (2) and ranks documents using
Equation (3) as well. In the second stage, we use both posi-
tive and negative patterns as described in Sections 4 and 5.
We set min supr ¼ 0:2 (which was suggested by [59]) for all
models that use patterns.
Different to sequential patterns, n-grams extract sequen-
tial patterns with a specified number of words and with no
gaps between the words [42]. n-grams are usually selected
based on the sliding window technique and the probability
of a n-gram¼ w1w2 . . . wn is calculated using the following
equation:
Pðw1w2 . . . wnÞ ¼ Pðw1ÞPðw2jw1w2Þ . . . Pðwnjw1 . . . wnÀ1Þ:
In the experiments, we used three language models [37]:
Unigram, Bigram and Trigram. Unigram uses 1-grams only,
Bigram uses both 2-grams and 1-grams and Trigram uses 3-
grams, 2-grams and 1-grams. The probability of an n-gram
is calculated in a training set D as follows:
Pðn-gramÞ ¼
tfðn-gram; Dþ
Þ
tfðn-gram; DÞ
; (6)
where tfðn-gram; Dþ
Þ is the number of appearances of
n-gram in Dþ
, and tfðn-gram; DÞ is the number of appear-
ances of n-gram in D, and n ¼ 1, 2 or 3.
The concept-based model was presented in [50], [51].
CBM was also used as a baseline model in [70] for infor-
mation filtering. The Rocchio model and BM25 are the
well-known models for representing relevant information.
We used the recommended experimental parameters
(suggested by [59], [60], [70]) in our experiments (please
note that the term frequency is the total number of term
appearance in all relevant documents).
The linear SVM has been proven very effective for text
categorization and filtering [47]. Most SVMs are designed
for making a binary decision rather than ranking docu-
ments. In this paper, we use SVM-Light3
for ranking docu-
ments. The optimization algorithms used in SVM-Light are
described in [23].
Mutual Information and chi-square (x2
) are popular
methods for feature selection [37]. More details about MI
and x2
can be found in Chapter 13 of book [37].
Lasso (least absolute shrinkage and selection operator) is
another method for feature selection [57], and there are
some extensions in recent years [56], [63]. Lasso [57] esti-
mate was defined by
ð^a; ^bÞ ¼ argmin
Pn
i¼1 yi À a À
Pp
j¼1 bjxij
 2
suject to dT
i b t:
where dT
i (i ¼ 1; 2; . . . ; 2p
Þ is p-tuples of the form ðÆ1;
Æ1; . . . ; Æ1Þ. In our implementation, yj ¼ 1 if dj 2 Dþ
; other-
wise yj ¼ À jDþj
jDÀj in order to make ^a ¼ y ¼ 0; and dT
i ¼ sign
ðbÞ. We also let xij ¼ 1 if term ti in document dj; otherwise
xij ¼ 0 for information filtering (this assumption is the same
as other models). We use tf*idf weights to find b ¼ fbjg,
and let bj ¼ wðtjÞ À w þ D. The initial b0
is assigned when
D ¼ 0 in order to make b0
¼ 0, where wðtjÞ is the tf*idf
weight and D is a parameter to try positive direction and
negative direction test [16].
6.3 Evaluation Metrics
The effectiveness of a model is usually measured by the
following means [32], [59]: the average precision of the
top-20 documents, F1 measure, mean average precision
(MAP), the break-even point (b=p), and interpolated
average precision (IAP) on 11-points. These are widely
accepted and well-established evaluation metrics. Each
metric focuses on a different aspect of the model’s perfor-
mance, as described below.
The F-beta (Fb) measure is a function to describe both
Recall (R) and Precision (P), together with a parameter beta
b. The parameter b ¼ 1 was used in this paper, which
denotes that precision and recall were weighed equally.
Therefore, Fb is denoted by: F1 ¼ 2PR
ðPþRÞ :
MAP measures the precision at each relevant document
first, then obtain the average precision for all the topics. It
combines precision, relevance ranking and overall recall
together to measure the performance of the models.
B/P is the value of the recall (or precision) for which the
P/R curve intersects the precision ¼ recall line. The larger
the value, the better the model performs.
11-Points was also adopted in several research works
[66]. It is used to measure the performance of different mod-
els by averaging the precisions at 11 standard recall levels
(recall ¼ 0.0, 0.1,. . .,1.0, where “0.0” means cut-off ¼ 1 in this
paper). We also used a statistical method, the paired two-
tailed t-test, to analyze the experimental results.
2. LCSH Web page, http://classificationweb.net/ 3. SVM-Light URL: http://svmlight.joachims.org/
1662 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
6.4 Hypotheses
The proposed model is called the relevance feature discov-
ery model, and consists of three major steps: feature discov-
ery and deploying, term classification and term weighting.
It first finds positive and negative patterns and terms in the
training set. It also classifies terms into three categories by
using parameters (u1 and u2) or Algorithm FClustering.
Finally, it works out the term weights by using Algorithm
WFeature.
In our experiments, we develop two versions of the RFD
model. Both versions use negative feedback to improve the
quality of extracted features from positive feedback. The
features extracted from both positive and negative feedback
are classified into three categories, namely, Tþ
, G and TÀ
.
The first version is called RFD1 which uses two empirical
parameters (u1 and u2, see [32]) to group the low-level terms
into three groups. This model can achieve the satisfactory
performance, but it has to manually decide the two parame-
ters according to their real performance in testing sets. The
second model is called RFD2 which uses the proposed
FClustering algorithm to automatically determine the three
categories Tþ
, G and TÀ
based on the training sets.
To conduct a comprehensive investigation of the pro-
posed model and the ways in which the term classification
could help to improve the performance, the proposed
model is discussed in terms of the following hypotheses:
 The RFD model classifies terms into three categories
(positive specific terms, general terms and negative
specific terms) by using the spe function.
Hypothesis H1. The spe function is reasonable for
describing terms’ specificity for most topics.
Hypothesis H2. The positive specific terms are the
most interesting in relation to what users want, but
general terms are the necessary information for
describing what users want. The use of the three cat-
egories together can generate the best performance.
 RFD1 is the state-of-the-art model for information
filtering. It can achieve satisfactory performance for
a given testing set. However, it is a parameterized
method, and the two empirical parameters are sensi-
tive to the change of testing sets.
Hypothesis H3. RFD2 overcomes the limitation of
RFD1 by using a clustering method to classify the
terms into three categories directly. It can achieve a
similar performance as RFD1. The RFD2 model also
shows remarkable performance compared with the
state-of-the-art models.
6.5 Results
In this section, we first compare RFD2 and RFD1, and
expect that the performance of RFD2 can be approximate
to the performance of RFD1. We also compare the RFD2
model with language models (n-grams) and other pat-
tern-based models, especially PTM, which is the best one
of the existing pattern-based models. In addition, RFD2 is
compared with the state-of-the-art term-based methods
underpinned by Rocchio, BM25, SVM, MI, x2
and Lasso
for each variable top À 20, B=P, MAP, IAP and Fb¼1 on
both datasets.
6.5.1 Understand of Specificity on LCSH Ontology
For each topic, let RFD-SPE be the set of positive specific
terms determined by RFD2, and RFD-G be the set of general
terms determined by RFD2. Fig. 1 shows the average speonto
values of terms in both RFD-SPE and RFD-G for all 50 topics
on RCV1. It is obvious that most topics (90 percent) can
obtain larger speonto values for the RFD2 positive specific
terms. This result supports Hypothesis H1.
6.5.2 RFD2 vs RFD1
RFD1 uses both u1 and u2 to group the low-level terms into
three categories. To achieve the satisfactory performance,
we conducted the cross validation for the two parameters in
the testing sets, and we finally set u1 ¼ 0:2 and u2 ¼ 0:3 for
FRD1 manually.
RFD2 uses Algorithm FClustering to automatically group
terms into the three categories of Tþ
, G and TÀ
for each
topic. Table 1 shows the average results of the five measures
on all 50 assessing topics, where %chg denotes the percent-
age change of RFD2 over RFD1.
As shown in Table 1, RFD2 can produce the same perfor-
mance as RFD1. In addition, a small improvement to four
measures (top À 20, B=P, IAP and Fb¼1) was observed.
These results support Hypothesis H3.
6.5.3 RFD2 vs Pattern-Based Models and n-Grams
The results on data collection RCV1 for all model in the
first category (RFD2, language models (n-grams), CBM
and other pattern-based models) are presented in Table 2,
where %chg means the percentage change of RFD2 over
PTM. As noted earlier, pattern-based methods struggle
in some topics as too much noise is generated in the dis-
covery of positive patterns. The most important findings
revealed in this table are that closed sequential patterns
(Closed Seq Ptns) perform better than other patterns, and
PTM deploying method outperforms largely closed
sequential patterns. The result also supports the superi-
ority of using closed sequential patterns in text mining
and highlights the importance of the adoption of proper
pattern deploying methods on terms for using discov-
ered patterns in text documents.
In terms of n-grams, the Trigram model outperforms the
Bigram and Unigram models. The performance of the Tri-
gram model is very good and has similar results as PTM.
In order to see the effectiveness of using both positive
and negative patterns for relevance feature discovery, we
also compare RFD2 with the best pattern based model PTM
which uses positive patterns only in Reuters-21578 (see
Table 3).
Both tables show that RFD2 achieves excellent perfor-
mance with 10:35 percent in percentage change on average
Fig. 1. speontoðtÞ for all t 2 RFD-SPE v.s speontoðtÞ for all t 2 RFD-G.
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1663
www.redpel.com+917620593389
www.redpel.com+917620593389
for RCV1 (with a maximum of 13:10 percent and minimum
of 6:99 percent) and 9:71 percent in percentage change on
average for Reuters-21578 (with a maximum of 11:52 per-
cent and minimum of 6:73 percent).
6.5.4 RFD2 vs Term Feature Selection Models
The proposed method using RFD2 was also compared with
popular feature selection models including Rocchio, BM25,
SVM, MI, x2
and Lasso. The experimental results on RCV1
for all 50 assessing topics are reported in Table 4. In the
table, RFD2 is also compared with Rocchio (which is the best
model for feature selection) and the percentage change is
calculated.
As shown in Table 4, the proposed new model RFD2
achieved the best performance for the assessor topics,
where RFD2 is compared with Lasso (which is the second
best term-based model on RCV1). The average percentage
of improvement over the standard measures is 7:90 per-
cent with a maximum of 10:87 percent and minimum of
5:62 percent.
The experimental results on Reuters-21578(R8) are
reported in Table 5, where RFD2 is compared with SVM
(which is the second best term-based model on Reuters-
21578) and the percentage change is calculated. As shown
in Table 5, RFD2 also achieved the best performance.
Compared to SVM, RFD2 has the same top-20 precision as
SVM, and it is better than SVM for other four measures. The
maximum percentage of improvement over the Fb¼1 mea-
sure is 7:72 percent.
At last, the statistical significance tests are illustrated in
Table 6 to compare the proposed model with other high
performance models on all data collections. The results
show that the proposed model is significant as all p-values
are less than 0.05.
6.5.5 Robustness
In this paper, the robustness is used to discuss the characteris-
tics of a model for describing its capacity to effectively per-
form while its training sets are altered or the application
environment is changed. We call a model robust if it still pro-
vides satisfactory performance regardless of having its train-
ing sets altered or the application environment changed. For
this evaluation, we only use RCV1 because Reuters-21578’s
testing set will become too small if we increase training sets.
For altered training sets, we used six loops for each
topic and each loop used a sliding window to increase the
training sets, where each sliding window included 25
documents that were randomly selected from the testing
set. The 25 documents were also removed from the corre-
sponding testing set.
TABLE 1
Comparison Results of RFD1 and RFD2 Models
in All Assessing Topics on RCV1
Model top-20 b/p MAP Fb¼1 IAP
RFD1 0.5570 0.4724 0.4932 0.4696 0.5125
RFD2 0.5610 0.4729 0.4930 0.4699 0.5136
%chg 0.71% 0.11% -0.04% 0.06% 0.21%
TABLE 2
Comparison of All Pattern (Phrase) Based Methods on RCV1
Model top-20 b/p MAP Fb¼1 IAP
RFD2 0.561 0.473 0.493 0.470 0.513
PTM 0.496 0.430 0.444 0.439 0.464
Seq Patterns 0.401 0.343 0.361 0.385 0.384
Closed Seq Ptns 0.406 0.353 0.364 0.390 0.392
Freq Patterns 0.412 0.352 0.361 0.386 0.384
Freq Closed Ptns 0.428 0.346 0.361 0.385 0.387
Unigram 0.417 0.386 0.388 0.404 0.411
Bigram 0.477 0.420 0.435 0.436 0.458
Trigram 0.499 0.420 0.439 0.438 0.460
CBM 0.448 0.409 0.415 0.423 0.440
%chg +13.10 +9.87 +11.14 +6.99 +10.66
TABLE 3
Comparison of the Proposed Model with the Best Pattern
Based Model PTM on Reuters-21578(R8)
Model top-20 b/p MAP Fb¼1 IAP
RFD2 0.794 0.704 0.747 0.601 0.748
PTM 0.731 0.633 0.661 0.564 0.664
%chg +6.73 +10.24 +10.66 +9.40 +11.52
TABLE 4
Comparison Results of All Models on RCV1
Model top-20 b/p MAP Fb¼1 IAP
RFD2 0.561 0.473 0.493 0.470 0.513
Rocchio 0.501 0.424 0.440 0.433 0.459
BM25 0.445 0.407 0.407 0.414 0.428
SVM 0.453 0.408 0.409 0.421 0.435
MI 0.316 0.311 0.312 0.347 0.337
x2 0.322 0.326 0.319 0.355 0.345
Lasso 0.506 0.434 0.460 0.445 0.480
%chg +10.87% +8.99% +7.17% +5.62% +6.88%
TABLE 5
Comparison of All Models on Reuters-21578(R8)
Model top-20 b/p MAP Fb¼1 IAP
RFD2 0.794 0.699 0.745 0.600 0.746
Rocchio 0.706 0.594 0.633 0.527 0.632
BM25 0.675 0.556 0.582 0.508 0.590
SVM 0.794 0.693 0.729 0.557 0.709
MI 0.275 0.261 0.219 0.269 0.251
x2 0.263 0.245 0.211 0.260 0.242
Lasso 0.719 0.627 0.657 0.536 0.651
%chg 0.0% +0.87% +2.19% +7.72% +5.22%
TABLE 6
p-Values for RFD2 vs Other High Performance Models
Model top-20 b/p MAP Fb¼1 IAP
Lasso 0.01100 0.01976 0.03850 0.02509 0.03040
PTM 0.00101 0.00070 0.00003 0.00002 0.00001
SVM 0.00030 0.00688 0.00160 0.00058 0.00113
Rocchio 0.00463 0.00436 0.00570 0.00496 0.00405
1664 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
Rocchio model is a popular robust model for filtering.
Table 7 shows the results of Rocchio model based on these
settings.
Tables 8 and 9 show the experimental results for both
RFD1 and RFD2, respectively. It is clear that both RFD
models achieved better results when using more training
documents. The performance of RFD models is satisfactory.
The comparison between RFD1 and RFD2 is shown in
Table 10. The difference between RFD1 and RFD2 is not sig-
nificant as the p-values are obviously greater than 0.05.
For the altered application environment, we use the pro-
posed model for text classification. RFD models can be eas-
ily used for ranking documents by using rank function
defined in Eq. (3) for the term weight wðtÞ defined in Sec-
tion 4.2. To apply Eq. (3) for binary text classification, we
require a threshold t to determine relevance (rankðdÞ ! t)
and non-relevance (rankðdÞ  t). We call this kind of classi-
fier RFDt. Let tþ
¼ minfrankðdÞjd 2 Dþ
g, tÀ
¼ maxfrankðdÞjd 2
DÀ
g, and t ¼ minftþ
; tÀ
g. To avoid bias, we use balance
testing sets for each topic by randomly selecting five equiva-
lent negative subsets to match the positive set. We also use
other two well-known classifiers: SVM and (Sequential Min-
imal Optimization (SMO) for training SVM), and their
LibSVM implementation (http://www.csie.ntu.edu.tw/
cjlin/libsvm/) and selected the best one for each classifier
for this comparison. Table 11 shows the results, where Accm
and AccM
are the Micro Accuracy and Macro Accuracy,
respectively.
These experiments show that the performance of the pro-
posed model is satisfactory for both altered training sets
and the application environment. These results also support
Hypothesis H3.
6.6 Discussion
The proposed model has three major steps: feature discov-
ery and deploying, term classification, and term weighing.
Offender selection plays an important role for using nega-
tive feedback in the process of feature discovery and
deploying. In this section, we first discuss the issue of
offender selection. We also discuss other issues for the pro-
posed model such as term classification and specificities.
6.6.1 Offender Selection
We believe that the positive feedback is more constructive
than the negative feedback since the objective of relevance
feature discovery is to find relevant knowledge. However,
we believe that negative feedback contains some useful
information that can help to identify the boundary between
relevant and irrelevant information for improving the
effectiveness of relevance feature discovery. The obvious
problem for using irrelevant documents is that most of the
irrelevant documents are not closed to the given topic
because of the very large amount of negative information.
Therefore, it is required to choose some useful irrelevant
documents (offenders) to decide the groups of terms for the
three categories [32]).
TABLE 7
Results of Rocchio Model on Six Sliding Windows
for All Assessor Topics
Model top-20 b/p MAP Fb¼1 IAP
Rocchio-1 0.525 0.444474 0.458249 0.448621 0.476696
Rocchio-2 0.495 0.444119 0.454437 0.448007 0.474435
Rocchio-3 0.505 0.455495 0.463649 0.449008 0.485906
Rocchio-4 0.497 0.450539 0.460866 0.448778 0.483619
Rocchio-5 0.497 0.441519 0.449421 0.441622 0.472068
Rocchio-6 0.479 0.428432 0.443400 0.439774 0.466213
AVG 0.500 0.444096 0.455004 0.445968 0.476490
Rocchio 0.501 0.4240 0.4400 0.4333 0.4590
TABLE 8
Results of RFD1 Model on Six Sliding Windows
for All Assessor Topics
Model top-20 b/p MAP Fb¼1 IAP
RFD1-1 0.585 0.495 0.513 0.483 0.532
RFD1-2 0.565 0.491 0.512 0.485 0.529
RFD1-3 0.581 0.486 0.507 0.479 0.528
RFD1-4 0.575 0.499 0.518 0.484 0.540
RFD1-5 0.558 0.476 0.497 0.470 0.518
RFD1-6 0.547 0.475 0.498 0.473 0.519
AVG 0.569 0.487 0.508 0.479 0.528
RFD1 0.557 0.4724 0.493 0.4696 0.5125
TABLE 9
Results of RFD2 Model on Six Sliding Windows
for All Assessor Topics
Model top-20 b/p MAP Fb¼1 IAP
RFD2-1 0.582 0.497 0.513 0.484 0.533
RFD2-2 0.563 0.496 0.513 0.486 0.530
RFD2-3 0.577 0.483 0.504 0.478 0.525
RFD2-4 0.569 0.493 0.516 0.483 0.537
RFD2-5 0.555 0.476 0.494 0.468 0.514
RFD2-6 0.556 0.478 0.499 0.473 0.520
AVG 0.567 0.487 0.507 0.479 0.527
RFD2 0.561 0.473 0.493 0.470 0.514
TABLE 10
P-value for Comparing RFD1 and RFD2 Model
Loop top-20 b/p MAP Fb¼1 IAP
Loop-1 0.472 0.465 0.766 0.526 0.509
Loop-2 0.569 0.113 0.622 0.404 0.695
Loop-3 0.522 0.349 0.384 0.546 0.369
Loop-4 0.224 0.096 0.243 0.426 0.283
Loop-5 0.411 0.993 0.137 0.182 0.061
Loop-6 0.083 0.196 0.628 0.911 0.678
AVG 0.380 0.369 0.463 0.499 0.433
TABLE 11
Results of RFD Based Classifier with Threshold t
Model Macro-Average AccM
Micro-Average Accm
RFDt 0.682 0.701
SVM linear 0.611 0.656
SMO polynomial 0.616 0.661
%chg +11.62% +6.05%
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1665
www.redpel.com+917620593389
www.redpel.com+917620593389
Table 12 shows the performances for different K values,
where K ¼ jDþ
j=2 obtained the best performance. Table 12
a shows the average number of relevant documents, irrele-
vant documents, offenders and extracted terms in the train-
ing sets(where jDÀ
j  jDþ
j  jDþ
j=2); and Table 12b shows
the performance and the average term weight for the three
categories. The results illustrate that when the value of K is
larger then the performance and term weights of Tþ
and G
are lower. Another advantage of offender selection is to
reduce the space of negative relevance feedback. Table 12a
clearly shows that only 15:8% ¼ 6:54
41:3 of the irrelevant docu-
ments are selected as offenders for the best performance.
In summary, the experimental results support the strat-
egy of offender selection used in the proposed model. We
therefore conclude that the proposed method for offender
selection in RFD meets the design objectives.
6.6.2 Term Classification and Specificity
Terms can be grouped based on spe function and the classifi-
cation rules or the feature clustering method. RFD1 uses two
thresholds to decide the categories of terms. It obtained a sat-
isfactory performance; however, it requires a prior knowl-
edge and costs more effort to set the right values for the
parameters. RFD2 uses the feature clustering technique to
group terms into three categories adaptively for each topic.
In this section, we mainly discuss the results of using RFD2.
Table 13 shows the statistical information for both RFD2
and PTM. The average number of terms that PTM extracted
was 156:9, and all those terms were used as a single group.
RFD2 groups terms into three categories, and the number
of terms in both the positive specific category and general
category is reduced to 46:64 ¼ 24:24 þ 22:4, that is, only
29:73 percent were retained in RFD, and there is about
70:27% ¼ 100% À 29:73% **of extracted PTM terms are pos-
sible noisy terms. The percentage of general terms is
48:03% ¼ 22:4
22:4þ24:24 (see Table 13a) General terms frequently
appeared not only in relevant documents, but also in some
irrelevant documents. To further reduce the side effects of
using general terms, RFD2 adds some negative specific
terms (TÀ
).
We believe that positive specific terms (with large speci-
ficity value) are more interesting than general terms (with
small specificity value) for a given topic. As shown in
Table 13a, PTM assigned 66:92% ¼ 2:5952
1:28273þ2:5952 weights to
specific positive terms, and 33:07 percent weights to general
terms. RFD2 increased the weights of positive specific
terms. It assigned 25:55% ¼ 1:28273
1:28273þ3:73828 to general terms
and 74:45 percent to positive specific terms (see Table 13a).
Fig. 2 shows that the use of only positive specific terms
(Tþ
) could achieve much better result than the use of only
general terms (G). It is also recommended to use both posi-
tive specific terms and the general terms (Tþ
[ G) that can
significantly improve the effectiveness. This recommenda-
tion is also suggested by the SAGE model [12], where a topic
model explicitly consider the background signal (like the
neutral (G) cluster).
In summary, the use of negative feedback is significant
for RFD models. It can balance the percentages of positive
TABLE 12
Statistical Information for RFD2 with Different Values of K
K Average number of training documents Average number of extracted terms
Relevant Irrelevant Offenders Tþ G TÀ
jDþ
j=2 12.78 41.3 6.54 24.24 22.4 231.04
jDþ
j 12.78 41.3 10.08 28.94 24.68 267.38
jDÀ
j 12.78 41.3 38.92 31.78 8.46 521.64
(a)
K Average weight of extracted terms top-20 MAP Fb¼1
wðTþ
Þ wðGÞ wðTÀ
Þ
jDþ
j=2 3.7383 1.2827 -0.3328 0.561 0.493 0.470
jDþ
j 3.3044 1.2227 -3.1947 0.542 0.463 0.451
jDÀ
j 2.6307 0.4602 -69.9437 0.274 0.278 0.295
(b)
TABLE 13
Statistical Information for Both RFD2 and PTM
Average number of Extracted Terms used RFD Average weightðtÞ in PTM
Tþ G TÀ
wðTþ
Þ wðGÞ wðTÀ
Þ
24.24 22.4 231.04 2.5952 1.28273 0.68486
(a)
Average weightðtÞ in RFD Terms extracted from Dþ
used PTM
wðTþ
Þ wðGÞ wðTÀ
Þ T wðTÞ
3.73828 1.28273 -0.33275 156.9 1.45210
(b)
1666 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
specific terms and general terms for largely reducing noises.
The experimental results demonstrate that we can roughly
choose the same amount of positive specific terms and gen-
eral terms, and assign large weights to the positive specific
terms. These results support Hypothesis H2.
7 CONCLUSION
The research proposes an alternative approach for relevance
feature discovery in text documents. It presents a method to
find and classify low-level features based on both their
appearances in the higher-level patterns and their specific-
ity. It also introduces a method to select irrelevant docu-
ments for weighting features. In this paper, we continued to
develop the RFD model and experimentally prove that the
proposed specificity function is reasonable and the term
classification can be effectively approximated by a feature
clustering method.
The first RFD model uses two empirical parameters to
set the boundary between the categories. It achieves the
expected performance, but it requires the manually testing
of a large number of different values of parameters. The
new model uses a feature clustering technique to automati-
cally group terms into the three categories. Compared with
the first model, the new model is much more efficient and
achieved the satisfactory performance as well.
This paper also includes a set of experiments on RCV1
(TREC topics), Reuters-21578 and LCSH ontology. These
experiments illustrate that the proposed model achieves the
best performance for comparing with term-based baseline
models and pattern-based baseline models. The results also
show that the term classification can be effectively approxi-
mated by the proposed feature clustering method, the pro-
posed spe function is reasonable and the proposed models
are robust.
This paper demonstrates that the proposed model was
thoroughly tested and the results prove that the proposed
model is statistically significant. The paper also proves that
the use of irrelevance feedback is significant for improving
the performance of relevance feature discovery models. It
provides a promising methodology for developing effective
text mining models for relevance feature discovery based
on both positive and negative feedback.
ACKNOWLEDGMENTS
This paper was partially supported by Grant DP140103157
from the Australian Research Council (ARC Discovery
Project). Y. Li is the corresponding author.
REFERENCES
[1] M. Aghdam, N. Ghasem-Aghaee, and M. Basiri, “Text feature
selection using ant colony optimization,” in Expert Syst. Appl.,
vol. 36, pp. 6843–6853, 2009.
[2] A. Algarni and Y. Li, “Mining specific features for acquiring user
information needs,” in Proc. Pacific Asia Knowl. Discovery Data
Mining, 2013, pp. 532–543.
[3] A. Algarni, Y. Li, and Y. Xu, “Selected new training documents to
update user profile,” in Proc. Int. Conf. Inf. Knowl. Manage., 2010,
pp. 799–808.
[4] N. Azam and J. Yao, “Comparison of term frequency and doc-
ument frequency based feature selection metrics in text cate-
gorization,” Expert Syst. Appl., vol. 39, no. 5, pp. 4760–4768,
2012.
[5] R. Bekkerman and M. Gavish, “High-precision phrase-based doc-
ument classification on a modern scale,” in Proc. 11th ACM
SIGKDD Knowl. Discovery Data Mining, 2011, pp. 231–239.
[6] A. Blum and P. Langley, “Selection of relevant features and exam-
ples in machine learning,” Artif. Intell., vol. 97, nos. 1/2, pp. 245–
271, 1997.
[7] C. Buckley, G. Salton, and J. Allan, “The effect of adding relevance
information in a relevance feedback environment,” in Proc. Annu.
Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 1994, pp. 292–300.
[8] G. Cao, J.-Y. Nie, J. Gao, and S. Robertson, “Selecting good expan-
sion terms for pseudo-relevance feedback,” in Proc. Annu. Int.
ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008, pp. 243–250.
[9] G. Chandrashekar and F. Sahin, “Asurvey on feature selection
methods,” in Comput. Electr. Eng., vol. 40, pp. 16–28, 2014.
[10] B. Croft, D. Metzler, and T. Strohman, Search Engines: Information
Retrieval in Practice. Reading, MA, USA: Addison-Wesley, 2009.
[11] F. Debole and F. Sebastiani, “An analysis of the relative hardness
of Reuters-21578 subsets,” J. Amer. Soc. Inf. Sci. Technol., vol. 56,
no. 6, pp. 584–596, 2005.
[12] J. Eisenstein, A. Ahmed, and E. P. Xing, “Sparse additive genera-
tive models of text,” in Proc. Annu. Int. Conf. Mach. Learn., 2011,
pp. 274–281.
[13] G. Forman, “An extensive empirical study of feature selection
metrics for text classification,” in J. Mach. Learn. Res., vol. 3,
pp. 1289–1305, 2003.
[14] Y. Gao, Y. Xu, and Y. Li, “Topical pattern based document model-
ling and relevance ranking,” in Proc. 15th Int. Conf. Web Inf. Syst.
Eng., 2014, pp. 186–201.
[15] X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and H.-Y. Shum,
“Query dependent ranking using k-nearest neighbor,” in Proc.
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008,
pp. 115–122.
[16] A. Genkin, D. D. Lewis, and D. Madigan, “Large-scale Bayesian
logistic regression for text categorization,” Technometrics, vol. 49,
no. 3, pp. 291–304, 2007.
[17] I. Guyon and A. Elisseeff, “An introduction to variable and feature
selection,” in J. Mach. Learn. Res., vol. 3, no. 1, pp. 1157–1182, 2003.
[18] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without can-
didate generation,” in Proc. ACM SIGMOD Int. Conf. Manage.
Data, 2000, pp. 1–12.
[19] Y.-F. Huang and S.-Y. Lin, “Mining sequential patterns using
graph search techniques,” in Proc. Annu. Int. Conf. Comput. Softw.
Appl., 2003, pp. 4–9.
[20] G. Ifrim, G. Bakir, and G. Weikum, “Fast logistic regression for
text categorization with variable-length n-grams,” in Proc. ACM
SIGKDD Knowl. Discovery Data Mining, 2008, pp. 354–362.
[21] N. Jindal and B. Liu, “Identifying comparative sentences in text
documents,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf.
Retrieval, 2006, pp. 244–251.
[22] T. Joachims, “Transductive inference for text classification using
support vector machines,” in Proc. Annu. Int. Conf. Mach. Learn.,
1999, pp. 200–209.
[23] T. Joachims, “Optimizing search engines using clickthrough
data,” in Proc. ACM SIGKDD Knowl. Discovery Data Mining, 2002,
pp. 133–142.
Fig. 2. Comparison for using different combinations of categories of
terms for RFD2.
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1667
www.redpel.com+917620593389
www.redpel.com+917620593389
[24] S. Jones, S. WalkervvKaren, and S. E. Robertson, “A probabi-
listic model of information retrieval: Development and
comparative experiments,” Inf. Process. Manage., vol. 36, no. 6,
pp. 779–808, 2000.
[25] R. Lau, P. Bruza, and D. Song, “Towards a belief-revision-based
adaptive and context-sensitive information retrieval system,”
ACM Trans. Inf. Syst., vol. 26, no. 2, pp. 1–38, 2008.
[26] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new bench-
mark collection for text categorization research,” in J. Mach. Learn.
Res., vol. 5, pp. 361–397, Dec. 2004.
[27] X. Li and B. Liu, “Learning to classify texts using positive and
unlabeled data,” in Proc. 18th Int. Joint Conf. Artif. Intell., 2003,
pp. 587–592.
[28] X.-L. Li, B. Liu, and S.-K. Ng, “Learning to classify documents
with only a small positive training set,” in Proc. 18th Eur. Conf.
Mach. Learn., 2007, pp. 201–213.
[29] Y. Li, D. F. Hus, and S. M. Chung, “Combination of multiple fea-
ture selection methods for text categorization by using combina-
tional fusion analysis and rank-score characteristic,” Int. J. Artif.
Intell. Tools, vol. 22, no. 2, p. 1350001, 2013.
[30] Y. Li, A. Algarni, S.-T. Wu, and Y. Xue, “Mining negative rele-
vance feedback for information filtering,” in Proc. Web Intell. Intell.
Agent Technol., 2009, pp. 606–613.
[31] Y. Li, A. Algarni, and Y. Xu, “A pattern mining approach for
information filtering systems,” in Inf. Retrieval, vol. 14,
pp. 237–256, 2011.
[32] Y. Li, A. Algarni, and N. Zhong, “Mining positive and negative
patterns for relevance feature discovery,” in Proc. ACM SIGKDD
Knowl. Discovery Data Mining, 2010, pp. 753–762.
[33] Y. Li and N. Zhong, “Mining ontology for automatically acquiring
web user information needs,” in IEEE Trans. Knowl. Data Eng.,
vol. 18, no. 4, pp. 554–568, Apr. 2006.
[34] Y. Li, X. Zhou, P. Bruza, Y. Xu, and R. Y. Lau, “A two-stage text
mining model for information filtering,” in Proc. 17th ACM Conf.
Inf. Knowl. Manage., 2008, pp. 1023–1032.
[35] Y. Li, X. Zhou, P. Bruza, Y. Xu, and R. Y. Lau, “Two-stage decision
model for information filtering,” Decision Support Syst., vol. 52,
no. 3, pp. 706–716, 2012.
[36] X. Ling, Q. Mei, C. Zhai, and B. Schatz, “Mining multi-faceted
overviews of arbitrary topics in a text collection,” in Proc. 14th
ACM SIGKDD Knowl. Discovery Data Mining, 2008, pp. 497–505.
[37] C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Infor-
mation Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2009.
[38] C. D. Manning and H. Sch€utze, Foundations of Statistical Natural
Language Processing. Cambridge, MA, USA: MIT Press, 1999.
[39] D. Metzler and W. B. Croft, “Latent concept expansion using Mar-
kov random fields,” in Proc. Annu. Int. ACM SIGIR Conf. Res.
Develop. Inf. Retrieval, 2007, pp. 311–318.
[40] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu, “Prefixspan: Mining sequential patterns efficiently by
prefix-projected pattern growth,” in Proc. Int. Conf. Data Eng.,
2001, pp. 215–224.
[41] R. K. Pon, A. F. Cardenas, D. Buttler, and T. Critchlow, “Tracking
multiple topics for finding interesting articles,” in Proc. ACM
SIGKDD Knowl. Discovery Data Mining, 2007, pp. 560–569.
[42] S. Quiniou, P. Cellier, T. Charnois, and D. Legallois, “What about
sequential data mining techniques to identify linguistic patterns
for stylistics?” in Computational Linguistics and Intelligent Text Proc-
essing. New York, NY, USA: Springer, 2012, pp. 166–177.
[43] S. Robertson, H. Zaragoza, and M. Taylor, “Simple bm25 exten-
sion to multiple weighted fields,” in Proc. 17th ACM Conf. Inf.
Knowl. Manage., 2004, pp. 42–49.
[44] S. E. Robertson and I. Soboroff, “The TREC 2002 filtering track
report,” in Proc. 11th Text Retrieval Conf., 2002.
[45] G. Salton and C. Buckley, “Term-weighting approaches in auto-
matic text retrieval,” in Inf. Process. Manage., vol. 24, no. 5,
pp. 513–523, Aug. 1988.
[46] S. Scott and S. Matwin, “Feature engineering for text classi-
fication,” in Proc. Annu. Int. Conf. Mach. Learn., 1999, pp. 379–388.
[47] F. Sebastiani, “Machine learning in automated text catego-
rization,” ACM Comput. Surveys, vol. 34, no. 1, pp. 1–47, 2002.
[48] M. Seno and G. Karypis, “Slpminer: An algorithm for finding fre-
quent sequential patterns using length-decreasing support con-
straint,” in Proc. 2nd IEEE Conf. Data Mining, 2002, pp. 418–425.
[49] R. Sharma and S. Raman, “Phrase-based text representation for
managing the web documents,” in Proc. Int. Conf. Inf. Technol.:
Coding Comput., 2003, pp. 165–169.
[50] S. Shehata, F. Karray, and M. Kamel, “Enhancing text clustering
using concept-based mining model,” in Proc. 2nd IEEE Conf. Data
Mining, 2006, pp. 1043–1048.
[51] S. Shehata, F. Karray, and M. Kamel, “A concept-based model for
enhancing text categorization,” in Proc. ACM SIGKDD Knowl. Dis-
covery Data Mining, 2007, pp. 629–637.
[52] I. Soboroff and S. Robertson, “Building a filtering test collection
for TREC 2002,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop.
Inf. Retrieval, 2003, pp. 243–250.
[53] F. Song and W. B. Croft, “A general language model for informa-
tion retrieval,” in Proc. ACM Conf. Inf. Knowl. Manage., 1999,
pp. 316–321.
[54] Q. Song, J. Ni, and G. Wang, “A fast clustering-based feature sub-
set selection algorithm for high-dimensional data,” in IEEE Trans.
Knowl. Data Eng., vol. 25, no. 1, pp. 1–14, Jan. 2013.
[55] X. Tao, Y. Li, and N. Zhong, “A personalized ontology model for
web information gathering,” in IEEE Trans. Knowl. Data Eng.,
vol. 23, no. 4, pp. 496–511, Apr. 2011.
[56] R. Tibshirani, “Regression shrinkage and selection via the Lasso:
A retrospective,” in J. Royal Stat. Soc. B, vol. 73, pp. 273–282, 2011.
[57] R. Tibshirani, “Regression shrinkage and selection via the Lasso,”
J. Royal Stat. Soc. B, vol. 58, no. 1, pp. 267–288, 1996.
[58] X. Wang, H. Fang, and C. Zhai, “A study of methods for negative
relevance feedback,” in Proc. Annu. Int. ACM SIGIR Conf. Res.
Develop. Inf. Retrieval, 2008, pp. 219–226.
[59] S.-T. Wu, Y. Li, and Y. Xu, “Deploying approaches for pattern
refinement in text mining,” in Proc. IEEE Conf. Data Mining, 2006,
pp. 1157–1161.
[60] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic pattern-
taxonomy extraction for web mining,” in Proc. Int. Conf. Web
Intell., 2004, pp. 242–248.
[61] Z. Xu, and R. Akella, “Active relevance feedback for difficult quer-
ies,” in Proc. ACM Conf. Inf. Knowl. Manage., 2008, pp. 459–468.
[62] G.-R. Xue, D. Xing, Q. Yang, and Y. Yu, “Deep classification in
large-scale text hierarchies,” in Proc. Annu. Int. ACM SIGIR Conf.
Res. Develop. Inf. Retrieval, 2008, pp. 619–626.
[63] M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, and M. Sugiyama,
“High-dimensional feature selection by feature-wise kernelized
Lasso,” Neural Comput., vol. 26, no. 1, pp. 185–207, 2014.
[64] X. Yan, H. Cheng, J. Han, and D. Xin, “Summarizing itemset pat-
terns: A profile-based approach,” in Proc. ACM SIGKDD Knowl.
Discovery Data Mining, 2005, pp. 314–323.
[65] C. C. Yang, “Search engines information retrieval in practice,” in
J. Amer. Soc. Inf. Sci. Technol., vol. 61, pp. 430–430, 2010.
[66] Y. Yang, “An evaluation of statistical approaches to text catego-
rization,” in Inf. Retreival, vol. 1, pp. 69–90, 1999.
[67] Y. Yang and J. O. Pedersen, “A comparative study on feature
selection in text categorization,” in Proc. Annu. Int. Conf. Mach.
Learn., 1997, pp. 412–420.
[68] M. J. Zaki, “Spade: An efficient algorithm for mining frequent
sequences,” in Mach. Learn. J. Spec. Issue Unsupervised Learn.,
vol. 42, pp. 31–60, 2001.
[69] Z. Zhao, L. Wang, H. Liu, and J. Ye, “On similarity preserving fea-
ture selection,” in IEEE Trans. Knowl. Data Eng., vol. 25, no. 3,
pp. 619–632, Mar. 2013.
[70] N. Zhong, Y. Li, and S.-T. Wu, “Effective pattern discovery for text
mining,” in IEEE Trans. Knowl. Data Eng., vol. 24, no. 1, pp. 30–44,
Jan. 2012.
[71] S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification
using maximum entropy method,” in Proc. Annu. Int. ACM SIGIR
Conf. Res. Develop. Inf. Retrieval, 2005, pp. 1041–1048.
1668 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
Yuefeng Li is a full professor in the School of
Electrical Engineering and Computer Science,
Queensland University of Technology, Australia.
He has published more than 150 refereed papers
(including 43 journal papers). He has demonstra-
ble experience in leading large-scale research
projects and has achieved many established
research outcomes that have been published
and highly cited in top data mining journals and
conferences (Highest citation per paper ¼ 188).
He is the managing editor of Web Intelligence
and Agent Systems and an associate editor of the International Journal
of Pattern Recognition and Artificial Intelligence.
Abdulmohsen Algarni received the PhD degree
from Queensland University of Technology, Aus-
tralia, in 2012. He was a research associate in the
School of Electrical Engineering and Computer
Science, Queensland University of Technology,
Australia, in 2012. He is currently an assistant pro-
fessor of the College of Computer Science, King
Khalid University. His research interest includes
text mining and information filtering.
Mubarak Albathan received the MSc degree in
network computing from Monash University,
Australia, in 2009. He is currently working toward
the PhD degree in the School of Electrical
Engineering and Computer Science, Queensland
University of Technology, Brisbane, Australia.
His research interests include feature selection
and Web intelligence.
Yan Shen received the PhD degree from the
Queensland University of Technology, Australia,
in 2013. He is a research associate in the School
of Electrical Engineering and Computer Science,
Queensland University of Technology, Australia.
His research interest includes ontology learning
and text mining.
Moch Arif Bijaksana received the master’s
degree from RMIT University, Australia. He is
currently working toward the PhD degree in the
School of Electrical Engineering and Computer
Science, Queensland University of Technology,
Australia. He is working at Telkom University,
Indonesia. His research interest includes text
classification and knowledge discovery.
 For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1669
www.redpel.com+917620593389
www.redpel.com+917620593389

Mais conteúdo relacionado

Mais procurados

The effect of number of concepts on readability of schemas 2
The effect of number of concepts on readability of schemas 2The effect of number of concepts on readability of schemas 2
The effect of number of concepts on readability of schemas 2Saman Sara
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationIJCSIS Research Publications
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
 
Query Answering Approach Based on Document Summarization
Query Answering Approach Based on Document SummarizationQuery Answering Approach Based on Document Summarization
Query Answering Approach Based on Document SummarizationIJMER
 
Semantic extraction of arabic
Semantic extraction of arabicSemantic extraction of arabic
Semantic extraction of arabiccsandit
 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case StudyIJERA Editor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlescsandit
 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
 
Towards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsTowards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsCITE
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)IJERA Editor
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
 
25.ranking on data manifold with sink points
25.ranking on data manifold with sink points25.ranking on data manifold with sink points
25.ranking on data manifold with sink pointsVenkatesh Neerukonda
 

Mais procurados (18)

The effect of number of concepts on readability of schemas 2
The effect of number of concepts on readability of schemas 2The effect of number of concepts on readability of schemas 2
The effect of number of concepts on readability of schemas 2
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text Classification
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
0
00
0
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
 
Query Answering Approach Based on Document Summarization
Query Answering Approach Based on Document SummarizationQuery Answering Approach Based on Document Summarization
Query Answering Approach Based on Document Summarization
 
Semantic extraction of arabic
Semantic extraction of arabicSemantic extraction of arabic
Semantic extraction of arabic
 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case Study
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A0210110
A0210110A0210110
A0210110
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articles
 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
 
Towards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsTowards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong Students
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector Quantization
 
25.ranking on data manifold with sink points
25.ranking on data manifold with sink points25.ranking on data manifold with sink points
25.ranking on data manifold with sink points
 

Semelhante a Relevance Feature Discovery Model Outperforms Term-Based and Pattern-Based Text Mining Methods

A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
factorization methods
factorization methodsfactorization methods
factorization methodsShaina Raza
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesKausar Mukadam
 
Semantic Based Document Clustering Using Lexical Chains
Semantic Based Document Clustering Using Lexical ChainsSemantic Based Document Clustering Using Lexical Chains
Semantic Based Document Clustering Using Lexical ChainsIRJET Journal
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machineinventionjournals
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learningcsandit
 
Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Salam Shah
 
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016Aurangzeb Ch
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications
 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
 

Semelhante a Relevance Feature Discovery Model Outperforms Term-Based and Pattern-Based Text Mining Methods (20)

A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
 
C017321319
C017321319C017321319
C017321319
 
G04124041046
G04124041046G04124041046
G04124041046
 
factorization methods
factorization methodsfactorization methods
factorization methods
 
E43022023
E43022023E43022023
E43022023
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
 
Semantic Based Document Clustering Using Lexical Chains
Semantic Based Document Clustering Using Lexical ChainsSemantic Based Document Clustering Using Lexical Chains
Semantic Based Document Clustering Using Lexical Chains
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machine
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...
 
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
Stepsin researchprocesspartialleastsquareofstructuralequationmodeling2016
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
 

Mais de redpel dot com

An efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of thingsAn efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of thingsredpel dot com
 
Validation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri netValidation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri netredpel dot com
 
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...redpel dot com
 
Towards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduceTowards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduceredpel dot com
 
Toward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architectureToward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architectureredpel dot com
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacyredpel dot com
 
Privacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applicationsPrivacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applicationsredpel dot com
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...redpel dot com
 
Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...redpel dot com
 
Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...redpel dot com
 
Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...redpel dot com
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageredpel dot com
 
Bayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centersBayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centersredpel dot com
 
Architecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog networkArchitecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog networkredpel dot com
 
Analysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computingAnalysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computingredpel dot com
 
An anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computingAn anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computingredpel dot com
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataredpel dot com
 
A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...redpel dot com
 
A mobile offloading game against smart attacks
A mobile offloading game against smart attacksA mobile offloading game against smart attacks
A mobile offloading game against smart attacksredpel dot com
 

Mais de redpel dot com (20)

An efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of thingsAn efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of things
 
Validation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri netValidation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri net
 
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
 
Towards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduceTowards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduce
 
Toward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architectureToward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architecture
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacy
 
Privacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applicationsPrivacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applications
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...
 
Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...
 
Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...
 
Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
 
Bayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centersBayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centers
 
Architecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog networkArchitecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog network
 
Analysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computingAnalysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computing
 
An anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computingAn anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computing
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big data
 
A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...
 
A mobile offloading game against smart attacks
A mobile offloading game against smart attacksA mobile offloading game against smart attacks
A mobile offloading game against smart attacks
 

Último

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Último (20)

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

Relevance Feature Discovery Model Outperforms Term-Based and Pattern-Based Text Mining Methods

  • 1. Relevance Feature Discovery for Text Mining Yuefeng Li, Abdulmohsen Algarni, Mubarak Albathan, Yan Shen, and Moch Arif Bijaksana Abstract—It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of large scale terms and data patterns. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user preferences; yet, how to effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both positive and negative patterns in text documents as higher level features and deploys them over low-level features (terms). It also classifies terms into categories and updates term weights based on their specificity and their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods and the pattern based methods. Index Terms—Text mining, text feature extraction, text classification Ç 1 INTRODUCTION THE objective of relevance feature discovery (RFD) is to find the useful features available in text documents, including both relevant and irrelevant ones, for describing text mining results. This is a particularly challenging task in modern information analysis, from both an empirical and theoretical perspective [33], [36]. This problem is also of cen- tral interest in many Web personalized applications, and has received attention from researchers in Data Mining, Machine Learning, Information Retrieval and Web Intelli- gence communities [32]. There are two challenging issues in using pattern mining techniques for finding relevance features in both relevant and irrelevant documents [32]. The first is the low-sup- port problem. Given a topic, long patterns are usually more specific for the topic, but they usually appear in documents with low support or frequency. If the mini- mum support is decreased, a lot of noisy patterns can be discovered. The second issue is the misinterpretation problem, which means the measures (e.g., “support” and “confidence”) used in pattern mining turn out to be not suitable in using patterns for solving problems. For exam- ple, a highly frequent pattern (normally a short pattern) may be a general pattern since it can be frequently used in both relevant and irrelevant documents. Hence, the difficult problem is how to use discovered patterns to accurately weight useful features. There are several existing methods for solving the two challenging issues in text mining. Pattern taxonomy mining (PTM) models have been proposed [59], [60], [70], in which, mining closed sequential patterns in text paragraphs and deploying them over a term space to weight useful features. Concept-based model (CBM) [50], [51] has also been pro- posed to discover concepts by using natural language proc- essing (NLP) techniques. It proposed verb-argument structures to find concepts in sentences. These pattern (or concepts) based approaches have shown an important improvement in the effectiveness [70]. However, fewer sig- nificant improvements are made compared with the best term-based method because how to effectively integrate patterns in both relevant and irrelevant documents is still an open problem. Over the years, people have developed many mature term-based techniques for ranking documents, information filtering and text classification [37], [39], [44]. Recently, sev- eral hybrid approaches were proposed for text classifica- tion. To learn term features within only relevant documents and unlabelled documents, paper [27] used two term-based models. In the first stage, it utilized a Rocchio classifier to extract a set of reliable irrelevant documents from the unla- beled set. In the second stage, it built a SVM classifier to classify text documents. A two-stage model was also pro- posed in [34], [35], which proved that the integration of the rough analysis (a term-based model) and pattern taxonomy mining is the best way to design a two-stage model for information filtering systems. For many years, we have observed that many terms with larger weights are more general because they are likely to be frequently used in both relevant and irrelevant docu- ments [32]. For example, word “LIB” may be more fre- quently used than word “JDK”; but “JDK” is more specific than “LIB” for describing “Java Programming Languages”; and “LIB” is more general than “JDK” because “LIB” is also frequently used in other programming languages like C or Y. Li, A. Algarni, Y. Shen, and M. Bijaksana are with the School of Electrical Engineering and Computer Science, Queensland University of Technology, Australia, Brisbane, QLD 4001. E-mail: y2.li@qut.edu.au, {algarni.abdulmohsen, arifbijaksana}@gmail.com, y1.shen@student.qut. edu.au. M. Albathan is with the School of Electrical Engineering and Computer Science, Queensland University of Technology, Australia, Brisbane, QLD 4001, and the Al Imam Mohammad Ibn Saud Islamic University, Saudi Arabia, P.O.Box 5701, Riyadh 11432. E-mail: mubarak.albathan@student.qut.edu.au. Manuscript received 2 May 2013; revised 1 Nov. 2014; accepted 4 Nov. 2014. Date of publication 23 Nov. 2014; date of current version 27 Apr. 2015. Recommended for acceptance by P. G. Ipeirotis. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2014.2373357 1656 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015 1041-4347 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. www.redpel.com+917620593389 www.redpel.com+917620593389
  • 2. C++. Therefore, we recommend the consideration of both terms’ distributions and specificities for relevance feature discovery. Given a topic, a term’s specificity describes the extent to which the term focuses on the topic that users want [33]. However, it is very difficult to measure the specificity of terms because a term’s specificity depends on users’ per- spectives of their information needs [55]. We proposed the first definition of the specificity in [30], [31], which calcu- lated the specificity score of a term based on its appearance in discovered positive and negative patterns. However, this definition required an iterative algorithm (three loops) in order to weight terms accurately. In order to make a breakthrough in relation to the two challenging issues, we proposed the first version of the RFD model in [32]. In accordance with the distributions of terms in a training set, it provided a new definition for the speci- ficity function and used two empirical parameters to group terms into three categories: “positive specific terms”, “general terms”, and “negative specific terms”. Based on these definitions, the RFD model can accurately evaluate term weights according to both their specificity and their distributions in the higher level features, where the higher level features include both positive and negative patterns. The term classification method proposed in [32] requires manually setting two empirical parameters according to testing sets. In this paper, we continue to develop the RFD model, and experimentally prove that the proposed speci- ficity function is reasonable and the term classification can be effectively approximated by a feature clustering method. We also design a comprehensive approach for evaluating the proposed models. In addition, we conducted some new experiments by using six new sliding windows to adap- tively update the training sets and also applying the RFD model for binary text classification to test the robustness of the proposed model. This paper proposes an innovative technique for finding and classifying low-level terms based on both their appear- ances in the higher-level features (patterns) and their speci- ficity in a training set. It also introduces a method to select irrelevant documents (so-called offenders) that are closed to the extracted features in the relevant documents in order to effectively revise term weights. Compared with other meth- ods, the advantages of the proposed model include: Effective use of both relevant and irrelevant feed- back to find useful features; and Integration of both term and pattern features together rather than using them in two separated stages. To justify these claims for the proposed approach, we conducted substantial experiments on standard data collec- tions, namely, the Reuters Corpus Volume 1 (RCV1), TREC filtering assessor topics, the Library of Congress Subject Headings (LCSH) ontology and Reuters-21578. We also used five measures and the t-test to evaluate these experi- ments. The results show that the proposed specificity func- tion is adequate, the clustering method is effective and the proposed model is robust. The results also show that the proposed model significantly outperforms both the state-of- the-art term-based methods underpinned by Okapi BM25, Rocchio and language models, SVM and the pattern-based methods on most measures. The remainder of this paper is organized as follows. Section 2 introduces a detailed overview of the related works. Section 3 reviews the concept of features in text documents. Section 4 discusses the RFD model. Section 5 proposes a new feature clustering method based on the specificity function. To evaluate the performance of the pro- posed model, we conduct substantial experiments on LCSH, RCV1, TREC filtering topics and Reuters-21578. The empirical results and discussion are reported in Section 6, followed by concluding remarks in the last section. 2 RELATED WORK Feature selection is a technique that selects a subset of fea- tures from data for modeling systems (see http://en. wikipedia.org/wiki/Feature_selection). Over the years, a variety of feature selection methods (e.g., Filter, Wrapper, Embedded and Hybrid approaches, and unsupervised or semi-supervised methods) have been proposed in various fields [6], [9], [17], [54], [69]. Feature selection is also one of important steps for text classification and information filter- ing [1], [5], [47] which is the task of assigning documents to predefined classes. To date, many classifiers, such as Naive Bayes, Rocchio, kNN, SVM and Lasso regression [16], [26], [27], [28], [37], [62], [66] have been developed, in addition many believe that SVM is also a promising classifier [13]. The classification problems include the single class and multi-class problem. The most common solution [71] to the multi-class problem is to decompose it into some indepen- dence binary classifiers, where a binary one is assigned to one of two predefined classes (e.g., relevant category or irrelevant category). Most traditional text feature selection methods used the bag of words to select a set of features for the multi-class problem [13]. There are some feature selec- tion criteria for text categorization, including document fre- quency (DF), the global IDF, information gain, mutual information (MI), Chi-Square (x2 ) and term strength [1], [29], [37], [45], [67]. In this paper we focus on relevant feature selection in text documents. Relevance is a big research issue [25], [32], [65] for Web search, which discusses a documents relevance to a user or a query. However, the traditional feature selec- tion methods are not effective for selecting text features for solving relevance issue because relevance is a single class problem [13]. The efficient way of feature selection for rele- vance is based on a feature weighting function. A feature weighting function indicates the degree of information rep- resented by the feature occurrences in a document and reflects the relevance of the feature. The popular term-based ranking models include tf*idf based techniques, Rocchio algorithm, Probabilistic models and Okapi BM25 [4], [24], [37], [44]. Recently, one of the important issues for multimedia data is the identification of the optimal feature set without any redundancy [69]; however, the challenging issue for text fea- ture selection in text documents is the identification of which format or where the relevant features are in a text document because of the large amount of noisy information in the docu- ment [2]. Text features can be simple structures (words), LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1657 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 3. complex linguistic structures or statistical structures. We mainly discuss three complex structures below for selecting relevant features: n-grams, concepts and patterns. n-grams (or phrases) are more discriminative and carry more “semantics” than words. They were useful for build- ing good ranking functions [20], [47], [53]. In [49], a phrase- based text representation for Web document management was also proposed that used rule-based Natural Language Processing and Context Free Grammar techniques. Lan- guage models were proposed to calculate weights for n- grams, which are often approximated by Unigram, Bigram or Trigram models for considering word dependencies [8], [39], [53], [58]. A concept-based model [50], [51] was also presented to find concepts in text documents by using NLP techniques, which analyzed terms’ associations based on the semantic structure of sentences. This model included three components. The first one analyzed the semantic structure of sentences; the second one then constructed a conceptual ontological graph (COG) to represent the seman- tic structures; and the last one found top concepts according to the first two components to generate feature vectors by using the standard vector space model. Pattern mining has been extensively studied in data min- ing communities for many years. A variety of efficient algo- rithms such as Apriori-like algorithms, PrefixSpan, FP-tree, SPADE, SLPMiner and GST have been proposed [18], [19], [40], [48], [68]. Patterns post-processing were also proposed to compress or group patterns into some clusters [64]. How- ever, interpreting useful patterns for text mining remains an open problem [32]. Typically, text mining discusses terms’ associations at a broad spectrum level, paying little atten- tion to labeled information and duplications of terms [33], [34]. Usually, the existing text mining techniques return numerous patterns (sets of terms) in text documents. Not surprisingly, many patterns are redundant or noisy. There- fore, the challenging issue is how to effectively deal with the very large set of patterns and terms with a lot of redun- dant or noisy information [32]. To reduce the quantity of redundant information, closed patterns have turned out to be a good alternative to phrases [21], [60]. To effectively use closed patterns for weighting terms, a pattern deploying method in [59] has been pro- posed to map closed patterns into a term vector that includes a set of terms and a term-weight distribution. This method has also shown encouraging improvements on the effective- ness in comparing with traditional IR models [3], [32], [34]. The big obstacle of pattern mining based approaches for text mining is how to effectively use both relevant and irrel- evant feedback. In [70], a pattern deploying method was proposed to updated positive patterns; however, the improved effectiveness was not significant. In regard to the aforementioned problem of redundancy and noises, another challenging issue for pattern-based methods is how to deal with low frequency patterns [32]. By way of illustration, a short pattern (normally with large support, or called a highly frequent pattern) is usually a general pattern, or a large pattern (a low frequent pattern with small support) could be a specific one. Recently, a clustering-based feature subset selection method has presented to view features into clusters to reduce dimensionality [54]. Another interesting idea is to identify interesting features in LDA topics [14]. In summary, the existing methods for finding rele- vance features can be grouped into three approaches [32]. The first approach tries to diminish weights of terms that appear in both relevant documents and irrelevant docu- ments (e.g., Rocchio-based models [41]). This heuristic is obvious if we assume that terms are isolated atoms. The second one is based on how often features appear or do not appear in relevant and irrelevant documents (e.g., probabilistic based models [61] or BM25 [43], [44]). The third one is based on finding features through positive patterns [32], [59], [60]. The proposed model further develops the third approach by grouping features into three categories: “positive specific features”, “general features”, and “negative specific features”. 3 DEFINITIONS For a given topic, the goal of relevance feature discovery in text documents is to find a set of useful features, including patterns, terms and their weights, in a training set D, which consists of a set of relevant documents, Dþ , and a set of irrelevant documents, DÀ . In this paper, we assume that all text documents, d, are split into paragraphs, PSðdÞ. In this section, we introduce the basic definitions about patterns and the deploying method. These definitions can also be found in [32], [34], [59]. 3.1 Frequent and Closed Patterns Let T1 ¼ ft1; t2; . . . ; tmg be a set of terms (or words) which are extracted from Dþ , and termset X be a set of terms. For a given document d, coversetðXÞ is called the covering set of X in d, which includes all paragraphs dp 2 PSðdÞ such that X dp, i.e., coversetðXÞ ¼ fdpjdp 2 PSðdÞ; X dpg. Its abso- lute support is the number of occurrences of X in PSðdÞ, that is supaðXÞ ¼ jcoversetðXÞj. Its relative support is the fraction of the paragraphs that contain the pattern, that is, supr ðXÞ ¼ jcoversetðXÞj jPSðdÞj . A termset X is called a frequent pattern if its supa (or supr) ! min sup, a given minimum support. It is obvious that a termset X can be mapped to a set of paragraphs coversetðXÞ. We can also map a set of para- graphs Y PSðdÞ to a termset, which satisfies termsetðY Þ ¼ ftj8dp 2 Y ) t 2 dpg: A pattern X (also a termset) is called closed if and only if X ¼ termsetðcoversetðXÞÞ. Let X be a closed pattern. We have supaðX1Þ supaðXÞ (1) for all patterns X1 ' X. All closed patterns can be structured into a pattern taxon- omy by using the subset (or called is-a) relation [59]. 3.2 Closed Sequential Patterns A sequential pattern s ¼ t1; . . . ; tr (ti 2 T1) is an ordered list of terms. A sequence s1 ¼ x1; . . . ; xi is called a sub- sequence of another sequence s2 ¼ y1; . . . ; yj , denoted by s1 v s2, iff 9j1; . . . ; ji such that 1 j1 j2 . . . ji j and x1 ¼ yj1 ; x2 ¼ yj2 ; . . . ; xi ¼ yji . Given s1 v s2, we call s1 a sub-pattern of s2, and s2 a super-pattern of s1. In the fol- lowing, we refer to sequential patterns as patterns. 1658 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 4. Given a sequential pattern X in document d, coversetðXÞ is still used to describe the covering set of X, which includes all paragraphs ps 2 PSðdÞ such that X v ps, i.e., coverset ðXÞ ¼ fpsjps 2 PSðdÞ; X v psg. Its absolute support and rela- tive support are defined the same as for the normal patterns. A sequential pattern X is called a frequent pattern if its rel- ative support ! min sup. The property of closed patterns (see Eq. (1)) is used to define closed sequential patterns. A frequent sequential pattern X is closed if supaðX1Þ 6¼ supaðXÞ for any super-pattern X1 of X. 3.3 Deploying Higher Level Patterns on Low-Level Terms For term-based approaches, weighting the usefulness of a given term is based on its appearance in documents. However, for pattern-based approaches, weighting the usefulness of a given term is based on its appearance in discovered patterns. To improve the efficiency of the pattern taxonomy mining, an algorithm, SPMiningðDþ ; min supÞ [60], was proposed (also used in [34], [59]) to find closed sequential patterns for all documents 2 Dþ , which used the well- known Apriori property to reduce the searching space. For all relevant documents di 2 Dþ , the SPMining algo- rithm discovers all closed sequential patterns, SPi, based on a given min sup. We do not want to repeat this algorithm here because it is not the particular focus of this study. Let SP1, SP2; :::; SPjDþj be the sets of discovered closed sequential patterns for all documents di 2 Dþ ði ¼ 1; . . . ; nÞ, where n ¼ jDþ j. For a given term t, its d_support (deploying support, called weight in this paper) in discovered patterns can be described as follows: d supðt; Dþ Þ ¼ Xn i¼1 supiðtÞ ¼ Xn i¼1 jfpjp 2 SPi; t 2 pgj P p2SPi jpj ; (2) where jpj is the number of terms in p. After the deploying supports of terms have been com- puted from the training set, let wðtÞ ¼ d supðt; Dþ Þ, the fol- lowing rank function is used to decide the relevance of document d: rankðdÞ ¼ X t2T wðtÞtðt; dÞ; (3) where tðt; dÞ ¼ 1 if t 2 d; otherwise tðt; dÞ ¼ 0. 4 RFD MODEL In this section, we introduce the RFD model for relevance feature discovery, which describes the relevant features in relation to three groups: positive specific terms, general terms and negative specific terms based on their appearan- ces in a training set. We first discuss the concept of “specificity” in terms of the relative “specificity” in training datasets and the absolute “specificity” in domain ontology. We also present a way to understand whether the proposed relative“ specificity” is reasonable in term of the absolute “specificity”. Finally, we introduce the term weighting method in the RFD model. 4.1 Specificity Function In the RDF model, a term’s specificity (referred to as relative specificity in this paper) is defined [32] according to its appearance in a given training set. Let T2 be a set of terms which are extracted from DÀ and T ¼ T1 [ T2. Given a term t 2 T, its coverageþ is the set of relevant documents that con- tain t, and its coverageÀ is the set of irrelevant documents that contain t. We assume that the terms frequently used in both relevant documents and irrelevant documents are gen- eral terms. Therefore, we want to classify the terms that are more frequently used in the relevant documents into the positive specific category; the terms that are more fre- quently used in the irrelevant documents are classified into the negative specific category. Based on the above analysis, we defined the specificity of a given term t in the training set D ¼ Dþ [ DÀ as follows: speðtÞ ¼ jcoverageþ ðtÞj À jcoverageÀ ðtÞj n ; (4) where coverageþ ðtÞ ¼ fd 2 Dþ jt 2 dg, coverageÀ ðtÞ ¼ fd 2 DÀ jt 2 dg, and n ¼ jDþ j. speðtÞ 0 means that term t is used more frequently in relevant documents than in irrele- vant documents. Based on the spe function, we have the following classi- fication rules for determining general terms G, positive specific terms Tþ and negative specific terms TÀ : G ¼ ft 2 Tju1 speðtÞ u2g, Tþ ¼ ft 2 TjspeðtÞ u2g, and TÀ ¼ ft 2 TjspeðtÞ u1g, where u2 is an experimental coefficient, the maximum boundary of the specificity for the general terms, and u1 is also an experimental coefficient, the mini- mum boundary of the specificity for the general terms. We assume that u2 0 and u2 ! u1. It is easy to verify that G Tþ TÀ ¼ ;. Therefore, fG; Tþ ; TÀ g is a partition of all terms. A term’s relative specificity describes the extent to which the term focuses on the topic that users want. It is very diffi- cult to measure the relative specificity of terms because a term’s specificity depends on users’ perspectives of their information needs [55]. For example, “knowledge discov- ery” will be a general term in the data mining community; however, it may be a specific term when we talk about infor- mation technology. In this paper, we propose a way to understand whether the proposed relative “specificity” is reasonable in term of the absolute “specificity” in domain ontology, where “absolute” means the specificity is independent to any train- ing dataset. Normally, people consider terms to be more general if they are frequently used in a very large domain ontology; otherwise, they are more specific. Therefore, we define the absolute specificity of a term in the ontology as follows: speontoðtÞ ¼ 1 jcoverageðtÞj, where coverageðtÞ denotes the set of concepts of subjects that use term t for describing their meaning. To clearly illustrate the spe values between 0 and 1, we normalize the above equation as follows: speontoðtÞ ¼ log10 N jcoverageðtÞj =log10 N M ; (5) where N is the total number of subjects and M is the maxi- mum of coverageðtÞ for all t 2 T. LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1659 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 5. We call a relative spe function reasonable if the average absolute specificity of its positive specific terms (Tþ ) is greater than the average absolute specificity of its general terms (G). 4.2 Weighting Features To describe relevance features for a given topic, normally we believe that specific terms are very useful in order to dis- tinguish the topic from other topics. However, our experi- ments (see Section 6.6.2) show that using only specific terms is not good enough to improve the performance of relevance feature discovery because user information needs cannot simply be covered by documents that contain only the spe- cific terms. Therefore, the best way is to use the specific terms mixed with some of the general terms. We discuss this issue in the evaluation section. To improve the effectiveness, the RFD used irrelevant documents in the training set in order to remove the noises. The first issue in using irrelevant documents is how to select a suitable set of irrelevant documents since a very large set of negative samples is typically obtained. For example, a Google search can return millions of documents; however, only a few of those documents may be of interest to a Web user. Obvi- ously, it is not efficient to use all of the irrelevant documents. Most models can rank documents (see the ranking func- tion in Equation (3)) using a set of extracted features. If an irrelevant document gets a high rank, the document is called an offender [33] because it is a false discovery. The offenders are normally defined as the top-K ranked irrele- vant documents. The basic hypothesis in this paper is that relevance features are used to describe relevant documents, and irrelevant documents are used to assure the discrimina- tion of extracted features. Therefore, RFD only selects some offenders (i.e., top-K ranked irrelevant documents) rather than use all irrelevant documents. In Section 6.6.1 we dis- cuss the performance of using different K values, where K ¼ n 2 obtained the best performance. Once we select the top-K irrelevant documents, the set of irrelevant documents DÀ will be reduced to include only K offenders (irrelevant documents); therefore, we have jDþ j ! 2jDÀ j if K ¼ n 2. The spe function can get its maximum value, 1, if there is a term t, such that coverageÀ ðtÞ ¼ ;; and its minimum value, À 1 2, if there is a term t, such that coverageþ ðtÞ ¼ ;. Let 0 u2 1, then we can easily verify À 1 2 u1 u2 1 if K ¼ n 2. The calculation of original RFD term weighting function [32] includes two steps: initial weight calculation and weight revision. Based on Equation (2), in this paper we integrate the two steps into the following equation: wðtÞ ¼ d supðt; Dþ Þð1 þ speðtÞÞ t 2 Tþ d supðt; Dþ Þ t 2 G d supðt; Dþ Þð1 À jspeðtÞjÞ t 2 T1 Àd supðt; DÀ Þð1 þ jspeðtÞjÞ otherwise; 8 : where the d_sup function is defined in Equation (2). 5 TERM CLASSIFICATION RFD uses both specific features (e.g., Tþ and TÀ ) and gen- eral features (e.g., G). Therefore, the key research question is how to find the best partition (Tþ , G, TÀ ) to effectively classify relevant documents and irrelevant documents. For a given set of features, however, this question is an N-P hard problem because of the large number of possible com- binations of groups of features. In this section we propose an approximation approach, and efficient algorithms to refine the RFD model. 5.1 An Approximation Approach The best partition (Tþ , G, TÀ ) is used to clearly distinguish irrelevant documents from relevant ones. Assume that we have two characteristic functions f1, and f2, on all terms, such that f1ðtÞ is the approximate average weight of t for all relevant documents, and f2ðtÞ is the approximate average weight of t for all irrelevant documents. Therefore, the best partition (Tþ , G, TÀ ) can maximize the following integra- tion: Rtn t1 ðf1ðtÞ À f2ðtÞÞdt. The above discussion motivates us to find adequate u1 and u2 to make positive specific features move far away from negative specific features. If we view the terms that have the same specificity score as a cluster and use the spe function as the distance function, the new solution is to find three groups that can clearly divide the terms into three categories. Based on the above analysis, we can develop a clustering method to group terms into three categories automatically for each topic by using the specificity function. In the begin- ning, we assign terms that appear only in irrelevant docu- ments into the negative specific category TÀ . For the remaining terms, we initially view each term ti as a single cluster ci. We also represent each cluster ci using an interval ½minspeðciÞ; maxspeðciÞŠ, where minspeðciÞ is the smallest spe value of elements in ci, and maxspeðciÞ is the largest spe value of the elements in ci. Let ci and cj be two clusters. The difference between the two clusters is defined as follows: difðci; cjÞ ¼ minfjmaxspeðciÞ À minspeðcjÞj; jmaxspeðcjÞ À minspeðciÞjg: A bottom-up approach is used to merge two clusters if they have the minimum difference. Let ck be the merged cluster of ci and cj, then we have ck ¼ ci [ cj, minspeðckÞ ¼ min fminspeðciÞ; minspeðcjÞg and maxspeðckÞ ¼ maxfmaxspeðciÞ; maxspeðcjÞg. The merging operation continues until three clusters are left if the number of initial clusters is greater than three. The distances between two adjacent clusters in the retained three clusters should be greater than or equal to any other distances between two adjacent clusters. The cluster that has the biggest minspe is determined as Tþ , the cluster that has the second biggest minspe would form category G and the remainder will be part of TÀ . 5.2 Efficient Algorithms Algorithm FClustering describes the process of feature clus- tering, where DPþ is the set of discovered patterns of Dþ and DPÀ is the set of discovered patterns of DÀ . Step 1 to Step 4 initialize the three categories. All terms that are not the elements of positive patterns are assigned to category TÀ . For the remaining m terms, each is viewed as a single 1660 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 6. cluster in the beginning (Step 5 to Step 7). It also sorts these clusters in C based on their minspe values in Step 9. Step 10 to Step 21 describe the iterative process of merging clusters until there are three clusters left. The merging process first decides the closest of two adjacent clusters (Step 11 to Step 14), ck and ckþ1. It also merges the two clusters into one, denoted as ck (Step 15 to Step 19), and deletes ckþ1 from C (Step 20 and Step 21). In the last step, it chooses the first cluster as Tþ , the second cluster as G (if it exists) and the last cluster as a part of TÀ (if it exists). In the initialization, the algorithm uses the most time (OðjTj2 Þ) finding the initial value of TÀ . The initialization can also be implemented in OðjTjÞ if a hash function is used for the containment test. Before the merging process, it takes OðmlogmÞ to sort C, where m ¼ jCj and m jTj. In the while loop, it uses OðmÞ times to merge two clusters and takes Oðm2 Þ to move the clusters in C. Therefore, the time complexity is OðjTj þ mlogm þ m2 Þ ¼ OðjTj þ m2 Þ ¼ OðjTj2 Þ. FClustering ( ) Input: Discovered features T; DPþ ; DPÀ and function spe. Output: Three categories of terms Tþ , G and TÀ . Method: 1: G ¼ ;; Tþ ¼ ;; TÀ ¼ ;; 2: foreach ti 2 T do 3: if ti =2 ftjt 2 P; P 2 DPþ } 4: then TÀ ¼ TÀ [ ftig; 5: foreach ti 2 T À TÀ do { 6: ci ¼ ftig; 7: maxspeðciÞ ¼ minspeðciÞ ¼ speðtiÞ; } 8: let m ¼ jT À TÀ j; 9: let C ¼ fc1; c2; Á Á Á ; cmg and minspeðc1Þ ! Á Á Á ! minspeðcmÞ; 10: while (jCj 3){ //start merging process 11: let k ¼ 1 and mind ¼ difðc1; c2Þ; 12: for i ¼ 2 to m À 1 do 13: if difðci; ciþ1Þ mind 14: then {k ¼ i; mind ¼ difðci; ciþ1Þ;} 15: let ck ¼ ck [ ckþ1; 16: if minspeðckþ1Þ minspeðckÞ 17: then minspeðckÞ ¼ minspeðckþ1Þ; 18: if maxspeðckþ1Þ maxspeðckÞ 19: then maxspeðckÞ ¼ maxspeðckþ1Þ; 20: for i ¼ k þ 1 to m À 1 do // delete ckþ1 from C 21: let ci ¼ ciþ1;} 22: if jCj ¼ 1 then Tþ ¼ c1 23: else if jCj ¼ 2 then fTþ ¼ c1; G ¼ c2g 24: else fTþ ¼ c1; G ¼ c2; TÀ ¼ TÀ [ c3g; Algorithm WFeature is applied to calculate term weights after terms are classified using Algorithm FClustering. It first calculates the sup function and spe function (Step 1 and Step 8). For each term t, it takes Oðn  jpjÞ to calculate d sup function if an inverted index is utilized, where jpj is the average size of a pattern, and jpj jdj. For each term t, it also takes Oðn  jdjÞ to calculate spe function. Therefore, the time complexity is OððjTj  n  jpjÞ þ ðjTj  n  jdjÞÞ ¼ OðjTj  jdj  nÞ. It also uses Algorithm FClustering (Step 9) to classify the terms into the three categories of Tþ , G and TÀ . Finally, it calculates the weights of terms using the w function defined in Section 4.2. WFeature ðÞ Input: A updated training set, fDþ ; DÀ g; extracted features T; DPþ ; DPÀ ; and the initial term weight function w. Output: A term weight function. Method: 1: let n ¼ jDþ j; 2: T1 ¼ ftjt 2 p; p 2 DPþ g; 3: foreach t 2 T do 4: if t 2 T1 5: then supðtÞ ¼ d supðt; Dþ Þ; 6: else supðtÞ ¼ Àd supðt; DÀ Þ; 7: foreach t 2 T do 8: speðtÞ ¼ jfdjd2Dþ;t2dgjÀjfdjd2DÀ;t2dgj n ; 9: let ðTþ ; G; TÀ Þ ¼ FClusteringðT; DPþ ; DPÀ ; speðÞÞ; 10: foreach t 2 Tþ do 11: wðtÞ ¼ supðtÞ Ã ð1 þ speðtÞÞ; 12: foreach t 2 TÀ do 13: wðtÞ ¼ supðtÞ À jsupðtÞ Ã speðtÞj; Based on the above analysis, the time complexity of Algorithm WFeature is OðjTj  jdj  n þ jTj2 Þ, where jdj is the average size of the documents and n is the number of relevant documents in the training set. In our experiments, the size of the set of selected terms is less than 300, i.e., jTj jdj; so, Algorithm WFeature is efficient. 6 EVALUATION This section discusses the testing environment, and reports the experimental results and the discussions. It also provides recommendations for offender selection and the use of specific terms and general terms for describing user information needs. The proposed model is a supervised approach that needs a training set including both relevant documents and irrelevant documents. 6.1 Data We used two popular data sets to test the proposed model: Reuters Corpus Volume 1, a very large data collection; and Reuters-21578, a small one. RCV1 includes 806,791 docu- ments that cover a broad spectrum of issues or topics. TREC (2002) has developed and provided 50 reliable assessor topics [44] for RCV1, aiming at testing robust information filtering systems. These topics were evaluated by human assessors at the National Institute of Standards and Tech- nology (NIST) [52]. For each topic, a subset of RCV1 docu- ments is divided into a training set and a testing set. RCV1 is a standard data collection and the TREC 50 topics are sta- ble and sufficient enough for high quality experiments [55]. Reuters-21578 corpus is a widely used collection for text mining. The data was originally collected and labelled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system.1 In this experiment, we picked up the set of 10 classes. Accord- ing to Sebastiani’s convention [11], it was also called “R8” because two classes corn and wheat are intimately related to the class grain, and they were appended to class grain. 1. Reuters-21578, http://www.daviddlewis.com/resources/ LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1661 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 7. Documents in both RCV1 and Reuters-21578 are described in XML. To avoid bias in experiments, all of the information about the meta-data was ignored. All docu- ments were treated as plain text documents by a preprocess- ing, including removing stop-words according to a given stop-words list and stemming terms by applying the Porter Stemming algorithm. We also used the Library of Congress Subject Headings2 to understand the definition of the spe function in a domain ontology. LCSH is a very large taxonomic knowledge classi- fication, which was developed by librarians for organizing the large volume of library’s collections, and for retrieving information from these collections [55]. LCSH covers 394,070 concepts or subjects. 6.2 Baseline Models and Setting We grouped baseline models into two categories [32]. The first category included the up-to-date pattern based meth- ods (frequent patterns, frequent closed patterns, sequential patterns, and sequential closed patterns), language models (n-grams) and a concept-based model. The second category included well-known term feature selection models (or called term-based models): Rocchio, BM25, SVM, mutual information, chi-square and Lasso regression. We divided our approach into two stages. In the first stage, we used only positive patterns in the training sets. The model, called PTM, discovers sequential closed patterns from relevant documents, deploys discovered patterns on their terms using Equation (2) and ranks documents using Equation (3) as well. In the second stage, we use both posi- tive and negative patterns as described in Sections 4 and 5. We set min supr ¼ 0:2 (which was suggested by [59]) for all models that use patterns. Different to sequential patterns, n-grams extract sequen- tial patterns with a specified number of words and with no gaps between the words [42]. n-grams are usually selected based on the sliding window technique and the probability of a n-gram¼ w1w2 . . . wn is calculated using the following equation: Pðw1w2 . . . wnÞ ¼ Pðw1ÞPðw2jw1w2Þ . . . Pðwnjw1 . . . wnÀ1Þ: In the experiments, we used three language models [37]: Unigram, Bigram and Trigram. Unigram uses 1-grams only, Bigram uses both 2-grams and 1-grams and Trigram uses 3- grams, 2-grams and 1-grams. The probability of an n-gram is calculated in a training set D as follows: Pðn-gramÞ ¼ tfðn-gram; Dþ Þ tfðn-gram; DÞ ; (6) where tfðn-gram; Dþ Þ is the number of appearances of n-gram in Dþ , and tfðn-gram; DÞ is the number of appear- ances of n-gram in D, and n ¼ 1, 2 or 3. The concept-based model was presented in [50], [51]. CBM was also used as a baseline model in [70] for infor- mation filtering. The Rocchio model and BM25 are the well-known models for representing relevant information. We used the recommended experimental parameters (suggested by [59], [60], [70]) in our experiments (please note that the term frequency is the total number of term appearance in all relevant documents). The linear SVM has been proven very effective for text categorization and filtering [47]. Most SVMs are designed for making a binary decision rather than ranking docu- ments. In this paper, we use SVM-Light3 for ranking docu- ments. The optimization algorithms used in SVM-Light are described in [23]. Mutual Information and chi-square (x2 ) are popular methods for feature selection [37]. More details about MI and x2 can be found in Chapter 13 of book [37]. Lasso (least absolute shrinkage and selection operator) is another method for feature selection [57], and there are some extensions in recent years [56], [63]. Lasso [57] esti- mate was defined by ð^a; ^bÞ ¼ argmin Pn i¼1 yi À a À Pp j¼1 bjxij 2 suject to dT i b t: where dT i (i ¼ 1; 2; . . . ; 2p Þ is p-tuples of the form ðÆ1; Æ1; . . . ; Æ1Þ. In our implementation, yj ¼ 1 if dj 2 Dþ ; other- wise yj ¼ À jDþj jDÀj in order to make ^a ¼ y ¼ 0; and dT i ¼ sign ðbÞ. We also let xij ¼ 1 if term ti in document dj; otherwise xij ¼ 0 for information filtering (this assumption is the same as other models). We use tf*idf weights to find b ¼ fbjg, and let bj ¼ wðtjÞ À w þ D. The initial b0 is assigned when D ¼ 0 in order to make b0 ¼ 0, where wðtjÞ is the tf*idf weight and D is a parameter to try positive direction and negative direction test [16]. 6.3 Evaluation Metrics The effectiveness of a model is usually measured by the following means [32], [59]: the average precision of the top-20 documents, F1 measure, mean average precision (MAP), the break-even point (b=p), and interpolated average precision (IAP) on 11-points. These are widely accepted and well-established evaluation metrics. Each metric focuses on a different aspect of the model’s perfor- mance, as described below. The F-beta (Fb) measure is a function to describe both Recall (R) and Precision (P), together with a parameter beta b. The parameter b ¼ 1 was used in this paper, which denotes that precision and recall were weighed equally. Therefore, Fb is denoted by: F1 ¼ 2PR ðPþRÞ : MAP measures the precision at each relevant document first, then obtain the average precision for all the topics. It combines precision, relevance ranking and overall recall together to measure the performance of the models. B/P is the value of the recall (or precision) for which the P/R curve intersects the precision ¼ recall line. The larger the value, the better the model performs. 11-Points was also adopted in several research works [66]. It is used to measure the performance of different mod- els by averaging the precisions at 11 standard recall levels (recall ¼ 0.0, 0.1,. . .,1.0, where “0.0” means cut-off ¼ 1 in this paper). We also used a statistical method, the paired two- tailed t-test, to analyze the experimental results. 2. LCSH Web page, http://classificationweb.net/ 3. SVM-Light URL: http://svmlight.joachims.org/ 1662 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 8. 6.4 Hypotheses The proposed model is called the relevance feature discov- ery model, and consists of three major steps: feature discov- ery and deploying, term classification and term weighting. It first finds positive and negative patterns and terms in the training set. It also classifies terms into three categories by using parameters (u1 and u2) or Algorithm FClustering. Finally, it works out the term weights by using Algorithm WFeature. In our experiments, we develop two versions of the RFD model. Both versions use negative feedback to improve the quality of extracted features from positive feedback. The features extracted from both positive and negative feedback are classified into three categories, namely, Tþ , G and TÀ . The first version is called RFD1 which uses two empirical parameters (u1 and u2, see [32]) to group the low-level terms into three groups. This model can achieve the satisfactory performance, but it has to manually decide the two parame- ters according to their real performance in testing sets. The second model is called RFD2 which uses the proposed FClustering algorithm to automatically determine the three categories Tþ , G and TÀ based on the training sets. To conduct a comprehensive investigation of the pro- posed model and the ways in which the term classification could help to improve the performance, the proposed model is discussed in terms of the following hypotheses: The RFD model classifies terms into three categories (positive specific terms, general terms and negative specific terms) by using the spe function. Hypothesis H1. The spe function is reasonable for describing terms’ specificity for most topics. Hypothesis H2. The positive specific terms are the most interesting in relation to what users want, but general terms are the necessary information for describing what users want. The use of the three cat- egories together can generate the best performance. RFD1 is the state-of-the-art model for information filtering. It can achieve satisfactory performance for a given testing set. However, it is a parameterized method, and the two empirical parameters are sensi- tive to the change of testing sets. Hypothesis H3. RFD2 overcomes the limitation of RFD1 by using a clustering method to classify the terms into three categories directly. It can achieve a similar performance as RFD1. The RFD2 model also shows remarkable performance compared with the state-of-the-art models. 6.5 Results In this section, we first compare RFD2 and RFD1, and expect that the performance of RFD2 can be approximate to the performance of RFD1. We also compare the RFD2 model with language models (n-grams) and other pat- tern-based models, especially PTM, which is the best one of the existing pattern-based models. In addition, RFD2 is compared with the state-of-the-art term-based methods underpinned by Rocchio, BM25, SVM, MI, x2 and Lasso for each variable top À 20, B=P, MAP, IAP and Fb¼1 on both datasets. 6.5.1 Understand of Specificity on LCSH Ontology For each topic, let RFD-SPE be the set of positive specific terms determined by RFD2, and RFD-G be the set of general terms determined by RFD2. Fig. 1 shows the average speonto values of terms in both RFD-SPE and RFD-G for all 50 topics on RCV1. It is obvious that most topics (90 percent) can obtain larger speonto values for the RFD2 positive specific terms. This result supports Hypothesis H1. 6.5.2 RFD2 vs RFD1 RFD1 uses both u1 and u2 to group the low-level terms into three categories. To achieve the satisfactory performance, we conducted the cross validation for the two parameters in the testing sets, and we finally set u1 ¼ 0:2 and u2 ¼ 0:3 for FRD1 manually. RFD2 uses Algorithm FClustering to automatically group terms into the three categories of Tþ , G and TÀ for each topic. Table 1 shows the average results of the five measures on all 50 assessing topics, where %chg denotes the percent- age change of RFD2 over RFD1. As shown in Table 1, RFD2 can produce the same perfor- mance as RFD1. In addition, a small improvement to four measures (top À 20, B=P, IAP and Fb¼1) was observed. These results support Hypothesis H3. 6.5.3 RFD2 vs Pattern-Based Models and n-Grams The results on data collection RCV1 for all model in the first category (RFD2, language models (n-grams), CBM and other pattern-based models) are presented in Table 2, where %chg means the percentage change of RFD2 over PTM. As noted earlier, pattern-based methods struggle in some topics as too much noise is generated in the dis- covery of positive patterns. The most important findings revealed in this table are that closed sequential patterns (Closed Seq Ptns) perform better than other patterns, and PTM deploying method outperforms largely closed sequential patterns. The result also supports the superi- ority of using closed sequential patterns in text mining and highlights the importance of the adoption of proper pattern deploying methods on terms for using discov- ered patterns in text documents. In terms of n-grams, the Trigram model outperforms the Bigram and Unigram models. The performance of the Tri- gram model is very good and has similar results as PTM. In order to see the effectiveness of using both positive and negative patterns for relevance feature discovery, we also compare RFD2 with the best pattern based model PTM which uses positive patterns only in Reuters-21578 (see Table 3). Both tables show that RFD2 achieves excellent perfor- mance with 10:35 percent in percentage change on average Fig. 1. speontoðtÞ for all t 2 RFD-SPE v.s speontoðtÞ for all t 2 RFD-G. LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1663 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 9. for RCV1 (with a maximum of 13:10 percent and minimum of 6:99 percent) and 9:71 percent in percentage change on average for Reuters-21578 (with a maximum of 11:52 per- cent and minimum of 6:73 percent). 6.5.4 RFD2 vs Term Feature Selection Models The proposed method using RFD2 was also compared with popular feature selection models including Rocchio, BM25, SVM, MI, x2 and Lasso. The experimental results on RCV1 for all 50 assessing topics are reported in Table 4. In the table, RFD2 is also compared with Rocchio (which is the best model for feature selection) and the percentage change is calculated. As shown in Table 4, the proposed new model RFD2 achieved the best performance for the assessor topics, where RFD2 is compared with Lasso (which is the second best term-based model on RCV1). The average percentage of improvement over the standard measures is 7:90 per- cent with a maximum of 10:87 percent and minimum of 5:62 percent. The experimental results on Reuters-21578(R8) are reported in Table 5, where RFD2 is compared with SVM (which is the second best term-based model on Reuters- 21578) and the percentage change is calculated. As shown in Table 5, RFD2 also achieved the best performance. Compared to SVM, RFD2 has the same top-20 precision as SVM, and it is better than SVM for other four measures. The maximum percentage of improvement over the Fb¼1 mea- sure is 7:72 percent. At last, the statistical significance tests are illustrated in Table 6 to compare the proposed model with other high performance models on all data collections. The results show that the proposed model is significant as all p-values are less than 0.05. 6.5.5 Robustness In this paper, the robustness is used to discuss the characteris- tics of a model for describing its capacity to effectively per- form while its training sets are altered or the application environment is changed. We call a model robust if it still pro- vides satisfactory performance regardless of having its train- ing sets altered or the application environment changed. For this evaluation, we only use RCV1 because Reuters-21578’s testing set will become too small if we increase training sets. For altered training sets, we used six loops for each topic and each loop used a sliding window to increase the training sets, where each sliding window included 25 documents that were randomly selected from the testing set. The 25 documents were also removed from the corre- sponding testing set. TABLE 1 Comparison Results of RFD1 and RFD2 Models in All Assessing Topics on RCV1 Model top-20 b/p MAP Fb¼1 IAP RFD1 0.5570 0.4724 0.4932 0.4696 0.5125 RFD2 0.5610 0.4729 0.4930 0.4699 0.5136 %chg 0.71% 0.11% -0.04% 0.06% 0.21% TABLE 2 Comparison of All Pattern (Phrase) Based Methods on RCV1 Model top-20 b/p MAP Fb¼1 IAP RFD2 0.561 0.473 0.493 0.470 0.513 PTM 0.496 0.430 0.444 0.439 0.464 Seq Patterns 0.401 0.343 0.361 0.385 0.384 Closed Seq Ptns 0.406 0.353 0.364 0.390 0.392 Freq Patterns 0.412 0.352 0.361 0.386 0.384 Freq Closed Ptns 0.428 0.346 0.361 0.385 0.387 Unigram 0.417 0.386 0.388 0.404 0.411 Bigram 0.477 0.420 0.435 0.436 0.458 Trigram 0.499 0.420 0.439 0.438 0.460 CBM 0.448 0.409 0.415 0.423 0.440 %chg +13.10 +9.87 +11.14 +6.99 +10.66 TABLE 3 Comparison of the Proposed Model with the Best Pattern Based Model PTM on Reuters-21578(R8) Model top-20 b/p MAP Fb¼1 IAP RFD2 0.794 0.704 0.747 0.601 0.748 PTM 0.731 0.633 0.661 0.564 0.664 %chg +6.73 +10.24 +10.66 +9.40 +11.52 TABLE 4 Comparison Results of All Models on RCV1 Model top-20 b/p MAP Fb¼1 IAP RFD2 0.561 0.473 0.493 0.470 0.513 Rocchio 0.501 0.424 0.440 0.433 0.459 BM25 0.445 0.407 0.407 0.414 0.428 SVM 0.453 0.408 0.409 0.421 0.435 MI 0.316 0.311 0.312 0.347 0.337 x2 0.322 0.326 0.319 0.355 0.345 Lasso 0.506 0.434 0.460 0.445 0.480 %chg +10.87% +8.99% +7.17% +5.62% +6.88% TABLE 5 Comparison of All Models on Reuters-21578(R8) Model top-20 b/p MAP Fb¼1 IAP RFD2 0.794 0.699 0.745 0.600 0.746 Rocchio 0.706 0.594 0.633 0.527 0.632 BM25 0.675 0.556 0.582 0.508 0.590 SVM 0.794 0.693 0.729 0.557 0.709 MI 0.275 0.261 0.219 0.269 0.251 x2 0.263 0.245 0.211 0.260 0.242 Lasso 0.719 0.627 0.657 0.536 0.651 %chg 0.0% +0.87% +2.19% +7.72% +5.22% TABLE 6 p-Values for RFD2 vs Other High Performance Models Model top-20 b/p MAP Fb¼1 IAP Lasso 0.01100 0.01976 0.03850 0.02509 0.03040 PTM 0.00101 0.00070 0.00003 0.00002 0.00001 SVM 0.00030 0.00688 0.00160 0.00058 0.00113 Rocchio 0.00463 0.00436 0.00570 0.00496 0.00405 1664 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 10. Rocchio model is a popular robust model for filtering. Table 7 shows the results of Rocchio model based on these settings. Tables 8 and 9 show the experimental results for both RFD1 and RFD2, respectively. It is clear that both RFD models achieved better results when using more training documents. The performance of RFD models is satisfactory. The comparison between RFD1 and RFD2 is shown in Table 10. The difference between RFD1 and RFD2 is not sig- nificant as the p-values are obviously greater than 0.05. For the altered application environment, we use the pro- posed model for text classification. RFD models can be eas- ily used for ranking documents by using rank function defined in Eq. (3) for the term weight wðtÞ defined in Sec- tion 4.2. To apply Eq. (3) for binary text classification, we require a threshold t to determine relevance (rankðdÞ ! t) and non-relevance (rankðdÞ t). We call this kind of classi- fier RFDt. Let tþ ¼ minfrankðdÞjd 2 Dþ g, tÀ ¼ maxfrankðdÞjd 2 DÀ g, and t ¼ minftþ ; tÀ g. To avoid bias, we use balance testing sets for each topic by randomly selecting five equiva- lent negative subsets to match the positive set. We also use other two well-known classifiers: SVM and (Sequential Min- imal Optimization (SMO) for training SVM), and their LibSVM implementation (http://www.csie.ntu.edu.tw/ cjlin/libsvm/) and selected the best one for each classifier for this comparison. Table 11 shows the results, where Accm and AccM are the Micro Accuracy and Macro Accuracy, respectively. These experiments show that the performance of the pro- posed model is satisfactory for both altered training sets and the application environment. These results also support Hypothesis H3. 6.6 Discussion The proposed model has three major steps: feature discov- ery and deploying, term classification, and term weighing. Offender selection plays an important role for using nega- tive feedback in the process of feature discovery and deploying. In this section, we first discuss the issue of offender selection. We also discuss other issues for the pro- posed model such as term classification and specificities. 6.6.1 Offender Selection We believe that the positive feedback is more constructive than the negative feedback since the objective of relevance feature discovery is to find relevant knowledge. However, we believe that negative feedback contains some useful information that can help to identify the boundary between relevant and irrelevant information for improving the effectiveness of relevance feature discovery. The obvious problem for using irrelevant documents is that most of the irrelevant documents are not closed to the given topic because of the very large amount of negative information. Therefore, it is required to choose some useful irrelevant documents (offenders) to decide the groups of terms for the three categories [32]). TABLE 7 Results of Rocchio Model on Six Sliding Windows for All Assessor Topics Model top-20 b/p MAP Fb¼1 IAP Rocchio-1 0.525 0.444474 0.458249 0.448621 0.476696 Rocchio-2 0.495 0.444119 0.454437 0.448007 0.474435 Rocchio-3 0.505 0.455495 0.463649 0.449008 0.485906 Rocchio-4 0.497 0.450539 0.460866 0.448778 0.483619 Rocchio-5 0.497 0.441519 0.449421 0.441622 0.472068 Rocchio-6 0.479 0.428432 0.443400 0.439774 0.466213 AVG 0.500 0.444096 0.455004 0.445968 0.476490 Rocchio 0.501 0.4240 0.4400 0.4333 0.4590 TABLE 8 Results of RFD1 Model on Six Sliding Windows for All Assessor Topics Model top-20 b/p MAP Fb¼1 IAP RFD1-1 0.585 0.495 0.513 0.483 0.532 RFD1-2 0.565 0.491 0.512 0.485 0.529 RFD1-3 0.581 0.486 0.507 0.479 0.528 RFD1-4 0.575 0.499 0.518 0.484 0.540 RFD1-5 0.558 0.476 0.497 0.470 0.518 RFD1-6 0.547 0.475 0.498 0.473 0.519 AVG 0.569 0.487 0.508 0.479 0.528 RFD1 0.557 0.4724 0.493 0.4696 0.5125 TABLE 9 Results of RFD2 Model on Six Sliding Windows for All Assessor Topics Model top-20 b/p MAP Fb¼1 IAP RFD2-1 0.582 0.497 0.513 0.484 0.533 RFD2-2 0.563 0.496 0.513 0.486 0.530 RFD2-3 0.577 0.483 0.504 0.478 0.525 RFD2-4 0.569 0.493 0.516 0.483 0.537 RFD2-5 0.555 0.476 0.494 0.468 0.514 RFD2-6 0.556 0.478 0.499 0.473 0.520 AVG 0.567 0.487 0.507 0.479 0.527 RFD2 0.561 0.473 0.493 0.470 0.514 TABLE 10 P-value for Comparing RFD1 and RFD2 Model Loop top-20 b/p MAP Fb¼1 IAP Loop-1 0.472 0.465 0.766 0.526 0.509 Loop-2 0.569 0.113 0.622 0.404 0.695 Loop-3 0.522 0.349 0.384 0.546 0.369 Loop-4 0.224 0.096 0.243 0.426 0.283 Loop-5 0.411 0.993 0.137 0.182 0.061 Loop-6 0.083 0.196 0.628 0.911 0.678 AVG 0.380 0.369 0.463 0.499 0.433 TABLE 11 Results of RFD Based Classifier with Threshold t Model Macro-Average AccM Micro-Average Accm RFDt 0.682 0.701 SVM linear 0.611 0.656 SMO polynomial 0.616 0.661 %chg +11.62% +6.05% LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1665 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 11. Table 12 shows the performances for different K values, where K ¼ jDþ j=2 obtained the best performance. Table 12 a shows the average number of relevant documents, irrele- vant documents, offenders and extracted terms in the train- ing sets(where jDÀ j jDþ j jDþ j=2); and Table 12b shows the performance and the average term weight for the three categories. The results illustrate that when the value of K is larger then the performance and term weights of Tþ and G are lower. Another advantage of offender selection is to reduce the space of negative relevance feedback. Table 12a clearly shows that only 15:8% ¼ 6:54 41:3 of the irrelevant docu- ments are selected as offenders for the best performance. In summary, the experimental results support the strat- egy of offender selection used in the proposed model. We therefore conclude that the proposed method for offender selection in RFD meets the design objectives. 6.6.2 Term Classification and Specificity Terms can be grouped based on spe function and the classifi- cation rules or the feature clustering method. RFD1 uses two thresholds to decide the categories of terms. It obtained a sat- isfactory performance; however, it requires a prior knowl- edge and costs more effort to set the right values for the parameters. RFD2 uses the feature clustering technique to group terms into three categories adaptively for each topic. In this section, we mainly discuss the results of using RFD2. Table 13 shows the statistical information for both RFD2 and PTM. The average number of terms that PTM extracted was 156:9, and all those terms were used as a single group. RFD2 groups terms into three categories, and the number of terms in both the positive specific category and general category is reduced to 46:64 ¼ 24:24 þ 22:4, that is, only 29:73 percent were retained in RFD, and there is about 70:27% ¼ 100% À 29:73% **of extracted PTM terms are pos- sible noisy terms. The percentage of general terms is 48:03% ¼ 22:4 22:4þ24:24 (see Table 13a) General terms frequently appeared not only in relevant documents, but also in some irrelevant documents. To further reduce the side effects of using general terms, RFD2 adds some negative specific terms (TÀ ). We believe that positive specific terms (with large speci- ficity value) are more interesting than general terms (with small specificity value) for a given topic. As shown in Table 13a, PTM assigned 66:92% ¼ 2:5952 1:28273þ2:5952 weights to specific positive terms, and 33:07 percent weights to general terms. RFD2 increased the weights of positive specific terms. It assigned 25:55% ¼ 1:28273 1:28273þ3:73828 to general terms and 74:45 percent to positive specific terms (see Table 13a). Fig. 2 shows that the use of only positive specific terms (Tþ ) could achieve much better result than the use of only general terms (G). It is also recommended to use both posi- tive specific terms and the general terms (Tþ [ G) that can significantly improve the effectiveness. This recommenda- tion is also suggested by the SAGE model [12], where a topic model explicitly consider the background signal (like the neutral (G) cluster). In summary, the use of negative feedback is significant for RFD models. It can balance the percentages of positive TABLE 12 Statistical Information for RFD2 with Different Values of K K Average number of training documents Average number of extracted terms Relevant Irrelevant Offenders Tþ G TÀ jDþ j=2 12.78 41.3 6.54 24.24 22.4 231.04 jDþ j 12.78 41.3 10.08 28.94 24.68 267.38 jDÀ j 12.78 41.3 38.92 31.78 8.46 521.64 (a) K Average weight of extracted terms top-20 MAP Fb¼1 wðTþ Þ wðGÞ wðTÀ Þ jDþ j=2 3.7383 1.2827 -0.3328 0.561 0.493 0.470 jDþ j 3.3044 1.2227 -3.1947 0.542 0.463 0.451 jDÀ j 2.6307 0.4602 -69.9437 0.274 0.278 0.295 (b) TABLE 13 Statistical Information for Both RFD2 and PTM Average number of Extracted Terms used RFD Average weightðtÞ in PTM Tþ G TÀ wðTþ Þ wðGÞ wðTÀ Þ 24.24 22.4 231.04 2.5952 1.28273 0.68486 (a) Average weightðtÞ in RFD Terms extracted from Dþ used PTM wðTþ Þ wðGÞ wðTÀ Þ T wðTÞ 3.73828 1.28273 -0.33275 156.9 1.45210 (b) 1666 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 12. specific terms and general terms for largely reducing noises. The experimental results demonstrate that we can roughly choose the same amount of positive specific terms and gen- eral terms, and assign large weights to the positive specific terms. These results support Hypothesis H2. 7 CONCLUSION The research proposes an alternative approach for relevance feature discovery in text documents. It presents a method to find and classify low-level features based on both their appearances in the higher-level patterns and their specific- ity. It also introduces a method to select irrelevant docu- ments for weighting features. In this paper, we continued to develop the RFD model and experimentally prove that the proposed specificity function is reasonable and the term classification can be effectively approximated by a feature clustering method. The first RFD model uses two empirical parameters to set the boundary between the categories. It achieves the expected performance, but it requires the manually testing of a large number of different values of parameters. The new model uses a feature clustering technique to automati- cally group terms into the three categories. Compared with the first model, the new model is much more efficient and achieved the satisfactory performance as well. This paper also includes a set of experiments on RCV1 (TREC topics), Reuters-21578 and LCSH ontology. These experiments illustrate that the proposed model achieves the best performance for comparing with term-based baseline models and pattern-based baseline models. The results also show that the term classification can be effectively approxi- mated by the proposed feature clustering method, the pro- posed spe function is reasonable and the proposed models are robust. This paper demonstrates that the proposed model was thoroughly tested and the results prove that the proposed model is statistically significant. The paper also proves that the use of irrelevance feedback is significant for improving the performance of relevance feature discovery models. It provides a promising methodology for developing effective text mining models for relevance feature discovery based on both positive and negative feedback. ACKNOWLEDGMENTS This paper was partially supported by Grant DP140103157 from the Australian Research Council (ARC Discovery Project). Y. Li is the corresponding author. REFERENCES [1] M. Aghdam, N. Ghasem-Aghaee, and M. Basiri, “Text feature selection using ant colony optimization,” in Expert Syst. Appl., vol. 36, pp. 6843–6853, 2009. [2] A. Algarni and Y. Li, “Mining specific features for acquiring user information needs,” in Proc. Pacific Asia Knowl. Discovery Data Mining, 2013, pp. 532–543. [3] A. Algarni, Y. Li, and Y. Xu, “Selected new training documents to update user profile,” in Proc. Int. Conf. Inf. Knowl. Manage., 2010, pp. 799–808. [4] N. Azam and J. Yao, “Comparison of term frequency and doc- ument frequency based feature selection metrics in text cate- gorization,” Expert Syst. Appl., vol. 39, no. 5, pp. 4760–4768, 2012. [5] R. Bekkerman and M. Gavish, “High-precision phrase-based doc- ument classification on a modern scale,” in Proc. 11th ACM SIGKDD Knowl. Discovery Data Mining, 2011, pp. 231–239. [6] A. Blum and P. Langley, “Selection of relevant features and exam- ples in machine learning,” Artif. Intell., vol. 97, nos. 1/2, pp. 245– 271, 1997. [7] C. Buckley, G. Salton, and J. Allan, “The effect of adding relevance information in a relevance feedback environment,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 1994, pp. 292–300. [8] G. Cao, J.-Y. Nie, J. Gao, and S. Robertson, “Selecting good expan- sion terms for pseudo-relevance feedback,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008, pp. 243–250. [9] G. Chandrashekar and F. Sahin, “Asurvey on feature selection methods,” in Comput. Electr. Eng., vol. 40, pp. 16–28, 2014. [10] B. Croft, D. Metzler, and T. Strohman, Search Engines: Information Retrieval in Practice. Reading, MA, USA: Addison-Wesley, 2009. [11] F. Debole and F. Sebastiani, “An analysis of the relative hardness of Reuters-21578 subsets,” J. Amer. Soc. Inf. Sci. Technol., vol. 56, no. 6, pp. 584–596, 2005. [12] J. Eisenstein, A. Ahmed, and E. P. Xing, “Sparse additive genera- tive models of text,” in Proc. Annu. Int. Conf. Mach. Learn., 2011, pp. 274–281. [13] G. Forman, “An extensive empirical study of feature selection metrics for text classification,” in J. Mach. Learn. Res., vol. 3, pp. 1289–1305, 2003. [14] Y. Gao, Y. Xu, and Y. Li, “Topical pattern based document model- ling and relevance ranking,” in Proc. 15th Int. Conf. Web Inf. Syst. Eng., 2014, pp. 186–201. [15] X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and H.-Y. Shum, “Query dependent ranking using k-nearest neighbor,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008, pp. 115–122. [16] A. Genkin, D. D. Lewis, and D. Madigan, “Large-scale Bayesian logistic regression for text categorization,” Technometrics, vol. 49, no. 3, pp. 291–304, 2007. [17] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” in J. Mach. Learn. Res., vol. 3, no. 1, pp. 1157–1182, 2003. [18] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without can- didate generation,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2000, pp. 1–12. [19] Y.-F. Huang and S.-Y. Lin, “Mining sequential patterns using graph search techniques,” in Proc. Annu. Int. Conf. Comput. Softw. Appl., 2003, pp. 4–9. [20] G. Ifrim, G. Bakir, and G. Weikum, “Fast logistic regression for text categorization with variable-length n-grams,” in Proc. ACM SIGKDD Knowl. Discovery Data Mining, 2008, pp. 354–362. [21] N. Jindal and B. Liu, “Identifying comparative sentences in text documents,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2006, pp. 244–251. [22] T. Joachims, “Transductive inference for text classification using support vector machines,” in Proc. Annu. Int. Conf. Mach. Learn., 1999, pp. 200–209. [23] T. Joachims, “Optimizing search engines using clickthrough data,” in Proc. ACM SIGKDD Knowl. Discovery Data Mining, 2002, pp. 133–142. Fig. 2. Comparison for using different combinations of categories of terms for RFD2. LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1667 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 13. [24] S. Jones, S. WalkervvKaren, and S. E. Robertson, “A probabi- listic model of information retrieval: Development and comparative experiments,” Inf. Process. Manage., vol. 36, no. 6, pp. 779–808, 2000. [25] R. Lau, P. Bruza, and D. Song, “Towards a belief-revision-based adaptive and context-sensitive information retrieval system,” ACM Trans. Inf. Syst., vol. 26, no. 2, pp. 1–38, 2008. [26] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new bench- mark collection for text categorization research,” in J. Mach. Learn. Res., vol. 5, pp. 361–397, Dec. 2004. [27] X. Li and B. Liu, “Learning to classify texts using positive and unlabeled data,” in Proc. 18th Int. Joint Conf. Artif. Intell., 2003, pp. 587–592. [28] X.-L. Li, B. Liu, and S.-K. Ng, “Learning to classify documents with only a small positive training set,” in Proc. 18th Eur. Conf. Mach. Learn., 2007, pp. 201–213. [29] Y. Li, D. F. Hus, and S. M. Chung, “Combination of multiple fea- ture selection methods for text categorization by using combina- tional fusion analysis and rank-score characteristic,” Int. J. Artif. Intell. Tools, vol. 22, no. 2, p. 1350001, 2013. [30] Y. Li, A. Algarni, S.-T. Wu, and Y. Xue, “Mining negative rele- vance feedback for information filtering,” in Proc. Web Intell. Intell. Agent Technol., 2009, pp. 606–613. [31] Y. Li, A. Algarni, and Y. Xu, “A pattern mining approach for information filtering systems,” in Inf. Retrieval, vol. 14, pp. 237–256, 2011. [32] Y. Li, A. Algarni, and N. Zhong, “Mining positive and negative patterns for relevance feature discovery,” in Proc. ACM SIGKDD Knowl. Discovery Data Mining, 2010, pp. 753–762. [33] Y. Li and N. Zhong, “Mining ontology for automatically acquiring web user information needs,” in IEEE Trans. Knowl. Data Eng., vol. 18, no. 4, pp. 554–568, Apr. 2006. [34] Y. Li, X. Zhou, P. Bruza, Y. Xu, and R. Y. Lau, “A two-stage text mining model for information filtering,” in Proc. 17th ACM Conf. Inf. Knowl. Manage., 2008, pp. 1023–1032. [35] Y. Li, X. Zhou, P. Bruza, Y. Xu, and R. Y. Lau, “Two-stage decision model for information filtering,” Decision Support Syst., vol. 52, no. 3, pp. 706–716, 2012. [36] X. Ling, Q. Mei, C. Zhai, and B. Schatz, “Mining multi-faceted overviews of arbitrary topics in a text collection,” in Proc. 14th ACM SIGKDD Knowl. Discovery Data Mining, 2008, pp. 497–505. [37] C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Infor- mation Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2009. [38] C. D. Manning and H. Sch€utze, Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press, 1999. [39] D. Metzler and W. B. Croft, “Latent concept expansion using Mar- kov random fields,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2007, pp. 311–318. [40] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, “Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth,” in Proc. Int. Conf. Data Eng., 2001, pp. 215–224. [41] R. K. Pon, A. F. Cardenas, D. Buttler, and T. Critchlow, “Tracking multiple topics for finding interesting articles,” in Proc. ACM SIGKDD Knowl. Discovery Data Mining, 2007, pp. 560–569. [42] S. Quiniou, P. Cellier, T. Charnois, and D. Legallois, “What about sequential data mining techniques to identify linguistic patterns for stylistics?” in Computational Linguistics and Intelligent Text Proc- essing. New York, NY, USA: Springer, 2012, pp. 166–177. [43] S. Robertson, H. Zaragoza, and M. Taylor, “Simple bm25 exten- sion to multiple weighted fields,” in Proc. 17th ACM Conf. Inf. Knowl. Manage., 2004, pp. 42–49. [44] S. E. Robertson and I. Soboroff, “The TREC 2002 filtering track report,” in Proc. 11th Text Retrieval Conf., 2002. [45] G. Salton and C. Buckley, “Term-weighting approaches in auto- matic text retrieval,” in Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988. [46] S. Scott and S. Matwin, “Feature engineering for text classi- fication,” in Proc. Annu. Int. Conf. Mach. Learn., 1999, pp. 379–388. [47] F. Sebastiani, “Machine learning in automated text catego- rization,” ACM Comput. Surveys, vol. 34, no. 1, pp. 1–47, 2002. [48] M. Seno and G. Karypis, “Slpminer: An algorithm for finding fre- quent sequential patterns using length-decreasing support con- straint,” in Proc. 2nd IEEE Conf. Data Mining, 2002, pp. 418–425. [49] R. Sharma and S. Raman, “Phrase-based text representation for managing the web documents,” in Proc. Int. Conf. Inf. Technol.: Coding Comput., 2003, pp. 165–169. [50] S. Shehata, F. Karray, and M. Kamel, “Enhancing text clustering using concept-based mining model,” in Proc. 2nd IEEE Conf. Data Mining, 2006, pp. 1043–1048. [51] S. Shehata, F. Karray, and M. Kamel, “A concept-based model for enhancing text categorization,” in Proc. ACM SIGKDD Knowl. Dis- covery Data Mining, 2007, pp. 629–637. [52] I. Soboroff and S. Robertson, “Building a filtering test collection for TREC 2002,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2003, pp. 243–250. [53] F. Song and W. B. Croft, “A general language model for informa- tion retrieval,” in Proc. ACM Conf. Inf. Knowl. Manage., 1999, pp. 316–321. [54] Q. Song, J. Ni, and G. Wang, “A fast clustering-based feature sub- set selection algorithm for high-dimensional data,” in IEEE Trans. Knowl. Data Eng., vol. 25, no. 1, pp. 1–14, Jan. 2013. [55] X. Tao, Y. Li, and N. Zhong, “A personalized ontology model for web information gathering,” in IEEE Trans. Knowl. Data Eng., vol. 23, no. 4, pp. 496–511, Apr. 2011. [56] R. Tibshirani, “Regression shrinkage and selection via the Lasso: A retrospective,” in J. Royal Stat. Soc. B, vol. 73, pp. 273–282, 2011. [57] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” J. Royal Stat. Soc. B, vol. 58, no. 1, pp. 267–288, 1996. [58] X. Wang, H. Fang, and C. Zhai, “A study of methods for negative relevance feedback,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008, pp. 219–226. [59] S.-T. Wu, Y. Li, and Y. Xu, “Deploying approaches for pattern refinement in text mining,” in Proc. IEEE Conf. Data Mining, 2006, pp. 1157–1161. [60] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic pattern- taxonomy extraction for web mining,” in Proc. Int. Conf. Web Intell., 2004, pp. 242–248. [61] Z. Xu, and R. Akella, “Active relevance feedback for difficult quer- ies,” in Proc. ACM Conf. Inf. Knowl. Manage., 2008, pp. 459–468. [62] G.-R. Xue, D. Xing, Q. Yang, and Y. Yu, “Deep classification in large-scale text hierarchies,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008, pp. 619–626. [63] M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, and M. Sugiyama, “High-dimensional feature selection by feature-wise kernelized Lasso,” Neural Comput., vol. 26, no. 1, pp. 185–207, 2014. [64] X. Yan, H. Cheng, J. Han, and D. Xin, “Summarizing itemset pat- terns: A profile-based approach,” in Proc. ACM SIGKDD Knowl. Discovery Data Mining, 2005, pp. 314–323. [65] C. C. Yang, “Search engines information retrieval in practice,” in J. Amer. Soc. Inf. Sci. Technol., vol. 61, pp. 430–430, 2010. [66] Y. Yang, “An evaluation of statistical approaches to text catego- rization,” in Inf. Retreival, vol. 1, pp. 69–90, 1999. [67] Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proc. Annu. Int. Conf. Mach. Learn., 1997, pp. 412–420. [68] M. J. Zaki, “Spade: An efficient algorithm for mining frequent sequences,” in Mach. Learn. J. Spec. Issue Unsupervised Learn., vol. 42, pp. 31–60, 2001. [69] Z. Zhao, L. Wang, H. Liu, and J. Ye, “On similarity preserving fea- ture selection,” in IEEE Trans. Knowl. Data Eng., vol. 25, no. 3, pp. 619–632, Mar. 2013. [70] N. Zhong, Y. Li, and S.-T. Wu, “Effective pattern discovery for text mining,” in IEEE Trans. Knowl. Data Eng., vol. 24, no. 1, pp. 30–44, Jan. 2012. [71] S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification using maximum entropy method,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2005, pp. 1041–1048. 1668 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015 www.redpel.com+917620593389 www.redpel.com+917620593389
  • 14. Yuefeng Li is a full professor in the School of Electrical Engineering and Computer Science, Queensland University of Technology, Australia. He has published more than 150 refereed papers (including 43 journal papers). He has demonstra- ble experience in leading large-scale research projects and has achieved many established research outcomes that have been published and highly cited in top data mining journals and conferences (Highest citation per paper ¼ 188). He is the managing editor of Web Intelligence and Agent Systems and an associate editor of the International Journal of Pattern Recognition and Artificial Intelligence. Abdulmohsen Algarni received the PhD degree from Queensland University of Technology, Aus- tralia, in 2012. He was a research associate in the School of Electrical Engineering and Computer Science, Queensland University of Technology, Australia, in 2012. He is currently an assistant pro- fessor of the College of Computer Science, King Khalid University. His research interest includes text mining and information filtering. Mubarak Albathan received the MSc degree in network computing from Monash University, Australia, in 2009. He is currently working toward the PhD degree in the School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Australia. His research interests include feature selection and Web intelligence. Yan Shen received the PhD degree from the Queensland University of Technology, Australia, in 2013. He is a research associate in the School of Electrical Engineering and Computer Science, Queensland University of Technology, Australia. His research interest includes ontology learning and text mining. Moch Arif Bijaksana received the master’s degree from RMIT University, Australia. He is currently working toward the PhD degree in the School of Electrical Engineering and Computer Science, Queensland University of Technology, Australia. He is working at Telkom University, Indonesia. His research interest includes text classification and knowledge discovery. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1669 www.redpel.com+917620593389 www.redpel.com+917620593389