The document discusses relevance feature discovery for text mining. It presents an innovative model that discovers both positive and negative patterns in text documents as higher-level features and uses them to classify terms into categories and update term weights based on their specificity and distribution in patterns. Experiments on standard datasets show the proposed model outperforms both term-based and pattern-based methods.
Relevance Feature Discovery Model Outperforms Term-Based and Pattern-Based Text Mining Methods
1. Relevance Feature Discovery for Text Mining
Yuefeng Li, Abdulmohsen Algarni, Mubarak Albathan, Yan Shen, and Moch Arif Bijaksana
Abstract—It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user
preferences because of large scale terms and data patterns. Most existing popular text mining and classification methods have
adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years,
there has been often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user
preferences; yet, how to effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this
challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both positive and negative
patterns in text documents as higher level features and deploys them over low-level features (terms). It also classifies terms into
categories and updates term weights based on their specificity and their distributions in patterns. Substantial experiments using this
model on RCV1, TREC topics and Reuters-21578 show that the proposed model significantly outperforms both the state-of-the-art
term-based methods and the pattern based methods.
Index Terms—Text mining, text feature extraction, text classification
Ç
1 INTRODUCTION
THE objective of relevance feature discovery (RFD) is to
find the useful features available in text documents,
including both relevant and irrelevant ones, for describing
text mining results. This is a particularly challenging task in
modern information analysis, from both an empirical and
theoretical perspective [33], [36]. This problem is also of cen-
tral interest in many Web personalized applications, and
has received attention from researchers in Data Mining,
Machine Learning, Information Retrieval and Web Intelli-
gence communities [32].
There are two challenging issues in using pattern mining
techniques for finding relevance features in both relevant
and irrelevant documents [32]. The first is the low-sup-
port problem. Given a topic, long patterns are usually
more specific for the topic, but they usually appear in
documents with low support or frequency. If the mini-
mum support is decreased, a lot of noisy patterns can be
discovered. The second issue is the misinterpretation
problem, which means the measures (e.g., “support” and
“confidence”) used in pattern mining turn out to be not
suitable in using patterns for solving problems. For exam-
ple, a highly frequent pattern (normally a short pattern)
may be a general pattern since it can be frequently used
in both relevant and irrelevant documents. Hence, the
difficult problem is how to use discovered patterns to
accurately weight useful features.
There are several existing methods for solving the two
challenging issues in text mining. Pattern taxonomy mining
(PTM) models have been proposed [59], [60], [70], in which,
mining closed sequential patterns in text paragraphs and
deploying them over a term space to weight useful features.
Concept-based model (CBM) [50], [51] has also been pro-
posed to discover concepts by using natural language proc-
essing (NLP) techniques. It proposed verb-argument
structures to find concepts in sentences. These pattern (or
concepts) based approaches have shown an important
improvement in the effectiveness [70]. However, fewer sig-
nificant improvements are made compared with the best
term-based method because how to effectively integrate
patterns in both relevant and irrelevant documents is still
an open problem.
Over the years, people have developed many mature
term-based techniques for ranking documents, information
filtering and text classification [37], [39], [44]. Recently, sev-
eral hybrid approaches were proposed for text classifica-
tion. To learn term features within only relevant documents
and unlabelled documents, paper [27] used two term-based
models. In the first stage, it utilized a Rocchio classifier to
extract a set of reliable irrelevant documents from the unla-
beled set. In the second stage, it built a SVM classifier to
classify text documents. A two-stage model was also pro-
posed in [34], [35], which proved that the integration of the
rough analysis (a term-based model) and pattern taxonomy
mining is the best way to design a two-stage model for
information filtering systems.
For many years, we have observed that many terms with
larger weights are more general because they are likely to
be frequently used in both relevant and irrelevant docu-
ments [32]. For example, word “LIB” may be more fre-
quently used than word “JDK”; but “JDK” is more specific
than “LIB” for describing “Java Programming Languages”;
and “LIB” is more general than “JDK” because “LIB” is also
frequently used in other programming languages like C or
Y. Li, A. Algarni, Y. Shen, and M. Bijaksana are with the School of
Electrical Engineering and Computer Science, Queensland University of
Technology, Australia, Brisbane, QLD 4001. E-mail: y2.li@qut.edu.au,
{algarni.abdulmohsen, arifbijaksana}@gmail.com, y1.shen@student.qut.
edu.au.
M. Albathan is with the School of Electrical Engineering and Computer
Science, Queensland University of Technology, Australia, Brisbane, QLD
4001, and the Al Imam Mohammad Ibn Saud Islamic University, Saudi
Arabia, P.O.Box 5701, Riyadh 11432.
E-mail: mubarak.albathan@student.qut.edu.au.
Manuscript received 2 May 2013; revised 1 Nov. 2014; accepted 4 Nov. 2014.
Date of publication 23 Nov. 2014; date of current version 27 Apr. 2015.
Recommended for acceptance by P. G. Ipeirotis.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TKDE.2014.2373357
1656 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
1041-4347 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
www.redpel.com+917620593389
www.redpel.com+917620593389
2. C++. Therefore, we recommend the consideration of both
terms’ distributions and specificities for relevance feature
discovery.
Given a topic, a term’s specificity describes the extent to
which the term focuses on the topic that users want [33].
However, it is very difficult to measure the specificity of
terms because a term’s specificity depends on users’ per-
spectives of their information needs [55]. We proposed the
first definition of the specificity in [30], [31], which calcu-
lated the specificity score of a term based on its appearance
in discovered positive and negative patterns. However, this
definition required an iterative algorithm (three loops) in
order to weight terms accurately.
In order to make a breakthrough in relation to the two
challenging issues, we proposed the first version of the RFD
model in [32]. In accordance with the distributions of terms
in a training set, it provided a new definition for the speci-
ficity function and used two empirical parameters to group
terms into three categories: “positive specific terms”,
“general terms”, and “negative specific terms”. Based on
these definitions, the RFD model can accurately evaluate
term weights according to both their specificity and their
distributions in the higher level features, where the higher
level features include both positive and negative patterns.
The term classification method proposed in [32] requires
manually setting two empirical parameters according to
testing sets. In this paper, we continue to develop the RFD
model, and experimentally prove that the proposed speci-
ficity function is reasonable and the term classification can
be effectively approximated by a feature clustering method.
We also design a comprehensive approach for evaluating
the proposed models. In addition, we conducted some new
experiments by using six new sliding windows to adap-
tively update the training sets and also applying the RFD
model for binary text classification to test the robustness of
the proposed model.
This paper proposes an innovative technique for finding
and classifying low-level terms based on both their appear-
ances in the higher-level features (patterns) and their speci-
ficity in a training set. It also introduces a method to select
irrelevant documents (so-called offenders) that are closed to
the extracted features in the relevant documents in order to
effectively revise term weights. Compared with other meth-
ods, the advantages of the proposed model include:
Effective use of both relevant and irrelevant feed-
back to find useful features; and
Integration of both term and pattern features
together rather than using them in two separated
stages.
To justify these claims for the proposed approach, we
conducted substantial experiments on standard data collec-
tions, namely, the Reuters Corpus Volume 1 (RCV1), TREC
filtering assessor topics, the Library of Congress Subject
Headings (LCSH) ontology and Reuters-21578. We also
used five measures and the t-test to evaluate these experi-
ments. The results show that the proposed specificity func-
tion is adequate, the clustering method is effective and the
proposed model is robust. The results also show that the
proposed model significantly outperforms both the state-of-
the-art term-based methods underpinned by Okapi BM25,
Rocchio and language models, SVM and the pattern-based
methods on most measures.
The remainder of this paper is organized as follows.
Section 2 introduces a detailed overview of the related
works. Section 3 reviews the concept of features in text
documents. Section 4 discusses the RFD model. Section 5
proposes a new feature clustering method based on the
specificity function. To evaluate the performance of the pro-
posed model, we conduct substantial experiments on
LCSH, RCV1, TREC filtering topics and Reuters-21578. The
empirical results and discussion are reported in Section 6,
followed by concluding remarks in the last section.
2 RELATED WORK
Feature selection is a technique that selects a subset of fea-
tures from data for modeling systems (see http://en.
wikipedia.org/wiki/Feature_selection). Over the years, a
variety of feature selection methods (e.g., Filter, Wrapper,
Embedded and Hybrid approaches, and unsupervised or
semi-supervised methods) have been proposed in various
fields [6], [9], [17], [54], [69]. Feature selection is also one of
important steps for text classification and information filter-
ing [1], [5], [47] which is the task of assigning documents to
predefined classes. To date, many classifiers, such as Naive
Bayes, Rocchio, kNN, SVM and Lasso regression [16], [26],
[27], [28], [37], [62], [66] have been developed, in addition
many believe that SVM is also a promising classifier [13].
The classification problems include the single class and
multi-class problem. The most common solution [71] to the
multi-class problem is to decompose it into some indepen-
dence binary classifiers, where a binary one is assigned to
one of two predefined classes (e.g., relevant category or
irrelevant category). Most traditional text feature selection
methods used the bag of words to select a set of features for
the multi-class problem [13]. There are some feature selec-
tion criteria for text categorization, including document fre-
quency (DF), the global IDF, information gain, mutual
information (MI), Chi-Square (x2
) and term strength [1],
[29], [37], [45], [67].
In this paper we focus on relevant feature selection in
text documents. Relevance is a big research issue [25], [32],
[65] for Web search, which discusses a documents relevance
to a user or a query. However, the traditional feature selec-
tion methods are not effective for selecting text features for
solving relevance issue because relevance is a single class
problem [13]. The efficient way of feature selection for rele-
vance is based on a feature weighting function. A feature
weighting function indicates the degree of information rep-
resented by the feature occurrences in a document and
reflects the relevance of the feature. The popular term-based
ranking models include tf*idf based techniques, Rocchio
algorithm, Probabilistic models and Okapi BM25 [4], [24],
[37], [44].
Recently, one of the important issues for multimedia data
is the identification of the optimal feature set without any
redundancy [69]; however, the challenging issue for text fea-
ture selection in text documents is the identification of which
format or where the relevant features are in a text document
because of the large amount of noisy information in the docu-
ment [2]. Text features can be simple structures (words),
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1657
www.redpel.com+917620593389
www.redpel.com+917620593389
3. complex linguistic structures or statistical structures. We
mainly discuss three complex structures below for selecting
relevant features: n-grams, concepts and patterns.
n-grams (or phrases) are more discriminative and carry
more “semantics” than words. They were useful for build-
ing good ranking functions [20], [47], [53]. In [49], a phrase-
based text representation for Web document management
was also proposed that used rule-based Natural Language
Processing and Context Free Grammar techniques. Lan-
guage models were proposed to calculate weights for n-
grams, which are often approximated by Unigram, Bigram
or Trigram models for considering word dependencies [8],
[39], [53], [58]. A concept-based model [50], [51] was also
presented to find concepts in text documents by using NLP
techniques, which analyzed terms’ associations based on
the semantic structure of sentences. This model included
three components. The first one analyzed the semantic
structure of sentences; the second one then constructed a
conceptual ontological graph (COG) to represent the seman-
tic structures; and the last one found top concepts according
to the first two components to generate feature vectors by
using the standard vector space model.
Pattern mining has been extensively studied in data min-
ing communities for many years. A variety of efficient algo-
rithms such as Apriori-like algorithms, PrefixSpan, FP-tree,
SPADE, SLPMiner and GST have been proposed [18], [19],
[40], [48], [68]. Patterns post-processing were also proposed
to compress or group patterns into some clusters [64]. How-
ever, interpreting useful patterns for text mining remains an
open problem [32]. Typically, text mining discusses terms’
associations at a broad spectrum level, paying little atten-
tion to labeled information and duplications of terms [33],
[34]. Usually, the existing text mining techniques return
numerous patterns (sets of terms) in text documents. Not
surprisingly, many patterns are redundant or noisy. There-
fore, the challenging issue is how to effectively deal with
the very large set of patterns and terms with a lot of redun-
dant or noisy information [32].
To reduce the quantity of redundant information, closed
patterns have turned out to be a good alternative to phrases
[21], [60]. To effectively use closed patterns for weighting
terms, a pattern deploying method in [59] has been pro-
posed to map closed patterns into a term vector that includes
a set of terms and a term-weight distribution. This method
has also shown encouraging improvements on the effective-
ness in comparing with traditional IR models [3], [32], [34].
The big obstacle of pattern mining based approaches for
text mining is how to effectively use both relevant and irrel-
evant feedback. In [70], a pattern deploying method was
proposed to updated positive patterns; however, the
improved effectiveness was not significant. In regard to the
aforementioned problem of redundancy and noises, another
challenging issue for pattern-based methods is how to deal
with low frequency patterns [32]. By way of illustration,
a short pattern (normally with large support, or called
a highly frequent pattern) is usually a general pattern, or a
large pattern (a low frequent pattern with small support)
could be a specific one. Recently, a clustering-based feature
subset selection method has presented to view features into
clusters to reduce dimensionality [54]. Another interesting
idea is to identify interesting features in LDA topics [14].
In summary, the existing methods for finding rele-
vance features can be grouped into three approaches [32].
The first approach tries to diminish weights of terms that
appear in both relevant documents and irrelevant docu-
ments (e.g., Rocchio-based models [41]). This heuristic is
obvious if we assume that terms are isolated atoms. The
second one is based on how often features appear or do
not appear in relevant and irrelevant documents (e.g.,
probabilistic based models [61] or BM25 [43], [44]). The
third one is based on finding features through positive
patterns [32], [59], [60]. The proposed model further
develops the third approach by grouping features into
three categories: “positive specific features”, “general
features”, and “negative specific features”.
3 DEFINITIONS
For a given topic, the goal of relevance feature discovery in
text documents is to find a set of useful features, including
patterns, terms and their weights, in a training set D, which
consists of a set of relevant documents, Dþ
, and a set of
irrelevant documents, DÀ
. In this paper, we assume that all
text documents, d, are split into paragraphs, PSðdÞ. In this
section, we introduce the basic definitions about patterns
and the deploying method. These definitions can also be
found in [32], [34], [59].
3.1 Frequent and Closed Patterns
Let T1 ¼ ft1; t2; . . . ; tmg be a set of terms (or words) which
are extracted from Dþ
, and termset X be a set of terms. For a
given document d, coversetðXÞ is called the covering set of
X in d, which includes all paragraphs dp 2 PSðdÞ such that
X dp, i.e., coversetðXÞ ¼ fdpjdp 2 PSðdÞ; X dpg. Its abso-
lute support is the number of occurrences of X in PSðdÞ, that
is supaðXÞ ¼ jcoversetðXÞj. Its relative support is the fraction
of the paragraphs that contain the pattern, that is, supr
ðXÞ ¼ jcoversetðXÞj
jPSðdÞj . A termset X is called a frequent pattern if its
supa (or supr) ! min sup, a given minimum support.
It is obvious that a termset X can be mapped to a set of
paragraphs coversetðXÞ. We can also map a set of para-
graphs Y PSðdÞ to a termset, which satisfies
termsetðY Þ ¼ ftj8dp 2 Y ) t 2 dpg:
A pattern X (also a termset) is called closed if and only if
X ¼ termsetðcoversetðXÞÞ.
Let X be a closed pattern. We have
supaðX1Þ supaðXÞ (1)
for all patterns X1 ' X.
All closed patterns can be structured into a pattern taxon-
omy by using the subset (or called is-a) relation [59].
3.2 Closed Sequential Patterns
A sequential pattern s ¼ t1; . . . ; tr (ti 2 T1) is an ordered
list of terms. A sequence s1 ¼ x1; . . . ; xi is called a sub-
sequence of another sequence s2 ¼ y1; . . . ; yj , denoted
by s1 v s2, iff 9j1; . . . ; ji such that 1 j1 j2 . . . ji j
and x1 ¼ yj1
; x2 ¼ yj2
; . . . ; xi ¼ yji
. Given s1 v s2, we call s1
a sub-pattern of s2, and s2 a super-pattern of s1. In the fol-
lowing, we refer to sequential patterns as patterns.
1658 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
4. Given a sequential pattern X in document d, coversetðXÞ
is still used to describe the covering set of X, which includes
all paragraphs ps 2 PSðdÞ such that X v ps, i.e., coverset
ðXÞ ¼ fpsjps 2 PSðdÞ; X v psg. Its absolute support and rela-
tive support are defined the same as for the normal patterns.
A sequential pattern X is called a frequent pattern if its rel-
ative support ! min sup. The property of closed patterns
(see Eq. (1)) is used to define closed sequential patterns. A
frequent sequential pattern X is closed if supaðX1Þ 6¼
supaðXÞ for any super-pattern X1 of X.
3.3 Deploying Higher Level Patterns
on Low-Level Terms
For term-based approaches, weighting the usefulness of a
given term is based on its appearance in documents.
However, for pattern-based approaches, weighting the
usefulness of a given term is based on its appearance in
discovered patterns.
To improve the efficiency of the pattern taxonomy
mining, an algorithm, SPMiningðDþ
; min supÞ [60], was
proposed (also used in [34], [59]) to find closed sequential
patterns for all documents 2 Dþ
, which used the well-
known Apriori property to reduce the searching space.
For all relevant documents di 2 Dþ
, the SPMining algo-
rithm discovers all closed sequential patterns, SPi, based
on a given min sup. We do not want to repeat this
algorithm here because it is not the particular focus of
this study.
Let SP1, SP2; :::; SPjDþj be the sets of discovered closed
sequential patterns for all documents di 2 Dþ
ði ¼ 1; . . . ; nÞ,
where n ¼ jDþ
j. For a given term t, its d_support (deploying
support, called weight in this paper) in discovered patterns
can be described as follows:
d supðt; Dþ
Þ ¼
Xn
i¼1
supiðtÞ ¼
Xn
i¼1
jfpjp 2 SPi; t 2 pgj
P
p2SPi
jpj
; (2)
where jpj is the number of terms in p.
After the deploying supports of terms have been com-
puted from the training set, let wðtÞ ¼ d supðt; Dþ
Þ, the fol-
lowing rank function is used to decide the relevance of
document d:
rankðdÞ ¼
X
t2T
wðtÞtðt; dÞ; (3)
where tðt; dÞ ¼ 1 if t 2 d; otherwise tðt; dÞ ¼ 0.
4 RFD MODEL
In this section, we introduce the RFD model for relevance
feature discovery, which describes the relevant features in
relation to three groups: positive specific terms, general
terms and negative specific terms based on their appearan-
ces in a training set. We first discuss the concept of
“specificity” in terms of the relative “specificity” in training
datasets and the absolute “specificity” in domain ontology.
We also present a way to understand whether the proposed
relative“ specificity” is reasonable in term of the absolute
“specificity”. Finally, we introduce the term weighting
method in the RFD model.
4.1 Specificity Function
In the RDF model, a term’s specificity (referred to as relative
specificity in this paper) is defined [32] according to its
appearance in a given training set. Let T2 be a set of terms
which are extracted from DÀ
and T ¼ T1 [ T2. Given a term
t 2 T, its coverageþ
is the set of relevant documents that con-
tain t, and its coverageÀ
is the set of irrelevant documents
that contain t. We assume that the terms frequently used in
both relevant documents and irrelevant documents are gen-
eral terms. Therefore, we want to classify the terms that are
more frequently used in the relevant documents into the
positive specific category; the terms that are more fre-
quently used in the irrelevant documents are classified into
the negative specific category.
Based on the above analysis, we defined the specificity of
a given term t in the training set D ¼ Dþ
[ DÀ
as follows:
speðtÞ ¼
jcoverageþ
ðtÞj À jcoverageÀ
ðtÞj
n
; (4)
where coverageþ
ðtÞ ¼ fd 2 Dþ
jt 2 dg, coverageÀ
ðtÞ ¼ fd 2
DÀ
jt 2 dg, and n ¼ jDþ
j. speðtÞ 0 means that term t is
used more frequently in relevant documents than in irrele-
vant documents.
Based on the spe function, we have the following classi-
fication rules for determining general terms G, positive
specific terms Tþ
and negative specific terms TÀ
: G ¼ ft 2
Tju1 speðtÞ u2g, Tþ
¼ ft 2 TjspeðtÞ u2g, and TÀ
¼ ft
2 TjspeðtÞ u1g, where u2 is an experimental coefficient,
the maximum boundary of the specificity for the general
terms, and u1 is also an experimental coefficient, the mini-
mum boundary of the specificity for the general terms.
We assume that u2 0 and u2 ! u1. It is easy to verify
that G Tþ
TÀ
¼ ;. Therefore, fG; Tþ
; TÀ
g is a partition
of all terms.
A term’s relative specificity describes the extent to which
the term focuses on the topic that users want. It is very diffi-
cult to measure the relative specificity of terms because a
term’s specificity depends on users’ perspectives of their
information needs [55]. For example, “knowledge discov-
ery” will be a general term in the data mining community;
however, it may be a specific term when we talk about infor-
mation technology.
In this paper, we propose a way to understand whether
the proposed relative “specificity” is reasonable in term of
the absolute “specificity” in domain ontology, where
“absolute” means the specificity is independent to any train-
ing dataset. Normally, people consider terms to be more
general if they are frequently used in a very large domain
ontology; otherwise, they are more specific. Therefore, we
define the absolute specificity of a term in the ontology as
follows: speontoðtÞ ¼ 1
jcoverageðtÞj, where coverageðtÞ denotes the
set of concepts of subjects that use term t for describing their
meaning.
To clearly illustrate the spe values between 0 and 1, we
normalize the above equation as follows:
speontoðtÞ ¼ log10
N
jcoverageðtÞj
=log10
N
M
; (5)
where N is the total number of subjects and M is the maxi-
mum of coverageðtÞ for all t 2 T.
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1659
www.redpel.com+917620593389
www.redpel.com+917620593389
5. We call a relative spe function reasonable if the average
absolute specificity of its positive specific terms (Tþ
) is
greater than the average absolute specificity of its general
terms (G).
4.2 Weighting Features
To describe relevance features for a given topic, normally
we believe that specific terms are very useful in order to dis-
tinguish the topic from other topics. However, our experi-
ments (see Section 6.6.2) show that using only specific terms
is not good enough to improve the performance of relevance
feature discovery because user information needs cannot
simply be covered by documents that contain only the spe-
cific terms. Therefore, the best way is to use the specific
terms mixed with some of the general terms. We discuss
this issue in the evaluation section.
To improve the effectiveness, the RFD used irrelevant
documents in the training set in order to remove the noises.
The first issue in using irrelevant documents is how to select a
suitable set of irrelevant documents since a very large set of
negative samples is typically obtained. For example, a Google
search can return millions of documents; however, only a few
of those documents may be of interest to a Web user. Obvi-
ously, it is not efficient to use all of the irrelevant documents.
Most models can rank documents (see the ranking func-
tion in Equation (3)) using a set of extracted features. If an
irrelevant document gets a high rank, the document is
called an offender [33] because it is a false discovery. The
offenders are normally defined as the top-K ranked irrele-
vant documents. The basic hypothesis in this paper is that
relevance features are used to describe relevant documents,
and irrelevant documents are used to assure the discrimina-
tion of extracted features. Therefore, RFD only selects some
offenders (i.e., top-K ranked irrelevant documents) rather
than use all irrelevant documents. In Section 6.6.1 we dis-
cuss the performance of using different K values, where
K ¼ n
2 obtained the best performance.
Once we select the top-K irrelevant documents, the set of
irrelevant documents DÀ
will be reduced to include only K
offenders (irrelevant documents); therefore, we have
jDþ
j ! 2jDÀ
j if K ¼ n
2. The spe function can get its maximum
value, 1, if there is a term t, such that coverageÀ
ðtÞ ¼ ;; and
its minimum value, À 1
2, if there is a term t, such that
coverageþ
ðtÞ ¼ ;. Let 0 u2 1, then we can easily verify
À 1
2 u1 u2 1 if K ¼ n
2.
The calculation of original RFD term weighting function
[32] includes two steps: initial weight calculation and
weight revision. Based on Equation (2), in this paper we
integrate the two steps into the following equation:
wðtÞ ¼
d supðt; Dþ
Þð1 þ speðtÞÞ t 2 Tþ
d supðt; Dþ
Þ t 2 G
d supðt; Dþ
Þð1 À jspeðtÞjÞ t 2 T1
Àd supðt; DÀ
Þð1 þ jspeðtÞjÞ otherwise;
8
:
where the d_sup function is defined in Equation (2).
5 TERM CLASSIFICATION
RFD uses both specific features (e.g., Tþ
and TÀ
) and gen-
eral features (e.g., G). Therefore, the key research question
is how to find the best partition (Tþ
, G, TÀ
) to effectively
classify relevant documents and irrelevant documents. For
a given set of features, however, this question is an N-P
hard problem because of the large number of possible com-
binations of groups of features. In this section we propose
an approximation approach, and efficient algorithms to
refine the RFD model.
5.1 An Approximation Approach
The best partition (Tþ
, G, TÀ
) is used to clearly distinguish
irrelevant documents from relevant ones. Assume that we
have two characteristic functions f1, and f2, on all terms,
such that f1ðtÞ is the approximate average weight of t for all
relevant documents, and f2ðtÞ is the approximate average
weight of t for all irrelevant documents. Therefore, the best
partition (Tþ
, G, TÀ
) can maximize the following integra-
tion:
Rtn
t1
ðf1ðtÞ À f2ðtÞÞdt.
The above discussion motivates us to find adequate u1
and u2 to make positive specific features move far away
from negative specific features. If we view the terms that
have the same specificity score as a cluster and use the spe
function as the distance function, the new solution is to find
three groups that can clearly divide the terms into three
categories.
Based on the above analysis, we can develop a clustering
method to group terms into three categories automatically
for each topic by using the specificity function. In the begin-
ning, we assign terms that appear only in irrelevant docu-
ments into the negative specific category TÀ
. For the
remaining terms, we initially view each term ti as a single
cluster ci. We also represent each cluster ci using an interval
½minspeðciÞ; maxspeðciÞŠ, where minspeðciÞ is the smallest spe
value of elements in ci, and maxspeðciÞ is the largest spe
value of the elements in ci.
Let ci and cj be two clusters. The difference between the
two clusters is defined as follows: difðci; cjÞ ¼
minfjmaxspeðciÞ À minspeðcjÞj; jmaxspeðcjÞ À minspeðciÞjg:
A bottom-up approach is used to merge two clusters if they
have the minimum difference. Let ck be the merged cluster
of ci and cj, then we have ck ¼ ci [ cj, minspeðckÞ ¼ min
fminspeðciÞ; minspeðcjÞg and maxspeðckÞ ¼ maxfmaxspeðciÞ;
maxspeðcjÞg.
The merging operation continues until three clusters are
left if the number of initial clusters is greater than three. The
distances between two adjacent clusters in the retained
three clusters should be greater than or equal to any other
distances between two adjacent clusters. The cluster that
has the biggest minspe is determined as Tþ
, the cluster that
has the second biggest minspe would form category G and
the remainder will be part of TÀ
.
5.2 Efficient Algorithms
Algorithm FClustering describes the process of feature clus-
tering, where DPþ
is the set of discovered patterns of Dþ
and DPÀ
is the set of discovered patterns of DÀ
. Step 1 to
Step 4 initialize the three categories. All terms that are not
the elements of positive patterns are assigned to category
TÀ
. For the remaining m terms, each is viewed as a single
1660 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
6. cluster in the beginning (Step 5 to Step 7). It also sorts these
clusters in C based on their minspe values in Step 9. Step 10
to Step 21 describe the iterative process of merging clusters
until there are three clusters left. The merging process first
decides the closest of two adjacent clusters (Step 11 to
Step 14), ck and ckþ1. It also merges the two clusters into
one, denoted as ck (Step 15 to Step 19), and deletes ckþ1 from
C (Step 20 and Step 21). In the last step, it chooses the first
cluster as Tþ
, the second cluster as G (if it exists) and the
last cluster as a part of TÀ
(if it exists).
In the initialization, the algorithm uses the most time
(OðjTj2
Þ) finding the initial value of TÀ
. The initialization
can also be implemented in OðjTjÞ if a hash function is used
for the containment test. Before the merging process, it takes
OðmlogmÞ to sort C, where m ¼ jCj and m jTj. In the
while loop, it uses OðmÞ times to merge two clusters and
takes Oðm2
Þ to move the clusters in C. Therefore, the time
complexity is OðjTj þ mlogm þ m2
Þ ¼ OðjTj þ m2
Þ ¼ OðjTj2
Þ.
FClustering ( )
Input: Discovered features T; DPþ
; DPÀ
and function spe.
Output: Three categories of terms Tþ
, G and TÀ
.
Method:
1: G ¼ ;; Tþ
¼ ;; TÀ
¼ ;;
2: foreach ti 2 T do
3: if ti =2 ftjt 2 P; P 2 DPþ
}
4: then TÀ
¼ TÀ
[ ftig;
5: foreach ti 2 T À TÀ
do {
6: ci ¼ ftig;
7: maxspeðciÞ ¼ minspeðciÞ ¼ speðtiÞ; }
8: let m ¼ jT À TÀ
j;
9: let C ¼ fc1; c2; Á Á Á ; cmg and minspeðc1Þ ! Á Á Á ! minspeðcmÞ;
10: while (jCj 3){ //start merging process
11: let k ¼ 1 and mind ¼ difðc1; c2Þ;
12: for i ¼ 2 to m À 1 do
13: if difðci; ciþ1Þ mind
14: then {k ¼ i; mind ¼ difðci; ciþ1Þ;}
15: let ck ¼ ck [ ckþ1;
16: if minspeðckþ1Þ minspeðckÞ
17: then minspeðckÞ ¼ minspeðckþ1Þ;
18: if maxspeðckþ1Þ maxspeðckÞ
19: then maxspeðckÞ ¼ maxspeðckþ1Þ;
20: for i ¼ k þ 1 to m À 1 do // delete ckþ1 from C
21: let ci ¼ ciþ1;}
22: if jCj ¼ 1 then Tþ
¼ c1
23: else if jCj ¼ 2 then fTþ
¼ c1; G ¼ c2g
24: else fTþ
¼ c1; G ¼ c2; TÀ
¼ TÀ
[ c3g;
Algorithm WFeature is applied to calculate term weights
after terms are classified using Algorithm FClustering. It first
calculates the sup function and spe function (Step 1 and
Step 8). For each term t, it takes Oðn  jpjÞ to calculate d sup
function if an inverted index is utilized, where jpj is the
average size of a pattern, and jpj jdj. For each term t, it
also takes Oðn  jdjÞ to calculate spe function. Therefore,
the time complexity is OððjTj  n  jpjÞ þ ðjTj  n  jdjÞÞ ¼
OðjTj  jdj  nÞ. It also uses Algorithm FClustering (Step 9)
to classify the terms into the three categories of Tþ
, G and
TÀ
. Finally, it calculates the weights of terms using the w
function defined in Section 4.2.
WFeature ðÞ
Input: A updated training set, fDþ
; DÀ
g;
extracted features T; DPþ
; DPÀ
; and
the initial term weight function w.
Output: A term weight function.
Method:
1: let n ¼ jDþ
j;
2: T1 ¼ ftjt 2 p; p 2 DPþ
g;
3: foreach t 2 T do
4: if t 2 T1
5: then supðtÞ ¼ d supðt; Dþ
Þ;
6: else supðtÞ ¼ Àd supðt; DÀ
Þ;
7: foreach t 2 T do
8: speðtÞ ¼ jfdjd2Dþ;t2dgjÀjfdjd2DÀ;t2dgj
n ;
9: let ðTþ
; G; TÀ
Þ ¼ FClusteringðT; DPþ
; DPÀ
; speðÞÞ;
10: foreach t 2 Tþ
do
11: wðtÞ ¼ supðtÞ Ã ð1 þ speðtÞÞ;
12: foreach t 2 TÀ
do
13: wðtÞ ¼ supðtÞ À jsupðtÞ Ã speðtÞj;
Based on the above analysis, the time complexity of
Algorithm WFeature is OðjTj  jdj  n þ jTj2
Þ, where jdj is
the average size of the documents and n is the number of
relevant documents in the training set. In our experiments,
the size of the set of selected terms is less than 300, i.e.,
jTj jdj; so, Algorithm WFeature is efficient.
6 EVALUATION
This section discusses the testing environment, and reports
the experimental results and the discussions. It also
provides recommendations for offender selection and the
use of specific terms and general terms for describing user
information needs. The proposed model is a supervised
approach that needs a training set including both relevant
documents and irrelevant documents.
6.1 Data
We used two popular data sets to test the proposed model:
Reuters Corpus Volume 1, a very large data collection; and
Reuters-21578, a small one. RCV1 includes 806,791 docu-
ments that cover a broad spectrum of issues or topics. TREC
(2002) has developed and provided 50 reliable assessor
topics [44] for RCV1, aiming at testing robust information
filtering systems. These topics were evaluated by human
assessors at the National Institute of Standards and Tech-
nology (NIST) [52]. For each topic, a subset of RCV1 docu-
ments is divided into a training set and a testing set. RCV1
is a standard data collection and the TREC 50 topics are sta-
ble and sufficient enough for high quality experiments [55].
Reuters-21578 corpus is a widely used collection for text
mining. The data was originally collected and labelled by
Carnegie Group, Inc. and Reuters, Ltd. in the course of
developing the CONSTRUE text categorization system.1
In
this experiment, we picked up the set of 10 classes. Accord-
ing to Sebastiani’s convention [11], it was also called “R8”
because two classes corn and wheat are intimately related to
the class grain, and they were appended to class grain.
1. Reuters-21578, http://www.daviddlewis.com/resources/
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1661
www.redpel.com+917620593389
www.redpel.com+917620593389
7. Documents in both RCV1 and Reuters-21578 are
described in XML. To avoid bias in experiments, all of the
information about the meta-data was ignored. All docu-
ments were treated as plain text documents by a preprocess-
ing, including removing stop-words according to a given
stop-words list and stemming terms by applying the Porter
Stemming algorithm.
We also used the Library of Congress Subject Headings2
to understand the definition of the spe function in a domain
ontology. LCSH is a very large taxonomic knowledge classi-
fication, which was developed by librarians for organizing
the large volume of library’s collections, and for retrieving
information from these collections [55]. LCSH covers
394,070 concepts or subjects.
6.2 Baseline Models and Setting
We grouped baseline models into two categories [32]. The
first category included the up-to-date pattern based meth-
ods (frequent patterns, frequent closed patterns, sequential
patterns, and sequential closed patterns), language models
(n-grams) and a concept-based model. The second category
included well-known term feature selection models (or
called term-based models): Rocchio, BM25, SVM, mutual
information, chi-square and Lasso regression.
We divided our approach into two stages. In the first
stage, we used only positive patterns in the training sets.
The model, called PTM, discovers sequential closed patterns
from relevant documents, deploys discovered patterns on
their terms using Equation (2) and ranks documents using
Equation (3) as well. In the second stage, we use both posi-
tive and negative patterns as described in Sections 4 and 5.
We set min supr ¼ 0:2 (which was suggested by [59]) for all
models that use patterns.
Different to sequential patterns, n-grams extract sequen-
tial patterns with a specified number of words and with no
gaps between the words [42]. n-grams are usually selected
based on the sliding window technique and the probability
of a n-gram¼ w1w2 . . . wn is calculated using the following
equation:
Pðw1w2 . . . wnÞ ¼ Pðw1ÞPðw2jw1w2Þ . . . Pðwnjw1 . . . wnÀ1Þ:
In the experiments, we used three language models [37]:
Unigram, Bigram and Trigram. Unigram uses 1-grams only,
Bigram uses both 2-grams and 1-grams and Trigram uses 3-
grams, 2-grams and 1-grams. The probability of an n-gram
is calculated in a training set D as follows:
Pðn-gramÞ ¼
tfðn-gram; Dþ
Þ
tfðn-gram; DÞ
; (6)
where tfðn-gram; Dþ
Þ is the number of appearances of
n-gram in Dþ
, and tfðn-gram; DÞ is the number of appear-
ances of n-gram in D, and n ¼ 1, 2 or 3.
The concept-based model was presented in [50], [51].
CBM was also used as a baseline model in [70] for infor-
mation filtering. The Rocchio model and BM25 are the
well-known models for representing relevant information.
We used the recommended experimental parameters
(suggested by [59], [60], [70]) in our experiments (please
note that the term frequency is the total number of term
appearance in all relevant documents).
The linear SVM has been proven very effective for text
categorization and filtering [47]. Most SVMs are designed
for making a binary decision rather than ranking docu-
ments. In this paper, we use SVM-Light3
for ranking docu-
ments. The optimization algorithms used in SVM-Light are
described in [23].
Mutual Information and chi-square (x2
) are popular
methods for feature selection [37]. More details about MI
and x2
can be found in Chapter 13 of book [37].
Lasso (least absolute shrinkage and selection operator) is
another method for feature selection [57], and there are
some extensions in recent years [56], [63]. Lasso [57] esti-
mate was defined by
ð^a; ^bÞ ¼ argmin
Pn
i¼1 yi À a À
Pp
j¼1 bjxij
2
suject to dT
i b t:
where dT
i (i ¼ 1; 2; . . . ; 2p
Þ is p-tuples of the form ðÆ1;
Æ1; . . . ; Æ1Þ. In our implementation, yj ¼ 1 if dj 2 Dþ
; other-
wise yj ¼ À jDþj
jDÀj in order to make ^a ¼ y ¼ 0; and dT
i ¼ sign
ðbÞ. We also let xij ¼ 1 if term ti in document dj; otherwise
xij ¼ 0 for information filtering (this assumption is the same
as other models). We use tf*idf weights to find b ¼ fbjg,
and let bj ¼ wðtjÞ À w þ D. The initial b0
is assigned when
D ¼ 0 in order to make b0
¼ 0, where wðtjÞ is the tf*idf
weight and D is a parameter to try positive direction and
negative direction test [16].
6.3 Evaluation Metrics
The effectiveness of a model is usually measured by the
following means [32], [59]: the average precision of the
top-20 documents, F1 measure, mean average precision
(MAP), the break-even point (b=p), and interpolated
average precision (IAP) on 11-points. These are widely
accepted and well-established evaluation metrics. Each
metric focuses on a different aspect of the model’s perfor-
mance, as described below.
The F-beta (Fb) measure is a function to describe both
Recall (R) and Precision (P), together with a parameter beta
b. The parameter b ¼ 1 was used in this paper, which
denotes that precision and recall were weighed equally.
Therefore, Fb is denoted by: F1 ¼ 2PR
ðPþRÞ :
MAP measures the precision at each relevant document
first, then obtain the average precision for all the topics. It
combines precision, relevance ranking and overall recall
together to measure the performance of the models.
B/P is the value of the recall (or precision) for which the
P/R curve intersects the precision ¼ recall line. The larger
the value, the better the model performs.
11-Points was also adopted in several research works
[66]. It is used to measure the performance of different mod-
els by averaging the precisions at 11 standard recall levels
(recall ¼ 0.0, 0.1,. . .,1.0, where “0.0” means cut-off ¼ 1 in this
paper). We also used a statistical method, the paired two-
tailed t-test, to analyze the experimental results.
2. LCSH Web page, http://classificationweb.net/ 3. SVM-Light URL: http://svmlight.joachims.org/
1662 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
8. 6.4 Hypotheses
The proposed model is called the relevance feature discov-
ery model, and consists of three major steps: feature discov-
ery and deploying, term classification and term weighting.
It first finds positive and negative patterns and terms in the
training set. It also classifies terms into three categories by
using parameters (u1 and u2) or Algorithm FClustering.
Finally, it works out the term weights by using Algorithm
WFeature.
In our experiments, we develop two versions of the RFD
model. Both versions use negative feedback to improve the
quality of extracted features from positive feedback. The
features extracted from both positive and negative feedback
are classified into three categories, namely, Tþ
, G and TÀ
.
The first version is called RFD1 which uses two empirical
parameters (u1 and u2, see [32]) to group the low-level terms
into three groups. This model can achieve the satisfactory
performance, but it has to manually decide the two parame-
ters according to their real performance in testing sets. The
second model is called RFD2 which uses the proposed
FClustering algorithm to automatically determine the three
categories Tþ
, G and TÀ
based on the training sets.
To conduct a comprehensive investigation of the pro-
posed model and the ways in which the term classification
could help to improve the performance, the proposed
model is discussed in terms of the following hypotheses:
The RFD model classifies terms into three categories
(positive specific terms, general terms and negative
specific terms) by using the spe function.
Hypothesis H1. The spe function is reasonable for
describing terms’ specificity for most topics.
Hypothesis H2. The positive specific terms are the
most interesting in relation to what users want, but
general terms are the necessary information for
describing what users want. The use of the three cat-
egories together can generate the best performance.
RFD1 is the state-of-the-art model for information
filtering. It can achieve satisfactory performance for
a given testing set. However, it is a parameterized
method, and the two empirical parameters are sensi-
tive to the change of testing sets.
Hypothesis H3. RFD2 overcomes the limitation of
RFD1 by using a clustering method to classify the
terms into three categories directly. It can achieve a
similar performance as RFD1. The RFD2 model also
shows remarkable performance compared with the
state-of-the-art models.
6.5 Results
In this section, we first compare RFD2 and RFD1, and
expect that the performance of RFD2 can be approximate
to the performance of RFD1. We also compare the RFD2
model with language models (n-grams) and other pat-
tern-based models, especially PTM, which is the best one
of the existing pattern-based models. In addition, RFD2 is
compared with the state-of-the-art term-based methods
underpinned by Rocchio, BM25, SVM, MI, x2
and Lasso
for each variable top À 20, B=P, MAP, IAP and Fb¼1 on
both datasets.
6.5.1 Understand of Specificity on LCSH Ontology
For each topic, let RFD-SPE be the set of positive specific
terms determined by RFD2, and RFD-G be the set of general
terms determined by RFD2. Fig. 1 shows the average speonto
values of terms in both RFD-SPE and RFD-G for all 50 topics
on RCV1. It is obvious that most topics (90 percent) can
obtain larger speonto values for the RFD2 positive specific
terms. This result supports Hypothesis H1.
6.5.2 RFD2 vs RFD1
RFD1 uses both u1 and u2 to group the low-level terms into
three categories. To achieve the satisfactory performance,
we conducted the cross validation for the two parameters in
the testing sets, and we finally set u1 ¼ 0:2 and u2 ¼ 0:3 for
FRD1 manually.
RFD2 uses Algorithm FClustering to automatically group
terms into the three categories of Tþ
, G and TÀ
for each
topic. Table 1 shows the average results of the five measures
on all 50 assessing topics, where %chg denotes the percent-
age change of RFD2 over RFD1.
As shown in Table 1, RFD2 can produce the same perfor-
mance as RFD1. In addition, a small improvement to four
measures (top À 20, B=P, IAP and Fb¼1) was observed.
These results support Hypothesis H3.
6.5.3 RFD2 vs Pattern-Based Models and n-Grams
The results on data collection RCV1 for all model in the
first category (RFD2, language models (n-grams), CBM
and other pattern-based models) are presented in Table 2,
where %chg means the percentage change of RFD2 over
PTM. As noted earlier, pattern-based methods struggle
in some topics as too much noise is generated in the dis-
covery of positive patterns. The most important findings
revealed in this table are that closed sequential patterns
(Closed Seq Ptns) perform better than other patterns, and
PTM deploying method outperforms largely closed
sequential patterns. The result also supports the superi-
ority of using closed sequential patterns in text mining
and highlights the importance of the adoption of proper
pattern deploying methods on terms for using discov-
ered patterns in text documents.
In terms of n-grams, the Trigram model outperforms the
Bigram and Unigram models. The performance of the Tri-
gram model is very good and has similar results as PTM.
In order to see the effectiveness of using both positive
and negative patterns for relevance feature discovery, we
also compare RFD2 with the best pattern based model PTM
which uses positive patterns only in Reuters-21578 (see
Table 3).
Both tables show that RFD2 achieves excellent perfor-
mance with 10:35 percent in percentage change on average
Fig. 1. speontoðtÞ for all t 2 RFD-SPE v.s speontoðtÞ for all t 2 RFD-G.
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1663
www.redpel.com+917620593389
www.redpel.com+917620593389
9. for RCV1 (with a maximum of 13:10 percent and minimum
of 6:99 percent) and 9:71 percent in percentage change on
average for Reuters-21578 (with a maximum of 11:52 per-
cent and minimum of 6:73 percent).
6.5.4 RFD2 vs Term Feature Selection Models
The proposed method using RFD2 was also compared with
popular feature selection models including Rocchio, BM25,
SVM, MI, x2
and Lasso. The experimental results on RCV1
for all 50 assessing topics are reported in Table 4. In the
table, RFD2 is also compared with Rocchio (which is the best
model for feature selection) and the percentage change is
calculated.
As shown in Table 4, the proposed new model RFD2
achieved the best performance for the assessor topics,
where RFD2 is compared with Lasso (which is the second
best term-based model on RCV1). The average percentage
of improvement over the standard measures is 7:90 per-
cent with a maximum of 10:87 percent and minimum of
5:62 percent.
The experimental results on Reuters-21578(R8) are
reported in Table 5, where RFD2 is compared with SVM
(which is the second best term-based model on Reuters-
21578) and the percentage change is calculated. As shown
in Table 5, RFD2 also achieved the best performance.
Compared to SVM, RFD2 has the same top-20 precision as
SVM, and it is better than SVM for other four measures. The
maximum percentage of improvement over the Fb¼1 mea-
sure is 7:72 percent.
At last, the statistical significance tests are illustrated in
Table 6 to compare the proposed model with other high
performance models on all data collections. The results
show that the proposed model is significant as all p-values
are less than 0.05.
6.5.5 Robustness
In this paper, the robustness is used to discuss the characteris-
tics of a model for describing its capacity to effectively per-
form while its training sets are altered or the application
environment is changed. We call a model robust if it still pro-
vides satisfactory performance regardless of having its train-
ing sets altered or the application environment changed. For
this evaluation, we only use RCV1 because Reuters-21578’s
testing set will become too small if we increase training sets.
For altered training sets, we used six loops for each
topic and each loop used a sliding window to increase the
training sets, where each sliding window included 25
documents that were randomly selected from the testing
set. The 25 documents were also removed from the corre-
sponding testing set.
TABLE 1
Comparison Results of RFD1 and RFD2 Models
in All Assessing Topics on RCV1
Model top-20 b/p MAP Fb¼1 IAP
RFD1 0.5570 0.4724 0.4932 0.4696 0.5125
RFD2 0.5610 0.4729 0.4930 0.4699 0.5136
%chg 0.71% 0.11% -0.04% 0.06% 0.21%
TABLE 2
Comparison of All Pattern (Phrase) Based Methods on RCV1
Model top-20 b/p MAP Fb¼1 IAP
RFD2 0.561 0.473 0.493 0.470 0.513
PTM 0.496 0.430 0.444 0.439 0.464
Seq Patterns 0.401 0.343 0.361 0.385 0.384
Closed Seq Ptns 0.406 0.353 0.364 0.390 0.392
Freq Patterns 0.412 0.352 0.361 0.386 0.384
Freq Closed Ptns 0.428 0.346 0.361 0.385 0.387
Unigram 0.417 0.386 0.388 0.404 0.411
Bigram 0.477 0.420 0.435 0.436 0.458
Trigram 0.499 0.420 0.439 0.438 0.460
CBM 0.448 0.409 0.415 0.423 0.440
%chg +13.10 +9.87 +11.14 +6.99 +10.66
TABLE 3
Comparison of the Proposed Model with the Best Pattern
Based Model PTM on Reuters-21578(R8)
Model top-20 b/p MAP Fb¼1 IAP
RFD2 0.794 0.704 0.747 0.601 0.748
PTM 0.731 0.633 0.661 0.564 0.664
%chg +6.73 +10.24 +10.66 +9.40 +11.52
TABLE 4
Comparison Results of All Models on RCV1
Model top-20 b/p MAP Fb¼1 IAP
RFD2 0.561 0.473 0.493 0.470 0.513
Rocchio 0.501 0.424 0.440 0.433 0.459
BM25 0.445 0.407 0.407 0.414 0.428
SVM 0.453 0.408 0.409 0.421 0.435
MI 0.316 0.311 0.312 0.347 0.337
x2 0.322 0.326 0.319 0.355 0.345
Lasso 0.506 0.434 0.460 0.445 0.480
%chg +10.87% +8.99% +7.17% +5.62% +6.88%
TABLE 5
Comparison of All Models on Reuters-21578(R8)
Model top-20 b/p MAP Fb¼1 IAP
RFD2 0.794 0.699 0.745 0.600 0.746
Rocchio 0.706 0.594 0.633 0.527 0.632
BM25 0.675 0.556 0.582 0.508 0.590
SVM 0.794 0.693 0.729 0.557 0.709
MI 0.275 0.261 0.219 0.269 0.251
x2 0.263 0.245 0.211 0.260 0.242
Lasso 0.719 0.627 0.657 0.536 0.651
%chg 0.0% +0.87% +2.19% +7.72% +5.22%
TABLE 6
p-Values for RFD2 vs Other High Performance Models
Model top-20 b/p MAP Fb¼1 IAP
Lasso 0.01100 0.01976 0.03850 0.02509 0.03040
PTM 0.00101 0.00070 0.00003 0.00002 0.00001
SVM 0.00030 0.00688 0.00160 0.00058 0.00113
Rocchio 0.00463 0.00436 0.00570 0.00496 0.00405
1664 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
10. Rocchio model is a popular robust model for filtering.
Table 7 shows the results of Rocchio model based on these
settings.
Tables 8 and 9 show the experimental results for both
RFD1 and RFD2, respectively. It is clear that both RFD
models achieved better results when using more training
documents. The performance of RFD models is satisfactory.
The comparison between RFD1 and RFD2 is shown in
Table 10. The difference between RFD1 and RFD2 is not sig-
nificant as the p-values are obviously greater than 0.05.
For the altered application environment, we use the pro-
posed model for text classification. RFD models can be eas-
ily used for ranking documents by using rank function
defined in Eq. (3) for the term weight wðtÞ defined in Sec-
tion 4.2. To apply Eq. (3) for binary text classification, we
require a threshold t to determine relevance (rankðdÞ ! t)
and non-relevance (rankðdÞ t). We call this kind of classi-
fier RFDt. Let tþ
¼ minfrankðdÞjd 2 Dþ
g, tÀ
¼ maxfrankðdÞjd 2
DÀ
g, and t ¼ minftþ
; tÀ
g. To avoid bias, we use balance
testing sets for each topic by randomly selecting five equiva-
lent negative subsets to match the positive set. We also use
other two well-known classifiers: SVM and (Sequential Min-
imal Optimization (SMO) for training SVM), and their
LibSVM implementation (http://www.csie.ntu.edu.tw/
cjlin/libsvm/) and selected the best one for each classifier
for this comparison. Table 11 shows the results, where Accm
and AccM
are the Micro Accuracy and Macro Accuracy,
respectively.
These experiments show that the performance of the pro-
posed model is satisfactory for both altered training sets
and the application environment. These results also support
Hypothesis H3.
6.6 Discussion
The proposed model has three major steps: feature discov-
ery and deploying, term classification, and term weighing.
Offender selection plays an important role for using nega-
tive feedback in the process of feature discovery and
deploying. In this section, we first discuss the issue of
offender selection. We also discuss other issues for the pro-
posed model such as term classification and specificities.
6.6.1 Offender Selection
We believe that the positive feedback is more constructive
than the negative feedback since the objective of relevance
feature discovery is to find relevant knowledge. However,
we believe that negative feedback contains some useful
information that can help to identify the boundary between
relevant and irrelevant information for improving the
effectiveness of relevance feature discovery. The obvious
problem for using irrelevant documents is that most of the
irrelevant documents are not closed to the given topic
because of the very large amount of negative information.
Therefore, it is required to choose some useful irrelevant
documents (offenders) to decide the groups of terms for the
three categories [32]).
TABLE 7
Results of Rocchio Model on Six Sliding Windows
for All Assessor Topics
Model top-20 b/p MAP Fb¼1 IAP
Rocchio-1 0.525 0.444474 0.458249 0.448621 0.476696
Rocchio-2 0.495 0.444119 0.454437 0.448007 0.474435
Rocchio-3 0.505 0.455495 0.463649 0.449008 0.485906
Rocchio-4 0.497 0.450539 0.460866 0.448778 0.483619
Rocchio-5 0.497 0.441519 0.449421 0.441622 0.472068
Rocchio-6 0.479 0.428432 0.443400 0.439774 0.466213
AVG 0.500 0.444096 0.455004 0.445968 0.476490
Rocchio 0.501 0.4240 0.4400 0.4333 0.4590
TABLE 8
Results of RFD1 Model on Six Sliding Windows
for All Assessor Topics
Model top-20 b/p MAP Fb¼1 IAP
RFD1-1 0.585 0.495 0.513 0.483 0.532
RFD1-2 0.565 0.491 0.512 0.485 0.529
RFD1-3 0.581 0.486 0.507 0.479 0.528
RFD1-4 0.575 0.499 0.518 0.484 0.540
RFD1-5 0.558 0.476 0.497 0.470 0.518
RFD1-6 0.547 0.475 0.498 0.473 0.519
AVG 0.569 0.487 0.508 0.479 0.528
RFD1 0.557 0.4724 0.493 0.4696 0.5125
TABLE 9
Results of RFD2 Model on Six Sliding Windows
for All Assessor Topics
Model top-20 b/p MAP Fb¼1 IAP
RFD2-1 0.582 0.497 0.513 0.484 0.533
RFD2-2 0.563 0.496 0.513 0.486 0.530
RFD2-3 0.577 0.483 0.504 0.478 0.525
RFD2-4 0.569 0.493 0.516 0.483 0.537
RFD2-5 0.555 0.476 0.494 0.468 0.514
RFD2-6 0.556 0.478 0.499 0.473 0.520
AVG 0.567 0.487 0.507 0.479 0.527
RFD2 0.561 0.473 0.493 0.470 0.514
TABLE 10
P-value for Comparing RFD1 and RFD2 Model
Loop top-20 b/p MAP Fb¼1 IAP
Loop-1 0.472 0.465 0.766 0.526 0.509
Loop-2 0.569 0.113 0.622 0.404 0.695
Loop-3 0.522 0.349 0.384 0.546 0.369
Loop-4 0.224 0.096 0.243 0.426 0.283
Loop-5 0.411 0.993 0.137 0.182 0.061
Loop-6 0.083 0.196 0.628 0.911 0.678
AVG 0.380 0.369 0.463 0.499 0.433
TABLE 11
Results of RFD Based Classifier with Threshold t
Model Macro-Average AccM
Micro-Average Accm
RFDt 0.682 0.701
SVM linear 0.611 0.656
SMO polynomial 0.616 0.661
%chg +11.62% +6.05%
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1665
www.redpel.com+917620593389
www.redpel.com+917620593389
11. Table 12 shows the performances for different K values,
where K ¼ jDþ
j=2 obtained the best performance. Table 12
a shows the average number of relevant documents, irrele-
vant documents, offenders and extracted terms in the train-
ing sets(where jDÀ
j jDþ
j jDþ
j=2); and Table 12b shows
the performance and the average term weight for the three
categories. The results illustrate that when the value of K is
larger then the performance and term weights of Tþ
and G
are lower. Another advantage of offender selection is to
reduce the space of negative relevance feedback. Table 12a
clearly shows that only 15:8% ¼ 6:54
41:3 of the irrelevant docu-
ments are selected as offenders for the best performance.
In summary, the experimental results support the strat-
egy of offender selection used in the proposed model. We
therefore conclude that the proposed method for offender
selection in RFD meets the design objectives.
6.6.2 Term Classification and Specificity
Terms can be grouped based on spe function and the classifi-
cation rules or the feature clustering method. RFD1 uses two
thresholds to decide the categories of terms. It obtained a sat-
isfactory performance; however, it requires a prior knowl-
edge and costs more effort to set the right values for the
parameters. RFD2 uses the feature clustering technique to
group terms into three categories adaptively for each topic.
In this section, we mainly discuss the results of using RFD2.
Table 13 shows the statistical information for both RFD2
and PTM. The average number of terms that PTM extracted
was 156:9, and all those terms were used as a single group.
RFD2 groups terms into three categories, and the number
of terms in both the positive specific category and general
category is reduced to 46:64 ¼ 24:24 þ 22:4, that is, only
29:73 percent were retained in RFD, and there is about
70:27% ¼ 100% À 29:73% **of extracted PTM terms are pos-
sible noisy terms. The percentage of general terms is
48:03% ¼ 22:4
22:4þ24:24 (see Table 13a) General terms frequently
appeared not only in relevant documents, but also in some
irrelevant documents. To further reduce the side effects of
using general terms, RFD2 adds some negative specific
terms (TÀ
).
We believe that positive specific terms (with large speci-
ficity value) are more interesting than general terms (with
small specificity value) for a given topic. As shown in
Table 13a, PTM assigned 66:92% ¼ 2:5952
1:28273þ2:5952 weights to
specific positive terms, and 33:07 percent weights to general
terms. RFD2 increased the weights of positive specific
terms. It assigned 25:55% ¼ 1:28273
1:28273þ3:73828 to general terms
and 74:45 percent to positive specific terms (see Table 13a).
Fig. 2 shows that the use of only positive specific terms
(Tþ
) could achieve much better result than the use of only
general terms (G). It is also recommended to use both posi-
tive specific terms and the general terms (Tþ
[ G) that can
significantly improve the effectiveness. This recommenda-
tion is also suggested by the SAGE model [12], where a topic
model explicitly consider the background signal (like the
neutral (G) cluster).
In summary, the use of negative feedback is significant
for RFD models. It can balance the percentages of positive
TABLE 12
Statistical Information for RFD2 with Different Values of K
K Average number of training documents Average number of extracted terms
Relevant Irrelevant Offenders Tþ G TÀ
jDþ
j=2 12.78 41.3 6.54 24.24 22.4 231.04
jDþ
j 12.78 41.3 10.08 28.94 24.68 267.38
jDÀ
j 12.78 41.3 38.92 31.78 8.46 521.64
(a)
K Average weight of extracted terms top-20 MAP Fb¼1
wðTþ
Þ wðGÞ wðTÀ
Þ
jDþ
j=2 3.7383 1.2827 -0.3328 0.561 0.493 0.470
jDþ
j 3.3044 1.2227 -3.1947 0.542 0.463 0.451
jDÀ
j 2.6307 0.4602 -69.9437 0.274 0.278 0.295
(b)
TABLE 13
Statistical Information for Both RFD2 and PTM
Average number of Extracted Terms used RFD Average weightðtÞ in PTM
Tþ G TÀ
wðTþ
Þ wðGÞ wðTÀ
Þ
24.24 22.4 231.04 2.5952 1.28273 0.68486
(a)
Average weightðtÞ in RFD Terms extracted from Dþ
used PTM
wðTþ
Þ wðGÞ wðTÀ
Þ T wðTÞ
3.73828 1.28273 -0.33275 156.9 1.45210
(b)
1666 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
12. specific terms and general terms for largely reducing noises.
The experimental results demonstrate that we can roughly
choose the same amount of positive specific terms and gen-
eral terms, and assign large weights to the positive specific
terms. These results support Hypothesis H2.
7 CONCLUSION
The research proposes an alternative approach for relevance
feature discovery in text documents. It presents a method to
find and classify low-level features based on both their
appearances in the higher-level patterns and their specific-
ity. It also introduces a method to select irrelevant docu-
ments for weighting features. In this paper, we continued to
develop the RFD model and experimentally prove that the
proposed specificity function is reasonable and the term
classification can be effectively approximated by a feature
clustering method.
The first RFD model uses two empirical parameters to
set the boundary between the categories. It achieves the
expected performance, but it requires the manually testing
of a large number of different values of parameters. The
new model uses a feature clustering technique to automati-
cally group terms into the three categories. Compared with
the first model, the new model is much more efficient and
achieved the satisfactory performance as well.
This paper also includes a set of experiments on RCV1
(TREC topics), Reuters-21578 and LCSH ontology. These
experiments illustrate that the proposed model achieves the
best performance for comparing with term-based baseline
models and pattern-based baseline models. The results also
show that the term classification can be effectively approxi-
mated by the proposed feature clustering method, the pro-
posed spe function is reasonable and the proposed models
are robust.
This paper demonstrates that the proposed model was
thoroughly tested and the results prove that the proposed
model is statistically significant. The paper also proves that
the use of irrelevance feedback is significant for improving
the performance of relevance feature discovery models. It
provides a promising methodology for developing effective
text mining models for relevance feature discovery based
on both positive and negative feedback.
ACKNOWLEDGMENTS
This paper was partially supported by Grant DP140103157
from the Australian Research Council (ARC Discovery
Project). Y. Li is the corresponding author.
REFERENCES
[1] M. Aghdam, N. Ghasem-Aghaee, and M. Basiri, “Text feature
selection using ant colony optimization,” in Expert Syst. Appl.,
vol. 36, pp. 6843–6853, 2009.
[2] A. Algarni and Y. Li, “Mining specific features for acquiring user
information needs,” in Proc. Pacific Asia Knowl. Discovery Data
Mining, 2013, pp. 532–543.
[3] A. Algarni, Y. Li, and Y. Xu, “Selected new training documents to
update user profile,” in Proc. Int. Conf. Inf. Knowl. Manage., 2010,
pp. 799–808.
[4] N. Azam and J. Yao, “Comparison of term frequency and doc-
ument frequency based feature selection metrics in text cate-
gorization,” Expert Syst. Appl., vol. 39, no. 5, pp. 4760–4768,
2012.
[5] R. Bekkerman and M. Gavish, “High-precision phrase-based doc-
ument classification on a modern scale,” in Proc. 11th ACM
SIGKDD Knowl. Discovery Data Mining, 2011, pp. 231–239.
[6] A. Blum and P. Langley, “Selection of relevant features and exam-
ples in machine learning,” Artif. Intell., vol. 97, nos. 1/2, pp. 245–
271, 1997.
[7] C. Buckley, G. Salton, and J. Allan, “The effect of adding relevance
information in a relevance feedback environment,” in Proc. Annu.
Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 1994, pp. 292–300.
[8] G. Cao, J.-Y. Nie, J. Gao, and S. Robertson, “Selecting good expan-
sion terms for pseudo-relevance feedback,” in Proc. Annu. Int.
ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008, pp. 243–250.
[9] G. Chandrashekar and F. Sahin, “Asurvey on feature selection
methods,” in Comput. Electr. Eng., vol. 40, pp. 16–28, 2014.
[10] B. Croft, D. Metzler, and T. Strohman, Search Engines: Information
Retrieval in Practice. Reading, MA, USA: Addison-Wesley, 2009.
[11] F. Debole and F. Sebastiani, “An analysis of the relative hardness
of Reuters-21578 subsets,” J. Amer. Soc. Inf. Sci. Technol., vol. 56,
no. 6, pp. 584–596, 2005.
[12] J. Eisenstein, A. Ahmed, and E. P. Xing, “Sparse additive genera-
tive models of text,” in Proc. Annu. Int. Conf. Mach. Learn., 2011,
pp. 274–281.
[13] G. Forman, “An extensive empirical study of feature selection
metrics for text classification,” in J. Mach. Learn. Res., vol. 3,
pp. 1289–1305, 2003.
[14] Y. Gao, Y. Xu, and Y. Li, “Topical pattern based document model-
ling and relevance ranking,” in Proc. 15th Int. Conf. Web Inf. Syst.
Eng., 2014, pp. 186–201.
[15] X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and H.-Y. Shum,
“Query dependent ranking using k-nearest neighbor,” in Proc.
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008,
pp. 115–122.
[16] A. Genkin, D. D. Lewis, and D. Madigan, “Large-scale Bayesian
logistic regression for text categorization,” Technometrics, vol. 49,
no. 3, pp. 291–304, 2007.
[17] I. Guyon and A. Elisseeff, “An introduction to variable and feature
selection,” in J. Mach. Learn. Res., vol. 3, no. 1, pp. 1157–1182, 2003.
[18] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without can-
didate generation,” in Proc. ACM SIGMOD Int. Conf. Manage.
Data, 2000, pp. 1–12.
[19] Y.-F. Huang and S.-Y. Lin, “Mining sequential patterns using
graph search techniques,” in Proc. Annu. Int. Conf. Comput. Softw.
Appl., 2003, pp. 4–9.
[20] G. Ifrim, G. Bakir, and G. Weikum, “Fast logistic regression for
text categorization with variable-length n-grams,” in Proc. ACM
SIGKDD Knowl. Discovery Data Mining, 2008, pp. 354–362.
[21] N. Jindal and B. Liu, “Identifying comparative sentences in text
documents,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf.
Retrieval, 2006, pp. 244–251.
[22] T. Joachims, “Transductive inference for text classification using
support vector machines,” in Proc. Annu. Int. Conf. Mach. Learn.,
1999, pp. 200–209.
[23] T. Joachims, “Optimizing search engines using clickthrough
data,” in Proc. ACM SIGKDD Knowl. Discovery Data Mining, 2002,
pp. 133–142.
Fig. 2. Comparison for using different combinations of categories of
terms for RFD2.
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1667
www.redpel.com+917620593389
www.redpel.com+917620593389
13. [24] S. Jones, S. WalkervvKaren, and S. E. Robertson, “A probabi-
listic model of information retrieval: Development and
comparative experiments,” Inf. Process. Manage., vol. 36, no. 6,
pp. 779–808, 2000.
[25] R. Lau, P. Bruza, and D. Song, “Towards a belief-revision-based
adaptive and context-sensitive information retrieval system,”
ACM Trans. Inf. Syst., vol. 26, no. 2, pp. 1–38, 2008.
[26] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new bench-
mark collection for text categorization research,” in J. Mach. Learn.
Res., vol. 5, pp. 361–397, Dec. 2004.
[27] X. Li and B. Liu, “Learning to classify texts using positive and
unlabeled data,” in Proc. 18th Int. Joint Conf. Artif. Intell., 2003,
pp. 587–592.
[28] X.-L. Li, B. Liu, and S.-K. Ng, “Learning to classify documents
with only a small positive training set,” in Proc. 18th Eur. Conf.
Mach. Learn., 2007, pp. 201–213.
[29] Y. Li, D. F. Hus, and S. M. Chung, “Combination of multiple fea-
ture selection methods for text categorization by using combina-
tional fusion analysis and rank-score characteristic,” Int. J. Artif.
Intell. Tools, vol. 22, no. 2, p. 1350001, 2013.
[30] Y. Li, A. Algarni, S.-T. Wu, and Y. Xue, “Mining negative rele-
vance feedback for information filtering,” in Proc. Web Intell. Intell.
Agent Technol., 2009, pp. 606–613.
[31] Y. Li, A. Algarni, and Y. Xu, “A pattern mining approach for
information filtering systems,” in Inf. Retrieval, vol. 14,
pp. 237–256, 2011.
[32] Y. Li, A. Algarni, and N. Zhong, “Mining positive and negative
patterns for relevance feature discovery,” in Proc. ACM SIGKDD
Knowl. Discovery Data Mining, 2010, pp. 753–762.
[33] Y. Li and N. Zhong, “Mining ontology for automatically acquiring
web user information needs,” in IEEE Trans. Knowl. Data Eng.,
vol. 18, no. 4, pp. 554–568, Apr. 2006.
[34] Y. Li, X. Zhou, P. Bruza, Y. Xu, and R. Y. Lau, “A two-stage text
mining model for information filtering,” in Proc. 17th ACM Conf.
Inf. Knowl. Manage., 2008, pp. 1023–1032.
[35] Y. Li, X. Zhou, P. Bruza, Y. Xu, and R. Y. Lau, “Two-stage decision
model for information filtering,” Decision Support Syst., vol. 52,
no. 3, pp. 706–716, 2012.
[36] X. Ling, Q. Mei, C. Zhai, and B. Schatz, “Mining multi-faceted
overviews of arbitrary topics in a text collection,” in Proc. 14th
ACM SIGKDD Knowl. Discovery Data Mining, 2008, pp. 497–505.
[37] C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Infor-
mation Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2009.
[38] C. D. Manning and H. Sch€utze, Foundations of Statistical Natural
Language Processing. Cambridge, MA, USA: MIT Press, 1999.
[39] D. Metzler and W. B. Croft, “Latent concept expansion using Mar-
kov random fields,” in Proc. Annu. Int. ACM SIGIR Conf. Res.
Develop. Inf. Retrieval, 2007, pp. 311–318.
[40] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu, “Prefixspan: Mining sequential patterns efficiently by
prefix-projected pattern growth,” in Proc. Int. Conf. Data Eng.,
2001, pp. 215–224.
[41] R. K. Pon, A. F. Cardenas, D. Buttler, and T. Critchlow, “Tracking
multiple topics for finding interesting articles,” in Proc. ACM
SIGKDD Knowl. Discovery Data Mining, 2007, pp. 560–569.
[42] S. Quiniou, P. Cellier, T. Charnois, and D. Legallois, “What about
sequential data mining techniques to identify linguistic patterns
for stylistics?” in Computational Linguistics and Intelligent Text Proc-
essing. New York, NY, USA: Springer, 2012, pp. 166–177.
[43] S. Robertson, H. Zaragoza, and M. Taylor, “Simple bm25 exten-
sion to multiple weighted fields,” in Proc. 17th ACM Conf. Inf.
Knowl. Manage., 2004, pp. 42–49.
[44] S. E. Robertson and I. Soboroff, “The TREC 2002 filtering track
report,” in Proc. 11th Text Retrieval Conf., 2002.
[45] G. Salton and C. Buckley, “Term-weighting approaches in auto-
matic text retrieval,” in Inf. Process. Manage., vol. 24, no. 5,
pp. 513–523, Aug. 1988.
[46] S. Scott and S. Matwin, “Feature engineering for text classi-
fication,” in Proc. Annu. Int. Conf. Mach. Learn., 1999, pp. 379–388.
[47] F. Sebastiani, “Machine learning in automated text catego-
rization,” ACM Comput. Surveys, vol. 34, no. 1, pp. 1–47, 2002.
[48] M. Seno and G. Karypis, “Slpminer: An algorithm for finding fre-
quent sequential patterns using length-decreasing support con-
straint,” in Proc. 2nd IEEE Conf. Data Mining, 2002, pp. 418–425.
[49] R. Sharma and S. Raman, “Phrase-based text representation for
managing the web documents,” in Proc. Int. Conf. Inf. Technol.:
Coding Comput., 2003, pp. 165–169.
[50] S. Shehata, F. Karray, and M. Kamel, “Enhancing text clustering
using concept-based mining model,” in Proc. 2nd IEEE Conf. Data
Mining, 2006, pp. 1043–1048.
[51] S. Shehata, F. Karray, and M. Kamel, “A concept-based model for
enhancing text categorization,” in Proc. ACM SIGKDD Knowl. Dis-
covery Data Mining, 2007, pp. 629–637.
[52] I. Soboroff and S. Robertson, “Building a filtering test collection
for TREC 2002,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop.
Inf. Retrieval, 2003, pp. 243–250.
[53] F. Song and W. B. Croft, “A general language model for informa-
tion retrieval,” in Proc. ACM Conf. Inf. Knowl. Manage., 1999,
pp. 316–321.
[54] Q. Song, J. Ni, and G. Wang, “A fast clustering-based feature sub-
set selection algorithm for high-dimensional data,” in IEEE Trans.
Knowl. Data Eng., vol. 25, no. 1, pp. 1–14, Jan. 2013.
[55] X. Tao, Y. Li, and N. Zhong, “A personalized ontology model for
web information gathering,” in IEEE Trans. Knowl. Data Eng.,
vol. 23, no. 4, pp. 496–511, Apr. 2011.
[56] R. Tibshirani, “Regression shrinkage and selection via the Lasso:
A retrospective,” in J. Royal Stat. Soc. B, vol. 73, pp. 273–282, 2011.
[57] R. Tibshirani, “Regression shrinkage and selection via the Lasso,”
J. Royal Stat. Soc. B, vol. 58, no. 1, pp. 267–288, 1996.
[58] X. Wang, H. Fang, and C. Zhai, “A study of methods for negative
relevance feedback,” in Proc. Annu. Int. ACM SIGIR Conf. Res.
Develop. Inf. Retrieval, 2008, pp. 219–226.
[59] S.-T. Wu, Y. Li, and Y. Xu, “Deploying approaches for pattern
refinement in text mining,” in Proc. IEEE Conf. Data Mining, 2006,
pp. 1157–1161.
[60] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic pattern-
taxonomy extraction for web mining,” in Proc. Int. Conf. Web
Intell., 2004, pp. 242–248.
[61] Z. Xu, and R. Akella, “Active relevance feedback for difficult quer-
ies,” in Proc. ACM Conf. Inf. Knowl. Manage., 2008, pp. 459–468.
[62] G.-R. Xue, D. Xing, Q. Yang, and Y. Yu, “Deep classification in
large-scale text hierarchies,” in Proc. Annu. Int. ACM SIGIR Conf.
Res. Develop. Inf. Retrieval, 2008, pp. 619–626.
[63] M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, and M. Sugiyama,
“High-dimensional feature selection by feature-wise kernelized
Lasso,” Neural Comput., vol. 26, no. 1, pp. 185–207, 2014.
[64] X. Yan, H. Cheng, J. Han, and D. Xin, “Summarizing itemset pat-
terns: A profile-based approach,” in Proc. ACM SIGKDD Knowl.
Discovery Data Mining, 2005, pp. 314–323.
[65] C. C. Yang, “Search engines information retrieval in practice,” in
J. Amer. Soc. Inf. Sci. Technol., vol. 61, pp. 430–430, 2010.
[66] Y. Yang, “An evaluation of statistical approaches to text catego-
rization,” in Inf. Retreival, vol. 1, pp. 69–90, 1999.
[67] Y. Yang and J. O. Pedersen, “A comparative study on feature
selection in text categorization,” in Proc. Annu. Int. Conf. Mach.
Learn., 1997, pp. 412–420.
[68] M. J. Zaki, “Spade: An efficient algorithm for mining frequent
sequences,” in Mach. Learn. J. Spec. Issue Unsupervised Learn.,
vol. 42, pp. 31–60, 2001.
[69] Z. Zhao, L. Wang, H. Liu, and J. Ye, “On similarity preserving fea-
ture selection,” in IEEE Trans. Knowl. Data Eng., vol. 25, no. 3,
pp. 619–632, Mar. 2013.
[70] N. Zhong, Y. Li, and S.-T. Wu, “Effective pattern discovery for text
mining,” in IEEE Trans. Knowl. Data Eng., vol. 24, no. 1, pp. 30–44,
Jan. 2012.
[71] S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification
using maximum entropy method,” in Proc. Annu. Int. ACM SIGIR
Conf. Res. Develop. Inf. Retrieval, 2005, pp. 1041–1048.
1668 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015
www.redpel.com+917620593389
www.redpel.com+917620593389
14. Yuefeng Li is a full professor in the School of
Electrical Engineering and Computer Science,
Queensland University of Technology, Australia.
He has published more than 150 refereed papers
(including 43 journal papers). He has demonstra-
ble experience in leading large-scale research
projects and has achieved many established
research outcomes that have been published
and highly cited in top data mining journals and
conferences (Highest citation per paper ¼ 188).
He is the managing editor of Web Intelligence
and Agent Systems and an associate editor of the International Journal
of Pattern Recognition and Artificial Intelligence.
Abdulmohsen Algarni received the PhD degree
from Queensland University of Technology, Aus-
tralia, in 2012. He was a research associate in the
School of Electrical Engineering and Computer
Science, Queensland University of Technology,
Australia, in 2012. He is currently an assistant pro-
fessor of the College of Computer Science, King
Khalid University. His research interest includes
text mining and information filtering.
Mubarak Albathan received the MSc degree in
network computing from Monash University,
Australia, in 2009. He is currently working toward
the PhD degree in the School of Electrical
Engineering and Computer Science, Queensland
University of Technology, Brisbane, Australia.
His research interests include feature selection
and Web intelligence.
Yan Shen received the PhD degree from the
Queensland University of Technology, Australia,
in 2013. He is a research associate in the School
of Electrical Engineering and Computer Science,
Queensland University of Technology, Australia.
His research interest includes ontology learning
and text mining.
Moch Arif Bijaksana received the master’s
degree from RMIT University, Australia. He is
currently working toward the PhD degree in the
School of Electrical Engineering and Computer
Science, Queensland University of Technology,
Australia. He is working at Telkom University,
Indonesia. His research interest includes text
classification and knowledge discovery.
For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
LI ET AL.: RELEVANCE FEATURE DISCOVERY FOR TEXT MINING 1669
www.redpel.com+917620593389
www.redpel.com+917620593389