SlideShare uma empresa Scribd logo
1 de 10
Baixar para ler offline
arXiv:cs/0312018v1[cs.IR]11Dec2003
Mapping Subsets of Scholarly Information
Paul Ginsparg∗
, Paul Houle, Thorsten Joachims, and Jae-Hoon Sul
(Cornell University, Ithaca, NY 14853, USA)
Abstract
We illustrate the use of machine learning techniques to analyze, structure, maintain,
and evolve a large online corpus of academic literature. An emerging field of research
can be identified as part of an existing corpus, permitting the implementation of a
more coherent community structure for its practitioners.
1 Background
The arXiv1
is an automated repository of over 250,000 full-text research articles2
in physics
and related disciplines, going back over a decade and growing at a rate of 40,000 new sub-
missions per year. It serves over 10 million requests per month [Ginsparg, 2003], including
tens of thousands of search queries per day, and over 20 million full text downloads during
calendar year ’02. It is a significant example of a Web-based service that has changed the
practice of research in a major scientific discipline. It provides nearly comprehensive cover-
age of large areas of physics, and serves as an on-line seminar system for those areas. It also
provides a significant resource for model building and algorithmic experiments in mapping
scholarly domains. Together with the SLAC SPIRES-HEP database3
, it provides a public
resource of full-text articles and associated citation tree of many millions of links, with a
focused disciplinary coverage, and rich usage data.
In what follows, we will use arXiv data to illustrate how machine learning methods can be
used to analyze, structure, maintain, and evolve a large online corpus of academic literature.
The specific application will be to train a support vector machine text classifier to extract
an emerging research area from a larger-scale resource. The automated detection of such
subunits can play an important role in disentangling other sub-networks and associated sub-
communities from the global network. Our notion of “mapping” here is in the mathematical
sense of associating attributes to objects, rather than in the sense of visualization tools.
While the former underlies the latter, we expect the two will increasingly go hand-in-hand
(for a recent review, see [B¨orner et al., 2003]).
∗
Presenter of invited talk at Arthur M. Sackler Colloquium on ”Mapping Knowledge Domains”, 9–11
May 2003, Beckman Center of the National Academy of Sciences, Irvine, CA
1
See http://arXiv.org/ . For general background, see [Ginsparg, 2001].
2
as of mid-Oct 2003
3
The Stanford Linear Accelerator Center SPIRES-HEP database has comprehensively catalogued the
High Energy Particle Physics (HEP) literature online since 1974, and indexes more than 500,000 high-energy
physics related articles including their full citation tree (see [O’Connell, 2002]).
1
graphics
baseball
specs
references
hockey
car
clinton
unix
space
quicktime
computer
.
.
.
0
3
0
1
0
0
0
1
0
2
0
From: xxx@sciences.sdsu.edu
Newsgroups: comp.graphics
Subject: Need specs on Apple QT
I need to get the specs, or at least a
for QuickTime. Technical articles from
be nice, too.
have on ...
very verbose interpretation of the specs,
on a Unix or MS-Dos system. I can’t
do much with the QuickTime stuff they
I also need the specs in a fromat usable
magazines and references to books would
Figure 1: Representing text as a feature vector.
2 Text Classification
The goal of text classification is the automatic assignment of documents to a fixed number
of semantic categories. In the “multi-label” setting, each document can be in zero or one or
more categories. Efficient automated techniques are essential to avoid tedious and expensive
manual category assigment for large document sets. A “knowledge engineering” approach,
involving hand-crafting accurate text classification rules, is surprisingly difficult and time-
consuming [Hayes and Weinstein, 1990]. We therefore take a machine learning approach to
generating text classification rules automatically from examples.
The machine learning approach can be phrased as a supervised learning problem. The
learning task is represented by the training sample Sn
(x1, y1), (x2, y2), . . . , (xn, yn) (1)
of size n documents, where xi represents the document content. In the multi-label setting,
each category label is treated as a separate binary classification problem. For each such
binary task, yi ∈ {−1, +1} indicates whether a document belongs to a particular class. The
task of the learning algorithm L is to find a decision rule hL : x −→ {−1, +1} based on Sn
that classifies new documents x as accurately as possible.
Documents need to be transformed into a representation suitable for the learning algo-
rithm and the classification task. Information Retrieval research suggests that words work
well as representation units, and that for many tasks their ordering can be ignored without
losing too much information. This type of representation is commonly called the “bag-of-
words” model, an attribute-value representation of text. Each text document is represented
by a vector in the lexicon space, i.e., by a “term frequency” feature vector TF(wi, x), with
component values equal to the number of times each distinct word wi in the corpus occurs
in the document x. Fig. 1 shows an example feature vector for a particular document.
This basic representation is ordinarily refined in a few ways:
TF×IDF Weighting: Scaling the components of the feature vector with their inverse doc-
ument frequency IDF(wi) [Salton and Buckley, 1988] often leads to improved perfor-
mance. In general, IDF(wi) is some decreasing function of the word frequency DF(wi),
2
equal to the number of documents in the corpus which contain the word wi. For
example,
IDF(wi) = log
n
DF(wi)
(2)
where n is the total number of documents. Intuitively, the inverse document frequency
assumes that rarer terms have more significance for classification purposes, and hence
gives them greater weight. To compensate for the effect of different document lengths,
each document feature vector xi is normalized to unit length: ||xi|| = 1.
Stemming: Instead of treating each occurrence form of a word as a different feature, stem-
ming is used to project the different forms of a word onto a single feature, the word
stem, by removing inflection information [Porter, 1980]. For example “computes”,
“computing”, and “computer” are all mapped to the same stem “comput”. The terms
“word” and “word stem” will be used synonymously in the following.
Stopword Removal: For many classification tasks, common words like “the”, “and”, or
“he” do not help discriminate between document classes. Stopword removal describes
the process of eliminating such words from the document by matching against a pre-
defined list of stop-words. We use a standard stoplist of roughly 300 words.
3 Support Vector Machines
SVMs [Vapnik, 1998] were developed by V. Vapnik et al. based on the structural risk mini-
mization principle from statistical learning theory. They have proven to be a highly effective
method for learning text classification rules, achieving state-of-the-art performance on a
broad range of tasks [Joachims, 1998, Dumais et al., 1998]. Two main advantages of using
SVMs for text classification lie in their ability to handle the high dimensional feature spaces
arising from the bag-of-words representation. From a statistical perspective, they are robust
to overfitting and are well suited for the statistical properties of text. From a computational
perspective, they can be trained efficiently despite the large number of features. A detailed
overview of the SVM approach to text classification, with more details on the notation used
below, is given in [Joachims, 2002].
In their basic form, SVMs learn linear decision rules
h(x) = sgn{w · x + b} , (3)
described by a weight vector w and a threshold b, from an input sample of n training examples
Sn = ((x1, y1), · · ·, (xn, yn)), xi ∈ IRN
, yi ∈ {−1, +1}. For a linearly separable Sn, the SVM
finds the hyperplane with maximum Euclidean distance to the closest training examples.
This distance is called the margin δ, as depicted in Fig. 2. Geometrically, the hyperplane
is defined by its normal vector, w, and its distance from the origin, −b. For non-separable
training sets, the amount of training error is measured using slack variables ξi.
Computing the position of the hyperplane is equivalent to solving the following convex
quadratic optimization problem [Vapnik, 1998]:
3
h*
δ
δ
δ
Figure 2: A binary classification problem (+ vs. −) in two dimensions. The hyperplane h∗
separates positive and negative training examples with maximum margin δ. The examples
closest to the hyperplane are called support vectors (marked with circles).
Optimization Problem 1 (SVM (primal))
minimize: V (w, b, ξ) =
1
2
w · w + C
n
i=1
ξi (4)
subj. to: ∀n
i=1 : yi[w · xi + b] ≥ 1 − ξi (5)
∀n
i=1 : ξi > 0 (6)
The margin of the resulting hyperplane is δ = 1/||w||.
The constraints (5) require that all training examples are classified correctly up to some
slack ξi. If a training example lies on the “wrong” side of the hyperplane, we have the
corresponding ξi ≥ 1, and thus n
i=1 ξi is an upper bound on the number of training errors.
The factor C in (4) is a parameter that allows trading off training error vs. model complexity.
The optimal value of this parameter depends on the particular classification task and must be
chosen via cross-validation or by some other model selection strategy. For text classification,
however, the default value of C = 1/maxi||xi||2
= 1 has proven to be effective over a large
range of tasks [Joachims, 2002].
OP1 has an equivalent dual formulation:
Optimization Problem 2 (SVM (dual))
maximize: W(α) =
n
i=1
αi −
1
2
n
i=1
n
j=1
yiyjαiαj(xi · xj) (7)
subj. to:
n
i=1
yiαi = 0 (8)
∀i ∈ [1..n] : 0 ≤ αi ≤ C (9)
From the solution of the dual, the classification rule solution can be constructed as
w =
n
i=1
αiyixi and b = yusv − w·xusv , (10)
4
where (xusv, yusv) is some training example with 0 < αusv < C. For the experiments in this
paper, SVMLight
[Joachims, 2002] is used for solving the dual optimization problem4
. More
detailed introductions to SVMs can be found in [Burges, 1998, Cristianini and Shawe-Taylor, 2000].
An alternative to the approach here would be to use Latent Semantic Analysis (LSA/LSI)
[Landauer and Dumais, 1997] to generate the feature vectors. LSA can potentially give bet-
ter recall by capturing partial synonymy in the word representation, i.e., by bundling related
words into a single feature. The reduction in dimension can also be computationally efficient,
facilitating capture of the full text content of papers, including their citation information.
On the other hand, the SVM already scales well to a large feature space, and performs well
at determining relevant content through suitable choices of weights. On a sufficiently large
training set, the SVM might even benefit from access to fine distinctions between words
potentially obscured by the LSA bundling. It is thus an open question, well worth further
investigation, as to whether LSA applied to full text could improve performance of the SVM
in this application, without loss of computational efficiency. Generative models for text
classification provide yet another alternative approach, in which each document is viewed as
a mixture of topics, as in the statistical inference algorithm for Latent Dirichlet Allocation
used in [Griffiths and Steyvers, 2003]. This approach can in principle provide significant ad-
ditional information about document content, but does not scale well to a large document
corpus, so the real-time applications intended here would not yet be computationally feasible
in that framework.
4 arXiv Benchmarks
Before using the machine learning framework to identify new subject area content, we first
assessed its performance on the existing (author-provided) category classifications. Roughly
180,000 titles and abstracts were fed to model building software which constructed a lexi-
con of roughly 100,000 distinct words and produced training files containing the TD×IDF
document vectors for SVMLight
. (While the SVM machinery could easily be used to analyze
the full document content, previous experiments [Joachims, 2002] suggest that well-written
titles and abstracts provide a highly focused characterization of content at least as effective
for our document classification purposes.) The set of support vectors and weight parameters
output by SVMLight
was converted into a form specific to the linear SVM, eq. (3): a weight
vector wc and a threshold bc, where c is an index over the categories.
As seen in Fig. 3, the success of the SVM in classifying documents improves as the
size of a category increases. The SVM is remarkably successful at identifying documents in
large (> 10, 000 documents) categories and less successful on smaller subject areas (< 500
documents). A cutoff was imposed to exclude subject areas with fewer than 100 documents.5
In experiments with small categories (N < 1000), the SVM consistently chose thresholds
that resulted in unacceptably low recall (i.e., missed documents relevant to the category).
This is in part because the null hypothesis, that no documents are in the category, describes
4
SVMLight
is available at http://svmlight.joachims.org/
5
Some of the smaller subject areas are known to be less topically focused, so the difficulty in recall, based
solely on title/abstract terminology, was expected.
5
100 1000 10000
Number of documents in category
0
0.2
0.4
0.6
0.8
1
Recall
100 1000 10000
Number of documents in category
0
0.2
0.4
0.6
0.8
1
Precision
Figure 3: Recall and Precision as functions of category size for 78 arXiv major categories
and minor subject classes. 2/3 of the sample was used as a training set and 1/3 as a test
set. The four largest categories, each with over 30,000 documents are cond-mat, astro-ph,
hep-th, and hep-ph.
the data well for a small category: e.g., if only 1000 of 100,000 documents are in a category,
the null hypothesis has a 99% accuracy.6
To counteract the bias against small categories,
10% of the training data was used as a validation set to fit a sigmoid probability distribution
1/(1+exp(Af+B)) [Platt, 1999]. This converts the dot product xi · wc between document
vector and weight vector into a probability P(i ∈ c | xi) that document i is a member of
category c, given its feature vector xi. Use of the probability in place of the uncalibrated
signed distance from the hyperplane output by the SVM permitted greater levels of recall
for small categories.
Other experiments showed that the use of TF×IDF weighting as in eq. (2) improved
accuracy consistently over pure TF weighting, so TF×IDF weighting was used in the exper-
iments to follow. We also used a document frequency threshhold to exclude rare words from
the lexicon, but found little difference in accuracy between using a document occurrence
threshold of two and five.7
Increasing the weight of title words with respect to abstract
words, on the other hand, consistently worsened accuracy, indicating that words in a well-
written abstract contain as much or more content classification import as words in the title.
Changes in the word tokenization and stemming algorithms did not have a significant impact
on overall accuracy. The default value of C = 1 in eq. (9) was preferred.
6
“Accuracy” is the percentage of documents correctly classified. We also employ other common termi-
nology from information retrieval: “precision” is the fraction of those documents retrieved that relevant to
a query, and “recall” is the fraction of all relevant documents in the collection that are retrieved.
7
Words that appeared in fewer than two documents constituted roughly 50% of the lexicon, and those
that appeared in fewer than five documents roughly 70%. Ignoring rare and consequently uninformative
words hence reduces the computational needs.
6
5 q-bio Extraction
There has been recent anecdotal evidence of an intellectual trend among physicists towards
work in biology, ranging from biomolecules, molecular pathways and networks, gene expres-
sion, cellular and multicellular systems to population dynamics and evolution. This work
has appeared in separate parts of the archive, particularly under “Soft Condensed Mat-
ter”, “Statistical Mechanics”, “Disordered Systems and Neural Networks”, “Biophysics, and
“Adaptive and Self-Organizing Systems” (abbreviated cond-mat.soft, cond-mat.stat-mech,
cond-mat.dis-nn, physics.bio-ph, and nlin.AO). A more coherent forum for the exchange
of these ideas was requested, under the nomenclature “Quantitative Biology” (abbreviated
“q-bio”).
To identify first whether there was indeed a real trend to nurture and amplify, and to
create a training set, volunteers were enlisted to identify the q-bio content from the above
subject areas in which it was most highly focused. Of 5565 such articles received from Jan
2002 through Mar 2003, 466 (8.4%) were found to have one of the above biological topics as
its primary focus. The total number of distinct words in these titles, abstracts, plus author
names, was 23558, of which 7984 were above the DF = 2 document frequency threshold.
(Author names were included in the analysis since they have potential “semantic” content
in this context, i.e., are potentially useful for document classification. The SVM algorithm
will automatically determine whether or not to use the information by choosing suitable
weights.)
A data-cleaning procedure was employed, in which SVMLight
was first run with C =
10. We recall from eqs. (4) and (9) that larger C penalizes training errors and requires
larger α’s to fit the data. Inspecting the “outlier” documents [Guyon et al., 1996] with the
largest |αi| then permitted manual cleaning of the training set. 10 were moved into q-bio,
and 15 moved out, for a net movement to 461 q-bio (8.3%) of the 5565 total. Some of
the others flagged involved word confusions, e.g., “genetic algorithms” typically involved
programming rather than biological applications. Other “q-bio” words with frequent non-
biological senses were “epidemic” (used for rumor propagation), “evolution” (used also for
dynamics of sandpiles), “survival probabilities”, and extinction. “Scale-free networks” were
sometimes used for biological applications, and sometimes not. To help resolve some of these
ambiguities, the vocabulary was enlarged to include a list of most frequently used two-word
phrases with semantic content different from their constituent words .
With a training set fully representative of the categories in question, it was then possible
to run the classifier on the entirety of the same subject area content received from 1992
through Apr 2003, a total of 28,830 documents. The results are shown in Fig. 4. A total
of 1649 q-bio documents was identified, and the trend towards an increasing percentage of
q-bio activity among these arXiv users is evident: individual authors can be tracked as they
migrate into the domain. Visibility of the new domain can be further enhanced by referring
submitters in real time, via the automated classifier running behind the submission interface.
Some components of the weight vector generated by the SVM for this training set are
shown in Fig. 5. Since the distance of a document to the classifying hyperplane is determined
by taking the dot product with the normal vector, its component values can be interpreted
as the classifying weight for the associated words. The approach here illustrates one of the
7
0
500
1000
1500
2000
2500
3000
3500
4000
4500
cond-mat.(soft,stat-mech,dis-nn) + physics.bio-ph + nlin.AO
q-bio
0.7% 1.1% 1.4% 1.6% 2.6%
5.1% 5.5%
7.0% 6.5% 7.5% 8.0% 8.8%
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
#submissions/year
Figure 4: The number of submissions per year from 1992 through April 2003 in particular
subsets of arXiv subject areas of cond-mat, physics, and nlin most likely to have “quantitative
biology” content. The percentage of q-bio content in these areas grew from roughly 1% to
nearly 10% during this timeframe, suggesting a change in intellectual activity among arXiv-
using members of these communities.

































protein +8.57
dna +7.08
biological +5.06
neuron +5.05
rna +3.30
gene +3.21
mutation +3.15
population +3.11
epidemic +3.05
biology +3.02
disease +2.93
cell +2.90
neural +2.89
brain +2.83
ecosystem +2.56
tissue +2.52
sequence +2.51
genetic +2.51
bacterial +2.48
blood +2.43
genome +2.37
peptide +2.37
infection +2.34
...
forward +0.00
minimalist +0.00
region +0.00
confinement +0.00
implies +0.00
96 +0.00
y togashi +0.00
n wingreen +0.00
mean free +0.00
narrower +0.00
shot -0.00
repton -0.00
kyoto -0.00
regular -0.00
generalisation -0.00
d saakian -0.00
conformity -0.00
aware -0.00
even though -0.00
practitioner -0.00
permittivity -0.00...
point -1.04
equation -1.05
boundary -1.06
social -1.06
n -1.09
relaxation -1.14
fluid -1.15
indian -1.15
spin -1.17
spin glass -1.17
traffic -1.18
system -1.30
polymer -1.33
class -1.35
emerge -1.36
gradient -1.39
quantum -1.43
surface -1.43
synchronization -1.45
market -1.47
particle -1.52
polyelectrolyte -1.53
world -1.57

































Figure 5: Shown above are the most positive, a few intermediate, and most negative com-
ponents of the q-bio classifying weight vector.
8
major lessons of the past decade: the surprising power of simple algorithms operating on
large datasets.
While implemented as a passive dissemination system, the arXiv has also played a social
engineering role, with active research users developing an affinity to the system and adjusting
their behavior accordingly. They scan new submissions on a daily basis, assume others in
their field do so and are consequently aware of anything relevant that has appeared there
(while anything that doesn’t may as well not exist), and use it to stake intellectual priority
claims in advance of journal publication. We see further that machine learning tools can
characterize a subdomain and thereby help accelerate its growth, via the interaction of an
information resource with the social system of its practitioners.8
Postscript: The q-bio extraction described in section 5 was not just a thought experiment,
but a prelude to an engineering experiment. The new category went on-line in mid-September
2003,9
at which time past submitters flagged by the system were asked to confirm the q-bio
content of their submissions, and to send future submissions to the new category. The activity
levels at the outset corresponded precisely to the predictions of the SVM text classifier, and
later began to show indications of growth catalyzed by the public existence of the new
category.
References
[B¨orner et al., 2003] B¨orner, K., Chen, C., and Boyack, K. (2003). Visualizing knowledge
domains. Annual Review of Information Science & Technology, 37:179–255.
[Burges, 1998] Burges, C. (1998). A tutorial on support vector machines for pattern recog-
nition. Data Mining and Knowledge Discovery, 2(2):121–167.
[Cristianini and Shawe-Taylor, 2000] Cristianini, N. and Shawe-Taylor, J. (2000). An Intro-
duction to Support Vector Machines and Other Kernel-Based Learning Methods. Cam-
bridge University Press.
[Dumais et al., 1998] Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). In-
ductive learning algorithms and representations for text categorization. In Proceedings of
ACM-CIKM98.
[Ginsparg, 2001] Ginsparg, P. (2001). Creating a global knowledge network. In Elec-
tronic Publishing in Science II (proceedings of joint ICSU Press/UNESCO confer-
ence, Paris, France). ICSU Press. http://users.ox.ac.uk/˜icsuinfo/ginspargfin.htm (copy
at http://arXiv.org/blurb/pg01unesco.html).
[Ginsparg, 2003] Ginsparg, P. (2003). Can peer review be better focused? Science & Tech-
nology Libraries, pages to appear, copy at http://arXiv.org/blurb/pg02pr.html.
8
Some other implications of these methods, including potential modification of the peer review system,
are considered elsewhere [Ginsparg, 2003].
9
See http://arXiv.org/new/q-bio.html
9
[Griffiths and Steyvers, 2003] Griffiths, T. and Steyvers, M. (2003). Finding scientific topics.
In Arthur M. Sackler Colloquium on ”Mapping Knowledge Domains”. Proceedings of the
National Academy of Sciences, in press.
[Guyon et al., 1996] Guyon, I., Matic, N., and Vapnik, V. (1996). Discovering informative
patterns and data cleaning. In Advances in Knowledge Discovery and Data Mining, pages
181–203.
[Hayes and Weinstein, 1990] Hayes, P. and Weinstein, S. (1990). CONSTRUE/TIS: a sys-
tem for content-based indexing of a database of news stories. In Annual Conference on
Innovative Applications of AI.
[Joachims, 1998] Joachims, T. (1998). Text categorization with support vector machines:
Learning with many relevant features. In Proceedings of the European Conference on
Machine Learning, pages 137 – 142, Berlin. Springer.
[Joachims, 2002] Joachims, T. (2002). Learning to Classify Text Using Support Vector Ma-
chines – Methods, Theory, and Algorithms. Kluwer.
[Landauer and Dumais, 1997] Landauer, T. K. and Dumais, S. T. (1997). A solution to
Plato’s problem: The latent semantic analysis theory of acquisition, induction and repre-
sentation of knowledge. Psychological Review, 104:211–240.
[O’Connell, 2002] O’Connell, H. B. (2002). Physicists thriving with paperless publishing.
HEP Lib. Web., 6:3, arXiv:physics/0007040.
[Platt, 1999] Platt, J. (1999). Probabilities for SV machines. In Advances in Large Margin
Classifiers, pages 61–74. MIT Press.
[Porter, 1980] Porter, M. (1980). An algorithm for suffix stripping. Program (Automated
Library and Information Systems), 14(3):130–137.
[Salton and Buckley, 1988] Salton, G. and Buckley, C. (1988). Term weighting approaches
in automatic text retrieval. Information Processing and Management, 24(5):513–523.
[Vapnik, 1998] Vapnik, V. (1998). Statistical Learning Theory. Wiley, Chichester, GB.
10

Mais conteúdo relacionado

Mais procurados

Topic models
Topic modelsTopic models
Topic modelsAjay Ohri
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspectiveankurpandeyinfo
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET Journal
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clusteringKrish_ver2
 
Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...ijtsrd
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Dwaipayan Roy
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGA COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGIJORCS
 
Dimensionality Reduction Techniques for Document Clustering- A Survey
Dimensionality Reduction Techniques for Document Clustering- A SurveyDimensionality Reduction Techniques for Document Clustering- A Survey
Dimensionality Reduction Techniques for Document Clustering- A SurveyIJTET Journal
 
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...ijcsity
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 

Mais procurados (20)

Topic models
Topic modelsTopic models
Topic models
 
E1062530
E1062530E1062530
E1062530
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
 
Canini09a
Canini09aCanini09a
Canini09a
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
 
Topicmodels
TopicmodelsTopicmodels
Topicmodels
 
Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGA COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
 
Dimensionality Reduction Techniques for Document Clustering- A Survey
Dimensionality Reduction Techniques for Document Clustering- A SurveyDimensionality Reduction Techniques for Document Clustering- A Survey
Dimensionality Reduction Techniques for Document Clustering- A Survey
 
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 

Semelhante a Mapping Emerging Research Areas Using Machine Learning

Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfIJDKP
 
Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...IJERA Editor
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
 
Fuzzy formal concept analysis: Approaches, applications and issues
Fuzzy formal concept analysis: Approaches, applications and issuesFuzzy formal concept analysis: Approaches, applications and issues
Fuzzy formal concept analysis: Approaches, applications and issuesCSITiaesprime
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextFulvio Rotella
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextUniversity of Bari (Italy)
 
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...sherinmm
 
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...sherinmm
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemnooriasukmaningtyas
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Sherin Mathews
 

Semelhante a Mapping Emerging Research Areas Using Machine Learning (20)

TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Text categorization
Text categorizationText categorization
Text categorization
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdf
 
Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
Fuzzy formal concept analysis: Approaches, applications and issues
Fuzzy formal concept analysis: Approaches, applications and issuesFuzzy formal concept analysis: Approaches, applications and issues
Fuzzy formal concept analysis: Approaches, applications and issues
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
 
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problem
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
 

Mais de Paul Houle

Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Paul Houle
 
Estimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessEstimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessPaul Houle
 
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Paul Houle
 
Fixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemFixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemPaul Houle
 
Cisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataCisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataPaul Houle
 
Making the semantic web work
Making the semantic web workMaking the semantic web work
Making the semantic web workPaul Houle
 
Ontology2 platform
Ontology2 platformOntology2 platform
Ontology2 platformPaul Houle
 
Ontology2 Platform Evolution
Ontology2 Platform EvolutionOntology2 Platform Evolution
Ontology2 Platform EvolutionPaul Houle
 
Paul houle the supermen
Paul houle   the supermenPaul houle   the supermen
Paul houle the supermenPaul Houle
 
Paul houle what ails enterprise search
Paul houle   what ails enterprise search Paul houle   what ails enterprise search
Paul houle what ails enterprise search Paul Houle
 
Subjective Importance Smackdown
Subjective Importance SmackdownSubjective Importance Smackdown
Subjective Importance SmackdownPaul Houle
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Paul Houle
 
Dropping unique constraints in sql server
Dropping unique constraints in sql serverDropping unique constraints in sql server
Dropping unique constraints in sql serverPaul Houle
 
Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Paul Houle
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resumePaul Houle
 
Keeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksKeeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksPaul Houle
 
Embrace dynamic PHP
Embrace dynamic PHPEmbrace dynamic PHP
Embrace dynamic PHPPaul Houle
 
Once asynchronous, always asynchronous
Once asynchronous, always asynchronousOnce asynchronous, always asynchronous
Once asynchronous, always asynchronousPaul Houle
 
What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?Paul Houle
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Paul Houle
 

Mais de Paul Houle (20)

Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6
 
Estimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessEstimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development Process
 
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
 
Fixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemFixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI System
 
Cisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataCisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart Data
 
Making the semantic web work
Making the semantic web workMaking the semantic web work
Making the semantic web work
 
Ontology2 platform
Ontology2 platformOntology2 platform
Ontology2 platform
 
Ontology2 Platform Evolution
Ontology2 Platform EvolutionOntology2 Platform Evolution
Ontology2 Platform Evolution
 
Paul houle the supermen
Paul houle   the supermenPaul houle   the supermen
Paul houle the supermen
 
Paul houle what ails enterprise search
Paul houle   what ails enterprise search Paul houle   what ails enterprise search
Paul houle what ails enterprise search
 
Subjective Importance Smackdown
Subjective Importance SmackdownSubjective Importance Smackdown
Subjective Importance Smackdown
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#
 
Dropping unique constraints in sql server
Dropping unique constraints in sql serverDropping unique constraints in sql server
Dropping unique constraints in sql server
 
Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resume
 
Keeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksKeeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacks
 
Embrace dynamic PHP
Embrace dynamic PHPEmbrace dynamic PHP
Embrace dynamic PHP
 
Once asynchronous, always asynchronous
Once asynchronous, always asynchronousOnce asynchronous, always asynchronous
Once asynchronous, always asynchronous
 
What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#
 

Último

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Último (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Mapping Emerging Research Areas Using Machine Learning

  • 1. arXiv:cs/0312018v1[cs.IR]11Dec2003 Mapping Subsets of Scholarly Information Paul Ginsparg∗ , Paul Houle, Thorsten Joachims, and Jae-Hoon Sul (Cornell University, Ithaca, NY 14853, USA) Abstract We illustrate the use of machine learning techniques to analyze, structure, maintain, and evolve a large online corpus of academic literature. An emerging field of research can be identified as part of an existing corpus, permitting the implementation of a more coherent community structure for its practitioners. 1 Background The arXiv1 is an automated repository of over 250,000 full-text research articles2 in physics and related disciplines, going back over a decade and growing at a rate of 40,000 new sub- missions per year. It serves over 10 million requests per month [Ginsparg, 2003], including tens of thousands of search queries per day, and over 20 million full text downloads during calendar year ’02. It is a significant example of a Web-based service that has changed the practice of research in a major scientific discipline. It provides nearly comprehensive cover- age of large areas of physics, and serves as an on-line seminar system for those areas. It also provides a significant resource for model building and algorithmic experiments in mapping scholarly domains. Together with the SLAC SPIRES-HEP database3 , it provides a public resource of full-text articles and associated citation tree of many millions of links, with a focused disciplinary coverage, and rich usage data. In what follows, we will use arXiv data to illustrate how machine learning methods can be used to analyze, structure, maintain, and evolve a large online corpus of academic literature. The specific application will be to train a support vector machine text classifier to extract an emerging research area from a larger-scale resource. The automated detection of such subunits can play an important role in disentangling other sub-networks and associated sub- communities from the global network. Our notion of “mapping” here is in the mathematical sense of associating attributes to objects, rather than in the sense of visualization tools. While the former underlies the latter, we expect the two will increasingly go hand-in-hand (for a recent review, see [B¨orner et al., 2003]). ∗ Presenter of invited talk at Arthur M. Sackler Colloquium on ”Mapping Knowledge Domains”, 9–11 May 2003, Beckman Center of the National Academy of Sciences, Irvine, CA 1 See http://arXiv.org/ . For general background, see [Ginsparg, 2001]. 2 as of mid-Oct 2003 3 The Stanford Linear Accelerator Center SPIRES-HEP database has comprehensively catalogued the High Energy Particle Physics (HEP) literature online since 1974, and indexes more than 500,000 high-energy physics related articles including their full citation tree (see [O’Connell, 2002]). 1
  • 2. graphics baseball specs references hockey car clinton unix space quicktime computer . . . 0 3 0 1 0 0 0 1 0 2 0 From: xxx@sciences.sdsu.edu Newsgroups: comp.graphics Subject: Need specs on Apple QT I need to get the specs, or at least a for QuickTime. Technical articles from be nice, too. have on ... very verbose interpretation of the specs, on a Unix or MS-Dos system. I can’t do much with the QuickTime stuff they I also need the specs in a fromat usable magazines and references to books would Figure 1: Representing text as a feature vector. 2 Text Classification The goal of text classification is the automatic assignment of documents to a fixed number of semantic categories. In the “multi-label” setting, each document can be in zero or one or more categories. Efficient automated techniques are essential to avoid tedious and expensive manual category assigment for large document sets. A “knowledge engineering” approach, involving hand-crafting accurate text classification rules, is surprisingly difficult and time- consuming [Hayes and Weinstein, 1990]. We therefore take a machine learning approach to generating text classification rules automatically from examples. The machine learning approach can be phrased as a supervised learning problem. The learning task is represented by the training sample Sn (x1, y1), (x2, y2), . . . , (xn, yn) (1) of size n documents, where xi represents the document content. In the multi-label setting, each category label is treated as a separate binary classification problem. For each such binary task, yi ∈ {−1, +1} indicates whether a document belongs to a particular class. The task of the learning algorithm L is to find a decision rule hL : x −→ {−1, +1} based on Sn that classifies new documents x as accurately as possible. Documents need to be transformed into a representation suitable for the learning algo- rithm and the classification task. Information Retrieval research suggests that words work well as representation units, and that for many tasks their ordering can be ignored without losing too much information. This type of representation is commonly called the “bag-of- words” model, an attribute-value representation of text. Each text document is represented by a vector in the lexicon space, i.e., by a “term frequency” feature vector TF(wi, x), with component values equal to the number of times each distinct word wi in the corpus occurs in the document x. Fig. 1 shows an example feature vector for a particular document. This basic representation is ordinarily refined in a few ways: TF×IDF Weighting: Scaling the components of the feature vector with their inverse doc- ument frequency IDF(wi) [Salton and Buckley, 1988] often leads to improved perfor- mance. In general, IDF(wi) is some decreasing function of the word frequency DF(wi), 2
  • 3. equal to the number of documents in the corpus which contain the word wi. For example, IDF(wi) = log n DF(wi) (2) where n is the total number of documents. Intuitively, the inverse document frequency assumes that rarer terms have more significance for classification purposes, and hence gives them greater weight. To compensate for the effect of different document lengths, each document feature vector xi is normalized to unit length: ||xi|| = 1. Stemming: Instead of treating each occurrence form of a word as a different feature, stem- ming is used to project the different forms of a word onto a single feature, the word stem, by removing inflection information [Porter, 1980]. For example “computes”, “computing”, and “computer” are all mapped to the same stem “comput”. The terms “word” and “word stem” will be used synonymously in the following. Stopword Removal: For many classification tasks, common words like “the”, “and”, or “he” do not help discriminate between document classes. Stopword removal describes the process of eliminating such words from the document by matching against a pre- defined list of stop-words. We use a standard stoplist of roughly 300 words. 3 Support Vector Machines SVMs [Vapnik, 1998] were developed by V. Vapnik et al. based on the structural risk mini- mization principle from statistical learning theory. They have proven to be a highly effective method for learning text classification rules, achieving state-of-the-art performance on a broad range of tasks [Joachims, 1998, Dumais et al., 1998]. Two main advantages of using SVMs for text classification lie in their ability to handle the high dimensional feature spaces arising from the bag-of-words representation. From a statistical perspective, they are robust to overfitting and are well suited for the statistical properties of text. From a computational perspective, they can be trained efficiently despite the large number of features. A detailed overview of the SVM approach to text classification, with more details on the notation used below, is given in [Joachims, 2002]. In their basic form, SVMs learn linear decision rules h(x) = sgn{w · x + b} , (3) described by a weight vector w and a threshold b, from an input sample of n training examples Sn = ((x1, y1), · · ·, (xn, yn)), xi ∈ IRN , yi ∈ {−1, +1}. For a linearly separable Sn, the SVM finds the hyperplane with maximum Euclidean distance to the closest training examples. This distance is called the margin δ, as depicted in Fig. 2. Geometrically, the hyperplane is defined by its normal vector, w, and its distance from the origin, −b. For non-separable training sets, the amount of training error is measured using slack variables ξi. Computing the position of the hyperplane is equivalent to solving the following convex quadratic optimization problem [Vapnik, 1998]: 3
  • 4. h* δ δ δ Figure 2: A binary classification problem (+ vs. −) in two dimensions. The hyperplane h∗ separates positive and negative training examples with maximum margin δ. The examples closest to the hyperplane are called support vectors (marked with circles). Optimization Problem 1 (SVM (primal)) minimize: V (w, b, ξ) = 1 2 w · w + C n i=1 ξi (4) subj. to: ∀n i=1 : yi[w · xi + b] ≥ 1 − ξi (5) ∀n i=1 : ξi > 0 (6) The margin of the resulting hyperplane is δ = 1/||w||. The constraints (5) require that all training examples are classified correctly up to some slack ξi. If a training example lies on the “wrong” side of the hyperplane, we have the corresponding ξi ≥ 1, and thus n i=1 ξi is an upper bound on the number of training errors. The factor C in (4) is a parameter that allows trading off training error vs. model complexity. The optimal value of this parameter depends on the particular classification task and must be chosen via cross-validation or by some other model selection strategy. For text classification, however, the default value of C = 1/maxi||xi||2 = 1 has proven to be effective over a large range of tasks [Joachims, 2002]. OP1 has an equivalent dual formulation: Optimization Problem 2 (SVM (dual)) maximize: W(α) = n i=1 αi − 1 2 n i=1 n j=1 yiyjαiαj(xi · xj) (7) subj. to: n i=1 yiαi = 0 (8) ∀i ∈ [1..n] : 0 ≤ αi ≤ C (9) From the solution of the dual, the classification rule solution can be constructed as w = n i=1 αiyixi and b = yusv − w·xusv , (10) 4
  • 5. where (xusv, yusv) is some training example with 0 < αusv < C. For the experiments in this paper, SVMLight [Joachims, 2002] is used for solving the dual optimization problem4 . More detailed introductions to SVMs can be found in [Burges, 1998, Cristianini and Shawe-Taylor, 2000]. An alternative to the approach here would be to use Latent Semantic Analysis (LSA/LSI) [Landauer and Dumais, 1997] to generate the feature vectors. LSA can potentially give bet- ter recall by capturing partial synonymy in the word representation, i.e., by bundling related words into a single feature. The reduction in dimension can also be computationally efficient, facilitating capture of the full text content of papers, including their citation information. On the other hand, the SVM already scales well to a large feature space, and performs well at determining relevant content through suitable choices of weights. On a sufficiently large training set, the SVM might even benefit from access to fine distinctions between words potentially obscured by the LSA bundling. It is thus an open question, well worth further investigation, as to whether LSA applied to full text could improve performance of the SVM in this application, without loss of computational efficiency. Generative models for text classification provide yet another alternative approach, in which each document is viewed as a mixture of topics, as in the statistical inference algorithm for Latent Dirichlet Allocation used in [Griffiths and Steyvers, 2003]. This approach can in principle provide significant ad- ditional information about document content, but does not scale well to a large document corpus, so the real-time applications intended here would not yet be computationally feasible in that framework. 4 arXiv Benchmarks Before using the machine learning framework to identify new subject area content, we first assessed its performance on the existing (author-provided) category classifications. Roughly 180,000 titles and abstracts were fed to model building software which constructed a lexi- con of roughly 100,000 distinct words and produced training files containing the TD×IDF document vectors for SVMLight . (While the SVM machinery could easily be used to analyze the full document content, previous experiments [Joachims, 2002] suggest that well-written titles and abstracts provide a highly focused characterization of content at least as effective for our document classification purposes.) The set of support vectors and weight parameters output by SVMLight was converted into a form specific to the linear SVM, eq. (3): a weight vector wc and a threshold bc, where c is an index over the categories. As seen in Fig. 3, the success of the SVM in classifying documents improves as the size of a category increases. The SVM is remarkably successful at identifying documents in large (> 10, 000 documents) categories and less successful on smaller subject areas (< 500 documents). A cutoff was imposed to exclude subject areas with fewer than 100 documents.5 In experiments with small categories (N < 1000), the SVM consistently chose thresholds that resulted in unacceptably low recall (i.e., missed documents relevant to the category). This is in part because the null hypothesis, that no documents are in the category, describes 4 SVMLight is available at http://svmlight.joachims.org/ 5 Some of the smaller subject areas are known to be less topically focused, so the difficulty in recall, based solely on title/abstract terminology, was expected. 5
  • 6. 100 1000 10000 Number of documents in category 0 0.2 0.4 0.6 0.8 1 Recall 100 1000 10000 Number of documents in category 0 0.2 0.4 0.6 0.8 1 Precision Figure 3: Recall and Precision as functions of category size for 78 arXiv major categories and minor subject classes. 2/3 of the sample was used as a training set and 1/3 as a test set. The four largest categories, each with over 30,000 documents are cond-mat, astro-ph, hep-th, and hep-ph. the data well for a small category: e.g., if only 1000 of 100,000 documents are in a category, the null hypothesis has a 99% accuracy.6 To counteract the bias against small categories, 10% of the training data was used as a validation set to fit a sigmoid probability distribution 1/(1+exp(Af+B)) [Platt, 1999]. This converts the dot product xi · wc between document vector and weight vector into a probability P(i ∈ c | xi) that document i is a member of category c, given its feature vector xi. Use of the probability in place of the uncalibrated signed distance from the hyperplane output by the SVM permitted greater levels of recall for small categories. Other experiments showed that the use of TF×IDF weighting as in eq. (2) improved accuracy consistently over pure TF weighting, so TF×IDF weighting was used in the exper- iments to follow. We also used a document frequency threshhold to exclude rare words from the lexicon, but found little difference in accuracy between using a document occurrence threshold of two and five.7 Increasing the weight of title words with respect to abstract words, on the other hand, consistently worsened accuracy, indicating that words in a well- written abstract contain as much or more content classification import as words in the title. Changes in the word tokenization and stemming algorithms did not have a significant impact on overall accuracy. The default value of C = 1 in eq. (9) was preferred. 6 “Accuracy” is the percentage of documents correctly classified. We also employ other common termi- nology from information retrieval: “precision” is the fraction of those documents retrieved that relevant to a query, and “recall” is the fraction of all relevant documents in the collection that are retrieved. 7 Words that appeared in fewer than two documents constituted roughly 50% of the lexicon, and those that appeared in fewer than five documents roughly 70%. Ignoring rare and consequently uninformative words hence reduces the computational needs. 6
  • 7. 5 q-bio Extraction There has been recent anecdotal evidence of an intellectual trend among physicists towards work in biology, ranging from biomolecules, molecular pathways and networks, gene expres- sion, cellular and multicellular systems to population dynamics and evolution. This work has appeared in separate parts of the archive, particularly under “Soft Condensed Mat- ter”, “Statistical Mechanics”, “Disordered Systems and Neural Networks”, “Biophysics, and “Adaptive and Self-Organizing Systems” (abbreviated cond-mat.soft, cond-mat.stat-mech, cond-mat.dis-nn, physics.bio-ph, and nlin.AO). A more coherent forum for the exchange of these ideas was requested, under the nomenclature “Quantitative Biology” (abbreviated “q-bio”). To identify first whether there was indeed a real trend to nurture and amplify, and to create a training set, volunteers were enlisted to identify the q-bio content from the above subject areas in which it was most highly focused. Of 5565 such articles received from Jan 2002 through Mar 2003, 466 (8.4%) were found to have one of the above biological topics as its primary focus. The total number of distinct words in these titles, abstracts, plus author names, was 23558, of which 7984 were above the DF = 2 document frequency threshold. (Author names were included in the analysis since they have potential “semantic” content in this context, i.e., are potentially useful for document classification. The SVM algorithm will automatically determine whether or not to use the information by choosing suitable weights.) A data-cleaning procedure was employed, in which SVMLight was first run with C = 10. We recall from eqs. (4) and (9) that larger C penalizes training errors and requires larger α’s to fit the data. Inspecting the “outlier” documents [Guyon et al., 1996] with the largest |αi| then permitted manual cleaning of the training set. 10 were moved into q-bio, and 15 moved out, for a net movement to 461 q-bio (8.3%) of the 5565 total. Some of the others flagged involved word confusions, e.g., “genetic algorithms” typically involved programming rather than biological applications. Other “q-bio” words with frequent non- biological senses were “epidemic” (used for rumor propagation), “evolution” (used also for dynamics of sandpiles), “survival probabilities”, and extinction. “Scale-free networks” were sometimes used for biological applications, and sometimes not. To help resolve some of these ambiguities, the vocabulary was enlarged to include a list of most frequently used two-word phrases with semantic content different from their constituent words . With a training set fully representative of the categories in question, it was then possible to run the classifier on the entirety of the same subject area content received from 1992 through Apr 2003, a total of 28,830 documents. The results are shown in Fig. 4. A total of 1649 q-bio documents was identified, and the trend towards an increasing percentage of q-bio activity among these arXiv users is evident: individual authors can be tracked as they migrate into the domain. Visibility of the new domain can be further enhanced by referring submitters in real time, via the automated classifier running behind the submission interface. Some components of the weight vector generated by the SVM for this training set are shown in Fig. 5. Since the distance of a document to the classifying hyperplane is determined by taking the dot product with the normal vector, its component values can be interpreted as the classifying weight for the associated words. The approach here illustrates one of the 7
  • 8. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 cond-mat.(soft,stat-mech,dis-nn) + physics.bio-ph + nlin.AO q-bio 0.7% 1.1% 1.4% 1.6% 2.6% 5.1% 5.5% 7.0% 6.5% 7.5% 8.0% 8.8% 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 #submissions/year Figure 4: The number of submissions per year from 1992 through April 2003 in particular subsets of arXiv subject areas of cond-mat, physics, and nlin most likely to have “quantitative biology” content. The percentage of q-bio content in these areas grew from roughly 1% to nearly 10% during this timeframe, suggesting a change in intellectual activity among arXiv- using members of these communities.                                  protein +8.57 dna +7.08 biological +5.06 neuron +5.05 rna +3.30 gene +3.21 mutation +3.15 population +3.11 epidemic +3.05 biology +3.02 disease +2.93 cell +2.90 neural +2.89 brain +2.83 ecosystem +2.56 tissue +2.52 sequence +2.51 genetic +2.51 bacterial +2.48 blood +2.43 genome +2.37 peptide +2.37 infection +2.34 ... forward +0.00 minimalist +0.00 region +0.00 confinement +0.00 implies +0.00 96 +0.00 y togashi +0.00 n wingreen +0.00 mean free +0.00 narrower +0.00 shot -0.00 repton -0.00 kyoto -0.00 regular -0.00 generalisation -0.00 d saakian -0.00 conformity -0.00 aware -0.00 even though -0.00 practitioner -0.00 permittivity -0.00... point -1.04 equation -1.05 boundary -1.06 social -1.06 n -1.09 relaxation -1.14 fluid -1.15 indian -1.15 spin -1.17 spin glass -1.17 traffic -1.18 system -1.30 polymer -1.33 class -1.35 emerge -1.36 gradient -1.39 quantum -1.43 surface -1.43 synchronization -1.45 market -1.47 particle -1.52 polyelectrolyte -1.53 world -1.57                                  Figure 5: Shown above are the most positive, a few intermediate, and most negative com- ponents of the q-bio classifying weight vector. 8
  • 9. major lessons of the past decade: the surprising power of simple algorithms operating on large datasets. While implemented as a passive dissemination system, the arXiv has also played a social engineering role, with active research users developing an affinity to the system and adjusting their behavior accordingly. They scan new submissions on a daily basis, assume others in their field do so and are consequently aware of anything relevant that has appeared there (while anything that doesn’t may as well not exist), and use it to stake intellectual priority claims in advance of journal publication. We see further that machine learning tools can characterize a subdomain and thereby help accelerate its growth, via the interaction of an information resource with the social system of its practitioners.8 Postscript: The q-bio extraction described in section 5 was not just a thought experiment, but a prelude to an engineering experiment. The new category went on-line in mid-September 2003,9 at which time past submitters flagged by the system were asked to confirm the q-bio content of their submissions, and to send future submissions to the new category. The activity levels at the outset corresponded precisely to the predictions of the SVM text classifier, and later began to show indications of growth catalyzed by the public existence of the new category. References [B¨orner et al., 2003] B¨orner, K., Chen, C., and Boyack, K. (2003). Visualizing knowledge domains. Annual Review of Information Science & Technology, 37:179–255. [Burges, 1998] Burges, C. (1998). A tutorial on support vector machines for pattern recog- nition. Data Mining and Knowledge Discovery, 2(2):121–167. [Cristianini and Shawe-Taylor, 2000] Cristianini, N. and Shawe-Taylor, J. (2000). An Intro- duction to Support Vector Machines and Other Kernel-Based Learning Methods. Cam- bridge University Press. [Dumais et al., 1998] Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). In- ductive learning algorithms and representations for text categorization. In Proceedings of ACM-CIKM98. [Ginsparg, 2001] Ginsparg, P. (2001). Creating a global knowledge network. In Elec- tronic Publishing in Science II (proceedings of joint ICSU Press/UNESCO confer- ence, Paris, France). ICSU Press. http://users.ox.ac.uk/˜icsuinfo/ginspargfin.htm (copy at http://arXiv.org/blurb/pg01unesco.html). [Ginsparg, 2003] Ginsparg, P. (2003). Can peer review be better focused? Science & Tech- nology Libraries, pages to appear, copy at http://arXiv.org/blurb/pg02pr.html. 8 Some other implications of these methods, including potential modification of the peer review system, are considered elsewhere [Ginsparg, 2003]. 9 See http://arXiv.org/new/q-bio.html 9
  • 10. [Griffiths and Steyvers, 2003] Griffiths, T. and Steyvers, M. (2003). Finding scientific topics. In Arthur M. Sackler Colloquium on ”Mapping Knowledge Domains”. Proceedings of the National Academy of Sciences, in press. [Guyon et al., 1996] Guyon, I., Matic, N., and Vapnik, V. (1996). Discovering informative patterns and data cleaning. In Advances in Knowledge Discovery and Data Mining, pages 181–203. [Hayes and Weinstein, 1990] Hayes, P. and Weinstein, S. (1990). CONSTRUE/TIS: a sys- tem for content-based indexing of a database of news stories. In Annual Conference on Innovative Applications of AI. [Joachims, 1998] Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, pages 137 – 142, Berlin. Springer. [Joachims, 2002] Joachims, T. (2002). Learning to Classify Text Using Support Vector Ma- chines – Methods, Theory, and Algorithms. Kluwer. [Landauer and Dumais, 1997] Landauer, T. K. and Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and repre- sentation of knowledge. Psychological Review, 104:211–240. [O’Connell, 2002] O’Connell, H. B. (2002). Physicists thriving with paperless publishing. HEP Lib. Web., 6:3, arXiv:physics/0007040. [Platt, 1999] Platt, J. (1999). Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–74. MIT Press. [Porter, 1980] Porter, M. (1980). An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14(3):130–137. [Salton and Buckley, 1988] Salton, G. and Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523. [Vapnik, 1998] Vapnik, V. (1998). Statistical Learning Theory. Wiley, Chichester, GB. 10