http://clgiles.ist.psu.edu
Prof.
C.
Lee
Giles

•  Intelligent
and
specialty
search
engines;
cyberinfrastructure

for
science,
academia
and
government

–  Modular,
scalable,
robust,
automaEc
cyberinfrastructure
and

search
engine
creaEon
and
maintenance

–  Large
heterogeneous
data
and
informaEon
systems

–  Specialty
search
engines
and
portals
for
knowledge
integraEon

•  CiteSeerx
(computer
and
informaEon
science)

•  ChemXSeer
(e-‐chemistry
portal)

•  GrantSeer
(grant
search)

•  RefSeer

(recommendaEon
of
paper
references)

•  Scalable
intelligent
tools/agents/methods/algorithms

–  InformaEon,
knowledge
and
data
integraEon

–  InformaEon
and
metadata
extracEon;
enEty
disambiguaEon

–  Unique
search,
knowledge
discovery,
informaEon
integraEon,

data
mining
algorithms

–  Web
2.0
methods

•  Automated
tagging
for
search
and
informaEon
retrieval

•  Social
network
analysis

SeerSuite
Contributors/Collaborators:
recent

past
and
present
(incomplete
list)

Projects:
CiteSeer,
CiteSeerX,
ChemXSeer,
ArchSeer,
CollabSeer,
GrantSeer,

SeerSeer,
RefSeer,
AlgoSeer,
AckSeer,
BotSeer,
YouSeer,
…

•  P.
Mitra,
V.
Bhatnagar,
L.
Bolelli,
J.
Carroll,
I.
Councill,
F.
Fonseca,
J.
Jansen,

D.
Lee,
W-‐C.
Lee,
H.
Li,
J.
Li,
E.
Manavoglu,
A.
Sivasubramaniam,
P.

Teregowda,
H.
Zha,
S.
Zheng,
D.
Zhou,
Z.
Zhuang,
J.
Stribling,
D.
Karger,
S.

Lawrence,
J.
Gray,
G.
Flake,
S.
Debnath,
H.
Han,
D.
Pavlov,
E.
Fox,
M.
Gori,

E.
Blanzieri,
M.
Marchese,
N.
Shadbolt,
I.
Cox,
S.
Gauch,
A.
Bernstein,
L.

Cassel,
M-‐Y.
Kan,
X.
Lu,
Y.
Liu,
A.
Jaiswal,
K.
Bai,
B.
Sun,
Y.
Sung,
J.
Z.
Wang,

K.
Mueller,
J.Kubicki,
B.
Garrison,
J.
Bandstra,
Q.
Tan,
J.
Fernandez,
P.

Treeratpituk,
W.
Brouwer,
U.
Farooq,
J.
Huang,
M.
Khabsa,
M.
Halm,
B.

Urgaonkar,
Q.
He,
D.
Kifer,
J.
Pei,
S.
Das,
S.
Kataria,
D.
Yuan,
T.
Suppawong,

others.

•  Current
funding:
NSF,
Dow
Chemical

Outline

•  MoEvaEon

–  Data
science;
Cyberinfrastructure

–  Vast
growth
in
domain
science
data
and
documents

•  SeerSuite

–  Tool
for
creaEng
Seers

–  Specialized
data
and
document
search
and
recommendaEons

•  Tables,
formulae,
ﬁgures,
references
…

–  Use
of
Solr/Lucene

•  Disciplinary
sciences,
indexes
&
informaEon
extracEon
(the

Seers)

–  Computer
science

–  Chemistry

–  Brieﬂy
other
Seers

•  OpportuniEes
for
Research

•  Conclusions
and
DirecEons

The
Evolu3on
of
Science
-‐
the
4th

Paradigm
Jim Gray’s paradigm
•  Observa3onal
Science

–  ScienEst
gathers
data
by
direct

observaEon

–  ScienEst
analyzes
data

•  Analy3cal
Science

–  ScienEst
builds
analyEcal
model

–  Makes
predicEons.

•  Computa3onal
Science

–  Simulate
analyEcal
model

–  Validate
model
and
makes
predicEons

•  Data
Driven
Science

–  Data
captured
from
the
web,
by

instruments,
or
from
documents

–  Data
generated
by
simulaEon

–  Placed
in
data
structures
/
ﬁles

–  ScienEst(s)
analyze(s)
data

–  Access
&
search
crucial

Data
Access
Varies
with
Discipline

or
Small
vs
Big
Science

•  Small
vs
Big
science

–  “Data
from
Big
Science
is
…
easier
to
handle,
understand
and
archive.

Small
Science
is
horribly
heterogeneous
and
far
more
vast.
In
Eme
Small

Science
will
generate
2-‐3
Emes
more
data
than
Big
Science.”

•  ‘Lost
in
a
Sea
of
Science
Data’
S.Carlson,
The
Chronicle
of
Higher
EducaEon

(23/06/2006)

–  Data
is
local

–  Data
will
not
be
shared

•  At
some
point
there
will
be
needed

–  indices
to
control
search

–  parallel
data
search
and
analysis

•  Cyberinfrastructure
can
help

–  If
you
can’t
move
the
data
around,

–  Bandwidth
of
a
van
loaded
with
disks

take
the
analysis
to
the
data!

–  Do
all
data
manipulaEons
locally

•  Build
custom
procedures
and
funcEons
locally

SeerSuite

•  Open
source
search
engine
and
digital
library
tool
kit
used
to

build
search
engines
and
digital
libraries

–  CiteSeerX
,
ChemXSeer,
RefSeer,YouSeer,
CollabSeer,
etc.

•  Supports
research
in

–  Indexing
and
search

–  Digital
libraries

–  Data
mining
&
structures

–  InformaEon
and
knowledge
extracEon

–  Social
networks

–  Scientometrics/infometrics

–  Systems
engineering,
User
design

–  Sokware
engineering
and
management

–  Web
crawling

•  Trains
students
in
search
and
sokware
systems

–  EducaEonal
tool
for
search
engine
creaEon

–  Students
highly
sought
in
industry
and
government

SeerSuite
-‐
proper3es

•  Modular,
scalable,
extensible,
robust
design

–  Extensible
to
many
problems
and
disciplines

•  Integrated
features

–  Focused
crawler
-‐
Heritrix

–  Indexer
-‐
Solr/lucene

–  Metadata
extracEon
-‐
modular

–  Ranked
results

•  Builds
on
experience
with
other
domain
engines
and
OS
tools

– 
Lucene
and
Solr

– 
The
MySQL
Database
and
InnoDB
Storage
Engine

– 
Apache
Tomcat

– 
Spring
Framework

– 
Acegi
Security

– 
AcEveMQ

– 
AcEveBPEL
Open
Source
Engine

– 
Apache
Commons
Libraries

– 
SVMlight
support
vector
machine
package

– 
CRF++
condiEonal
random
ﬁeld
package

•  Hardware
independent;
Linux

•  Reuse
not
reinvent

Data Mining & Information Extraction in Seers
•  Data acquisition
•  SeerSuite systems often crawls the public web for new data
•  Many data types available
•  Richness of data offers unique data mining features
•  CiteSeerX as testbed/sandbox
•  Large scale data resources
•  Millions of documents, authors, etc.
•  Some common features/metadata
•  Commercial grade indexer (Solr/Lucene)
•  Scalable to G’s of documents and M’s of users
•  “Watson”
•  Modular design
•  Cloudable
•  State of the art algorithms (machine learning) for large scale
unique metadata (information) extraction & mining
•  Unique parsers and indexing
•  Quality of extraction
•  Precision/recall
•  Ranking
•  Architecture/integration

Seer
Friends

•  In
various
stages
of
the
system
lifecycle
with
various
data
resources

and
indexes:

–  Mature
and
developing,
code
released

•  CiteSeer,
now
CiteSeerX

•  ChemXSeer

•  TableSeer

•  YouSeer

–  New,
future
TBD,
not
all
aspects
public

•  ArchSeer

•  AlgoSeer

•  CollabSeer

•  RefSeer

•  SeerSeer

•  GrantSeer

–  Dead
or
limping
by
(could
be
revived)

•  AckSeer
(acknowledgement
indexing)
(revived!)

•  BizSeer

•  BotSeer

–  Proposed,
but
do
not
exist

•  BrainSeer

•  CensorSeer

•  ArXivSeer

Why
Solr/Lucene?

•  Only
open
source
considered
–
cost

•  CompeEtors:

–  Indri

–  Wumpus

–  Terrier

–  Others?

•  Must
scale
for
both
number
of
documents
and
users

•  Easily
integrable
and
customizable

–  Other
indexes,
crawlers,
ingesEon,
metadata
extractors

•  Well
used
(Watson)

•  AcEve
community
of
support

–  Enterprise
plaporm
a
plus

•  Easy
to
transiEon
to
government/industry/academia

–  Apache
license

Next Generation CiteSeer, CiteSeerX

• 

2
M
documents

• 

40
M
citaEons

• 
2
to
5
M
authors

• 
2
to
4
M
hits
day

• 
800K
individual
users

• 
en3re
data
shared

• 
Index
-‐
50
G

http://citeseerx.ist.psu.edu

History:
CiteSeer
(aka
ResearchIndex)

  Project
at
NEC
Research
InsEtute,
Princeton

  1st
academic
document
search
engine

  Very
popular
with
computer
science

C. Lee Giles
  Hosted
at
NEC
from
1997
–
2004.

  Moved
to
Penn
State
as
collaborators
lek.

  Provided
a
broad
range
of
unique
services

including

  AutomaEc
citaEon
indexing,
reference
linking,

full
text
indexing,
similar
documents
lisEng,
Kurt Bollacker
automated
metadata
extracEon
and
several

other
pioneering
features.

  Refactored
and
redesigned
as
CiteSeerx

  Released
2008

  Lucene
based
indexing

CiteSeer continuously running for 15 years! Steve Lawrence

SeerSuite/CiteSeerX Architecture
•  Web Application
•  Focused Crawler
•  Document Conversion and
Extraction
•  Document Ingestion
•  Data Storage
•  Maintenance Services
•  Federated Services

Teregowda, USENIX ‘10

4 systems:

•  Production
•  Crawling
•  Staging
•  Research

All or some
can be
cloudized
Teregowda, USENIX 2010

CiteSeerX
Services

  CiteSeerX
is
a
very
automated
system:

  Full
OAI
metadata
if
available

  Full
text
Indexing
(many
diﬀerent
indexes)

  Documents

  CitaEons

  Tables

  More
forthcoming

(Algorithms,
Figures,
Acknowledgements).

  CitaEon
Graph.

  Ranking
based
on
citaEons.

  Linking
documents

-  Co-‐citaEons

-  CiEng
documents

  Author
DisambiguaEon

  DisEnguish
between
authors
with
similar
names.

  Proﬁles
and
publicaEon
informaEon
for
author.

  AutomaEc
crawling
from
list
and
submissions

  PersonalizaEon

-  Login
based
access
to
features
on
CiteSeerX.

-  CorrecEons
to
metadata.

-  Storage
of
queries.

-  CollecEon
of
papers

-  Follows
document
metadata
changes.

Focused
Crawling

•  Maintain
a
list
of
parent
URLs
where
documents
were
previously
found

–  Parent
URLs
are
usually
academic
homepages.

•  300,000
unique
parent
URLs,
as
of
summer
2011

–  Parent
URLs
are
stored
in
a
database
table
with
two
addiEonal
ﬁelds
for

scheduling:

•  Last
Eme
changed,
get
new
documents
from
the
page.

•  EsEmated
change
rate
according
to
previous
crawls
of
this
page.

•  The
crawling
process
starts
with
the
scheduler
selecEng
1000
parent
URLs

which
have
the
highest
probability
of
having
new
documents
available.

–  Assume
Poisson
process
for
the
change
behavior
of
a
parent
page.

•  Suppose
a
parent
page
P’s
last
observed
change
occurred
at
Eme
t1,
and
its
esEmated

change
rate
is
R,
then
at
Eme
t2
(t2
=
t1
+
Δ),
the
probability
that
it
has
changed
again

since
t1
is
1
–
exp(-‐R*Δ)

•  Larger
R
or
larger
Δ
will
give
larger
probability.

•  Aker
each
crawl,
the
change
rate
of
the
scheduled
parent
URL
should
be
recalculated.

•  Crawling
run
incrementally
daily
(invoked
by
a
Linux
cron
job
at
12
am)

–  Most
discovered
documents
have
been
crawled
before.

•  Use
hash
table
comparison
for
detecEon
of
new
documents

•  Normally
retrieve
a
few
thousand
NEW
documents
per
day,
someEmes
less
than
1k.

•  Moved
to
whitelist
vs
blacklist

Zheng, CIKM’09

documents
from
crawled
urls

90% all
citations from
the first 550
sites

90% all
documents
from the first
1250 sites

How
will
we
get
metadata
for
ﬁelds?

Now... that should clear up a few things around here

Metadata
ExtracEon

•  Documents
are
converted
from
PDF/PS
to
text
using

converters.

–  Converters
include
TET,
pd{ox,
pdkotext,gs.

•  Documents
are
ﬁltered
checking,
for
existence
of

references
and
duplicaEon
(checksum).

•  Use
tools
or
build
your
own

–  Metadata
extracEon
system
uses
machine
learning

methods
like
SVM
(Header
Parser),
CRF
(ParsCit)
to

extract
various
enEEes
from
the
document.

•  Rule
based
templates
are
applied
before
extracEon.

AutomaEcally
Created
DB
of
paper
in
CSX

10.1.1.130.782 Tensor Decompositions and Applications

This .. 2009 pages 455-500
id title
abstract year publisher SIAM

“Tensor Decompositions and Applications”, SIAM REVIEW, 2009, pp 455-500
Abstract: This ….
Cited 34 times, 6 times by Author venue Assigned
SIAM REVIEW
By
venueType
version cluster System
JOURNAL
Extractor/
2 9248987 User/
10 12/30/2008 True Inference
n-cites 34
Inference/
selfCites 6 public User
repositoryID crawldate

3
Tier
Architecture

Queries
Index
Web 1

Index - Tables
User Request Load Balancer
Web Application
Load Balancer Repository

Web 2
Database
Requests
Storage

Crawler Ingestion

Extraction

CiteSeer X
Sokware
Overview

•  IngesEon
Process:
Responsible
for
obtaining
and
preparing
a
document
and
the

related
metadata.

–  Process
the
document

•  Submi|ed
by
the
user
or
Crawler

–  Extract
Metadata

•  Header

•  CitaEons

•  Acknowledgements

–  Store
the
metadata
and
documents.

•  CitaEon
Matching

–  Iden>fying
the
underlying
graph
structure
–
documents
ci>ng
this
document
and

the
rela>onship
between
documents
and
cita>ons

•  Inference
matching
and
graph
generaEon

–  User
CorrecEons
(Version
Maintenance)

–  Determine
and
accept
valid
user
correc>ons

–  Regular
NoEﬁcaEon
Mechanisms

–  Ensure
that
the
user
is
no>ﬁed
when
new
documents
are
added
to
the
collec>on

•  Linked
to
MyCiteSeer.

•  Update
and
Maintenance

–  Update
and
make
valid
the
full
text
index
and
various
sta>s>cs.

–  StaEsEcs

–  Index
updates

CiteSeerX
Search

  Enabling
Search

  Fulltext

  Fields
created

-  Title

-  Authors

-  CitaEons

-  Venue

-  Keywords

-  Abstract

-  Range
(PublicaEon)

-  CitaEons

Field
Schema

Field Type Indexed/Stored
DOI String Y/Y - Unique
Citation/Document String Y/Y
Title Text Y/Y
Author A Text Y/Y
Authors Normalized A Text Y/N
ncites (# cited by) Integer Y/Y
URL String Y/Y
cites Tokens Y/N
citedby Tokens Y/N
Timestamp Date Y/Y

* - A Text is a Text field which does not have a stopword filter or stemming
^ - Tokens are a Text field with only duplicate removal and whitespace tokenizer

CiteSeerX
Search
Results

  Results
SorEng

  Relevance
(default)

-  Based
on
dismax
query

handling
with
boosEng.

Sorting
  CitaEons

-  CitaEons
received
by
the

document
in
collecEon
plus

default

  Year

-  PublicaEon
date.

  Recency

-  Date
of
acquisiEon.

CiteSeerX
CitaEon
Graph

  RelaEonships

B

Cited by   CitaEon
graph

E
-  Store
Cited
by
and

A
Cites Cites
in
index

  Build

D
-  Build
document

C
graph
by
querying

index
for

relaEonship.

Adding
documents

  Ingest
documents
for
new
crawls

-  Add
metadata
to
collecEon

-  Add
full
text
to
system

-  Link
metadata
in
collecEon

  Run
maintenance
scripts

-  Poll
updates
and
post
to
Solr.

  Fulltext

  Metadata

  RelaEonships

  Challenge:
Maintain
data
freshness.

Query
Response

Web
•  Query
forwarded
to
Solr

from
the
presentaEon

Web Interface layer
(JSP)

•  Solr
generates
ranked

response
in
JSON

•  Build
each
record
in
xml

with
the
database
(Add

Database
ﬁelds:
Abstract)

•  PresentaEon
layer
(JSP)

Index formats
records
based

on
ranking.

Ranking
with
BoosEng
(Relevance)

  Use
of
Boost
FuncEon,
Minimum
Match,

Query
Fields

  Boost
FuncEon
–

the
eﬀect
of
citaEons

-  Map
number
of
citaEons
>
1
to
500

  Minimum
Match
–
2

  Query
Fields

-  Text
(1)

-  Title
(4)

-  Abstract
(2)

Query
Response

Web Interaface
  Query
at
Interface
(JSP)

Q
  Hand
over
to
Web

Text R HashMap
applicaEon
(Java/Spring)

Web Application
  Hand
over
to
Solr

F   Ranked
response
from
Solr

Text JSON
Q R HashMap
(JSON)

DB   Response
unwrapped
and

more
details
included
with

Index informaEon
from
DB

  Present
response
at

Interface
(JSP)

Name
DisambiguaEon

•  Name
disambiguaEon
(NER)

–  A
person
can
be
referred
to
in
diﬀerent
ways
with
diﬀerent
a|ributes
in

mulEple
records;
the
goal
of
name
disambiguaEon
is
to
resolve
such

ambiguiEes,
linking
and
merging
all
the
records
of
the
same
enEty
together

•  Three
types
of
name
ambiguiEes:

–  Aliases
-‐
one
person
with
mulEple
aliases,
name
variaEons,
or
name

changed

e.g.
CL
Giles
&
Lee
Giles,
Superman
&
Clark
Kent

–  Common
Names
-‐
more
than
one
person
shares
a
common
name,

e.g.
Jian
Huang
–
103
papers
in
DBLP

–  Typography
Errors
-‐
resulEng
from
human
input
or
automaEc
extracEon

•  Goal:
disambiguate,
cluster
and
link
names
in
a
large
digital

library
or
bibliographic
resource
such
as
Medline,
CiteSeerX,
etc.

Efficient
Large
Scale
En3ty
Disambigua3on

Testbed:
CiteSeerX
and
PubMedSeer
et.al PKDD 2006
Huang,
Treeratpituk, et.al JCDL 2009
•  EnEty
disambiguaEon
problem
Online SVM
–  Determine
the
real
idenEty
of
the
with Active Learning
authors
using
metadata
of
the
Annotator
Distance Learner
research
papers,
including
co-‐ Metadata
authors,
affiliaEon,
physical

Actors, entities
Extraction Soft-
address,
email
address,

Module
Jaccard
TFIDF

documents
Similarity
informaEon
from
crawling
such
Similarity SVM
Distance DBSCAN
as
host
server,
etc.
Function Clustering
–  EnEty
normalizaEon
Similarity Module
Function
•  MoEvaEon

–  Enhance
search
funcEonaliEes
Blocking
for
digital
repositories
Module Candidate
Class
•  Fielded
search
by
author
name
Author 1
Paper 3
Author 2
Paper 4

–  Improve
metadata
quality

–  Improved
social
network
analysis

–  Government
and
business
•  Key
features

intelligence

•  E.g.
census
data
and
credit
–  LASVM
distance
funcEon

records
•  AcEve
learning

•  Challenges
– 
– 
Simpler
and
more
accurate
model

Be|er
generalizaEon
power

–  Accuracy
•  Online
learning

–  Scalability
–  Expandable
to
new
training
data

–  Expandability
–  DBSCAN
clustering

•  Ameliorate
labeling
inconsistency
(transiEvity
problem)

•  Efficient
soluEon
to
find
name
clusters

•  N
logN
scaling

Author
DisambiguaEon
Field

•  Currently
uses
author
ﬁelds

–  For
author
search
(both
for
author
menEons
and
for

disambiguated
authors)

•  Future
direcEon

–  Use
Lucene
index
for
blocking
in
author
disambiguaEon
–

creaEng
candidate
set
of
author
menEons
that
could

belong
to
the
same
cluster

Author
DisambiguaEon

•  Random
Forest
(RF)

–  Use
random
feature
selecEon+bootstrap
sampling
to
construct
mulEple
decision
trees
from
one
training
data

–  Aggregate
votes
of
a
collecEon
of
decision
tree
as
final
decision

–  The
more
independent
each
tree
is,
the
be|er
the
improvement
over
a
single
decision
tree

•  Author
disambiguaEon
with
Random
Forest

–  Various
meta
data
is
used
as
features
in
Random
Forest
to
determine
whether
two
author
name
from
two
papers

refer
to
the
same
person

•  E.g.
Author
names,
affiliaEon,
coauthors,
keywords,
journal
informaEon,
year
of
publicaEons,
etc

–  MulEple
distance
funcEons
are
used
for
each
type
of
meta
data

•  E.g.
TFIDF,
Jaccard
distance,
for
comparing
affiliaEons

•  Compared
with
previous
SVM-‐based
approach

–  Shown
to
provide
higher
accuracy
than
SVM
in
pair-‐wise
author
disambiguaEon
task

–  Easy
parameterizaEon
in
the
training
phrase
(only
number
of
trees
and
randomness
at
each
node,
no
decision
on

kernel
funcEon
needed),
and
performance
is
not
sensiEve
to
parameters
chosen

–  Provide
measurement
for
importance
of
each
individual
features
(how
informaEve
each
feature
is,
and
how

sensiEve
the
decision
is
to
noise
in
a
parEcular
feature),
which
is
not
trivial
for
SVM
with
non-‐linear
kernel

–  Training
Eme
&
classificaEon
Eme
is
linear
to
the
number
of
tree
and
data
size

•  Also
provide
higher
disambiguaEon
accuracy
when
compared
with
other
tradiEonal
method
(LogisEc

Regression,
Naïve
Bayes,
Decision
Tree)

Treeratpituk, Giles, JCDL09

Data and Publications in the Field of Chemistry
Chemistry
• not physics - no arXiv – or computer science - no CiteSeer
• Legacy of early information access - Chem Abstracts
• Cheminformatics is not bioinformatics

Chemistry has been up to recently a data poor field
Data sharing tradition just being established
Data creation is exploding - local (small science)

Journals and societies sensitive to their IP issues dominate the field
Unsubstantiated IP claims such as data in the paper belongs to the publisher
Discourage online versions of publications - ACS

Large powerful international companies have a vested interest in research
Chemical information extraction tools are easily monetized
Standards exist - CML, InCHI

“Fixing the past so we can fix the future.” Jeremy Frey
Chemistry is an old discipline with publications going back 100 years

Chemistry is compound centric, not algorithmic centric
Search is about the compound!
Compounds have a rich data environ
3D graph structure, energies, etc.

ChemXSeer Architecture
Integrate and implement well-used open source tools
Use CiteSeerX tools when possible
Integrate into SeerSuite
Search
Chemical formulae unique search
Table search
Figure search
More data (grey literature) than documents

•  Automated information extraction modules based on machine learning methods
•  Lucene/Solr indices for extracted fields,
•  Relational databases for datasets,

Work closely with chemists to understand their needs
Tools for data conversion

Provide a public portal and repository for easy use
User access controls

Integrated visualization tools like JMOL for Gaussian data residing into
our repository

API’s for users for extracted data

Data and documents standards de facto: xml, pdf, etc.

ChemXSeer Formula Search

• Extraction and search of chemical formulae in scientific
documents has been shown to be very useful.

• Intersection of two research areas:
• Information retrieval
• Chemoinformatics

•  Formulae cannot be treated as text.
• Domain knowledge (formula identification)
• Structural knowledge (substructure finding and search)

B. Sun, WWW’07, WWW’08, TOIS’11
D. Yuan, ICDE’12

Challenges in Formula Search

How to identify a formula in scientific documents?

Non-Formula
“… This work was funded under NIH grants …”
“ … YSI 5301, Yellow Springs, OH, USA …”
“… action and disease. He has published over …”

Formula
“… such as hydroxyl radical OH, superoxide O2- …”
“ and the other He emissions scarcely changed …”

Machine learning algorithms (SVM + CRF) yield high
accuracies for correct formula identification.

SegmenEng
chemical
names

•  Goal:
to
discover
semanEcally
meaningful
sub-‐terms
in

chemical
names

–  Methylethyl
alcohol

–  methionylglutaminylarginyltyrosylglutamylserylleucyl

phenylalanylalanylglutaminylleucyllysylglutamylarginyl

lysylglutamylglycylalanylphenylalanylvalylprolylphenyl

alanylvalylthreonylleucylglycylaspartylprolylglycylisol

eucylglutamylglutaminylserylleucyllysylisoleucylaspartyl

threonylleucylisoleucylglutamylalanylglycylalanylaspartyl

alanylleucylglutamylleucylglycylisoleucylprolylphenyl

alanylserylaspartylprolylleucylalanylaspartylglycylprolyl

threonylisoleucylglutaminylasparaginylalanylthreonylleucyl

arginylalanylphenylalanylalanylalanylglycylvalylthreonyl

prolylalanylglutaminylcysteinylphenylalanylglutamyl

methionylleucylalanylleucylisoleucylarginylglutaminyllysyl

hisEdylprolylthreonylisoleucylprolylisoleucylglycylleucyl

leucylmethionyltyrosylalanylasparaginylleucylvalylphenyl

alanylasparaginyllysylglycylisoleucylaspartylglutamylphenyl

alanyltyrosylalanylglutaminylcysteinylglutamyllysylvalyl

glycylvalylaspartylserylvalylleucylvalylalanylaspartylvalyl

prolylvalylglutaminylglutamylserylalanylprolylphenylalanyl

arginylglutaminylalanylalanylleucylarginylhisEdylasparaginyl

valylalanylprolylisoleucylphenylalanylisoleucylcysteinyl

prolylprolylaspartylalanylaspartylaspartylaspartylleucyl

leucylarginylglutaminylisoleucylalanylseryltyrosylglycyl

arginylglycyltyrosylthreonyltyrosylleucylleucylserylarginyl

Chemical
Search
Aspects

•  Parsing

•  ExtracEon
and
tagging

•  Indexing

•  Ranking

Chemical
EnEty
ExtracEon
and
Tagging

•  Name
tagging

–  Each
chemical
name
can
be
a
phrase

–  Example

•  "...
Determina>on
of
lac4c
acid
and
...“

•  "...
insec>cide
promecarb
(3-‐isopropyl-‐5-‐methylphenyl
methylcarbamate)
acts

against
..."

•  Formula
tagging

–  Each
formula
is
a
single
term

–  Example

•  "...
such
as
hydroxyl
radical
OH,
superoxide
..."

–  Non-‐formula
example

•  "...
YSI
5301,
Yellow
Springs,
OH,
USA
...
”

•  Tagging
examples

–  Name
tagging:

"...

of
<name-‐type>lac>c
acid</name-‐type>
and
...“

–  Formula
tagging:

"...
radical
<formula-‐type>OH</formula-‐type>
,
superoxide
..."

Textual
Chemical
Molecule
InformaEon

Indexing
and
Search

•  Index
Schemes:

–  Which
tokens
to
index?

–  Indexing
all
subsequences
generates
a
large
size
index

•  SegmentaEon-‐based
index
scheme

–  Used
for
indexing
chemical
names

methylethyl
–  First
segment
a
chemical
name
hierarchically

and
then
index
substrings
at
each
node
methyl ethyl

meth yl eth yl

me th

•  Frequency-‐and-‐discriminaEon-‐based
index
scheme

–  Used
for
indexing
chemical
formulas

–  SequenEally
select
frequent
and
discriminaEve
subsequences
of
a

formula
from
the
shortest
to
the
longest

Features
for
Formula
Indexing

•  Formula

–  A
sequence
of
chemical
element
or
par3al
formula

with
corresponding
frequencies

–  E.g.
CH3(CH2)2OH

•  ParEal
formula

–  ParEal
formula:
a
subsequence
of
a
formula

–  E.g.
C,
H,
O,
CH3,
CH2,
OH,
CH3(CH)2,
H3(CH)2,
CH3
(CH)2O,
etc.

•  Index
construcEon

–  ParEal
formulas
with
frequencies:
e.g.
<C,3>,<H,
6>,<CH2,2>,
etc.

–  Too
many
parEal
formulas,
need
feature
selec3on

Criteria
of
Feature
SelecEon

•  Criteria
of
feature
selecEon

–  Frequent
features
(Freqs≥Freqmin)

–  DiscriminaEve
features
(αs
≥αmin)

•  If
a
sequence’s
selected
subsequences
are
enough
to

disEnguish
formulas
containing
them
from
other

formulas,
this
sequence
is
redundant.

•  DiscriminaEon
score

α s =| I s '∈F ∧ s 'p s Ds ' | / | Ds |

where
F
is
the
selected
feature
set,
and
Ds
is
the
set
of

formulas
containing
s.

An
Example
for
Formula
Indexing

•  Data
set:

–  1.CH3COOH,
2.CH3(CH2)2OH,
3.CH3(CH2)3COOH

•  Parameter:

–  Freqmin=2,
αmin=1.1

•  Steps:

–  Length=1,
Candidates={C,H,O},
F={C,H,O}

–  Length=2,
Candidates={CH3,H3C,CO,OO,OH,CH2},
Frequent

Candidates={CH3,CO,OO,OH,CH2}

α CH 3 =| {1,2,3}C I {1,2,3}H | / | {1,2,3}CH 3 |= 1
α CO =| {1,2,3}C I {1,2,3}O | / | {1,3}CO |= 1.5

Frequent
&
DiscriminaEve
Candidates={CO,OO,CH2}

F={C,H,O,CO,OO,CH2}

–  Length=3,
…

Formula
Search

•  SF.IEF:
Subsequence
Frequency
&
Inverse
EnEty
Frequency

Freq(s,e) |C |
SF(s,e) = ,IEF(s) = log
|e | |{e | s p e} |
•  Exact
formula
search

–  Search
for
exact
representaEons.
E.g.
=C1-‐2H4-‐6
matches
CH4
and

C2H6,
not
H4C
or
H6C2.

€
•  Frequency
formula
search

–  Full
frequency
search:
search
for
formulas
with
specified
chemical

elements
and
frequency
ranges,
ignoring
the
order,
no
unspecified

elements.
E.g.
C1-‐2H4-‐6
matches
CH4,
C2H6,
H6C2,
CH3CH3,
not

CH4O,
C2H6O2.

–  ParEal
frequency
search:
similar
but
allow
unspecified
elements.
E.g.

*C1-‐2H4-‐6
matches
CH4,
C2H6,
H6C2,
CH3CH3,
and
CH4O
and

C2H6O2
as
well.

–  Ranking
funcEon

score(q, e) = ∑ SF ( s, e) IFF ( s ) 2 /( | f | × ∑ IFF (s) 2
)
s∈q s∈q

Formula
Search
substructure

•  Substructure
formula
search

–  Search
for
formulas
that
may
have
a
substructure.
E.g.
-‐COOH

matches
CH3COOH
(exact
match:
high
score),
HOOCCH3
(reverse

match:
medium
score),
and
CH3CHO2
(parsed
match:
low
score).

–  Ranking
funcEon
score(s,e) = W match(s, f )SF(s,e)IFF(s) / | e |

where
Wmatch(q,f)

is
the
weight
for
exact
match,
reverse
match,
and

parsed
match

•  Similarity
formula
search

–  Search
for
formulas
with
a
similar
structure
of
the
query
formula.

€
Feature-‐based
approach
using
parEal
formula
matching.
E.g.

~CH3COOH
matches
CH3COOH,
(CH3COO)2Co,
CH3COO-‐,
etc.

–  Ranking
funcEon

score(q,e) = ∑W match(q,e )W (s)SF(s,q)SF(s,e)IFF(s) / | e |
sp q
•  ConjuncEve
search
of
the
basic
types
of
formula
searches

–  E.g.
[*C2H4-‐6
-‐COOH]
matches
CH3COOH,
not
C2H4O
or

CH3CH2COOH.

€
•  Document
query
rewriEng

–  E.g.
document
query
atom
formula:=CH4
is
rewri|en
to
atom
(CH4

OR
CD4),
if
formula
search
of
=CH4
matches
CH4
and
CD4.

Formula
Search
-‐Query
Models

Many
models
are
possible
from
exact
to
semanEc

Models
discriminated
by
matching
algorithms

•  Exact
search

–  Search
for
exact
representaEons

–  E.g.
=C1-‐2H4-‐6
matches
CH4
and
C2H6,
not
H4C
or
H6C2

•  Frequency
searches

–  Full
frequency
search:
search
for
formulae
with
specified
chemical
elements
and

frequency
ranges,
ignoring
the
order,
no
unspecified
elements

–  E.g.
C1-‐2H4-‐6
matches
CH4,
C2H6,
H6C2,
CH3CH3,
not
CH4O,
C2H6O2

–  ParEal
frequency
search:
similar
but
allow
unspecified
elements

–  E.g.
*C1-‐2H4-‐6
matches
CH4,
C2H6,
H6C2,
CH3CH3,
and
CH4O
and
C2H6O2
as
well

•  Substructure
search

–  Search
for
formulae
that
may
have
a
substructure

–  E.g.
-‐COOH
matches
CH3COOH
(exact
match:
high
score),
HOOCCH3
(reverse
match:

medium
score),
and
CH3CHO2
(parsed
match:
low
score).

•  Similarity
search

–  Search
for
formulae
with
a
similar
structure
of
the
query
formula.
Feature-‐based

approach
using
parEal
formulae
matching.

–  E.g.
~CH3COOH
matches
CH3COOH,
(CH3COO)2Co,
CH3COO-‐,
etc.

Ranking
formulae

•  Ranking
formulae
has
to
depend
on
need
and
importance

•  Focus
on
structural
methods
and
frequency

•  Importance
can
be
introduced
by
citaEon
rank
or
pagerank
or
others

•  SF.IFF

–  Substructure
frequency
and
inverse
formula
frequency

•  Frequency
searches

– 
score(q, f ) = SF (e, f ) IFF (e) 2 /( | f | ×

IFF (e) 2 )
∑e∈q
∑
e∈q
–  where
|f|
is
the
total
frequency
of
elements

•  Substructure
search

–  score(q, f ) = W SF (q, f ) IFF (q) / | f |
match ( q , f )

– 
where
Wmatch(q,f)

is
the
weight
for
exact
match,
reverse
match,
and

parsed
match

•  Similarity
search

– 

score(q, f ) =
∑W
s pq
W ( s ) SF ( s, q ) SF ( s, f ) IFF ( s ) / | f |
match ( q , f )

Chemical
compounds
as
graphs

•  Chemical
compound
modeled
as
a
semanEc

graph
with
properEes

Atom: vertex/node in the graph
Bond: edge in the graph
Dimensions: 3 or 4
Above figures are copied from
eMolecules.com

What’s
Chemical
Structure
Search

•  Substructure
Search

–  Given
an
input
chemical
structure
sketch,
find
all

the
chemical
compounds
containing
the
input
as
a

substructure.

•  Super
structure
Search

–  Given
an
input
chemical
structure
sketch,
find
all

the
important
descriptors
(substructures/

funcEonal
group)
contained
in
the
input.

•  Similarity
Search

–  Given
an
input
chemical
structure
sketch,
find
all

the
chemical
compounds
“similar”
to
the
input.

Table Search

Tables are widely used to present experimental results or statistical
data in scientific documents; some data only exists in these tables.

Current search engines treat tabular data as regular text
•  Structural information and semantics not preserved.

Goal: automatically identify tables, extract table metadata from pdf
documents into xml and rank data

Table Metadata Representation:
•  Environment metadata: (document specifics: type, title,…)
•  Frame metadata: (border left, right, top, bottom, …)
•  Affiliated metadata: (Caption, footnote, …)
•  Layout metadata: (number of rows, columns, headers,…)
•  Cell content metadata: (values in cells)
•  Type metadata: (numeric, symbolic, hybrid, …)
Y. Liu AAA’07, JCDL’07.

Tables

•  A history that pre-dates that of sentential text
–  Cuneiform clay tablets
•  Not received the same level of formal characterization
enjoyed by sentential text
•  Varying and irregular formats
•  Different intuitive understanding of what a “table” is.
–  Is the Periodic Table of the Elements a table?
–  Tables vs. Lists?
–  Tables vs. Forms?
–  Tables vs. Figures?
–  Genuine table vs. non-genuine table? [12]
•  Our definition: scientific genuine table
–  Caption + tabular structure
–  Ruling lines are not required

TableSeer

Beta design of a table search engine

TableSeer

System

Architecture

Page
Box-‐Cu‡ng
Algorithm

•  Improves
the
table
detecEon
performance
by

excluding
more
than
93.6%
document
content

in
the
beginning

Sample
Table
Metadata
Extracted
File

•  <Table>

•  <DocumentOrigin>Analyst</DocumentOrigin>

•  <DocumentName>b006011i.pdf</DocumentName>

•  <Year>2001</Year>

•  <DocumentTitle>Detec3on
of
chlorinated
methanes
by
3n
oxide
gas
sensors
</DocumentTitle>

•  <Author>Sang
Hyun
Park,
a
?
Young-‐Chan
Son,
a
Brenda
R
.
Shaw,
a
Kenneth
E.
Creasy,*
b
and
Steven
L.
Suib*
acd
a
Department
of
Chemistry,
U-‐60,
University
of
Connec3cut,

Storrs,
C
T
06269-‐3060</Author>

•  <TheNumOfCiters></TheNumOfCiters>

•  <Citers></Citers>

•  <TableCap3on>Table
1
Temperature
effect
o
n
r
esistance
change
(
D
R
)
and
response
3meof
3n
oxide
thin
film
with
1
%
C
Cl
4</TableCap3on>

•  <TableColumnHeading>D
R
Temperature/
¡ã
C
D
R
a
/
W
(
R
,O
2
)
(%)
R
esponse
3me
Reproducibiliy
</TableColumnHeading>

•  <TableContent>100
223
5
~
22
min
Yes
200
270
9
~
7-‐8
min
Yes
300
1027
21
<
2
0
s
Yes
400
993
31
~
1
0
s
No
</TableContent>

•  <TableFootnote>
a
D
R
=(
R
,
CCl
4
)
-‐
(
R
,O
2
).
</TableFootnote>

•  <ColumnNum>5</ColumnNum>

•  <TableReferenceText>In
page
3,
line
11,
…
Film
responses
to
1%
CCl4
at
different
temperatures
are
summarized
in
Table
1……</TableReferenceText>

•  <PageNumOfTable>3</PageNumOfTable>

•  <Snapshot>b006011i/b006011i_t1.jpg</Snapshot>

•  </Table>

TableRank

• Rank tables by rating the <query, table> pairs, instead of the
<query, document> pairs: preventing a lot of false positive hits
for table search, which frequently occur in current web search
engines
• The similarity between a <table, query> pair: the cosine of the
angle between vectors

• Tailored term vector space => table vectors:
• Query vectors and table vectors, instead of document
vectors

Table
Index

  Index

  CapEons

  Footnotes

  Reference
Text

  BoosEng

  CapEons
(2)

  FuncEon:

-  Inversely
(recip)
proporEonal
to
#cites.

Term
WeighEng
for
Tables

–  TTF
–
ITTF:
(Table
Term
Frequency-‐Inverse
Table
Term
Frequency)

–  TLB:
Table
Level
Boost
Factors
(e.g.,
table
frequency)

–  DLB:
Document
Level
Boost
factors
(e.g.,
journal/proceeding
order,
document

citaEon)

Table
term
ranking

• A term occurring in a few tables is likely to be a better discriminator than a term
appearing in most or all tables
• Similar to document abstract, table metadata and table query should be treated as
semi-structured text
• Not complete sentences and express a summary
• P = 0.5 (G. Salton 1988)
•  b is the total number of tables
• IDF(ijk): the number of tables that term t(i) occurs in the matadata m(k)

Table
Level
Boost
and
Document
Level

Boost

Btbf is the boost value of the table frequency
Btrt is the boost value of the table reference text (e.g., the normalized length), and
Btp is the boost value of the table position. r is a parameter, which is 1 if users
specify the table position in the query. Otherwise, r = 0.

IVj: document Importance Value (IV). If a table comes from a document with
a high IV , all the table terms of this document should get a high document
level boost
ICj: the inherited citation value (ICj)
DOj: source value (the rank of the journal/conference proceeding)
DFj: document freshness

Table
citaEon
network

•  Similar
to
the
PageRank
network

–  Documents
construct
a
network
from
the
citaEons

–  The
“incoming
links”
–
the
documents
that
cite
the
document
in
which

the
table
is
located

–  ExponenEal
decay
used
to
deal
with
the
impact
of
the
propagated

importance

•  Unlike
the
PageRank
network

–  Directed
Acyclic
Graph

–  Importance
Value
(IV)
of
a
document
not
decreased
as
the
number
of

citaEons
increases

–  IV
not
divided
by
the
number
of
outbound
links

•  A
document
may
have
mulEple,
one,
or
no
tables

•  Each
table
is
consisted
as
a
set
of
metadata

•  Same
keywords
may
appear
in
diﬀerent
metadata
in
diﬀerent

tables

Table
Search
Summary

•  An
novel
ﬁrst
table
ranking
algorithm
-‐-‐
TableRank

•  A
tailored
table
term
vector
space

•  A
table
term
weighEng
scheme
–
TTF-‐ITTF

–  AggregaEng
impact
factors
from
three
levels:
the

term,
the
table,
and
the
document

•  Index
table
referenced
texts,
term
locaEons,
and

document
backgrounds

•  Design
and
implement
ﬁrst
table
search
engine,

TableSeer,
to
evaluate
the
TableRank
and
compare
with

popular
web
search
engines

•  Code
released

•  Currently
implement
in
CiteSeerX
-‐
millions
of
tables

•  Improving
extracEon
–
Dow
Chemical
support

Automated Figure Data Extraction and Search"
•  Large amount of results in digital documents are recorded in ﬁgures, time series, experimental
results (eg., NMR spectra, income growth) and this is the only record of the data"

•  Extraction for purposes of:"
–  Further modeling using presented data"
–  Indexing, meta-data creation for storage & search on ﬁgures for data reuse"

•  Current extraction done manually!!

Documents

Extracted
Plot
Extracted
Info.

Document
Merged

Index
Plot
Index

Index

Digital
Library

User

Seer Figure/Plot Data Extraction and Search

Numerical data in
scientific publications
are often found in figures.

Tools that automate the data extraction from figures
provide the following:
•  Increases our understanding of key concepts of papers
•  Provides data for automatic comparative analyses.
•  Enables regeneration of figures in different contexts.
•  Enables search for documents with figures containing
specific experiment results.
X. Lu JCDL’06 & IJDAR’09, Brouwer JCDL’08, Kataria AAAI’08

Metadata & data to extract:  
2 Dimensional Plot"

Y-Axis
Labels
Legend

Data Points

Ticks

Axis Units
X-Axis
Label
Snapshot of a document Extracted 2D plot

Our
Approach
to
Plot
Data
ExtracEon

• Identify and extract figures from digital documents
• Ascii and image extraction (xpdf)
• OCR - bit map, raster pdfs
• Identify figures as images of 2D plots using SVM (Only for Bit
map images)
• Hough transform
• Wavelets coefficients of image
• Surrounding text features
• Binarization of the 2D plots identified for preprocessing (No
need for Vectorized Images)
• Adaptive Thresholding
•  Image segmentation to identify regions
• Profiling or Image Signature
•  Text block detection
• Nearest Neighbor
•  Data point detection
• K-means Filtering
•  Data point disambiguation for overlapping points
• Simulated Annealing

Future Directions
•  System integration within ChemXSeer or
CiteSeerX"
–  XML data generation"
–  Open source tool in Lucene/SOLR "

•  Extension to other ﬁgures (3D, …)

"
1.2e+08

1e+08"
8e+07"
6e+07"
4e+07"
2e+07"
" 0

30 "
25 " "
20 " " 60 " 70

15 " " 50

10 " " 30 " 40

5 " 10 " 20

ChemXSeer Highlights
•  Portal for academic researchers in environmental chemistry which integrates the scientific
literature with experimental, analytical and simulation results and tools

•  Provides unique metadata extraction, indexing and searching pertinent to the chemical
literature by using heuristics combined with machine learning
•  Chemical formulae and names
•  Tables
•  Figures
•  Publication functions as in CiteSeerX
•  Interoperability ORE-Chem development
•  Novel ranking required

•  After extraction, data stored API accessible xml for users

•  Hybrid repository (Not fully open): Serves as a federated information interoperational system
•  Scientific papers crawled and indexed from the web
•  User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM
toolkit outputs)
•  Scientific documents and metadata from publishers (e.g. Royal Society of Chemistry)

•  Access control for publisher-provided content and user-submitted experiment data

•  Takes advantage of developments in other funded cyberinfrastructure and open source
projects
•  CiteSeerX, PlanetLab, Lucene/Solr, ORE, others
•  Some released open source

Experimental Collaborator recommendation system

•  CollabSeer
currently
supports
400k
authors

•  h|p://collabseer.ist.psu.edu

CollaboraEon
recommendaEon

•  Metadata
of
authors
and
coauthors
and
topics
of
interest

(similar
to
expert
recommendaEon)

•  Use
social
network
and
topics
to
recommend

collaborators
of
collaborators
(FOF)

•  Devise
SN
index
and
ranking
scheme

•  Explore
models
of
vertex
similarity

•  Built
on
SeerSuite

Gou JCDL’10,
•  Other
recommendaEons?
Gou MIR’10
–  Experimental
methods
Chen JCDL’11, SAC’12

–  Chemicals?

RecommendaEon
list
and
user’s
topic
of
interest

•  Users
reﬁne
the
recommend
list
by
clicking
on
their
topic
of
interest.
(lek:
reﬁned
by
“query

processing”,
right:
default
recommendaEon
list)

•  How
two
potenEal
collaborators
are
linked
by
common
collaborators

IntegraEon
of
Vertex
Similarity
and

Textual
Similarity

• 

–  S:
vertex
similarity

–  SC.O.T.:
collaborator’s
contribuEon
to
a
speciﬁed
topic

–  Use
the
product
of
exponenEal
funcEons
to
avoid
zero

vertex
similarity
score
or
zero
contribuEon
(textual

similarity)
score
to
turn
the
whole
measure
into
zero

•  Other
measures?

•  RefSeerX:
recommend
citaEons
for
papers

Use these
paper

citaEons

The authors are unaware of related work
 they do not know they are looking for
 recommends related citations
•  Based

–  ExisEng
citaEons

–  CitaEon
context

–  Venue
and
importance

–  Contemporary
vs
seminal

He, WWW ‘10, WSDM ’11; Kataria, CIKM ’10, IJCAI’11,

Expert
Search

• Expert search for authors, currently in alpha

Keyphrase
ExtracEon
for
experts

Text
Document

Parse document into sections with
SecEon
Parser
regular expression

Candidate
Use DBLP statistic to extract
DBLP
data
keyphrase candidates
Extractor

Train random forest to classify &
Training
Data
Random
Forest
rank whether a phrase is a
keyphrase

Top
Keyphrases

Treeratpituk, P., Teregowda, P., Huang, J. and Giles, CL. SEERLAB: A System for Extracting Keyphrases from
Scholarly Documents, Semeval-2010 task 5: Automatic keyphrase extraction from scientific article. ACL workshop
on Semantic Evaluations (SemEval 2010), Sweden, July 2010.

GrantSeer

•  Prototype
search
engine
for
PI
proﬁles
and
their
grant

informaEon
to
assist
funding
agencies,
deans
of
research,

foundaEons

•  Link
PIs
with
their

–  Grants

–  PublicaEons

–  CitaEons

–  OrganizaEon

–  ExperEse

–  Others?

•  Data
that
can
be
shared

–  CiteSeerX
or
Google
Scholar
data

–  Database
of
funded
research

Funded by NSF – Julia Lane

Cover
page
NSF
XML
extracEon

GrantSeer:
PI
proﬁle

grants awarded

PI’s expertise
publications + citations

Algorithm
Search

• Homepage search for authors, currently in alpha

AlgorithmSeer

Algorithm
Search

-‐
ExtracEon

-‐
Indexing

-‐
Ranking

Suite Workshop
ICSE ‘11

Metadata extraction
• Extract
• Pseudo-codes and their metadata
• Captions
• Reference sentences
• Synopsys
• Etc.
• Index metadata using Solr to make the pseudo-
codes searchable
• Each search result has a pointer to the page in the
document where the pseudo-code appears

Index Fields
id <string>
caption <text>
reftext <text> (Reference Sentences)
synopsis <text> (Summarizing Text)
page <sint> (Page Number)
paperid <string> (Document ID)
year <sint> (Year of Publication)
ncites <sint> (Number of Citations)

Number of Total C/A
Name Acknowledge-ments Citations Metric Name
Educational
Funding Agencies
Institutions
National Science Carnegie Mello
12287 144643 11.77
Foundation University
Defense Advanced Massachusetts
4712 80659 17.12
Research Projects Agency of Technology
California Inst
Office of Naval Research 3080 48873 15.87
Technology
Funding Agency Impact Deutsche
2780 9782 3.52 Santa Fe Institu
Forschungsgemeinschaft
French Nationa
National Aeronautics and
2408 21242 8.82 Institute for Re
Space Administration
Funding agency impact Engineering and Physical
Computer Scie
2007 16582 8.26 Stanford Unive
•  based on Science Research Council
Air Force Office of University of C
acknowledgement indexing Scientific Research
1657 16850 10.17
at Berkeley
National Sciences and National Cente
•  # of acknowledgements Engineering Research 1422 12050 8.47 Supercomputin
•  total citations Council of Canada Applications
International C
•  #Citation / #ack metric Department of Energy 1054 5562 5.28
Science Institu
Australian Research
1010 5464 5.41 Cornell Univer
Council
Based on acknowledgment European Union
University of I
Information Technologies 825 9594 11.63
entities extracted from 150K Program
Urbana-Champ
acknowledgements in CiteSeer National Institutes of
709 7279 10.27
USC Informati
Health Sciences Instit
University of C
New system available this spring Army Research Office 666 7709 11.58
Los Angeles
Netherlands Organization
AckSeer for Scientific Research
646 2843 4.4 McGill Univer
Science and Engineering Australian Nat
489 6976 14.27
Research Council University
Companies Individuals
International Business Giles, PNAS, 2004
1380 23948 17.35 Olivier Danvy
Machines
Intel Corporation 962 14441 15.01 Oded Goldreic

Most Acknowledged Authors and Impact Factor

C/A
Author Citations Acknowledge-ments Metric
Olivier
Interviewed by Danvy
847 268 29.85
Nature as to why Oded
3277 259 17.82
Goldreich
he was the most Luca
3847 247 43.91
acknowledged Cardelli
Tom
computer scientist Mitchell
3336 226 24.31
Martin
3507 222 43.46
Abadi
Phil
3780 181 40.07
Wadler
Moshe
3786 180 33.86
Vardi
Who is most acknowledged? 1790
Peter Lee 167 53.54
Avi
2566 160 18.13
Mom or dad Wigderson
Matthias
Theorists or experimentalists Felleisen
1622 154 30.55
Benjamin
1484 152 30.53
Who has a better metric? Pierce
Noga Alon 2640 152 15.71
John
3693 152 41.9
Ousterhout
Frank
1639 148 13.84
Pfenning
Andrew
2064 144 52.99
Appel

Clouding CiteSeerX
•  Hosting cloud CiteSeerX instances
•  Economic issues
•  Cost of hosting
•  Cost of refactoring the source to be hosted in the cloud.
•  Computational/technical issues
•  What workflow to cloudize
•  Component modification for efficient operation
•  VM size: storage, memory and CPU sizing as a function of
needs
•  Establishing computational needs and availability clusters
•  Appropriate load balancing across multiple sites.
•  Security of data stored including metadata and user data.
•  Policy issues
•  Privacy of user data
•  Copyright issues.
Teregowda Cloud’10 USENIX’10

SeerSuite
Research/Development
Opportuni3es

•  Old
Seers

–  Improve
or
revive
old
systems
and
port
them
into
compeEEve
SeerX
space

•  eBizSeer
to
eBizSeerX;
BotSeer
to
BotSeerX;
ArchSeer
to
ArchSeerX

•  New
Seers

–  New
domains
such
as
physics,
neuroscience,
biology,
algorithms,
TBD
(build
new
indexes)

–  MyCiteSeerX

•  Be|er
features

–  Parsing

–  EnEty
disambiguaEon

–  CitaEon
analysis

–  Ranking;
ranking,
ranking

•  New
features

–  New
parsing,
indexing,
ranking

•  Tables,
ﬁgures,
equaEons,
algorithms,
maps,
carbon
daEng,
chemical
formulae,
etc

–  Homepage
linking

–  ORE
search
and
data
integraEon

–  CollaboraEve
spaces

–  API/web
services

–  IntegraEon
with
DL
such
as
Fedora

–  New
clusters

•  Topics,
venues,
aﬃliaEons

–  Recommender
systems

–  SNA
analysis

–  Others

Collabora>ons
welcomed!

Data
and
sohware
available

Research
SeerSuite
supports

•  Many
uses
as
a
research
testbed
and
support
structure

–  Scaling
of
algorithms
for
IR,
IE,
data
mining,
social
networks,
...

–  NLP
methods
on
large
text
collecEons

–  ML
methods
to
automaEcally
extract
data

–  Novel
indexing
and
ranking

–  Federated
search

–  CollaboraEve
and
social
networks

–  Focused
crawling
–
new
data
resources

–  Interface
design
and
integraEon

–  Systems
analysis

•  Many
development

applied
research
issues

–  IntegraEon
with
other
DLs

–  Automated
feature
development

–  Transfer
to
nontechnical
use

–  Cloud
based
delivery

Summary

•  Propose
an
infrastructure
for
academic
and
scienEﬁc
search
engine/digital
library

creaEon
-‐
SeerSuite

–  Modular,
scalable,
extensible,
robust

–  Based
on
commercial
grade
open
source
(Solr/Lucene);
easy
to
use

–  Easy
to
apply
to
other
domains
(separable
indexes
and
projects
-‐
integraEon)

•  Allows
scalable
data
mining
and
informaEon
extracEon
for
actual
systems

–  Unique
informa4on
extrac4on
plugins

–  Focus
on
unique
scalable
extracEon/data
mining
methods

•  Most
methods
less
than
N2
complexity

–  AutomaEcally
populates
databases
or
data
structures

•  Demonstrate
with
beta
systems
in

–  Computer
science,
Archaeology,
Chemistry,
Robots.txt,
PubMed,
YouSeer,
Tables,

Figures,
Maps,
References,
CollaboraEons,
DisambiguaEon

–  Personal
features

•  Systems
are
reasonably
easy
to
build;
issues
are

–  Data
collecEon
or
data
access

–  InformaEon
extracEon,
indexing,
ranking

•  Many
uses
as
a
research
testbed

–  Data
sharing
models

•  Want
to
ﬁnd
a
Seer,
search
Google
or
use
my
homepage.

Opportun3es

•  Science
is
being
ﬂooded
with
data

–  SimulaEons,
sensors,
web

•  Digital
humaniEes
is
right
behind

•  Needs
in

–  Large
scale
data
management
(tera
to
peta)

•  NoSQL
databases:
graphs,
documents,
ﬂoaEng
point,

–  Large
scale

•  data
mining

•  informaEon
extracEon

•  search

•  Domain
experEse
crucial

•  Reuse
not
reinvent
(much
is
out
there)

•  Solr/Lucene
is
great
for
both
demos,
producEon
and

research.

“Human attention is the scarce
resource, not information.” Herbert
A. Simon, Nobel Laureate, 1997.

For
more
informaEon

•  clgiles.ist.psu.edu

•  giles@ist.psu.edu

•  SourceForge.com

Using Lucene/Solr to Build CiteSeerX and Friends

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (17)

Destaque

Destaque (6)

Semelhante a Using Lucene/Solr to Build CiteSeerX and Friends

Semelhante a Using Lucene/Solr to Build CiteSeerX and Friends (20)

Mais de lucenerevolution

Mais de lucenerevolution (20)

Último

Último (20)

Using Lucene/Solr to Build CiteSeerX and Friends