Presented by C. Lee Giles, Pennsylvania State University - See complete conference videos - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Cyberinfrastructure or e-science has become crucial in many areas of science as data access often defines scientific progress. Open source systems have greatly facilitated design and implementation and supporting cyberinfrastructure. However, there exists no open source integrated system for building an integrated search engine and digital library that focuses on all phases of information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. We propose the open source SeerSuite architecture which is a modular, extensible system built on successful OS projects such as Lucene/Solr and discuss its uses in building enterprise search and cyberinfrastructure for the sciences and academia. We highlight application domains with examples of specialized search engines that we have built for computer science, CiteSeerX, chemistry, ChemXSeer, archaeology, ArchSeer. acknowledgements, AckSeer, reference recommendation, RefSeer, collaboration recommendation, CollabSeer, and others, all using Solr/Lucene. Because such enterprise systems require unique information extraction approaches, several different machine learning methods, such as conditional random fields, support vector machines, mutual information based feature selection, sequence mining, etc. are critical for performance.
1. Using
Lucene/Solr
to
Build
CiteSeerX
and
Friends
Dr. C. Lee Giles
Information Sciences and Technology
Computer Science and Engineering
The Pennsylvania State University
University Park, PA, USA
giles@ist.psu.edu
http://clgiles.ist.psu.edu
2. http://clgiles.ist.psu.edu
Prof.
C.
Lee
Giles
• Intelligent
and
specialty
search
engines;
cyberinfrastructure
for
science,
academia
and
government
– Modular,
scalable,
robust,
automaEc
cyberinfrastructure
and
search
engine
creaEon
and
maintenance
– Large
heterogeneous
data
and
informaEon
systems
– Specialty
search
engines
and
portals
for
knowledge
integraEon
• CiteSeerx
(computer
and
informaEon
science)
• ChemXSeer
(e-‐chemistry
portal)
• GrantSeer
(grant
search)
• RefSeer
(recommendaEon
of
paper
references)
• Scalable
intelligent
tools/agents/methods/algorithms
– InformaEon,
knowledge
and
data
integraEon
– InformaEon
and
metadata
extracEon;
enEty
disambiguaEon
– Unique
search,
knowledge
discovery,
informaEon
integraEon,
data
mining
algorithms
– Web
2.0
methods
• Automated
tagging
for
search
and
informaEon
retrieval
• Social
network
analysis
3. SeerSuite
Contributors/Collaborators:
recent
past
and
present
(incomplete
list)
Projects:
CiteSeer,
CiteSeerX,
ChemXSeer,
ArchSeer,
CollabSeer,
GrantSeer,
SeerSeer,
RefSeer,
AlgoSeer,
AckSeer,
BotSeer,
YouSeer,
…
• P.
Mitra,
V.
Bhatnagar,
L.
Bolelli,
J.
Carroll,
I.
Councill,
F.
Fonseca,
J.
Jansen,
D.
Lee,
W-‐C.
Lee,
H.
Li,
J.
Li,
E.
Manavoglu,
A.
Sivasubramaniam,
P.
Teregowda,
H.
Zha,
S.
Zheng,
D.
Zhou,
Z.
Zhuang,
J.
Stribling,
D.
Karger,
S.
Lawrence,
J.
Gray,
G.
Flake,
S.
Debnath,
H.
Han,
D.
Pavlov,
E.
Fox,
M.
Gori,
E.
Blanzieri,
M.
Marchese,
N.
Shadbolt,
I.
Cox,
S.
Gauch,
A.
Bernstein,
L.
Cassel,
M-‐Y.
Kan,
X.
Lu,
Y.
Liu,
A.
Jaiswal,
K.
Bai,
B.
Sun,
Y.
Sung,
J.
Z.
Wang,
K.
Mueller,
J.Kubicki,
B.
Garrison,
J.
Bandstra,
Q.
Tan,
J.
Fernandez,
P.
Treeratpituk,
W.
Brouwer,
U.
Farooq,
J.
Huang,
M.
Khabsa,
M.
Halm,
B.
Urgaonkar,
Q.
He,
D.
Kifer,
J.
Pei,
S.
Das,
S.
Kataria,
D.
Yuan,
T.
Suppawong,
others.
• Current
funding:
NSF,
Dow
Chemical
4. Outline
• MoEvaEon
– Data
science;
Cyberinfrastructure
– Vast
growth
in
domain
science
data
and
documents
• SeerSuite
– Tool
for
creaEng
Seers
– Specialized
data
and
document
search
and
recommendaEons
• Tables,
formulae,
figures,
references
…
– Use
of
Solr/Lucene
• Disciplinary
sciences,
indexes
&
informaEon
extracEon
(the
Seers)
– Computer
science
– Chemistry
– Briefly
other
Seers
• OpportuniEes
for
Research
• Conclusions
and
DirecEons
5. The
Evolu3on
of
Science
-‐
the
4th
Paradigm
Jim Gray’s paradigm
• Observa3onal
Science
– ScienEst
gathers
data
by
direct
observaEon
– ScienEst
analyzes
data
• Analy3cal
Science
– ScienEst
builds
analyEcal
model
– Makes
predicEons.
• Computa3onal
Science
– Simulate
analyEcal
model
– Validate
model
and
makes
predicEons
• Data
Driven
Science
– Data
captured
from
the
web,
by
instruments,
or
from
documents
– Data
generated
by
simulaEon
– Placed
in
data
structures
/
files
– ScienEst(s)
analyze(s)
data
– Access
&
search
crucial
6. Data
Access
Varies
with
Discipline
or
Small
vs
Big
Science
• Small
vs
Big
science
– “Data
from
Big
Science
is
…
easier
to
handle,
understand
and
archive.
Small
Science
is
horribly
heterogeneous
and
far
more
vast.
In
Eme
Small
Science
will
generate
2-‐3
Emes
more
data
than
Big
Science.”
• ‘Lost
in
a
Sea
of
Science
Data’
S.Carlson,
The
Chronicle
of
Higher
EducaEon
(23/06/2006)
– Data
is
local
– Data
will
not
be
shared
• At
some
point
there
will
be
needed
– indices
to
control
search
– parallel
data
search
and
analysis
• Cyberinfrastructure
can
help
– If
you
can’t
move
the
data
around,
– Bandwidth
of
a
van
loaded
with
disks
take
the
analysis
to
the
data!
– Do
all
data
manipulaEons
locally
• Build
custom
procedures
and
funcEons
locally
7. SeerSuite
• Open
source
search
engine
and
digital
library
tool
kit
used
to
build
search
engines
and
digital
libraries
– CiteSeerX
,
ChemXSeer,
RefSeer,YouSeer,
CollabSeer,
etc.
• Supports
research
in
– Indexing
and
search
– Digital
libraries
– Data
mining
&
structures
– InformaEon
and
knowledge
extracEon
– Social
networks
– Scientometrics/infometrics
– Systems
engineering,
User
design
– Sokware
engineering
and
management
– Web
crawling
• Trains
students
in
search
and
sokware
systems
– EducaEonal
tool
for
search
engine
creaEon
– Students
highly
sought
in
industry
and
government
8. SeerSuite
-‐
proper3es
• Modular,
scalable,
extensible,
robust
design
– Extensible
to
many
problems
and
disciplines
• Integrated
features
– Focused
crawler
-‐
Heritrix
– Indexer
-‐
Solr/lucene
– Metadata
extracEon
-‐
modular
– Ranked
results
• Builds
on
experience
with
other
domain
engines
and
OS
tools
–
Lucene
and
Solr
–
The
MySQL
Database
and
InnoDB
Storage
Engine
–
Apache
Tomcat
–
Spring
Framework
–
Acegi
Security
–
AcEveMQ
–
AcEveBPEL
Open
Source
Engine
–
Apache
Commons
Libraries
–
SVMlight
support
vector
machine
package
–
CRF++
condiEonal
random
field
package
• Hardware
independent;
Linux
• Reuse
not
reinvent
9. Data Mining & Information Extraction in Seers
• Data acquisition
• SeerSuite systems often crawls the public web for new data
• Many data types available
• Richness of data offers unique data mining features
• CiteSeerX as testbed/sandbox
• Large scale data resources
• Millions of documents, authors, etc.
• Some common features/metadata
• Commercial grade indexer (Solr/Lucene)
• Scalable to G’s of documents and M’s of users
• “Watson”
• Modular design
• Cloudable
• State of the art algorithms (machine learning) for large scale
unique metadata (information) extraction & mining
• Unique parsers and indexing
• Quality of extraction
• Precision/recall
• Ranking
• Architecture/integration
10. Seer
Friends
• In
various
stages
of
the
system
lifecycle
with
various
data
resources
and
indexes:
– Mature
and
developing,
code
released
• CiteSeer,
now
CiteSeerX
• ChemXSeer
• TableSeer
• YouSeer
– New,
future
TBD,
not
all
aspects
public
• ArchSeer
• AlgoSeer
• CollabSeer
• RefSeer
• SeerSeer
• GrantSeer
– Dead
or
limping
by
(could
be
revived)
• AckSeer
(acknowledgement
indexing)
(revived!)
• BizSeer
• BotSeer
– Proposed,
but
do
not
exist
• BrainSeer
• CensorSeer
• ArXivSeer
11. Why
Solr/Lucene?
• Only
open
source
considered
–
cost
• CompeEtors:
– Indri
– Wumpus
– Terrier
– Others?
• Must
scale
for
both
number
of
documents
and
users
• Easily
integrable
and
customizable
– Other
indexes,
crawlers,
ingesEon,
metadata
extractors
• Well
used
(Watson)
• AcEve
community
of
support
– Enterprise
plaporm
a
plus
• Easy
to
transiEon
to
government/industry/academia
– Apache
license
12. Next Generation CiteSeer, CiteSeerX
•
2
M
documents
•
40
M
citaEons
•
2
to
5
M
authors
•
2
to
4
M
hits
day
•
800K
individual
users
•
en3re
data
shared
•
Index
-‐
50
G
http://citeseerx.ist.psu.edu
13. History:
CiteSeer
(aka
ResearchIndex)
Project
at
NEC
Research
InsEtute,
Princeton
1st
academic
document
search
engine
Very
popular
with
computer
science
C. Lee Giles
Hosted
at
NEC
from
1997
–
2004.
Moved
to
Penn
State
as
collaborators
lek.
Provided
a
broad
range
of
unique
services
including
AutomaEc
citaEon
indexing,
reference
linking,
full
text
indexing,
similar
documents
lisEng,
Kurt Bollacker
automated
metadata
extracEon
and
several
other
pioneering
features.
Refactored
and
redesigned
as
CiteSeerx
Released
2008
Lucene
based
indexing
CiteSeer continuously running for 15 years! Steve Lawrence
14. SeerSuite/CiteSeerX Architecture
• Web Application
• Focused Crawler
• Document Conversion and
Extraction
• Document Ingestion
• Data Storage
• Maintenance Services
• Federated Services
Teregowda, USENIX ‘10
15. 4 systems:
• Production
• Crawling
• Staging
• Research
All or some
can be
cloudized
Teregowda, USENIX 2010
16. CiteSeerX
Services
CiteSeerX
is
a
very
automated
system:
Full
OAI
metadata
if
available
Full
text
Indexing
(many
different
indexes)
Documents
CitaEons
Tables
More
forthcoming
(Algorithms,
Figures,
Acknowledgements).
CitaEon
Graph.
Ranking
based
on
citaEons.
Linking
documents
- Co-‐citaEons
- CiEng
documents
Author
DisambiguaEon
DisEnguish
between
authors
with
similar
names.
Profiles
and
publicaEon
informaEon
for
author.
AutomaEc
crawling
from
list
and
submissions
PersonalizaEon
- Login
based
access
to
features
on
CiteSeerX.
- CorrecEons
to
metadata.
- Storage
of
queries.
- CollecEon
of
papers
- Follows
document
metadata
changes.
17. Focused
Crawling
• Maintain
a
list
of
parent
URLs
where
documents
were
previously
found
– Parent
URLs
are
usually
academic
homepages.
• 300,000
unique
parent
URLs,
as
of
summer
2011
– Parent
URLs
are
stored
in
a
database
table
with
two
addiEonal
fields
for
scheduling:
• Last
Eme
changed,
get
new
documents
from
the
page.
• EsEmated
change
rate
according
to
previous
crawls
of
this
page.
• The
crawling
process
starts
with
the
scheduler
selecEng
1000
parent
URLs
which
have
the
highest
probability
of
having
new
documents
available.
– Assume
Poisson
process
for
the
change
behavior
of
a
parent
page.
• Suppose
a
parent
page
P’s
last
observed
change
occurred
at
Eme
t1,
and
its
esEmated
change
rate
is
R,
then
at
Eme
t2
(t2
=
t1
+
Δ),
the
probability
that
it
has
changed
again
since
t1
is
1
–
exp(-‐R*Δ)
• Larger
R
or
larger
Δ
will
give
larger
probability.
• Aker
each
crawl,
the
change
rate
of
the
scheduled
parent
URL
should
be
recalculated.
• Crawling
run
incrementally
daily
(invoked
by
a
Linux
cron
job
at
12
am)
– Most
discovered
documents
have
been
crawled
before.
• Use
hash
table
comparison
for
detecEon
of
new
documents
• Normally
retrieve
a
few
thousand
NEW
documents
per
day,
someEmes
less
than
1k.
• Moved
to
whitelist
vs
blacklist
Zheng, CIKM’09
18. documents
from
crawled
urls
90% all
citations from
the first 550
sites
90% all
documents
from the first
1250 sites
19. How
will
we
get
metadata
for
fields?
Now... that should clear up a few things around here
20. Metadata
ExtracEon
• Documents
are
converted
from
PDF/PS
to
text
using
converters.
– Converters
include
TET,
pd{ox,
pdkotext,gs.
• Documents
are
filtered
checking,
for
existence
of
references
and
duplicaEon
(checksum).
• Use
tools
or
build
your
own
– Metadata
extracEon
system
uses
machine
learning
methods
like
SVM
(Header
Parser),
CRF
(ParsCit)
to
extract
various
enEEes
from
the
document.
• Rule
based
templates
are
applied
before
extracEon.
21. AutomaEcally
Created
DB
of
paper
in
CSX
10.1.1.130.782 Tensor Decompositions and Applications
This .. 2009 pages 455-500
id title
abstract year publisher SIAM
“Tensor Decompositions and Applications”, SIAM REVIEW, 2009, pp 455-500
Abstract: This ….
Cited 34 times, 6 times by Author venue Assigned
SIAM REVIEW
By
venueType
version cluster System
JOURNAL
Extractor/
2 9248987 User/
10 12/30/2008 True Inference
n-cites 34
Inference/
selfCites 6 public User
repositoryID crawldate
22. 3
Tier
Architecture
Queries
Index
Web 1
Index - Tables
User Request Load Balancer
Web Application
Load Balancer Repository
Web 2
Database
Requests
Storage
Crawler Ingestion
Extraction
23. CiteSeer X
Sokware
Overview
• IngesEon
Process:
Responsible
for
obtaining
and
preparing
a
document
and
the
related
metadata.
– Process
the
document
• Submi|ed
by
the
user
or
Crawler
– Extract
Metadata
• Header
• CitaEons
• Acknowledgements
– Store
the
metadata
and
documents.
• CitaEon
Matching
– Iden>fying
the
underlying
graph
structure
–
documents
ci>ng
this
document
and
the
rela>onship
between
documents
and
cita>ons
• Inference
matching
and
graph
generaEon
– User
CorrecEons
(Version
Maintenance)
– Determine
and
accept
valid
user
correc>ons
– Regular
NoEficaEon
Mechanisms
– Ensure
that
the
user
is
no>fied
when
new
documents
are
added
to
the
collec>on
• Linked
to
MyCiteSeer.
• Update
and
Maintenance
– Update
and
make
valid
the
full
text
index
and
various
sta>s>cs.
– StaEsEcs
– Index
updates
24. CiteSeerX
Search
Enabling
Search
Fulltext
Fields
created
- Title
- Authors
- CitaEons
- Venue
- Keywords
- Abstract
- Range
(PublicaEon)
- CitaEons
25. Field
Schema
Field Type Indexed/Stored
DOI String Y/Y - Unique
Citation/Document String Y/Y
Title Text Y/Y
Author A Text Y/Y
Authors Normalized A Text Y/N
ncites (# cited by) Integer Y/Y
URL String Y/Y
cites Tokens Y/N
citedby Tokens Y/N
Timestamp Date Y/Y
* - A Text is a Text field which does not have a stopword filter or stemming
^ - Tokens are a Text field with only duplicate removal and whitespace tokenizer
26. CiteSeerX
Search
Results
Results
SorEng
Relevance
(default)
- Based
on
dismax
query
handling
with
boosEng.
Sorting
CitaEons
- CitaEons
received
by
the
document
in
collecEon
plus
default
Year
- PublicaEon
date.
Recency
- Date
of
acquisiEon.
27. CiteSeerX
CitaEon
Graph
RelaEonships
B
Cited by CitaEon
graph
E
- Store
Cited
by
and
A
Cites Cites
in
index
Build
D
- Build
document
C
graph
by
querying
index
for
relaEonship.
28. Adding
documents
Ingest
documents
for
new
crawls
- Add
metadata
to
collecEon
- Add
full
text
to
system
- Link
metadata
in
collecEon
Run
maintenance
scripts
- Poll
updates
and
post
to
Solr.
Fulltext
Metadata
RelaEonships
Challenge:
Maintain
data
freshness.
29. Query
Response
Web
• Query
forwarded
to
Solr
from
the
presentaEon
Web Interface layer
(JSP)
• Solr
generates
ranked
response
in
JSON
• Build
each
record
in
xml
with
the
database
(Add
Database
fields:
Abstract)
• PresentaEon
layer
(JSP)
Index formats
records
based
on
ranking.
30. Ranking
with
BoosEng
(Relevance)
Use
of
Boost
FuncEon,
Minimum
Match,
Query
Fields
Boost
FuncEon
–
the
effect
of
citaEons
- Map
number
of
citaEons
>
1
to
500
Minimum
Match
–
2
Query
Fields
- Text
(1)
- Title
(4)
- Abstract
(2)
31. Query
Response
Web Interaface
Query
at
Interface
(JSP)
Q
Hand
over
to
Web
Text R HashMap
applicaEon
(Java/Spring)
Web Application
Hand
over
to
Solr
F Ranked
response
from
Solr
Text JSON
Q R HashMap
(JSON)
DB Response
unwrapped
and
more
details
included
with
Index informaEon
from
DB
Present
response
at
Interface
(JSP)
32. Name
DisambiguaEon
• Name
disambiguaEon
(NER)
– A
person
can
be
referred
to
in
different
ways
with
different
a|ributes
in
mulEple
records;
the
goal
of
name
disambiguaEon
is
to
resolve
such
ambiguiEes,
linking
and
merging
all
the
records
of
the
same
enEty
together
• Three
types
of
name
ambiguiEes:
– Aliases
-‐
one
person
with
mulEple
aliases,
name
variaEons,
or
name
changed
e.g.
CL
Giles
&
Lee
Giles,
Superman
&
Clark
Kent
– Common
Names
-‐
more
than
one
person
shares
a
common
name,
e.g.
Jian
Huang
–
103
papers
in
DBLP
– Typography
Errors
-‐
resulEng
from
human
input
or
automaEc
extracEon
• Goal:
disambiguate,
cluster
and
link
names
in
a
large
digital
library
or
bibliographic
resource
such
as
Medline,
CiteSeerX,
etc.
33. Efficient
Large
Scale
En3ty
Disambigua3on
Testbed:
CiteSeerX
and
PubMedSeer
et.al PKDD 2006
Huang,
Treeratpituk, et.al JCDL 2009
• EnEty
disambiguaEon
problem
Online SVM
– Determine
the
real
idenEty
of
the
with Active Learning
authors
using
metadata
of
the
Annotator
Distance Learner
research
papers,
including
co-‐ Metadata
authors,
affiliaEon,
physical
Actors, entities
Extraction Soft-
address,
email
address,
Module
Jaccard
TFIDF
documents
Similarity
informaEon
from
crawling
such
Similarity SVM
Distance DBSCAN
as
host
server,
etc.
Function Clustering
– EnEty
normalizaEon
Similarity Module
Function
• MoEvaEon
– Enhance
search
funcEonaliEes
Blocking
for
digital
repositories
Module Candidate
Class
• Fielded
search
by
author
name
Author 1
Paper 3
Author 2
Paper 4
– Improve
metadata
quality
– Improved
social
network
analysis
– Government
and
business
• Key
features
intelligence
• E.g.
census
data
and
credit
– LASVM
distance
funcEon
records
• AcEve
learning
• Challenges
–
–
Simpler
and
more
accurate
model
Be|er
generalizaEon
power
– Accuracy
• Online
learning
– Scalability
– Expandable
to
new
training
data
– Expandability
– DBSCAN
clustering
• Ameliorate
labeling
inconsistency
(transiEvity
problem)
• Efficient
soluEon
to
find
name
clusters
• N
logN
scaling
34. Author
DisambiguaEon
Field
• Currently
uses
author
fields
– For
author
search
(both
for
author
menEons
and
for
disambiguated
authors)
• Future
direcEon
– Use
Lucene
index
for
blocking
in
author
disambiguaEon
–
creaEng
candidate
set
of
author
menEons
that
could
belong
to
the
same
cluster
35. Author
DisambiguaEon
• Random
Forest
(RF)
– Use
random
feature
selecEon+bootstrap
sampling
to
construct
mulEple
decision
trees
from
one
training
data
– Aggregate
votes
of
a
collecEon
of
decision
tree
as
final
decision
– The
more
independent
each
tree
is,
the
be|er
the
improvement
over
a
single
decision
tree
• Author
disambiguaEon
with
Random
Forest
– Various
meta
data
is
used
as
features
in
Random
Forest
to
determine
whether
two
author
name
from
two
papers
refer
to
the
same
person
• E.g.
Author
names,
affiliaEon,
coauthors,
keywords,
journal
informaEon,
year
of
publicaEons,
etc
– MulEple
distance
funcEons
are
used
for
each
type
of
meta
data
• E.g.
TFIDF,
Jaccard
distance,
for
comparing
affiliaEons
• Compared
with
previous
SVM-‐based
approach
– Shown
to
provide
higher
accuracy
than
SVM
in
pair-‐wise
author
disambiguaEon
task
– Easy
parameterizaEon
in
the
training
phrase
(only
number
of
trees
and
randomness
at
each
node,
no
decision
on
kernel
funcEon
needed),
and
performance
is
not
sensiEve
to
parameters
chosen
– Provide
measurement
for
importance
of
each
individual
features
(how
informaEve
each
feature
is,
and
how
sensiEve
the
decision
is
to
noise
in
a
parEcular
feature),
which
is
not
trivial
for
SVM
with
non-‐linear
kernel
– Training
Eme
&
classificaEon
Eme
is
linear
to
the
number
of
tree
and
data
size
• Also
provide
higher
disambiguaEon
accuracy
when
compared
with
other
tradiEonal
method
(LogisEc
Regression,
Naïve
Bayes,
Decision
Tree)
Treeratpituk, Giles, JCDL09
36. Data and Publications in the Field of Chemistry
Chemistry
• not physics - no arXiv – or computer science - no CiteSeer
• Legacy of early information access - Chem Abstracts
• Cheminformatics is not bioinformatics
Chemistry has been up to recently a data poor field
Data sharing tradition just being established
Data creation is exploding - local (small science)
Journals and societies sensitive to their IP issues dominate the field
Unsubstantiated IP claims such as data in the paper belongs to the publisher
Discourage online versions of publications - ACS
Large powerful international companies have a vested interest in research
Chemical information extraction tools are easily monetized
Standards exist - CML, InCHI
“Fixing the past so we can fix the future.” Jeremy Frey
Chemistry is an old discipline with publications going back 100 years
Chemistry is compound centric, not algorithmic centric
Search is about the compound!
Compounds have a rich data environ
3D graph structure, energies, etc.
37. ChemXSeer Architecture
Integrate and implement well-used open source tools
Use CiteSeerX tools when possible
Integrate into SeerSuite
Search
Chemical formulae unique search
Table search
Figure search
More data (grey literature) than documents
• Automated information extraction modules based on machine learning methods
• Lucene/Solr indices for extracted fields,
• Relational databases for datasets,
Work closely with chemists to understand their needs
Tools for data conversion
Provide a public portal and repository for easy use
User access controls
Integrated visualization tools like JMOL for Gaussian data residing into
our repository
API’s for users for extracted data
Data and documents standards de facto: xml, pdf, etc.
39. ChemXSeer Formula Search
• Extraction and search of chemical formulae in scientific
documents has been shown to be very useful.
• Intersection of two research areas:
• Information retrieval
• Chemoinformatics
• Formulae cannot be treated as text.
• Domain knowledge (formula identification)
• Structural knowledge (substructure finding and search)
B. Sun, WWW’07, WWW’08, TOIS’11
D. Yuan, ICDE’12
40. Challenges in Formula Search
How to identify a formula in scientific documents?
Non-Formula
“… This work was funded under NIH grants …”
“ … YSI 5301, Yellow Springs, OH, USA …”
“… action and disease. He has published over …”
Formula
“… such as hydroxyl radical OH, superoxide O2- …”
“ and the other He emissions scarcely changed …”
Machine learning algorithms (SVM + CRF) yield high
accuracies for correct formula identification.
43. Chemical
EnEty
ExtracEon
and
Tagging
• Name
tagging
– Each
chemical
name
can
be
a
phrase
– Example
• "...
Determina>on
of
lac4c
acid
and
...“
• "...
insec>cide
promecarb
(3-‐isopropyl-‐5-‐methylphenyl
methylcarbamate)
acts
against
..."
• Formula
tagging
– Each
formula
is
a
single
term
– Example
• "...
such
as
hydroxyl
radical
OH,
superoxide
..."
– Non-‐formula
example
• "...
YSI
5301,
Yellow
Springs,
OH,
USA
...
”
• Tagging
examples
– Name
tagging:
"...
of
<name-‐type>lac>c
acid</name-‐type>
and
...“
– Formula
tagging:
"...
radical
<formula-‐type>OH</formula-‐type>
,
superoxide
..."
44. Textual
Chemical
Molecule
InformaEon
Indexing
and
Search
• Index
Schemes:
– Which
tokens
to
index?
– Indexing
all
subsequences
generates
a
large
size
index
• SegmentaEon-‐based
index
scheme
– Used
for
indexing
chemical
names
methylethyl
– First
segment
a
chemical
name
hierarchically
and
then
index
substrings
at
each
node
methyl ethyl
meth yl eth yl
me th
• Frequency-‐and-‐discriminaEon-‐based
index
scheme
– Used
for
indexing
chemical
formulas
– SequenEally
select
frequent
and
discriminaEve
subsequences
of
a
formula
from
the
shortest
to
the
longest
45. Features
for
Formula
Indexing
• Formula
– A
sequence
of
chemical
element
or
par3al
formula
with
corresponding
frequencies
– E.g.
CH3(CH2)2OH
• ParEal
formula
– ParEal
formula:
a
subsequence
of
a
formula
– E.g.
C,
H,
O,
CH3,
CH2,
OH,
CH3(CH)2,
H3(CH)2,
CH3
(CH)2O,
etc.
• Index
construcEon
– ParEal
formulas
with
frequencies:
e.g.
<C,3>,<H,
6>,<CH2,2>,
etc.
– Too
many
parEal
formulas,
need
feature
selec3on
46. Criteria
of
Feature
SelecEon
• Criteria
of
feature
selecEon
– Frequent
features
(Freqs≥Freqmin)
– DiscriminaEve
features
(αs
≥αmin)
• If
a
sequence’s
selected
subsequences
are
enough
to
disEnguish
formulas
containing
them
from
other
formulas,
this
sequence
is
redundant.
• DiscriminaEon
score
α s =| I s '∈F ∧ s 'p s Ds ' | / | Ds |
where
F
is
the
selected
feature
set,
and
Ds
is
the
set
of
formulas
containing
s.
47. An
Example
for
Formula
Indexing
• Data
set:
– 1.CH3COOH,
2.CH3(CH2)2OH,
3.CH3(CH2)3COOH
• Parameter:
– Freqmin=2,
αmin=1.1
• Steps:
– Length=1,
Candidates={C,H,O},
F={C,H,O}
– Length=2,
Candidates={CH3,H3C,CO,OO,OH,CH2},
Frequent
Candidates={CH3,CO,OO,OH,CH2}
α CH 3 =| {1,2,3}C I {1,2,3}H | / | {1,2,3}CH 3 |= 1
α CO =| {1,2,3}C I {1,2,3}O | / | {1,3}CO |= 1.5
Frequent
&
DiscriminaEve
Candidates={CO,OO,CH2}
F={C,H,O,CO,OO,CH2}
– Length=3,
…
48. Formula
Search
• SF.IEF:
Subsequence
Frequency
&
Inverse
EnEty
Frequency
Freq(s,e) |C |
SF(s,e) = ,IEF(s) = log
|e | |{e | s p e} |
• Exact
formula
search
– Search
for
exact
representaEons.
E.g.
=C1-‐2H4-‐6
matches
CH4
and
C2H6,
not
H4C
or
H6C2.
€
• Frequency
formula
search
– Full
frequency
search:
search
for
formulas
with
specified
chemical
elements
and
frequency
ranges,
ignoring
the
order,
no
unspecified
elements.
E.g.
C1-‐2H4-‐6
matches
CH4,
C2H6,
H6C2,
CH3CH3,
not
CH4O,
C2H6O2.
– ParEal
frequency
search:
similar
but
allow
unspecified
elements.
E.g.
*C1-‐2H4-‐6
matches
CH4,
C2H6,
H6C2,
CH3CH3,
and
CH4O
and
C2H6O2
as
well.
– Ranking
funcEon
score(q, e) = ∑ SF ( s, e) IFF ( s ) 2 /( | f | × ∑ IFF (s) 2
)
s∈q s∈q
49. Formula
Search
substructure
• Substructure
formula
search
– Search
for
formulas
that
may
have
a
substructure.
E.g.
-‐COOH
matches
CH3COOH
(exact
match:
high
score),
HOOCCH3
(reverse
match:
medium
score),
and
CH3CHO2
(parsed
match:
low
score).
– Ranking
funcEon
score(s,e) = W match(s, f )SF(s,e)IFF(s) / | e |
where
Wmatch(q,f)
is
the
weight
for
exact
match,
reverse
match,
and
parsed
match
• Similarity
formula
search
– Search
for
formulas
with
a
similar
structure
of
the
query
formula.
€
Feature-‐based
approach
using
parEal
formula
matching.
E.g.
~CH3COOH
matches
CH3COOH,
(CH3COO)2Co,
CH3COO-‐,
etc.
– Ranking
funcEon
score(q,e) = ∑W match(q,e )W (s)SF(s,q)SF(s,e)IFF(s) / | e |
sp q
• ConjuncEve
search
of
the
basic
types
of
formula
searches
– E.g.
[*C2H4-‐6
-‐COOH]
matches
CH3COOH,
not
C2H4O
or
CH3CH2COOH.
€
• Document
query
rewriEng
– E.g.
document
query
atom
formula:=CH4
is
rewri|en
to
atom
(CH4
OR
CD4),
if
formula
search
of
=CH4
matches
CH4
and
CD4.
50. Formula
Search
-‐Query
Models
Many
models
are
possible
from
exact
to
semanEc
Models
discriminated
by
matching
algorithms
• Exact
search
– Search
for
exact
representaEons
– E.g.
=C1-‐2H4-‐6
matches
CH4
and
C2H6,
not
H4C
or
H6C2
• Frequency
searches
– Full
frequency
search:
search
for
formulae
with
specified
chemical
elements
and
frequency
ranges,
ignoring
the
order,
no
unspecified
elements
– E.g.
C1-‐2H4-‐6
matches
CH4,
C2H6,
H6C2,
CH3CH3,
not
CH4O,
C2H6O2
– ParEal
frequency
search:
similar
but
allow
unspecified
elements
– E.g.
*C1-‐2H4-‐6
matches
CH4,
C2H6,
H6C2,
CH3CH3,
and
CH4O
and
C2H6O2
as
well
• Substructure
search
– Search
for
formulae
that
may
have
a
substructure
– E.g.
-‐COOH
matches
CH3COOH
(exact
match:
high
score),
HOOCCH3
(reverse
match:
medium
score),
and
CH3CHO2
(parsed
match:
low
score).
• Similarity
search
– Search
for
formulae
with
a
similar
structure
of
the
query
formula.
Feature-‐based
approach
using
parEal
formulae
matching.
– E.g.
~CH3COOH
matches
CH3COOH,
(CH3COO)2Co,
CH3COO-‐,
etc.
51. Ranking
formulae
• Ranking
formulae
has
to
depend
on
need
and
importance
• Focus
on
structural
methods
and
frequency
• Importance
can
be
introduced
by
citaEon
rank
or
pagerank
or
others
• SF.IFF
– Substructure
frequency
and
inverse
formula
frequency
• Frequency
searches
–
score(q, f ) = SF (e, f ) IFF (e) 2 /( | f | ×
IFF (e) 2 )
∑e∈q
∑
e∈q
– where
|f|
is
the
total
frequency
of
elements
• Substructure
search
– score(q, f ) = W SF (q, f ) IFF (q) / | f |
match ( q , f )
–
where
Wmatch(q,f)
is
the
weight
for
exact
match,
reverse
match,
and
parsed
match
• Similarity
search
–
score(q, f ) =
∑W
s pq
W ( s ) SF ( s, q ) SF ( s, f ) IFF ( s ) / | f |
match ( q , f )
52. Chemical
compounds
as
graphs
• Chemical
compound
modeled
as
a
semanEc
graph
with
properEes
Atom: vertex/node in the graph
Bond: edge in the graph
Dimensions: 3 or 4
Above figures are copied from
eMolecules.com
53. What’s
Chemical
Structure
Search
• Substructure
Search
– Given
an
input
chemical
structure
sketch,
find
all
the
chemical
compounds
containing
the
input
as
a
substructure.
• Super
structure
Search
– Given
an
input
chemical
structure
sketch,
find
all
the
important
descriptors
(substructures/
funcEonal
group)
contained
in
the
input.
• Similarity
Search
– Given
an
input
chemical
structure
sketch,
find
all
the
chemical
compounds
“similar”
to
the
input.
54. Table Search
Tables are widely used to present experimental results or statistical
data in scientific documents; some data only exists in these tables.
Current search engines treat tabular data as regular text
• Structural information and semantics not preserved.
Goal: automatically identify tables, extract table metadata from pdf
documents into xml and rank data
Table Metadata Representation:
• Environment metadata: (document specifics: type, title,…)
• Frame metadata: (border left, right, top, bottom, …)
• Affiliated metadata: (Caption, footnote, …)
• Layout metadata: (number of rows, columns, headers,…)
• Cell content metadata: (values in cells)
• Type metadata: (numeric, symbolic, hybrid, …)
Y. Liu AAA’07, JCDL’07.
55. Tables
• A history that pre-dates that of sentential text
– Cuneiform clay tablets
• Not received the same level of formal characterization
enjoyed by sentential text
• Varying and irregular formats
• Different intuitive understanding of what a “table” is.
– Is the Periodic Table of the Elements a table?
– Tables vs. Lists?
– Tables vs. Forms?
– Tables vs. Figures?
– Genuine table vs. non-genuine table? [12]
• Our definition: scientific genuine table
– Caption + tabular structure
– Ruling lines are not required
58. Page
Box-‐Cu‡ng
Algorithm
• Improves
the
table
detecEon
performance
by
excluding
more
than
93.6%
document
content
in
the
beginning
59. Sample
Table
Metadata
Extracted
File
• <Table>
• <DocumentOrigin>Analyst</DocumentOrigin>
• <DocumentName>b006011i.pdf</DocumentName>
• <Year>2001</Year>
• <DocumentTitle>Detec3on
of
chlorinated
methanes
by
3n
oxide
gas
sensors
</DocumentTitle>
• <Author>Sang
Hyun
Park,
a
?
Young-‐Chan
Son,
a
Brenda
R
.
Shaw,
a
Kenneth
E.
Creasy,*
b
and
Steven
L.
Suib*
acd
a
Department
of
Chemistry,
U-‐60,
University
of
Connec3cut,
Storrs,
C
T
06269-‐3060</Author>
• <TheNumOfCiters></TheNumOfCiters>
• <Citers></Citers>
• <TableCap3on>Table
1
Temperature
effect
o
n
r
esistance
change
(
D
R
)
and
response
3meof
3n
oxide
thin
film
with
1
%
C
Cl
4</TableCap3on>
• <TableColumnHeading>D
R
Temperature/
¡ã
C
D
R
a
/
W
(
R
,O
2
)
(%)
R
esponse
3me
Reproducibiliy
</TableColumnHeading>
• <TableContent>100
223
5
~
22
min
Yes
200
270
9
~
7-‐8
min
Yes
300
1027
21
<
2
0
s
Yes
400
993
31
~
1
0
s
No
</TableContent>
• <TableFootnote>
a
D
R
=(
R
,
CCl
4
)
-‐
(
R
,O
2
).
</TableFootnote>
• <ColumnNum>5</ColumnNum>
• <TableReferenceText>In
page
3,
line
11,
…
Film
responses
to
1%
CCl4
at
different
temperatures
are
summarized
in
Table
1……</TableReferenceText>
• <PageNumOfTable>3</PageNumOfTable>
• <Snapshot>b006011i/b006011i_t1.jpg</Snapshot>
• </Table>
60. TableRank
• Rank tables by rating the <query, table> pairs, instead of the
<query, document> pairs: preventing a lot of false positive hits
for table search, which frequently occur in current web search
engines
• The similarity between a <table, query> pair: the cosine of the
angle between vectors
• Tailored term vector space => table vectors:
• Query vectors and table vectors, instead of document
vectors
61. Table
Index
Index
CapEons
Footnotes
Reference
Text
BoosEng
CapEons
(2)
FuncEon:
- Inversely
(recip)
proporEonal
to
#cites.
62. Term
WeighEng
for
Tables
– TTF
–
ITTF:
(Table
Term
Frequency-‐Inverse
Table
Term
Frequency)
– TLB:
Table
Level
Boost
Factors
(e.g.,
table
frequency)
– DLB:
Document
Level
Boost
factors
(e.g.,
journal/proceeding
order,
document
citaEon)
63. Table
term
ranking
• A term occurring in a few tables is likely to be a better discriminator than a term
appearing in most or all tables
• Similar to document abstract, table metadata and table query should be treated as
semi-structured text
• Not complete sentences and express a summary
• P = 0.5 (G. Salton 1988)
• b is the total number of tables
• IDF(ijk): the number of tables that term t(i) occurs in the matadata m(k)
64. Table
Level
Boost
and
Document
Level
Boost
Btbf is the boost value of the table frequency
Btrt is the boost value of the table reference text (e.g., the normalized length), and
Btp is the boost value of the table position. r is a parameter, which is 1 if users
specify the table position in the query. Otherwise, r = 0.
IVj: document Importance Value (IV). If a table comes from a document with
a high IV , all the table terms of this document should get a high document
level boost
ICj: the inherited citation value (ICj)
DOj: source value (the rank of the journal/conference proceeding)
DFj: document freshness
65. Table
citaEon
network
• Similar
to
the
PageRank
network
– Documents
construct
a
network
from
the
citaEons
– The
“incoming
links”
–
the
documents
that
cite
the
document
in
which
the
table
is
located
– ExponenEal
decay
used
to
deal
with
the
impact
of
the
propagated
importance
• Unlike
the
PageRank
network
– Directed
Acyclic
Graph
– Importance
Value
(IV)
of
a
document
not
decreased
as
the
number
of
citaEons
increases
– IV
not
divided
by
the
number
of
outbound
links
• A
document
may
have
mulEple,
one,
or
no
tables
• Each
table
is
consisted
as
a
set
of
metadata
• Same
keywords
may
appear
in
different
metadata
in
different
tables
66. Table
Search
Summary
• An
novel
first
table
ranking
algorithm
-‐-‐
TableRank
• A
tailored
table
term
vector
space
• A
table
term
weighEng
scheme
–
TTF-‐ITTF
– AggregaEng
impact
factors
from
three
levels:
the
term,
the
table,
and
the
document
• Index
table
referenced
texts,
term
locaEons,
and
document
backgrounds
• Design
and
implement
first
table
search
engine,
TableSeer,
to
evaluate
the
TableRank
and
compare
with
popular
web
search
engines
• Code
released
• Currently
implement
in
CiteSeerX
-‐
millions
of
tables
• Improving
extracEon
–
Dow
Chemical
support
67. Automated Figure Data Extraction and Search"
• Large amount of results in digital documents are recorded in figures, time series, experimental
results (eg., NMR spectra, income growth) and this is the only record of the data"
• Extraction for purposes of:"
– Further modeling using presented data"
– Indexing, meta-data creation for storage & search on figures for data reuse"
• Current extraction done manually!!
Documents
Extracted
Plot
Extracted
Info.
Document
Merged
Index
Plot
Index
Index
Digital
Library
User
68. Seer Figure/Plot Data Extraction and Search
Numerical data in
scientific publications
are often found in figures.
Tools that automate the data extraction from figures
provide the following:
• Increases our understanding of key concepts of papers
• Provides data for automatic comparative analyses.
• Enables regeneration of figures in different contexts.
• Enables search for documents with figures containing
specific experiment results.
X. Lu JCDL’06 & IJDAR’09, Brouwer JCDL’08, Kataria AAAI’08
69. Metadata & data to extract:
2 Dimensional Plot"
Y-Axis
Labels
Legend
Data Points
Ticks
Axis Units
X-Axis
Label
Snapshot of a document Extracted 2D plot
70. Our
Approach
to
Plot
Data
ExtracEon
• Identify and extract figures from digital documents
• Ascii and image extraction (xpdf)
• OCR - bit map, raster pdfs
• Identify figures as images of 2D plots using SVM (Only for Bit
map images)
• Hough transform
• Wavelets coefficients of image
• Surrounding text features
• Binarization of the 2D plots identified for preprocessing (No
need for Vectorized Images)
• Adaptive Thresholding
• Image segmentation to identify regions
• Profiling or Image Signature
• Text block detection
• Nearest Neighbor
• Data point detection
• K-means Filtering
• Data point disambiguation for overlapping points
• Simulated Annealing
71. Future Directions
• System integration within ChemXSeer or
CiteSeerX"
– XML data generation"
– Open source tool in Lucene/SOLR "
• Extension to other figures (3D, …)
"
1.2e+08
1e+08"
8e+07"
6e+07"
4e+07"
2e+07"
" 0
30 "
25 " "
20 " " 60 " 70
15 " " 50
10 " " 30 " 40
5 " 10 " 20
72. ChemXSeer Highlights
• Portal for academic researchers in environmental chemistry which integrates the scientific
literature with experimental, analytical and simulation results and tools
• Provides unique metadata extraction, indexing and searching pertinent to the chemical
literature by using heuristics combined with machine learning
• Chemical formulae and names
• Tables
• Figures
• Publication functions as in CiteSeerX
• Interoperability ORE-Chem development
• Novel ranking required
• After extraction, data stored API accessible xml for users
• Hybrid repository (Not fully open): Serves as a federated information interoperational system
• Scientific papers crawled and indexed from the web
• User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM
toolkit outputs)
• Scientific documents and metadata from publishers (e.g. Royal Society of Chemistry)
• Access control for publisher-provided content and user-submitted experiment data
• Takes advantage of developments in other funded cyberinfrastructure and open source
projects
• CiteSeerX, PlanetLab, Lucene/Solr, ORE, others
• Some released open source
74. CollaboraEon
recommendaEon
• Metadata
of
authors
and
coauthors
and
topics
of
interest
(similar
to
expert
recommendaEon)
• Use
social
network
and
topics
to
recommend
collaborators
of
collaborators
(FOF)
• Devise
SN
index
and
ranking
scheme
• Explore
models
of
vertex
similarity
• Built
on
SeerSuite
Gou JCDL’10,
• Other
recommendaEons?
Gou MIR’10
– Experimental
methods
Chen JCDL’11, SAC’12
– Chemicals?
79. IntegraEon
of
Vertex
Similarity
and
Textual
Similarity
•
– S:
vertex
similarity
– SC.O.T.:
collaborator’s
contribuEon
to
a
specified
topic
– Use
the
product
of
exponenEal
funcEons
to
avoid
zero
vertex
similarity
score
or
zero
contribuEon
(textual
similarity)
score
to
turn
the
whole
measure
into
zero
• Other
measures?
80. • RefSeerX:
recommend
citaEons
for
papers
Use these
paper
citaEons
The authors are unaware of related work
they do not know they are looking for
recommends related citations
• Based
– ExisEng
citaEons
– CitaEon
context
– Venue
and
importance
– Contemporary
vs
seminal
83. Expert
Search
• Expert search for authors, currently in alpha
84. Expert
Search
• Expert search for authors, currently in alpha
85. Keyphrase
ExtracEon
for
experts
Text
Document
Parse document into sections with
SecEon
Parser
regular expression
Candidate
Use DBLP statistic to extract
DBLP
data
keyphrase candidates
Extractor
Train random forest to classify &
Training
Data
Random
Forest
rank whether a phrase is a
keyphrase
Top
Keyphrases
Treeratpituk, P., Teregowda, P., Huang, J. and Giles, CL. SEERLAB: A System for Extracting Keyphrases from
Scholarly Documents, Semeval-2010 task 5: Automatic keyphrase extraction from scientific article. ACL workshop
on Semantic Evaluations (SemEval 2010), Sweden, July 2010.
86. GrantSeer
• Prototype
search
engine
for
PI
profiles
and
their
grant
informaEon
to
assist
funding
agencies,
deans
of
research,
foundaEons
• Link
PIs
with
their
– Grants
– PublicaEons
– CitaEons
– OrganizaEon
– ExperEse
– Others?
• Data
that
can
be
shared
– CiteSeerX
or
Google
Scholar
data
– Database
of
funded
research
Funded by NSF – Julia Lane
92. Metadata extraction
• Extract
• Pseudo-codes and their metadata
• Captions
• Reference sentences
• Synopsys
• Etc.
• Index metadata using Solr to make the pseudo-
codes searchable
• Each search result has a pointer to the page in the
document where the pseudo-code appears
93. Index Fields
id <string>
caption <text>
reftext <text> (Reference Sentences)
synopsis <text> (Summarizing Text)
page <sint> (Page Number)
paperid <string> (Document ID)
year <sint> (Year of Publication)
ncites <sint> (Number of Citations)
96. Number of Total C/A
Name Acknowledge-ments Citations Metric Name
Educational
Funding Agencies
Institutions
National Science Carnegie Mello
12287 144643 11.77
Foundation University
Defense Advanced Massachusetts
4712 80659 17.12
Research Projects Agency of Technology
California Inst
Office of Naval Research 3080 48873 15.87
Technology
Funding Agency Impact Deutsche
2780 9782 3.52 Santa Fe Institu
Forschungsgemeinschaft
French Nationa
National Aeronautics and
2408 21242 8.82 Institute for Re
Space Administration
Funding agency impact Engineering and Physical
Computer Scie
2007 16582 8.26 Stanford Unive
• based on Science Research Council
Air Force Office of University of C
acknowledgement indexing Scientific Research
1657 16850 10.17
at Berkeley
National Sciences and National Cente
• # of acknowledgements Engineering Research 1422 12050 8.47 Supercomputin
• total citations Council of Canada Applications
International C
• #Citation / #ack metric Department of Energy 1054 5562 5.28
Science Institu
Australian Research
1010 5464 5.41 Cornell Univer
Council
Based on acknowledgment European Union
University of I
Information Technologies 825 9594 11.63
entities extracted from 150K Program
Urbana-Champ
acknowledgements in CiteSeer National Institutes of
709 7279 10.27
USC Informati
Health Sciences Instit
University of C
New system available this spring Army Research Office 666 7709 11.58
Los Angeles
Netherlands Organization
AckSeer for Scientific Research
646 2843 4.4 McGill Univer
Science and Engineering Australian Nat
489 6976 14.27
Research Council University
Companies Individuals
International Business Giles, PNAS, 2004
1380 23948 17.35 Olivier Danvy
Machines
Intel Corporation 962 14441 15.01 Oded Goldreic
97. Most Acknowledged Authors and Impact Factor
C/A
Author Citations Acknowledge-ments Metric
Olivier
Interviewed by Danvy
847 268 29.85
Nature as to why Oded
3277 259 17.82
Goldreich
he was the most Luca
3847 247 43.91
acknowledged Cardelli
Tom
computer scientist Mitchell
3336 226 24.31
Martin
3507 222 43.46
Abadi
Phil
3780 181 40.07
Wadler
Moshe
3786 180 33.86
Vardi
Who is most acknowledged? 1790
Peter Lee 167 53.54
Avi
2566 160 18.13
Mom or dad Wigderson
Matthias
Theorists or experimentalists Felleisen
1622 154 30.55
Benjamin
1484 152 30.53
Who has a better metric? Pierce
Noga Alon 2640 152 15.71
John
3693 152 41.9
Ousterhout
Frank
1639 148 13.84
Pfenning
Andrew
2064 144 52.99
Appel
98. Clouding CiteSeerX
• Hosting cloud CiteSeerX instances
• Economic issues
• Cost of hosting
• Cost of refactoring the source to be hosted in the cloud.
• Computational/technical issues
• What workflow to cloudize
• Component modification for efficient operation
• VM size: storage, memory and CPU sizing as a function of
needs
• Establishing computational needs and availability clusters
• Appropriate load balancing across multiple sites.
• Security of data stored including metadata and user data.
• Policy issues
• Privacy of user data
• Copyright issues.
Teregowda Cloud’10 USENIX’10
99. SeerSuite
Research/Development
Opportuni3es
• Old
Seers
– Improve
or
revive
old
systems
and
port
them
into
compeEEve
SeerX
space
• eBizSeer
to
eBizSeerX;
BotSeer
to
BotSeerX;
ArchSeer
to
ArchSeerX
• New
Seers
– New
domains
such
as
physics,
neuroscience,
biology,
algorithms,
TBD
(build
new
indexes)
– MyCiteSeerX
• Be|er
features
– Parsing
– EnEty
disambiguaEon
– CitaEon
analysis
– Ranking;
ranking,
ranking
• New
features
– New
parsing,
indexing,
ranking
• Tables,
figures,
equaEons,
algorithms,
maps,
carbon
daEng,
chemical
formulae,
etc
– Homepage
linking
– ORE
search
and
data
integraEon
– CollaboraEve
spaces
– API/web
services
– IntegraEon
with
DL
such
as
Fedora
– New
clusters
• Topics,
venues,
affiliaEons
– Recommender
systems
– SNA
analysis
– Others
Collabora>ons
welcomed!
Data
and
sohware
available
100. Research
SeerSuite
supports
• Many
uses
as
a
research
testbed
and
support
structure
– Scaling
of
algorithms
for
IR,
IE,
data
mining,
social
networks,
...
– NLP
methods
on
large
text
collecEons
– ML
methods
to
automaEcally
extract
data
– Novel
indexing
and
ranking
– Federated
search
– CollaboraEve
and
social
networks
– Focused
crawling
–
new
data
resources
– Interface
design
and
integraEon
– Systems
analysis
• Many
development
applied
research
issues
– IntegraEon
with
other
DLs
– Automated
feature
development
– Transfer
to
nontechnical
use
– Cloud
based
delivery
101. Summary
• Propose
an
infrastructure
for
academic
and
scienEfic
search
engine/digital
library
creaEon
-‐
SeerSuite
– Modular,
scalable,
extensible,
robust
– Based
on
commercial
grade
open
source
(Solr/Lucene);
easy
to
use
– Easy
to
apply
to
other
domains
(separable
indexes
and
projects
-‐
integraEon)
• Allows
scalable
data
mining
and
informaEon
extracEon
for
actual
systems
– Unique
informa4on
extrac4on
plugins
– Focus
on
unique
scalable
extracEon/data
mining
methods
• Most
methods
less
than
N2
complexity
– AutomaEcally
populates
databases
or
data
structures
• Demonstrate
with
beta
systems
in
– Computer
science,
Archaeology,
Chemistry,
Robots.txt,
PubMed,
YouSeer,
Tables,
Figures,
Maps,
References,
CollaboraEons,
DisambiguaEon
– Personal
features
• Systems
are
reasonably
easy
to
build;
issues
are
– Data
collecEon
or
data
access
– InformaEon
extracEon,
indexing,
ranking
• Many
uses
as
a
research
testbed
– Data
sharing
models
• Want
to
find
a
Seer,
search
Google
or
use
my
homepage.
102. Opportun3es
• Science
is
being
flooded
with
data
– SimulaEons,
sensors,
web
• Digital
humaniEes
is
right
behind
• Needs
in
– Large
scale
data
management
(tera
to
peta)
• NoSQL
databases:
graphs,
documents,
floaEng
point,
– Large
scale
• data
mining
• informaEon
extracEon
• search
• Domain
experEse
crucial
• Reuse
not
reinvent
(much
is
out
there)
• Solr/Lucene
is
great
for
both
demos,
producEon
and
research.
103. “Human attention is the scarce
resource, not information.” Herbert
A. Simon, Nobel Laureate, 1997.
For
more
informaEon
• clgiles.ist.psu.edu
• giles@ist.psu.edu
• SourceForge.com