University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature
1. Extrac'on
and
Representa'on
of
in
silico
Biological
Methods
from
the
Literature
Geraint
Duck
Supervisors:
Robert
Stevens,
Goran
Nenadic
and
David
Robertson
Advisor:
Joshua
Knowles
School
of
Computer
Science,
University
of
Manchester
2. Importance
of
Method
in
Science
• Understanding
– Key
part
of
research,
central
to
science
– Reproducibility
and
replica'on
– What?
Why?
Where?
How?
When?
– Extension
• Advise/evaluate
– “Current
Approach”
– “Best
Prac'ce”
2
3. Background
• In
silico:
performed
on
a
computer,
or
through
computer
simula'on
• Bioinforma'cs
is
a
resource-‐focused
domain
– Numerous
resources
appearing
– Literature
is
growing
rapidly
• Resource
availability
and
usage
is
central
to
biological
research
• Current
aTempts
oUen
manually
curated
and/
or
incomplete
3
4. The
Method
to
Obtain
a
Method
4
1. Extrac'on
– Automa'cally
extract
resource
and
task
men'ons
from
the
bioinforma'cs
literature
• This
presenta'on
focuses
on
this
step
2. Representa'on
and
Analysis
– Evaluate
the
extracted
men'ons
for
paTerns
of
representa'on
3. Explora'on
– Provide
a
means
of
exploring
the
methods
extracted
to
aid
other
research/researchers
5. Key
Hypothesis:
Resource
ordering
implies
method
• An
analogy
–
baking
a
cake:
– Ingredients:
buTer,
eggs,
flour,
sugar,
etc…
– Recipe/method:
Set
oven
to
180°C,
mix
in
a
bowl
the
buTer
and
sugar…
Divide
between
'ns,
cook
in
oven
for
30mins…
5
6. Key
Hypothesis:
Resource
ordering
implies
method
• An
analogy
–
baking
a
cake:
– Ingredients:
bu#er,
eggs,
flour,
sugar,
etc…
– Recipe/method:
Set
oven
to
180°C,
mix
in
a
bowl
the
bu#er
and
sugar…
Divide
between
2ns,
cook
in
oven
for
30mins…
6
Key:
Resource;
Task
7. Example:
Lagerström
et
al.
(2006)
…
all
sequences
were
aligned
…
using
…
BLAT
3.0
…
in
which
case
the
GenBank
sequence
was
used…
…
divided
…
by
BLAST
searches
…
were
combined
into
a
FASTA
file
and
aligned
using
…
ClustalW
1.82
…
The
alignment
was
bootstrapped
…
using
SEQBOOT
from
the
…
Phylip
3.6
package
…
[excerpt
removed]
…
branch
lengths
were
es'mated
in
TreePuzzle
using
the
following
parameters
…
…
constructed
and
scored
automa'cally
using
a
bash-‐
script
that
u'lized
ClustalW
as
alignment
engine
and
infoalign
from
the
EMBOSS
2.8.0
package
for
scoring,
…
All
sta's'cal
analysis
was
performed
using
MiniTab.
Graphs
were
ploTed
using
MicrosoU
Excel
and
MiniTab.
7
8. Example:
Lagerström
et
al.
(2006)
…
all
sequences
were
aligned
…
using
…
BLAT
3.0
…
in
which
case
the
GenBank
sequence
was
used…
…
divided
…
by
BLAST
searches
…
were
combined
into
a
FASTA
file
and
aligned
using
…
ClustalW
1.82
…
The
alignment
was
bootstrapped
…
using
SEQBOOT
from
the
…
Phylip
3.6
package
…
[excerpt
removed]
…
branch
lengths
were
es2mated
in
TreePuzzle
using
the
following
parameters
…
…
constructed
and
scored
automa'cally
using
a
bash-‐
script
that
u'lized
ClustalW
as
alignment
engine
and
infoalign
from
the
EMBOSS
2.8.0
package
for
scoring,
…
All
sta's'cal
analysis
was
performed
using
MiniTab.
Graphs
were
plo#ed
using
MicrosoL
Excel
and
MiniTab.
8
Key:
Resource;
Task;
Poten2al
Challenge
9. Example:
Lagerström
et
al.
(2006)
…
all
sequences
were
aligned
…
using
…
BLAT
3.0
…
in
which
case
the
GenBank
sequence
was
used…
…
divided
…
by
BLAST
searches
…
were
combined
into
a
FASTA
file
and
aligned
using
…
ClustalW
1.82
…
The
alignment
was
bootstrapped
…
using
SEQBOOT
from
the
…
Phylip
3.6
package
…
[excerpt
removed]
…
branch
lengths
were
es2mated
in
TreePuzzle
using
the
following
parameters.
…
constructed
and
scored
automa'cally
using
a
bash-‐
script
that
u'lized
ClustalW
as
alignment
engine
and
infoalign
from
the
EMBOSS
2.8.0
package
for
scoring,
…
All
sta's'cal
analysis
was
performed
using
MiniTab.
Graphs
were
plo#ed
using
MicrosoL
Excel
and
MiniTab.
9
Key:
Resource;
Task;
Poten2al
Challenge
10. Example:
Lagerström
et
al.
(2006)
10
Key:
GenBank
BLAT,
aligned
BLAST,
searched
ClustalW,
aligned
Resource;
Task
SEQBOOT,
bootstrapped
(Phylip)
TreePuzzle,
esDmated
ClustalW,
aligned
infoalign,
scored
(EMBOSS)
MiniTab,
staDsDcs
MS
Excel,
graphs
ploIed
MiniTab,
graphs
ploIed
Tree
Construc'on
Sequence
and
Tree
Analysis
Result
Visualisa'on
Sequence
Alignment
11. Example…
• Mul'ple
methods
– Usage
counts
– Recentness
of
use
– “best-‐prac'ce”
11
12. Challenges
-‐
Ambiguity
• leg
• white
• cab
• HIV
– Human
immunodeficiency
virus
– Human
immunovirus
• analysis
• Network
• graph
• DIP
– distal
interphalangeal
– Database
of
Interac'ng
Proteins
12
13. Challenges
-‐
Variability
• Orthographics
– Swiss
Prot
– SWISS-‐PROT
– SwissProt
• Misspellings
and
typos
– One
paper,
same
resource,
spelt
3
different
ways
• Abbrevia'ons
– Different
authors
can
use
different
acronyms
for
the
same
thing
13
14. Name
Composi'on
• Majority
are
single
nouns
– includes
acronyms
• 6%
lowercase
common
nouns
– affy,
bioconductor
• A
few
contained
numbers
– S4,
t2prhd
• A
few
misclassified
as
verbs
– …each
query
protein
is
first
BLASTed
with…
– …held
near
their
equilibrium
values
using
SHAKE.
– …graphical
representaKons
were
achieved
using
dot
v1.10…
14
15. Name
Composi'on
• Longest
Names
(most
tokens)
– Corpus:
5
–
Gene
Expression
Profile
Analysis
Suite
– Dic'onary:
12
–
PredicKon
of
Protein
SorKng
Signals
and
LocalisaKon
Sites
in
Amino
Acid
Sequences
• Evaluated
token
frequencies
within
our
dic'onary
– Long-‐tail
curve
– 87%
used
only
once
15
17. Named
En'ty
Recogni'on
(NER)
• Variety
of
NER
uses
– Species
– Gene/protein
names
– Chemical
names
• Variety
of
NER
accuracy
– 95%
F-‐score
species
(LINNAEUS)
– 73%
F-‐score
(strict)
gene
name
(ABNER)
– Over
70%
F-‐score
chemical
names
(OSCAR3)
17
18. bioNerDS
•
Automa'cally
matches
database
and
soLware
names
in
the
literature
–
Uses
dic'onary,
rules
and
clues
•
F-‐scores
between
63
and
91%
– Mixed
results
depending
on
corpus
– Issues
of
mul'ple
men'ons
of
a
single
resource
in
one
paper
– Ambiguity
and
variability…
hTp://bionerds.sourceforge.net/
18
20. Preliminary
Analysis
of
Resource
Usage
• Used
bioNerDS
to
extract
name
men'ons
from
two
journals:
– Genome
Biology
– BMC
Bioinforma'cs
• Analysed
differences
20
21. bioNerDS:
Results
• Over
36,000
men'ons
in
BMC
BioinformaKcs
• Over
15,000
men'ons
in
Genome
Biology.
• 78%
of
Genome
Biology
and
98%
of
BMC
BioinformaKcs
papers
contained
at
least
one
resource
men'on.
• The
top
5
men'oned
resources
were:
R,
BLAST,
GO,
GenBank,
GEO
and
PDB.
• The
general
trend
across
both
journals
have
most
major
resources
declining
in
usage
21
23. bioNerDS:
Full
PMC
Set
• Run
on
full
open-‐access
PMC
set
– ~230,000
full-‐text
ar'cles
– ~1000
different
journals
– Extracted
~1.8M
men'ons
• Method?
• Method
fingerprints
• Trying
to
extract
(data-‐mine):
– Ordering
– PaTerns
– Co-‐occurance
– Rela'onships
– Associate
rules
– Frequent
subsets
– “Networks”
23
24. Method
Analysis
and
Explora'on
• Mining
“best-‐prac'ce”:
Metrics
– Most
common
– Newest
– Who
uses
it
– What
resources
is
it
comprised
of
• Challenges
– Scien'fic
discourse
–
provenance
informa'on
– Men'on
order
does
not
imply
order
of
use
• Clustering
and
associa'ons
• Fingerprints
24
25. Conclusion
• Literature
mining
bioinforma'cs
in
silico
methods
• Developed
bioNerDS:
automated
resource
name
extrac'on
• Extrac'ng
and
analysing
paTerns
of
resource
usage
– Full
PMC
corpus
• Provided
a
way
to
extract
method
for
any
resource
based
domain
– Applied
this
to
bioinforma'cs
25
27. Resource
Men'ons
per
Journal
Journal
Total
ArDcles
Total
MenDons
RaDo
Nucleic
Acids
Research
7,192
200,339
27.8558
PLoS
One
15,791
168,624
10.6785
BMC
Bioinforma'cs
3,982
149,668
37.5861
BMC
Genomics
3,203
90,396
28.2223
Genome
Biology
2,321
48,976
21.1012
Acta
Crystallographica.
Sec'on
E,
Structure
Reports
Online
11,834
41,383
3.497
BMC
Evolu'onary
Biology
1,570
31,222
19.8866
PLoS
Computa'on
Biology
1,613
30,185
18.7136
PLoS
Gene'cs
1,876
29,734
15.8497
PLoS
Pathology
1,691
20,661
12.2182
27
28. Named
En'ty
Recogni'on
(NER)
• Variety
of
NER
uses
– Species
– Gene/protein
names
– Chemical
names
• Evalua'ng
NER
– True
posi'ves,
false
posi'ves,
false
nega'ves
– Precision:
– Recall:
– F-‐score:
28
29. Named
En'ty
Recogni'on
(NER)
• Evalua'ng
NER
– True
posi'ves,
false
posi'ves,
false
nega'ves
• tp:
Correct
• fp:
Returned
incorrect
• fn:
Missed
– Precision:
tp
/
(
tp
+
fp
)
• How
accurate
are
the
results
we
obtained
– Recall:
tp
/
(
tp
+
fn
)
• How
many
of
the
total
correct
results
did
we
obtain
– F-‐score:
2
x
P
x
R
/
(
P
+
R
)
29
30. Named
En'ty
Recogni'on
(NER)
• Evalua'ng
NER
– True
posi'ves,
false
posi'ves,
false
nega'ves
– Precision:
tp
/
(
tp
+
fp
)
– Recall:
tp
/
(
tp
+
fn
)
– F-‐score:
2
x
P
x
R
/
(
P
+
R
)
• Variety
of
NER
accuracy
– 95%
F-‐score
species
(LINNAEUS)
– 73%
F-‐score
(strict)
gene
name
(ABNER)
– Over
70%
F-‐score
chemical
names
(OSCAR3)
30