SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Extrac'on 
and 
Representa'on 
of 
in 
silico 
Biological 
Methods 
from 
the 
Literature 
Geraint 
Duck 
Supervisors: 
Robert 
Stevens, 
Goran 
Nenadic 
and 
David 
Robertson 
Advisor: 
Joshua 
Knowles 
School 
of 
Computer 
Science, 
University 
of 
Manchester
Importance 
of 
Method 
in 
Science 
• Understanding 
– Key 
part 
of 
research, 
central 
to 
science 
– Reproducibility 
and 
replica'on 
– What? 
Why? 
Where? 
How? 
When? 
– Extension 
• Advise/evaluate 
– “Current 
Approach” 
– “Best 
Prac'ce” 
2
Background 
• In 
silico: 
performed 
on 
a 
computer, 
or 
through 
computer 
simula'on 
• Bioinforma'cs 
is 
a 
resource-­‐focused 
domain 
– Numerous 
resources 
appearing 
– Literature 
is 
growing 
rapidly 
• Resource 
availability 
and 
usage 
is 
central 
to 
biological 
research 
• Current 
aTempts 
oUen 
manually 
curated 
and/ 
or 
incomplete 
3
The 
Method 
to 
Obtain 
a 
Method 
4 
1. Extrac'on 
– Automa'cally 
extract 
resource 
and 
task 
men'ons 
from 
the 
bioinforma'cs 
literature 
• This 
presenta'on 
focuses 
on 
this 
step 
2. Representa'on 
and 
Analysis 
– Evaluate 
the 
extracted 
men'ons 
for 
paTerns 
of 
representa'on 
3. Explora'on 
– Provide 
a 
means 
of 
exploring 
the 
methods 
extracted 
to 
aid 
other 
research/researchers
Key 
Hypothesis: 
Resource 
ordering 
implies 
method 
• An 
analogy 
– 
baking 
a 
cake: 
– Ingredients: 
buTer, 
eggs, 
flour, 
sugar, 
etc… 
– Recipe/method: 
Set 
oven 
to 
180°C, 
mix 
in 
a 
bowl 
the 
buTer 
and 
sugar… 
Divide 
between 
'ns, 
cook 
in 
oven 
for 
30mins… 
5
Key 
Hypothesis: 
Resource 
ordering 
implies 
method 
• An 
analogy 
– 
baking 
a 
cake: 
– Ingredients: 
bu#er, 
eggs, 
flour, 
sugar, 
etc… 
– Recipe/method: 
Set 
oven 
to 
180°C, 
mix 
in 
a 
bowl 
the 
bu#er 
and 
sugar… 
Divide 
between 
2ns, 
cook 
in 
oven 
for 
30mins… 
6 
Key: 
Resource; 
Task
Example: 
Lagerström 
et 
al. 
(2006) 
… 
all 
sequences 
were 
aligned 
… 
using 
… 
BLAT 
3.0 
… 
in 
which 
case 
the 
GenBank 
sequence 
was 
used… 
… 
divided 
… 
by 
BLAST 
searches 
… 
were 
combined 
into 
a 
FASTA 
file 
and 
aligned 
using 
… 
ClustalW 
1.82 
… 
The 
alignment 
was 
bootstrapped 
… 
using 
SEQBOOT 
from 
the 
… 
Phylip 
3.6 
package 
… 
[excerpt 
removed] 
… 
branch 
lengths 
were 
es'mated 
in 
TreePuzzle 
using 
the 
following 
parameters 
… 
… 
constructed 
and 
scored 
automa'cally 
using 
a 
bash-­‐ 
script 
that 
u'lized 
ClustalW 
as 
alignment 
engine 
and 
infoalign 
from 
the 
EMBOSS 
2.8.0 
package 
for 
scoring, 
… 
All 
sta's'cal 
analysis 
was 
performed 
using 
MiniTab. 
Graphs 
were 
ploTed 
using 
MicrosoU 
Excel 
and 
MiniTab. 
7
Example: 
Lagerström 
et 
al. 
(2006) 
… 
all 
sequences 
were 
aligned 
… 
using 
… 
BLAT 
3.0 
… 
in 
which 
case 
the 
GenBank 
sequence 
was 
used… 
… 
divided 
… 
by 
BLAST 
searches 
… 
were 
combined 
into 
a 
FASTA 
file 
and 
aligned 
using 
… 
ClustalW 
1.82 
… 
The 
alignment 
was 
bootstrapped 
… 
using 
SEQBOOT 
from 
the 
… 
Phylip 
3.6 
package 
… 
[excerpt 
removed] 
… 
branch 
lengths 
were 
es2mated 
in 
TreePuzzle 
using 
the 
following 
parameters 
… 
… 
constructed 
and 
scored 
automa'cally 
using 
a 
bash-­‐ 
script 
that 
u'lized 
ClustalW 
as 
alignment 
engine 
and 
infoalign 
from 
the 
EMBOSS 
2.8.0 
package 
for 
scoring, 
… 
All 
sta's'cal 
analysis 
was 
performed 
using 
MiniTab. 
Graphs 
were 
plo#ed 
using 
MicrosoL 
Excel 
and 
MiniTab. 
8 
Key: 
Resource; 
Task; 
Poten2al 
Challenge
Example: 
Lagerström 
et 
al. 
(2006) 
… 
all 
sequences 
were 
aligned 
… 
using 
… 
BLAT 
3.0 
… 
in 
which 
case 
the 
GenBank 
sequence 
was 
used… 
… 
divided 
… 
by 
BLAST 
searches 
… 
were 
combined 
into 
a 
FASTA 
file 
and 
aligned 
using 
… 
ClustalW 
1.82 
… 
The 
alignment 
was 
bootstrapped 
… 
using 
SEQBOOT 
from 
the 
… 
Phylip 
3.6 
package 
… 
[excerpt 
removed] 
… 
branch 
lengths 
were 
es2mated 
in 
TreePuzzle 
using 
the 
following 
parameters. 
… 
constructed 
and 
scored 
automa'cally 
using 
a 
bash-­‐ 
script 
that 
u'lized 
ClustalW 
as 
alignment 
engine 
and 
infoalign 
from 
the 
EMBOSS 
2.8.0 
package 
for 
scoring, 
… 
All 
sta's'cal 
analysis 
was 
performed 
using 
MiniTab. 
Graphs 
were 
plo#ed 
using 
MicrosoL 
Excel 
and 
MiniTab. 
9 
Key: 
Resource; 
Task; 
Poten2al 
Challenge
Example: 
Lagerström 
et 
al. 
(2006) 
10 
Key: 
GenBank 
BLAT, 
aligned 
BLAST, 
searched 
ClustalW, 
aligned 
Resource; 
Task 
SEQBOOT, 
bootstrapped 
(Phylip) 
TreePuzzle, 
esDmated 
ClustalW, 
aligned 
infoalign, 
scored 
(EMBOSS) 
MiniTab, 
staDsDcs 
MS 
Excel, 
graphs 
ploIed 
MiniTab, 
graphs 
ploIed 
Tree 
Construc'on 
Sequence 
and 
Tree 
Analysis 
Result 
Visualisa'on 
Sequence 
Alignment
Example… 
• Mul'ple 
methods 
– Usage 
counts 
– Recentness 
of 
use 
– “best-­‐prac'ce” 
11
Challenges 
-­‐ 
Ambiguity 
• leg 
• white 
• cab 
• HIV 
– Human 
immunodeficiency 
virus 
– Human 
immunovirus 
• analysis 
• Network 
• graph 
• DIP 
– distal 
interphalangeal 
– Database 
of 
Interac'ng 
Proteins 
12
Challenges 
-­‐ 
Variability 
• Orthographics 
– Swiss 
Prot 
– SWISS-­‐PROT 
– SwissProt 
• Misspellings 
and 
typos 
– One 
paper, 
same 
resource, 
spelt 
3 
different 
ways 
• Abbrevia'ons 
– Different 
authors 
can 
use 
different 
acronyms 
for 
the 
same 
thing 
13
Name 
Composi'on 
• Majority 
are 
single 
nouns 
– includes 
acronyms 
• 6% 
lowercase 
common 
nouns 
– affy, 
bioconductor 
• A 
few 
contained 
numbers 
– S4, 
t2prhd 
• A 
few 
misclassified 
as 
verbs 
– …each 
query 
protein 
is 
first 
BLASTed 
with… 
– …held 
near 
their 
equilibrium 
values 
using 
SHAKE. 
– …graphical 
representaKons 
were 
achieved 
using 
dot 
v1.10… 
14
Name 
Composi'on 
• Longest 
Names 
(most 
tokens) 
– Corpus: 
5 
– 
Gene 
Expression 
Profile 
Analysis 
Suite 
– Dic'onary: 
12 
– 
PredicKon 
of 
Protein 
SorKng 
Signals 
and 
LocalisaKon 
Sites 
in 
Amino 
Acid 
Sequences 
• Evaluated 
token 
frequencies 
within 
our 
dic'onary 
– Long-­‐tail 
curve 
– 87% 
used 
only 
once 
15
!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($& 
!"#$%"& 
'($)"*#& 
!"#"& 
+",-"#."& 
/-%0#& 
621& 
611& 
51& 
41& 
31& 
21& 
1& 
@<A$1& 
1& 27& 71& 87& 611& 627& 671& 
!"#$%&'($)*$%+,& 
!"-&./0&!"#$%1&23"(415& 
16
Named 
En'ty 
Recogni'on 
(NER) 
• Variety 
of 
NER 
uses 
– Species 
– Gene/protein 
names 
– Chemical 
names 
• Variety 
of 
NER 
accuracy 
– 95% 
F-­‐score 
species 
(LINNAEUS) 
– 73% 
F-­‐score 
(strict) 
gene 
name 
(ABNER) 
– Over 
70% 
F-­‐score 
chemical 
names 
(OSCAR3) 
17
bioNerDS 
• 
Automa'cally 
matches 
database 
and 
soLware 
names 
in 
the 
literature 
– 
Uses 
dic'onary, 
rules 
and 
clues 
• 
F-­‐scores 
between 
63 
and 
91% 
– Mixed 
results 
depending 
on 
corpus 
– Issues 
of 
mul'ple 
men'ons 
of 
a 
single 
resource 
in 
one 
paper 
– Ambiguity 
and 
variability… 
hTp://bionerds.sourceforge.net/ 
18
!"#$%"&'#(()*+! 
"#$%&'()!*$(+,(! 
-./+#,00,(! 
! 
/.'1,(2"! 
2.3#.'%$(4! 
, 
2.3#.'%$(4! 
-''567*! 
! 
8.%)!7%5%'9%! 
0,%#.'%+! 
, 
8.-#,(!-.5,-4! 
8:+! 
! 
;'0/.%,!#<,!+3'(,+! 
! 
",3'%)!*$++!9.#<! 
0,%#.'%+!$/'=,!#<,! 
#<(,+<'-)! 
System 
Overview 
19
Preliminary 
Analysis 
of 
Resource 
Usage 
• Used 
bioNerDS 
to 
extract 
name 
men'ons 
from 
two 
journals: 
– Genome 
Biology 
– BMC 
Bioinforma'cs 
• Analysed 
differences 
20
bioNerDS: 
Results 
• Over 
36,000 
men'ons 
in 
BMC 
BioinformaKcs 
• Over 
15,000 
men'ons 
in 
Genome 
Biology. 
• 78% 
of 
Genome 
Biology 
and 
98% 
of 
BMC 
BioinformaKcs 
papers 
contained 
at 
least 
one 
resource 
men'on. 
• The 
top 
5 
men'oned 
resources 
were: 
R, 
BLAST, 
GO, 
GenBank, 
GEO 
and 
PDB. 
• The 
general 
trend 
across 
both 
journals 
have 
most 
major 
resources 
declining 
in 
usage 
21
Rela've 
Usage 
within 
the 
Top 
50 
Genome 
Biology 
BMC 
BioinformaDcs 
2001 
2002 
2003 
2004 
2005 
2006 
2007 
2008 
2009 
2010 
2011 
2001 
2002 
2003 
2004 
2005 
2006 
2007 
2008 
2009 
2010 
2011 
22 
BLAST 
Bioconductor 
ClustalW 
Ensembl 
GenBank 
Gene 
Ontology 
R 
Swiss-­‐Prot
bioNerDS: 
Full 
PMC 
Set 
• Run 
on 
full 
open-­‐access 
PMC 
set 
– ~230,000 
full-­‐text 
ar'cles 
– ~1000 
different 
journals 
– Extracted 
~1.8M 
men'ons 
• Method? 
• Method 
fingerprints 
• Trying 
to 
extract 
(data-­‐mine): 
– Ordering 
– PaTerns 
– Co-­‐occurance 
– Rela'onships 
– Associate 
rules 
– Frequent 
subsets 
– “Networks” 
23
Method 
Analysis 
and 
Explora'on 
• Mining 
“best-­‐prac'ce”: 
Metrics 
– Most 
common 
– Newest 
– Who 
uses 
it 
– What 
resources 
is 
it 
comprised 
of 
• Challenges 
– Scien'fic 
discourse 
– 
provenance 
informa'on 
– Men'on 
order 
does 
not 
imply 
order 
of 
use 
• Clustering 
and 
associa'ons 
• Fingerprints 
24
Conclusion 
• Literature 
mining 
bioinforma'cs 
in 
silico 
methods 
• Developed 
bioNerDS: 
automated 
resource 
name 
extrac'on 
• Extrac'ng 
and 
analysing 
paTerns 
of 
resource 
usage 
– Full 
PMC 
corpus 
• Provided 
a 
way 
to 
extract 
method 
for 
any 
resource 
based 
domain 
– Applied 
this 
to 
bioinforma'cs 
25
Thank-­‐you 
• Acknowledgements 
– Supervisors: 
• Robert 
Stevens 
• Goran 
Nenadic 
• David 
Robertson 
– Funding: 
26
Resource 
Men'ons 
per 
Journal 
Journal 
Total 
ArDcles 
Total 
MenDons 
RaDo 
Nucleic 
Acids 
Research 
7,192 
200,339 
27.8558 
PLoS 
One 
15,791 
168,624 
10.6785 
BMC 
Bioinforma'cs 
3,982 
149,668 
37.5861 
BMC 
Genomics 
3,203 
90,396 
28.2223 
Genome 
Biology 
2,321 
48,976 
21.1012 
Acta 
Crystallographica. 
Sec'on 
E, 
Structure 
Reports 
Online 
11,834 
41,383 
3.497 
BMC 
Evolu'onary 
Biology 
1,570 
31,222 
19.8866 
PLoS 
Computa'on 
Biology 
1,613 
30,185 
18.7136 
PLoS 
Gene'cs 
1,876 
29,734 
15.8497 
PLoS 
Pathology 
1,691 
20,661 
12.2182 
27
Named 
En'ty 
Recogni'on 
(NER) 
• Variety 
of 
NER 
uses 
– Species 
– Gene/protein 
names 
– Chemical 
names 
• Evalua'ng 
NER 
– True 
posi'ves, 
false 
posi'ves, 
false 
nega'ves 
– Precision: 
– Recall: 
– F-­‐score: 
28
Named 
En'ty 
Recogni'on 
(NER) 
• Evalua'ng 
NER 
– True 
posi'ves, 
false 
posi'ves, 
false 
nega'ves 
• tp: 
Correct 
• fp: 
Returned 
incorrect 
• fn: 
Missed 
– Precision: 
tp 
/ 
( 
tp 
+ 
fp 
) 
• How 
accurate 
are 
the 
results 
we 
obtained 
– Recall: 
tp 
/ 
( 
tp 
+ 
fn 
) 
• How 
many 
of 
the 
total 
correct 
results 
did 
we 
obtain 
– F-­‐score: 
2 
x 
P 
x 
R 
/ 
( 
P 
+ 
R 
) 
29
Named 
En'ty 
Recogni'on 
(NER) 
• Evalua'ng 
NER 
– True 
posi'ves, 
false 
posi'ves, 
false 
nega'ves 
– Precision: 
tp 
/ 
( 
tp 
+ 
fp 
) 
– Recall: 
tp 
/ 
( 
tp 
+ 
fn 
) 
– F-­‐score: 
2 
x 
P 
x 
R 
/ 
( 
P 
+ 
R 
) 
• Variety 
of 
NER 
accuracy 
– 95% 
F-­‐score 
species 
(LINNAEUS) 
– 73% 
F-­‐score 
(strict) 
gene 
name 
(ABNER) 
– Over 
70% 
F-­‐score 
chemical 
names 
(OSCAR3) 
30

Mais conteúdo relacionado

Mais procurados

Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?Maryann Martone
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp
 
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Amit Sheth
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilitiesmkim8
 
Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...UC San Diego
 

Mais procurados (11)

B.3.5
B.3.5B.3.5
B.3.5
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defense
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
 
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
 
Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...
 
BioNLPSADI
BioNLPSADIBioNLPSADI
BioNLPSADI
 
Rp 3010 5814
Rp 3010 5814Rp 3010 5814
Rp 3010 5814
 

Destaque

Alfabeto de nomes c
Alfabeto de nomes   cAlfabeto de nomes   c
Alfabeto de nomes cDário Reis
 
Introdution to HTML
Introdution to HTMLIntrodution to HTML
Introdution to HTMLyashh1402
 
State of Retail & CRM - Time to Re-Imagine
State of Retail & CRM - Time to Re-ImagineState of Retail & CRM - Time to Re-Imagine
State of Retail & CRM - Time to Re-ImagineRobert Eastwood
 
The Institutional Capital Model - macro economics
The Institutional Capital Model - macro economics  The Institutional Capital Model - macro economics
The Institutional Capital Model - macro economics Ayush Parekh
 
Alfabeto de nomes v
Alfabeto de nomes   vAlfabeto de nomes   v
Alfabeto de nomes vDário Reis
 
The effect of TAP pipeline for the Balkans, Turkey and Italian gas markets
The effect of TAP pipeline for the Balkans, Turkey and Italian gas marketsThe effect of TAP pipeline for the Balkans, Turkey and Italian gas markets
The effect of TAP pipeline for the Balkans, Turkey and Italian gas marketsARERA
 
La regolazione per i Sistemi di Distribuzione Chiusi (SDC)
La regolazione per i Sistemi di Distribuzione Chiusi (SDC)La regolazione per i Sistemi di Distribuzione Chiusi (SDC)
La regolazione per i Sistemi di Distribuzione Chiusi (SDC)ARERA
 
Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useSwiss Big Data User Group
 
Estrategia de daytrading para mercados de alta volatilidad basada en gaps
Estrategia de daytrading para mercados de alta volatilidad basada en gapsEstrategia de daytrading para mercados de alta volatilidad basada en gaps
Estrategia de daytrading para mercados de alta volatilidad basada en gapsRaul Canessa
 
GPU power consumption and performance trends
GPU power consumption and performance trendsGPU power consumption and performance trends
GPU power consumption and performance trendsAlessio Villardita
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentationVishal Singh
 
Cultural Times - The first global map of cultural and creative industries
Cultural Times - The first global map of cultural and creative industriesCultural Times - The first global map of cultural and creative industries
Cultural Times - The first global map of cultural and creative industriesEY
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 

Destaque (18)

Alfabeto de nomes c
Alfabeto de nomes   cAlfabeto de nomes   c
Alfabeto de nomes c
 
Best peer++
Best peer++Best peer++
Best peer++
 
Example dr shriniwas kashalikar
Example dr shriniwas kashalikarExample dr shriniwas kashalikar
Example dr shriniwas kashalikar
 
Introdution to HTML
Introdution to HTMLIntrodution to HTML
Introdution to HTML
 
State of Retail & CRM - Time to Re-Imagine
State of Retail & CRM - Time to Re-ImagineState of Retail & CRM - Time to Re-Imagine
State of Retail & CRM - Time to Re-Imagine
 
Part 2
Part 2Part 2
Part 2
 
The Institutional Capital Model - macro economics
The Institutional Capital Model - macro economics  The Institutional Capital Model - macro economics
The Institutional Capital Model - macro economics
 
Alfabeto de nomes v
Alfabeto de nomes   vAlfabeto de nomes   v
Alfabeto de nomes v
 
The effect of TAP pipeline for the Balkans, Turkey and Italian gas markets
The effect of TAP pipeline for the Balkans, Turkey and Italian gas marketsThe effect of TAP pipeline for the Balkans, Turkey and Italian gas markets
The effect of TAP pipeline for the Balkans, Turkey and Italian gas markets
 
La regolazione per i Sistemi di Distribuzione Chiusi (SDC)
La regolazione per i Sistemi di Distribuzione Chiusi (SDC)La regolazione per i Sistemi di Distribuzione Chiusi (SDC)
La regolazione per i Sistemi di Distribuzione Chiusi (SDC)
 
Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
Estrategia de daytrading para mercados de alta volatilidad basada en gaps
Estrategia de daytrading para mercados de alta volatilidad basada en gapsEstrategia de daytrading para mercados de alta volatilidad basada en gaps
Estrategia de daytrading para mercados de alta volatilidad basada en gaps
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
GPU power consumption and performance trends
GPU power consumption and performance trendsGPU power consumption and performance trends
GPU power consumption and performance trends
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
Cultural Times - The first global map of cultural and creative industries
Cultural Times - The first global map of cultural and creative industriesCultural Times - The first global map of cultural and creative industries
Cultural Times - The first global map of cultural and creative industries
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 

Semelhante a University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...geraintduck
 
Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014Susanna-Assunta Sansone
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
The UCSC genome browser: A Neuroscience focused overview
The UCSC genome browser: A Neuroscience focused overviewThe UCSC genome browser: A Neuroscience focused overview
The UCSC genome browser: A Neuroscience focused overviewVictoria Perreau
 
2018. gwas data cleaning
2018. gwas data cleaning2018. gwas data cleaning
2018. gwas data cleaningFOODCROPS
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)r-kor
 
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Hendrik Drachsler
 
WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...Chris Evelo
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...geraintduck
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsAltuna Akalin
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopMorgan Langille
 
Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Sijo A
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRONPrabin Shakya
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...DataScienceConferenc1
 

Semelhante a University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature (20)

ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
 
Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
The UCSC genome browser: A Neuroscience focused overview
The UCSC genome browser: A Neuroscience focused overviewThe UCSC genome browser: A Neuroscience focused overview
The UCSC genome browser: A Neuroscience focused overview
 
2018. gwas data cleaning
2018. gwas data cleaning2018. gwas data cleaning
2018. gwas data cleaning
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
 
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
 
WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
MPDB Presentation
MPDB PresentationMPDB Presentation
MPDB Presentation
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRON
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 

Último

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 

Último (20)

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 

University of Manchester Symposium 2012: Extraction and Representation of in silico Biological Methods from the Literature

  • 1. Extrac'on and Representa'on of in silico Biological Methods from the Literature Geraint Duck Supervisors: Robert Stevens, Goran Nenadic and David Robertson Advisor: Joshua Knowles School of Computer Science, University of Manchester
  • 2. Importance of Method in Science • Understanding – Key part of research, central to science – Reproducibility and replica'on – What? Why? Where? How? When? – Extension • Advise/evaluate – “Current Approach” – “Best Prac'ce” 2
  • 3. Background • In silico: performed on a computer, or through computer simula'on • Bioinforma'cs is a resource-­‐focused domain – Numerous resources appearing – Literature is growing rapidly • Resource availability and usage is central to biological research • Current aTempts oUen manually curated and/ or incomplete 3
  • 4. The Method to Obtain a Method 4 1. Extrac'on – Automa'cally extract resource and task men'ons from the bioinforma'cs literature • This presenta'on focuses on this step 2. Representa'on and Analysis – Evaluate the extracted men'ons for paTerns of representa'on 3. Explora'on – Provide a means of exploring the methods extracted to aid other research/researchers
  • 5. Key Hypothesis: Resource ordering implies method • An analogy – baking a cake: – Ingredients: buTer, eggs, flour, sugar, etc… – Recipe/method: Set oven to 180°C, mix in a bowl the buTer and sugar… Divide between 'ns, cook in oven for 30mins… 5
  • 6. Key Hypothesis: Resource ordering implies method • An analogy – baking a cake: – Ingredients: bu#er, eggs, flour, sugar, etc… – Recipe/method: Set oven to 180°C, mix in a bowl the bu#er and sugar… Divide between 2ns, cook in oven for 30mins… 6 Key: Resource; Task
  • 7. Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es'mated in TreePuzzle using the following parameters … … constructed and scored automa'cally using a bash-­‐ script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were ploTed using MicrosoU Excel and MiniTab. 7
  • 8. Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es2mated in TreePuzzle using the following parameters … … constructed and scored automa'cally using a bash-­‐ script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were plo#ed using MicrosoL Excel and MiniTab. 8 Key: Resource; Task; Poten2al Challenge
  • 9. Example: Lagerström et al. (2006) … all sequences were aligned … using … BLAT 3.0 … in which case the GenBank sequence was used… … divided … by BLAST searches … were combined into a FASTA file and aligned using … ClustalW 1.82 … The alignment was bootstrapped … using SEQBOOT from the … Phylip 3.6 package … [excerpt removed] … branch lengths were es2mated in TreePuzzle using the following parameters. … constructed and scored automa'cally using a bash-­‐ script that u'lized ClustalW as alignment engine and infoalign from the EMBOSS 2.8.0 package for scoring, … All sta's'cal analysis was performed using MiniTab. Graphs were plo#ed using MicrosoL Excel and MiniTab. 9 Key: Resource; Task; Poten2al Challenge
  • 10. Example: Lagerström et al. (2006) 10 Key: GenBank BLAT, aligned BLAST, searched ClustalW, aligned Resource; Task SEQBOOT, bootstrapped (Phylip) TreePuzzle, esDmated ClustalW, aligned infoalign, scored (EMBOSS) MiniTab, staDsDcs MS Excel, graphs ploIed MiniTab, graphs ploIed Tree Construc'on Sequence and Tree Analysis Result Visualisa'on Sequence Alignment
  • 11. Example… • Mul'ple methods – Usage counts – Recentness of use – “best-­‐prac'ce” 11
  • 12. Challenges -­‐ Ambiguity • leg • white • cab • HIV – Human immunodeficiency virus – Human immunovirus • analysis • Network • graph • DIP – distal interphalangeal – Database of Interac'ng Proteins 12
  • 13. Challenges -­‐ Variability • Orthographics – Swiss Prot – SWISS-­‐PROT – SwissProt • Misspellings and typos – One paper, same resource, spelt 3 different ways • Abbrevia'ons – Different authors can use different acronyms for the same thing 13
  • 14. Name Composi'on • Majority are single nouns – includes acronyms • 6% lowercase common nouns – affy, bioconductor • A few contained numbers – S4, t2prhd • A few misclassified as verbs – …each query protein is first BLASTed with… – …held near their equilibrium values using SHAKE. – …graphical representaKons were achieved using dot v1.10… 14
  • 15. Name Composi'on • Longest Names (most tokens) – Corpus: 5 – Gene Expression Profile Analysis Suite – Dic'onary: 12 – PredicKon of Protein SorKng Signals and LocalisaKon Sites in Amino Acid Sequences • Evaluated token frequencies within our dic'onary – Long-­‐tail curve – 87% used only once 15
  • 16. !"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($& !"#$%"& '($)"*#& !"#"& +",-"#."& /-%0#& 621& 611& 51& 41& 31& 21& 1& @<A$1& 1& 27& 71& 87& 611& 627& 671& !"#$%&'($)*$%+,& !"-&./0&!"#$%1&23"(415& 16
  • 17. Named En'ty Recogni'on (NER) • Variety of NER uses – Species – Gene/protein names – Chemical names • Variety of NER accuracy – 95% F-­‐score species (LINNAEUS) – 73% F-­‐score (strict) gene name (ABNER) – Over 70% F-­‐score chemical names (OSCAR3) 17
  • 18. bioNerDS • Automa'cally matches database and soLware names in the literature – Uses dic'onary, rules and clues • F-­‐scores between 63 and 91% – Mixed results depending on corpus – Issues of mul'ple men'ons of a single resource in one paper – Ambiguity and variability… hTp://bionerds.sourceforge.net/ 18
  • 19. !"#$%"&'#(()*+! "#$%&'()!*$(+,(! -./+#,00,(! ! /.'1,(2"! 2.3#.'%$(4! , 2.3#.'%$(4! -''567*! ! 8.%)!7%5%'9%! 0,%#.'%+! , 8.-#,(!-.5,-4! 8:+! ! ;'0/.%,!#<,!+3'(,+! ! ",3'%)!*$++!9.#<! 0,%#.'%+!$/'=,!#<,! #<(,+<'-)! System Overview 19
  • 20. Preliminary Analysis of Resource Usage • Used bioNerDS to extract name men'ons from two journals: – Genome Biology – BMC Bioinforma'cs • Analysed differences 20
  • 21. bioNerDS: Results • Over 36,000 men'ons in BMC BioinformaKcs • Over 15,000 men'ons in Genome Biology. • 78% of Genome Biology and 98% of BMC BioinformaKcs papers contained at least one resource men'on. • The top 5 men'oned resources were: R, BLAST, GO, GenBank, GEO and PDB. • The general trend across both journals have most major resources declining in usage 21
  • 22. Rela've Usage within the Top 50 Genome Biology BMC BioinformaDcs 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 22 BLAST Bioconductor ClustalW Ensembl GenBank Gene Ontology R Swiss-­‐Prot
  • 23. bioNerDS: Full PMC Set • Run on full open-­‐access PMC set – ~230,000 full-­‐text ar'cles – ~1000 different journals – Extracted ~1.8M men'ons • Method? • Method fingerprints • Trying to extract (data-­‐mine): – Ordering – PaTerns – Co-­‐occurance – Rela'onships – Associate rules – Frequent subsets – “Networks” 23
  • 24. Method Analysis and Explora'on • Mining “best-­‐prac'ce”: Metrics – Most common – Newest – Who uses it – What resources is it comprised of • Challenges – Scien'fic discourse – provenance informa'on – Men'on order does not imply order of use • Clustering and associa'ons • Fingerprints 24
  • 25. Conclusion • Literature mining bioinforma'cs in silico methods • Developed bioNerDS: automated resource name extrac'on • Extrac'ng and analysing paTerns of resource usage – Full PMC corpus • Provided a way to extract method for any resource based domain – Applied this to bioinforma'cs 25
  • 26. Thank-­‐you • Acknowledgements – Supervisors: • Robert Stevens • Goran Nenadic • David Robertson – Funding: 26
  • 27. Resource Men'ons per Journal Journal Total ArDcles Total MenDons RaDo Nucleic Acids Research 7,192 200,339 27.8558 PLoS One 15,791 168,624 10.6785 BMC Bioinforma'cs 3,982 149,668 37.5861 BMC Genomics 3,203 90,396 28.2223 Genome Biology 2,321 48,976 21.1012 Acta Crystallographica. Sec'on E, Structure Reports Online 11,834 41,383 3.497 BMC Evolu'onary Biology 1,570 31,222 19.8866 PLoS Computa'on Biology 1,613 30,185 18.7136 PLoS Gene'cs 1,876 29,734 15.8497 PLoS Pathology 1,691 20,661 12.2182 27
  • 28. Named En'ty Recogni'on (NER) • Variety of NER uses – Species – Gene/protein names – Chemical names • Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves – Precision: – Recall: – F-­‐score: 28
  • 29. Named En'ty Recogni'on (NER) • Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves • tp: Correct • fp: Returned incorrect • fn: Missed – Precision: tp / ( tp + fp ) • How accurate are the results we obtained – Recall: tp / ( tp + fn ) • How many of the total correct results did we obtain – F-­‐score: 2 x P x R / ( P + R ) 29
  • 30. Named En'ty Recogni'on (NER) • Evalua'ng NER – True posi'ves, false posi'ves, false nega'ves – Precision: tp / ( tp + fp ) – Recall: tp / ( tp + fn ) – F-­‐score: 2 x P x R / ( P + R ) • Variety of NER accuracy – 95% F-­‐score species (LINNAEUS) – 73% F-­‐score (strict) gene name (ABNER) – Over 70% F-­‐score chemical names (OSCAR3) 30