This document discusses how bioinformatics research can benefit from techniques in information retrieval. It provides background on bioinformatics, information retrieval, and how the fields intersect. Specifically, it describes how indexing, searching, filtering, mining and categorizing large amounts of bioinformatics data and publications can help with tasks like acquiring, analyzing, organizing and storing biological information. The document also presents several case studies of specific tools and systems that apply IR techniques in bioinformatics.
Human Factors of XR: Using Human Factors to Design XR Systems
Bioinformatics Meets Information Retrieval
1. Bioinformatics Meets
Information Retrieval
State of the Art and a Case Study
Eloisa Vargiu
Intelligent Agents and Soft-Computing Group
Dept. of Electrical and Electronic Engineering
University of Cagliari, Italy
February 16, 2011 – Valencia (Spain) email: vargiu@diee.unica.it
2. My Background
2000 – 2004 2004 – 2009
Automatic planning Bioinformatics
Classic domains: HW[] Protein secondary structure
Dynamic domains: HIPE prediction: MASSP3 and
GAME/SSP
2000 - …
2006 - …
Multiage s te
nt ys ms
Information Retrieval
A Personalized Adaptive and
Cooperative Multiagent Hierarchical text
System: PACMAS categorization: PF and TSA
A generic architecture to Recommender systems and
perform information retrieval contextual advertising: ConCA
tasks: X.MAS
February 16, 2011 – Valencia (Spain)
3. Outline
Context and Mission
Why Bioinformatics Needs Information Retrieval
Bioinformatics Meets Information Retrieval
Case Study: Retrieving and Filtering Bioinformatics Publications
Conclusions
February 16, 2011 – Valencia (Spain)
5. Web Evolution
Web 1.0 1993
Source of information
Personal homepages
Web 2.0 2004
Social networks
(Micro)Blogging
Web 3.0 2005
Semantic web
Web composition
February 16, 2011 – Valencia (Spain)
6. Web Evolution and Bioinformatics
A long time ago...
Data was stored in local DBs
Data was shared as flat files
Biologists worked alone or in small groups
February 16, 2011 – Valencia (Spain)
7. Web Evolution and Bioinformatics
Today...
Online repositories
The major sources of nucleotide sequence are the ones belonging to the
International Nucleotide Sequence Database Collaboration
DDBJ (DNA DataBank of Japan)
EMBL (European Molecular Biology Laboratory)
GenBank (NIH genetic sequence database)
Web services
Basic bioinformatics services are
classified by the EBI into three categories
SSS (Sequence Search Services)
MSA (Multiple Sequence Alignment)
BSA (Biological Sequence Analysis)
February 16, 2011 – Valencia (Spain)
8. Web Evolution and Scientific
Publications
A long time ago...
Publications were consulted at the library
Just two or three relevant available journals
Manual selection of relevant publications
February 16, 2011 – Valencia (Spain)
9. Web Evolution and Scientific
Publications
Today...
Online journals
Online conference proceedings
Publications are often available for free
Manual selection of relevant publications
becomes unfeasible
February 16, 2011 – Valencia (Spain)
10. As a Consequence...
Unstructured information
Information overload
Personalized information selection and input imbalance
February 16, 2011 – Valencia (Spain)
11. Our Mission
To cope with
Unstructured information, classifying documents according to a
given taxonomy
Information overload, filtering information to reduce redundancy
Personalized information selection and input imbalance, filtering
information according to user preferences
Case study
Retrieving and filtering bioinformatics publications
February 16, 2011 – Valencia (Spain)
12. Research Topics
Information Retrieval
Bioinformatics
February 16, 2011 – Valencia (Spain)
13. Information Retrieval
Information Retrieval (IR) deals with the representation,
Information Retrieval (IR) deals with the representation,
storage, organization of, and access to information items.
storage, organization of, and access to information items.
The user must first translate this information need into a query
The user must first translate this information need into a query
which can be processed by an IR system.
which can be processed by an IR system.
Given the user query, the key goal of an IR system is to retrieve
Given the user query, the key goal of an IR system is to retrieve
information which might be useful or relevant to the user.
information which might be useful or relevant to the user.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
New York: Addison-Wesley, 1999.
February 16, 2011 – Valencia (Spain)
14. Main IR Topics
Indexing
Search and Web Search
Information Filtering
Text Mining
Text Categorization and Hierarchical Text Categorization
February 16, 2011 – Valencia (Spain)
15. Bioinformatics
Bioinformatics is the field of science in which biology,
Bioinformatics is the field of science in which biology,
computer science, and information technology merge to form a
computer science, and information technology merge to form a
single discipline.
single discipline.
The ultimate goal of the field is to enable the discovery of new
The ultimate goal of the field is to enable the discovery of new
biological insights as well as to create a global perspective from
biological insights as well as to create a global perspective from
which unifying principles in biology can be discerned.
which unifying principles in biology can be discerned.
National Center for Biotechnology Information (NCBI),
http://www.ncbi.nlm.nih.gov/.
February 16, 2011 – Valencia (Spain)
16. Main Bioinformatics Research Areas
Sequence analysis
Genome annotation
Computational evolutionary biology
Analysis of gene expression
Analysis of protein expression
Analysis of mutations in cancer
Comparative genomics
Modelling biological systems
Prediction of protein structure
Molecular interaction
February 16, 2011 – Valencia (Spain)
17. Why Bioinformatics
Needs
Information Retrieval
February 16, 2011 – Valencia (Spain)
18. Does Bioinformatics Need IR?
Bioinformatics is concerned with researching, developing and
applying tools and methods to acquire, analyse, organize and
store biological and medical data
Indexing and search techniques may help in the task of acquiring
Information filtering, text mining and text categorization
techniques may be useful to the analysis of data
Text categorization, with particular reference to hierarchical text
categorization, may be used in the organization and storage tasks
February 16, 2011 – Valencia (Spain)
19. Bioinformatics Data
A very huge amount of of data to be
Indexed
Searched for in large databases or on the web
Filtered according to users' preferences
Text mined
Categorized according to its textual content
February 16, 2011 – Valencia (Spain)
20. DB Indexing
Why
Data types are relegated to blob and unstructured text fields
Few results in building persistent access paths to support fast
retrieval methods
Genomic datasets in public repositories are annotated with free-text
fields describing the pathological state of the studied sample
Annotations are not mapped to concept in any ontology
February 16, 2011 – Valencia (Spain)
21. DB Indexing
Who
MoBIoS – Molecular Biological Information System
What
A specialized database management system
The storage manager is based on metric-space indexing
Query language entails biological data types
Where
Sequence homology: local alignment and mutations
D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to
Support Biological Discovery. Proceedings of the International
Conference on Scientific and Statistical Database Management
Systems, 2003.
February 16, 2011 – Valencia (Spain)
22. DB Indexing
Who
--
What
Ontology-driven indexing of public datasets for translational
bioinformatics
Methods to map text annotations of gene expression datasets to
concept in the UMLS
Where
Gene Expression Omnibus
Standford Tissue Microarray Database
N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A.
.
Musen. Ontology-driven indexing of public datasets for translational
bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009.
February 16, 2011 – Valencia (Spain)
23. Web Indexing
Why
Most often sequence retrieval tools and sequence analysis tools are
separated
The usage of sequence DBs is often general and limited to
keyword searching and entry retrieval
Discovering and accessing the appropriate bioinformatics resource
for a specific task has become increasingly important
February 16, 2011 – Valencia (Spain)
24. Web Indexing
Who
SIRW – A Web Server for Simple Indexing and Retrieval System
What
A WWW interface to the Simple Indexing and Retrieval (SIR)
system to parse and index flat file DBs
A framework for doing sequence analysis for selected biological
sequences
Where
Sequence analysis: motif pattern searches
C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval
System that combines sequence motif searches with keyword searches.
Nucleic Acids Research, 31(13). pp. 3771-3774, 2003.
February 16, 2011 – Valencia (Spain)
25. Web Indexing
Who
BIRI - BIoinformatics Resource Inventory
What
An approach for automatically discovering and indexing public
bioinformatics resources
Where
The scientific literature
G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V.
Maojo. BIRI: a new approach for automatically discovering and
indexing available public bioinformatics resources from the literature.
BMC Bioinformatics, Oct 7;10:320, 2009.
February 16, 2011 – Valencia (Spain)
26. DB Search
Why
A wealth of bioinformatics tools and databases has been created
over the last decade and most are freely available
Often it is desired to visualize the database hits stacked according
to the query sequence
There is no inventory presenting an up-to-date and easily
searchable index of all these resources
February 16, 2011 – Valencia (Spain)
27. DB Search
Who
MView – Multiple alignment Viewer
What
A tool for converting the result of a sequence database search into
the form of a coloured multiple alignment of hits stacked against
the query
Where
Multiple alignment
N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible
.
database search or multiple alignment viewer. Bioinformatics, 14(4), pp.
380-381, 1998.
February 16, 2011 – Valencia (Spain)
28. DB Search
Who
BioWareDB
What
An extensive and current catalog of software and DBs of relevance
to researchers in the field of biology and medicine
Where
Current and available biomedical computing resources
M.W. Matthiessen. BioWareDB: the biomedical software and database
search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003.
February 16, 2011 – Valencia (Spain)
29. Web Search
Why
Today, scientists can easily post their research findings on the Web
or compare their discoveries with previous work
Manually maintaining a wrapper library will not scale to
accommodate the growth of genomics data sources on the Web
February 16, 2011 – Valencia (Spain)
30. Web Search
Who
---
What
An automated system able to find, classify, and wrap new sources
without constant human intervention
Where
Distributed genomics data sources
D. Rocco and T. Critchlow. Automatic discovery and classification of
bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933,
2003.
February 16, 2011 – Valencia (Spain)
31. Web Search
Who
GoPubMed
What
An ontology-based literature search applied to Gene Ontology
(GO) and PubMed
Where
Scientific literature
R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed:
ontology-based literature search applied to gene ontology and PubMed.
In Proceedings of German Bioinformatics Conference, pp. 169–178,
2004.
February 16, 2011 – Valencia (Spain)
32. Information Filtering
Why
In the Web 2.0 scenario, users look for collaborative environments,
in which they can meet further users with similar preferences and
needs
Researchers need to search for and/or generate specialized datasets
that meet specific requirements
February 16, 2011 – Valencia (Spain)
33. Information Filtering
Who
ProDaMa-C Protein Dataset Management – Collaborative
What
A web application aimed at
Generating specialized protein structure datasets
Favouring the collaboration among researchers
Where
Protein structures
G. Armano and A. Manconi. A Collaborative Web Application for
Supporting Researchers in the Task of Generating Protein Datasets.
Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A.
Soro, E. Vargiu (eds.), Springer-Verlag, 2011.
February 16, 2011 – Valencia (Spain)
34. Information Filtering
Who
Gene Recommender
What
An algorithm that ranks genes according to how strongly they
correlate with a set of query genes
Where
Analysis of gene expression
A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene
recommender algorithm to identify coexpressed genes. Genome
Research, Aug;13(8), pp. 1828-37, 2003.
February 16, 2011 – Valencia (Spain)
35. Text Mining
Why
Web-based tools capable of filtering public DBs are more and more
required
Interesting and useful information, relevant to the researcher, could
appear in documents (e.g., papers) they have not read and therefore
be missed entirely
Of paramount importance to DB search methods is a reliable
means of distinguishing true hits from false hits
Biologists construct a pathway by reading a large number of
articles and interpreting them a consistent network, but the link to
the original article is missed
February 16, 2011 – Valencia (Spain)
36. Text Mining
Who
MedMiner
What
An Internet text mining tool that filters the literature and presents
the most relevant portions in a well-organized way that facilitate
understanding
Where
Gene expression profiling
L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N.
Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical
Information, with Application to Gene Expression Profiling.
Biotechniques, Dec;27(6), pp. 1210-4, 1999.
February 16, 2011 – Valencia (Spain)
37. Text Mining
Who
BioRAT
What
A research assistant that, given a query,
autonomously finds a set of papers
reads them
highlights the most relevant facts in each
Where
Scientific literature
D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones.
.
BioRAT: Extracting biological information from full-length papers.
Bioinformatics, 20(17), pp. 3206–3213, 2004.
February 16, 2011 – Valencia (Spain)
38. Text Mining
Who
SAWTED – Structure Assignment With Text Description
What
An automated system to filtering DB hits
Where
Homologues annotation
R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure
assignment with text description-enhanced detection of remote
homologues with automated SWISS-PROT annotation comparisons.
Bioinformatics, Feb;16(2), pp. 125-9, 2000.
February 16, 2011 – Valencia (Spain)
39. Text Mining
Who
PathText
What
A system to integrate a pathway visualized, text mining systems
and annotation tools into a seamless environment
Where
Pathway visualizations
B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S.
Ananiadou, and J. Tsujii. PathText: a text mining integrator for
biological pathway visualizations. Bioinformatics, 26(12), pp. i374-
i381, 2010.
February 16, 2011 – Valencia (Spain)
40. Text Categorization
Why
Information in text form, such as MEDLINE records, is a greatly
underutilized source of biological information
Individual researchers find it difficult to keep up with all the new,
relevant information
Systems that extract structured information from natural language
passages have been highly successful in specialized domains
Time is ripe for developing such applications for molecular biology
and genomics
February 16, 2011 – Valencia (Spain)
41. Text Categorization
Who
--
What
Constructing biological knowledge bases by extracting information
from text sources
Where
MEDLINE
M. Craven and J. Kumlien. Constructing Biological Knowledge Bases
by Extracting Information from Text Sources. In Proceedings of the 7th
International Conference on Intelligent Systems for Molecular Biology,
1999.
February 16, 2011 – Valencia (Spain)
42. Text Categorization
Who
Genies
What
A natural-language processing system for the extraction of
molecular pathways
Where
Scientific publications
C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies:
.
a natural-language processing system for the extraction of molecular
pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001.
February 16, 2011 – Valencia (Spain)
43. Hierarchical Text Categorization
Why
A great deal of genomics information accumulated through years is
available in online text repositories (such as MEDLINE)
These resources do not still provide adequate mechanisms for
retrieving the required information
Traditional filtering techniques based on keyword search are often
inadequate to express what the user is really searching for
Web repositories, such as Medical Subject Headings (MeSH) in
MEDLINE, encompass an underlying taxonomy
February 16, 2011 – Valencia (Spain)
44. Hierarchical Text Categorization
Who
--
What
A tool for assisting biologists with literature search for the task of
associating genes with Gene Ontology codes
Where
MEDLINE
S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text
categorization as a tool of associating genes with gene ontology codes.
In 2nd European Workshop on Data Mining and Text Mining for
Bioinformatics, pp. 26–30, 2004.
February 16, 2011 – Valencia (Spain)
45. Hierarchical Text Categorization
Who
Pub.MAS
What
A multiagent system for retrieving and classifying publications
Where
BMC Bioinformatics
PubMed Central
G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
Retrieving Bioinformatics Publications from Web Sources. IEEE
Transactions on Nanobioscience, Special Session on GRID, Web
Services, Software Agents and Ontology Applications for Life Science,
6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
46. Case Study:
Retrieving and Filtering
Bioinformatic Publications
February 16, 2011 – Valencia (Spain)
47. An IR Task
Information Extraction
Online Repositories
Wrapping Information Sources
Extracted Data/Information
Text Categorization
Selected Data/Information Taxonomic Classification of Items
User's Feedback
Adaptive Behavior
February 16, 2011 – Valencia (Spain)
48. Information Extraction
Essential to retrieve documents provided by heterogeneous and
distributed sources
A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) :
A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp.
84–93.
February 16, 2011 – Valencia (Spain)
49. Text Categorization
It is the task of determining and assigning topical labels to
content
Typical approaches to text categorization
Statistical
Semantic
In the last years several researchers have investigated the use of
hierarchies for text categorization
F. Sebastiani. A tutorial on automated text categorisation. Proceedings
of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7-
35, 1999.
February 16, 2011 – Valencia (Spain)
50. Users' Feedback
It is aimed at dealing with any feedback provided by the user
In semiautomated classification and adaptive filtering we may
expect the user of a classifier to provide feedback on how test
documents have been classified
In this case further training may be performed during the
operating phase
February 16, 2011 – Valencia (Spain)
51. Hierarchical Text Categorization
Hierarchical Text Categorization (HTC) deals with problems
Hierarchical Text Categorization (HTC) deals with problems
where categories are organized in the form of a hierarchy.
where categories are organized in the form of a hierarchy.
D. Koller, M. Sahami. Hierarchically classifying documents using very
few words. Proceedings of 14th International Conference on Machine
Learning, pp. 170– 178, 1997.
February 16, 2011 – Valencia (Spain)
52. HTC at a Glance
HTC studies how to improve the performances provided by
classical text categorization techniques by exploiting the
knowledge of the taxonomic relationships among classes
February 16, 2011 – Valencia (Spain)
53. Motivations
People organize large collections of documents in hierarchies of
topics, or arrange a large body of knowledge in ontologies
The main goal of automatic text categorization is to deal with
underlying taxonomies
A hierarchical approach can
give benefits in real-world
scenarios, characterized by
information overload and
imbalanced data
February 16, 2011 – Valencia (Spain)
54. HTC Approaches
Pachinko machine
At each level of the hierarchy
The classifier selects the one most probable category
It goes down the hierarchy inspecting only the children of the selected
nodes
Probabilistic hierarchical local approach
At each level of the hierarchy
The classifier makes probabilistic decisions
It selects the leaf categories on the most probable paths
S. Kiritchenko. Hierarchical text categorization and its application to
bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006.
February 16, 2011 – Valencia (Spain)
55. HTC Approaches
Local classifier per node
Each classifier decides if forwarding the document to its children
Local classifier per parent node
Each classifier decides to which subtree(s) the document should be
sent to
Local classifier per level
The number of outputs per level grows while going down through
the taxonomy
Global classifier
One classifier is trained, able to discriminate among all categories
C.J. Silla and A. Freitas. A survey on hierarchical classification across
different application domains. Journal of Data Mining and Knowledge
Discovery, 2(1-2), pp. 31-72, 2010.
February 16, 2011 – Valencia (Spain)
56. Progressive Filtering
Progressive Filtering (PF) is a simple categorization technique
that operates on hierarchically structured categories
A way to implement PF consists of decomposing a given rooted
taxonomy into pipelines, one for of each path that exists between
the root and each node of the taxonomy
Each node is a binary classifier able to recognize whether or not
an input belongs to the corresponding class
A threshold selection algorithm (TSA) can be run to identify an
optimal, or sub-optimal, combination of thresholds for each
pipeline
A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to
Perform Hierarchical Text Categorization in Presence of Input
Imbalance. Proceedings of International Conference on Knowledge
February 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010.
Discovery and
57. PF at a Glance
Starting from the root, each input traverses the taxonomy as a
“token”
February 16, 2011 – Valencia (Spain)
58. Classifiers in PF
Partitioning the taxonomy in pipelines gives rise to a set of new
classifiers, each represented by a pipeline
February 16, 2011 – Valencia (Spain)
60. Classifiers in PF
The same classifier may have different behaviours, depending on
which pipeline it is embedded
Each pipeline can be considered in isolation from the others
February 16, 2011 – Valencia (Spain)
61. Threshold Selection in PF
A relevant problem is how to calibrate the threshold of the
binary classifiers embedded by each pipeline in order to
optimize the pipeline behaviour
Searching for a optimal or sub-optimal combination of
thresholds in a pipeline can be actually viewed as the problem of
finding a maximum in a utility function F that depends on the
corresponding threshold vector θ
February 16, 2011 – Valencia (Spain)
62. TSA
For each pipeline the best combination of thresholds is
calculated according to a bottom up algorithm that uses two
functions
Repair which increases/decreases (↑ / ↓ the threshold until the
)
utility function reaches a maximum
Calibrate which recursively operates downward from the given
classifier by repeatedly calling repair (↑ / ↓)
A. Addis, G. Armano, E. Vargiu. A comparative experimental
assessment of a threshold selection algorithm in hierarchical text
categorization. In: Advances in Information Retrieval. The 33rd
European Conference on Information Retrieval (ECIR 2011), 2011
February 16, 2011 – Valencia (Spain)
64. The Prototype
MultiAgent Architecture
X.MAS
Agent Framework
JADE
A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent
Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6,
Sixth International Workshop, From Agent Theory to Agent
Implementation, pp. 3–9, 2008.
F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent
Systems with JADE (Wiley Series in Agent Technology). John Wiley
and Sons, 2007.
February 16, 2011 – Valencia (Spain)
65. X.MAS at a Glance
Macro-architecture
February 16, 2011 – Valencia (Spain)
66. X.MAS at a Glance
Information Agent
Scheduler Source
Micro-architecture
Middle Agent
Scheduler Dispatcher
Filter Agent
Scheduler Actuator
Middle Agent
Scheduler Dispatcher
Task Agent
Scheduler Actuator
Middle Agent
Scheduler Dispatcher
Interface Agent
Scheduler
February 16, 2011 – Valencia (Spain)
68. Pub.MAS
G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
Retrieving Bioinformatics Publications from Web Sources. IEEE
Transactions on Nanobioscience, Special Session on GRID, Web
Services, Software Agents and Ontology Applications for Life Science,
6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
69. Information Extraction
It is supported by a set of agents explicitly devoted to
wrap the selected information sources
encode the extracted documents
An information agent wraps BMC Bioinformatics web site
HTML wrapper
An information agent wraps PubMed Central digital archive
Web service wrapper
February 16, 2011 – Valencia (Spain)
70. Hierarchical Text Categorization
The PF approach previously described has been implemented
Document has been encoded to
remove all non-informative words
remove the most common morphological and inflexional suffixes
select the relevant features
generate a feature vector for each document
Classification is performed by wkNN classifiers
the score is assigned using non parametric density estimation of the
“ a posteriori” probability
February 16, 2011 – Valencia (Spain)
71. The Adopted Taxonomy
P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A.
.
Brass. An ontology for bioinformatics applications, Bioinformatics,
15(6), pp. 510–520, 1999.
February 16, 2011 – Valencia (Spain)
74. Users' Feedback
User feedback is aimed at dealing with any feedback provided
by the user
Two solutions have been experimented
training an ANN
using a kNN classifier
February 16, 2011 – Valencia (Spain)
75. Experiments
Different kinds of tests have been performed, each aimed at
highlighting a specific issue
we estimated the (normalized) confusion matrix for each classifier
belonging to the highest level of the taxonomy
we studied the impact of taking into account pipelines of
classifiers, also trying to assess whether a residual independence
was in fact present
we assessed the solution devised for implementing user’s feedback,
based on the k-NN technique
February 16, 2011 – Valencia (Spain)
76. Experiments
Tests have been performed using selected publications extracted
from the BMC Bioinformatics site and from the PubMed Central
digital archive
Publications have been classified by an expert of the domain
according to the proposed taxonomy
For each item of the taxonomy, a set of about 100-150 articles
has been selected to train the corresponding wk-NN classifier,
and 300-400 articles have been used to test it
February 16, 2011 – Valencia (Spain)
78. Conclusions
Bioinformatics needs suitable, automated, and “ intelligent”
solutions to acquire, analyse, organize, and store biological data
IR might be very useful to face with bioinformatics problems
Currently, few IR techniques have been adopted to solve some
bioinformatics tasks
A system aimed at retrieving and filtering bioinformatics
publications has been presented as case study
We argue that further investigations and experiments could be
made to exploit IR in bioinformatics
February 16, 2011 – Valencia (Spain)
79. Acknowledgments
This work was partially supported by the Italian Ministry of
Education – Investment funds for basic research, under the
project ITALBIONET – Italian Network of Bioinformatics
I wish to thank all the IASC Group members for their valuable
help
IASC Group members are:
G. Armano – head
A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc
A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students
S. Curatti – collaborator, programmer
I wish to thank also Andrea Manconi for his suggestions
February 16, 2011 – Valencia (Spain)
80. Thanks for your
attention!
Contact: Eloisa Vargiu vargiu@diee.unica.it
February 16, 2011 – Valencia (Spain)