SlideShare uma empresa Scribd logo
1 de 80
Bioinformatics Meets
               Information Retrieval
   State of the Art and a Case Study
                                              Eloisa Vargiu



                           Intelligent Agents and Soft-Computing Group
                          Dept. of Electrical and Electronic Engineering
                                       University of Cagliari, Italy
February 16, 2011 – Valencia (Spain)   email: vargiu@diee.unica.it
My Background

  2000 – 2004                                 2004 – 2009
     Automatic planning                          Bioinformatics
             Classic domains: HW[]                  Protein secondary structure
             Dynamic domains: HIPE                   prediction: MASSP3 and
                                                      GAME/SSP
  2000 - …
                                               2006 - …
     Multiage s te
              nt ys ms
                                                  Information Retrieval
             A Personalized Adaptive and
              Cooperative Multiagent                 Hierarchical text
              System: PACMAS                          categorization: PF and TSA
             A generic architecture to              Recommender systems and
              perform information retrieval           contextual advertising: ConCA
              tasks: X.MAS


February 16, 2011 – Valencia (Spain)
Outline

    Context and Mission
    Why Bioinformatics Needs Information Retrieval
    Bioinformatics Meets Information Retrieval
    Case Study: Retrieving and Filtering Bioinformatics Publications
    Conclusions




February 16, 2011 – Valencia (Spain)
Context and Mission




February 16, 2011 – Valencia (Spain)
Web Evolution

  Web 1.0                             1993

    Source of information
    Personal homepages
  Web 2.0                             2004
    Social networks
    (Micro)Blogging
  Web 3.0                             2005

    Semantic web
    Web composition




February 16, 2011 – Valencia (Spain)
Web Evolution and Bioinformatics

  A long time ago...
     Data was stored in local DBs
     Data was shared as flat files
     Biologists worked alone or in small groups




February 16, 2011 – Valencia (Spain)
Web Evolution and Bioinformatics

  Today...
     Online repositories
             The major sources of nucleotide sequence are the ones belonging to the
              International Nucleotide Sequence Database Collaboration
                DDBJ (DNA DataBank of Japan)

                EMBL (European Molecular Biology Laboratory)

                GenBank (NIH genetic sequence database)

       Web services
         Basic bioinformatics services are
          classified by the EBI into three categories
            SSS (Sequence Search Services)

            MSA (Multiple Sequence Alignment)

            BSA (Biological Sequence Analysis)



February 16, 2011 – Valencia (Spain)
Web Evolution and Scientific
Publications
  A long time ago...
     Publications were consulted at the library
     Just two or three relevant available journals
     Manual selection of relevant publications




February 16, 2011 – Valencia (Spain)
Web Evolution and Scientific
Publications
  Today...
     Online journals
     Online conference proceedings
     Publications are often available for free
     Manual selection of relevant publications
      becomes unfeasible




February 16, 2011 – Valencia (Spain)
As a Consequence...

  Unstructured information
  Information overload
  Personalized information selection and input imbalance




February 16, 2011 – Valencia (Spain)
Our Mission

  To cope with
     Unstructured information, classifying documents according to a
      given taxonomy
     Information overload, filtering information to reduce redundancy
     Personalized information selection and input imbalance, filtering
      information according to user preferences
  Case study
     Retrieving and filtering bioinformatics publications




February 16, 2011 – Valencia (Spain)
Research Topics

  Information Retrieval
  Bioinformatics




February 16, 2011 – Valencia (Spain)
Information Retrieval

 Information Retrieval (IR) deals with the representation,
  Information Retrieval (IR) deals with the representation,
 storage, organization of, and access to information items.
  storage, organization of, and access to information items.

 The user must first translate this information need into a query
 The user must first translate this information need into a query
 which can be processed by an IR system.
 which can be processed by an IR system.

 Given the user query, the key goal of an IR system is to retrieve
  Given the user query, the key goal of an IR system is to retrieve
 information which might be useful or relevant to the user.
  information which might be useful or relevant to the user.


                R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
                New York: Addison-Wesley, 1999.

February 16, 2011 – Valencia (Spain)
Main IR Topics

    Indexing
    Search and Web Search
    Information Filtering
    Text Mining
    Text Categorization and Hierarchical Text Categorization




February 16, 2011 – Valencia (Spain)
Bioinformatics

 Bioinformatics is the field of science in which biology,
  Bioinformatics is the field of science in which biology,
 computer science, and information technology merge to form a
  computer science, and information technology merge to form a
 single discipline.
  single discipline.

 The ultimate goal of the field is to enable the discovery of new
 The ultimate goal of the field is to enable the discovery of new
 biological insights as well as to create a global perspective from
 biological insights as well as to create a global perspective from
 which unifying principles in biology can be discerned.
 which unifying principles in biology can be discerned.



                National Center for Biotechnology Information (NCBI),
                http://www.ncbi.nlm.nih.gov/.

February 16, 2011 – Valencia (Spain)
Main Bioinformatics Research Areas

    Sequence analysis
    Genome annotation
    Computational evolutionary biology
    Analysis of gene expression
    Analysis of protein expression
    Analysis of mutations in cancer
    Comparative genomics
    Modelling biological systems
    Prediction of protein structure
    Molecular interaction

February 16, 2011 – Valencia (Spain)
Why Bioinformatics
                       Needs
                Information Retrieval




February 16, 2011 – Valencia (Spain)
Does Bioinformatics Need IR?

  Bioinformatics is concerned with researching, developing and
    applying tools and methods to acquire, analyse, organize and
    store biological and medical data

  Indexing and search techniques may help in the task of acquiring
  Information filtering, text mining and text categorization
   techniques may be useful to the analysis of data
  Text categorization, with particular reference to hierarchical text
   categorization, may be used in the organization and storage tasks



February 16, 2011 – Valencia (Spain)
Bioinformatics Data

  A very huge amount of of data to be
     Indexed
     Searched for in large databases or on the web
     Filtered according to users' preferences
     Text mined
     Categorized according to its textual content




February 16, 2011 – Valencia (Spain)
DB Indexing

  Why
    Data types are relegated to blob and unstructured text fields
    Few results in building persistent access paths to support fast
     retrieval methods
    Genomic datasets in public repositories are annotated with free-text
     fields describing the pathological state of the studied sample
    Annotations are not mapped to concept in any ontology




February 16, 2011 – Valencia (Spain)
DB Indexing

  Who
    MoBIoS – Molecular Biological Information System
  What
    A specialized database management system
    The storage manager is based on metric-space indexing
    Query language entails biological data types
  Where
    Sequence homology: local alignment and mutations


                D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to
                Support Biological Discovery. Proceedings of the International
                Conference on Scientific and Statistical Database Management
                Systems, 2003.
February 16, 2011 – Valencia (Spain)
DB Indexing

 Who
   --
 What
   Ontology-driven indexing of public datasets for translational
    bioinformatics
   Methods to map text annotations of gene expression datasets to
    concept in the UMLS
 Where
   Gene Expression Omnibus
   Standford Tissue Microarray Database

                N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A.
                                           .
                Musen. Ontology-driven indexing of public datasets for translational
                bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009.
February 16, 2011 – Valencia (Spain)
Web Indexing

  Why
    Most often sequence retrieval tools and sequence analysis tools are
     separated
    The usage of sequence DBs is often general and limited to
     keyword searching and entry retrieval
    Discovering and accessing the appropriate bioinformatics resource
     for a specific task has become increasingly important




February 16, 2011 – Valencia (Spain)
Web Indexing

  Who
    SIRW – A Web Server for Simple Indexing and Retrieval System
  What
    A WWW interface to the Simple Indexing and Retrieval (SIR)
     system to parse and index flat file DBs
    A framework for doing sequence analysis for selected biological
     sequences
  Where
    Sequence analysis: motif pattern searches

                C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval
                System that combines sequence motif searches with keyword searches.
                Nucleic Acids Research, 31(13). pp. 3771-3774, 2003.

February 16, 2011 – Valencia (Spain)
Web Indexing

  Who
    BIRI - BIoinformatics Resource Inventory
  What
    An approach for automatically discovering and indexing public
     bioinformatics resources
  Where
    The scientific literature




                G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V.
                Maojo. BIRI: a new approach for automatically discovering and
                indexing available public bioinformatics resources from the literature.
                BMC Bioinformatics, Oct 7;10:320, 2009.
February 16, 2011 – Valencia (Spain)
DB Search

  Why
    A wealth of bioinformatics tools and databases has been created
     over the last decade and most are freely available
    Often it is desired to visualize the database hits stacked according
     to the query sequence
    There is no inventory presenting an up-to-date and easily
     searchable index of all these resources




February 16, 2011 – Valencia (Spain)
DB Search

  Who
    MView – Multiple alignment Viewer
  What
    A tool for converting the result of a sequence database search into
     the form of a coloured multiple alignment of hits stacked against
     the query
  Where
    Multiple alignment


                N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible
                   .
                database search or multiple alignment viewer. Bioinformatics, 14(4), pp.
                380-381, 1998.

February 16, 2011 – Valencia (Spain)
DB Search

  Who
    BioWareDB
  What
    An extensive and current catalog of software and DBs of relevance
     to researchers in the field of biology and medicine
  Where
    Current and available biomedical computing resources




                M.W. Matthiessen. BioWareDB: the biomedical software and database
                search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003.


February 16, 2011 – Valencia (Spain)
Web Search

  Why
    Today, scientists can easily post their research findings on the Web
     or compare their discoveries with previous work
    Manually maintaining a wrapper library will not scale to
     accommodate the growth of genomics data sources on the Web




February 16, 2011 – Valencia (Spain)
Web Search

  Who
    ---
  What
    An automated system able to find, classify, and wrap new sources
     without constant human intervention
  Where
    Distributed genomics data sources




                D. Rocco and T. Critchlow. Automatic discovery and classification of
                bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933,
                2003.

February 16, 2011 – Valencia (Spain)
Web Search

  Who
    GoPubMed
  What
    An ontology-based literature search applied to Gene Ontology
     (GO) and PubMed
  Where
    Scientific literature



                R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed:
                ontology-based literature search applied to gene ontology and PubMed.
                In Proceedings of German Bioinformatics Conference, pp. 169–178,
                2004.
February 16, 2011 – Valencia (Spain)
Information Filtering

  Why
    In the Web 2.0 scenario, users look for collaborative environments,
     in which they can meet further users with similar preferences and
     needs
    Researchers need to search for and/or generate specialized datasets
     that meet specific requirements




February 16, 2011 – Valencia (Spain)
Information Filtering

  Who
    ProDaMa-C Protein Dataset Management – Collaborative
  What
    A web application aimed at
             Generating specialized protein structure datasets
             Favouring the collaboration among researchers
  Where
    Protein structures


                G. Armano and A. Manconi. A Collaborative Web Application for
                Supporting Researchers in the Task of Generating Protein Datasets.
                Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A.
                Soro, E. Vargiu (eds.), Springer-Verlag, 2011.
February 16, 2011 – Valencia (Spain)
Information Filtering

  Who
    Gene Recommender
  What
    An algorithm that ranks genes according to how strongly they
     correlate with a set of query genes
  Where
    Analysis of gene expression




                A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene
                recommender algorithm to identify coexpressed genes. Genome
                Research, Aug;13(8), pp. 1828-37, 2003.

February 16, 2011 – Valencia (Spain)
Text Mining

  Why
    Web-based tools capable of filtering public DBs are more and more
     required
    Interesting and useful information, relevant to the researcher, could
     appear in documents (e.g., papers) they have not read and therefore
     be missed entirely
    Of paramount importance to DB search methods is a reliable
     means of distinguishing true hits from false hits
    Biologists construct a pathway by reading a large number of
     articles and interpreting them a consistent network, but the link to
     the original article is missed


February 16, 2011 – Valencia (Spain)
Text Mining

  Who
    MedMiner
  What
    An Internet text mining tool that filters the literature and presents
     the most relevant portions in a well-organized way that facilitate
     understanding
  Where
    Gene expression profiling

                L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N.
                Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical
                Information, with Application to Gene Expression Profiling.
                Biotechniques, Dec;27(6), pp. 1210-4, 1999.
February 16, 2011 – Valencia (Spain)
Text Mining

  Who
    BioRAT
  What
    A research assistant that, given a query,
             autonomously finds a set of papers
             reads them
             highlights the most relevant facts in each
  Where
    Scientific literature

                D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones.
                    .
                BioRAT: Extracting biological information from full-length papers.
                Bioinformatics, 20(17), pp. 3206–3213, 2004.

February 16, 2011 – Valencia (Spain)
Text Mining

  Who
    SAWTED – Structure Assignment With Text Description
  What
    An automated system to filtering DB hits
  Where
    Homologues annotation




                R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure
                assignment with text description-enhanced detection of remote
                homologues with automated SWISS-PROT annotation comparisons.
                Bioinformatics, Feb;16(2), pp. 125-9, 2000.
February 16, 2011 – Valencia (Spain)
Text Mining

  Who
    PathText
  What
    A system to integrate a pathway visualized, text mining systems
     and annotation tools into a seamless environment
  Where
    Pathway visualizations



                B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S.
                Ananiadou, and J. Tsujii. PathText: a text mining integrator for
                biological pathway visualizations. Bioinformatics, 26(12), pp. i374-
                i381, 2010.
February 16, 2011 – Valencia (Spain)
Text Categorization

  Why
    Information in text form, such as MEDLINE records, is a greatly
     underutilized source of biological information
    Individual researchers find it difficult to keep up with all the new,
     relevant information
    Systems that extract structured information from natural language
     passages have been highly successful in specialized domains
    Time is ripe for developing such applications for molecular biology
     and genomics




February 16, 2011 – Valencia (Spain)
Text Categorization

  Who
    --
  What
    Constructing biological knowledge bases by extracting information
     from text sources
  Where
    MEDLINE



                M. Craven and J. Kumlien. Constructing Biological Knowledge Bases
                by Extracting Information from Text Sources. In Proceedings of the 7th
                International Conference on Intelligent Systems for Molecular Biology,
                1999.
February 16, 2011 – Valencia (Spain)
Text Categorization

  Who
    Genies
  What
    A natural-language processing system for the extraction of
     molecular pathways
  Where
    Scientific publications



                C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies:
                               .
                a natural-language processing system for the extraction of molecular
                pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001.

February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

  Why
    A great deal of genomics information accumulated through years is
     available in online text repositories (such as MEDLINE)
    These resources do not still provide adequate mechanisms for
     retrieving the required information
    Traditional filtering techniques based on keyword search are often
     inadequate to express what the user is really searching for
    Web repositories, such as Medical Subject Headings (MeSH) in
     MEDLINE, encompass an underlying taxonomy




February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

  Who
    --
  What
    A tool for assisting biologists with literature search for the task of
     associating genes with Gene Ontology codes
  Where
    MEDLINE



                S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text
                categorization as a tool of associating genes with gene ontology codes.
                In 2nd European Workshop on Data Mining and Text Mining for
                Bioinformatics, pp. 26–30, 2004.
February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

  Who
    Pub.MAS
  What
    A multiagent system for retrieving and classifying publications
  Where
    BMC Bioinformatics
    PubMed Central


                G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
                Retrieving Bioinformatics Publications from Web Sources. IEEE
                Transactions on Nanobioscience, Special Session on GRID, Web
                Services, Software Agents and Ontology Applications for Life Science,
                6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
Case Study:
    Retrieving and Filtering
   Bioinformatic Publications




February 16, 2011 – Valencia (Spain)
An IR Task

                                                                                                Information Extraction
             Online Repositories
                                                                                           Wrapping Information Sources




                                                                   Extracted Data/Information




                                                                                                  Text Categorization
                           Selected Data/Information                                     Taxonomic Classification of Items




                                                       User's Feedback

                                               Adaptive Behavior




February 16, 2011 – Valencia (Spain)
Information Extraction

  Essential to retrieve documents provided by heterogeneous and
    distributed sources




                A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) :
                A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp.
                84–93.
February 16, 2011 – Valencia (Spain)
Text Categorization

  It is the task of determining and assigning topical labels to
   content
  Typical approaches to text categorization
       Statistical
       Semantic
  In the last years several researchers have investigated the use of
    hierarchies for text categorization


                F. Sebastiani. A tutorial on automated text categorisation. Proceedings
                of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7-
                35, 1999.

February 16, 2011 – Valencia (Spain)
Users' Feedback

  It is aimed at dealing with any feedback provided by the user
  In semiautomated classification and adaptive filtering we may
   expect the user of a classifier to provide feedback on how test
   documents have been classified
  In this case further training may be performed during the
   operating phase




February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

 Hierarchical Text Categorization (HTC) deals with problems
 Hierarchical Text Categorization (HTC) deals with problems
 where categories are organized in the form of a hierarchy.
 where categories are organized in the form of a hierarchy.




                D. Koller, M. Sahami. Hierarchically classifying documents using very
                few words. Proceedings of 14th International Conference on Machine
                Learning, pp. 170– 178, 1997.

February 16, 2011 – Valencia (Spain)
HTC at a Glance

  HTC studies how to improve the performances provided by
    classical text categorization techniques by exploiting the
    knowledge of the taxonomic relationships among classes




February 16, 2011 – Valencia (Spain)
Motivations

  People organize large collections of documents in hierarchies of
   topics, or arrange a large body of knowledge in ontologies
  The main goal of automatic text categorization is to deal with
   underlying taxonomies
  A hierarchical approach can
   give benefits in real-world
   scenarios, characterized by
   information overload and
   imbalanced data




February 16, 2011 – Valencia (Spain)
HTC Approaches

  Pachinko machine
     At each level of the hierarchy
             The classifier selects the one most probable category
             It goes down the hierarchy inspecting only the children of the selected
              nodes
  Probabilistic hierarchical local approach
     At each level of the hierarchy
             The classifier makes probabilistic decisions
             It selects the leaf categories on the most probable paths



                S. Kiritchenko. Hierarchical text categorization and its application to
                bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006.
February 16, 2011 – Valencia (Spain)
HTC Approaches

 Local classifier per node
    Each classifier decides if forwarding the document to its children
 Local classifier per parent node
    Each classifier decides to which subtree(s) the document should be
     sent to
 Local classifier per level
    The number of outputs per level grows while going down through
     the taxonomy
 Global classifier
    One classifier is trained, able to discriminate among all categories

                C.J. Silla and A. Freitas. A survey on hierarchical classification across
                different application domains. Journal of Data Mining and Knowledge
                Discovery, 2(1-2), pp. 31-72, 2010.
February 16, 2011 – Valencia (Spain)
Progressive Filtering

 Progressive Filtering (PF) is a simple categorization technique
  that operates on hierarchically structured categories
 A way to implement PF consists of decomposing a given rooted
  taxonomy into pipelines, one for of each path that exists between
  the root and each node of the taxonomy
 Each node is a binary classifier able to recognize whether or not
  an input belongs to the corresponding class
 A threshold selection algorithm (TSA) can be run to identify an
  optimal, or sub-optimal, combination of thresholds for each
  pipeline
                A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to
                Perform Hierarchical Text Categorization in Presence of Input
                Imbalance. Proceedings of International Conference on Knowledge
February 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010.
                Discovery and
PF at a Glance




  Starting from the root, each input traverses the taxonomy as a
     “token”
February 16, 2011 – Valencia (Spain)
Classifiers in PF




  Partitioning the taxonomy in pipelines gives rise to a set of new
    classifiers, each represented by a pipeline


February 16, 2011 – Valencia (Spain)
Classifiers in PF




February 16, 2011 – Valencia (Spain)
Classifiers in PF




  The same classifier may have different behaviours, depending on
   which pipeline it is embedded
  Each pipeline can be considered in isolation from the others
February 16, 2011 – Valencia (Spain)
Threshold Selection in PF

  A relevant problem is how to calibrate the threshold of the
   binary classifiers embedded by each pipeline in order to
   optimize the pipeline behaviour
  Searching for a optimal or sub-optimal combination of
   thresholds in a pipeline can be actually viewed as the problem of
   finding a maximum in a utility function F that depends on the
   corresponding threshold vector θ




February 16, 2011 – Valencia (Spain)
TSA

  For each pipeline the best combination of thresholds is
    calculated according to a bottom up algorithm that uses two
    functions
       Repair which increases/decreases (↑ / ↓ the threshold until the
                                               )
        utility function reaches a maximum
       Calibrate which recursively operates downward from the given
        classifier by repeatedly calling repair (↑ / ↓)


                A. Addis, G. Armano, E. Vargiu. A comparative experimental
                assessment of a threshold selection algorithm in hierarchical text
                categorization. In: Advances in Information Retrieval. The 33rd
                European Conference on Information Retrieval (ECIR 2011), 2011


February 16, 2011 – Valencia (Spain)
TSA: An Example




February 16, 2011 – Valencia (Spain)
The Prototype

  MultiAgent Architecture
     X.MAS
  Agent Framework
     JADE



                A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent
                Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6,
                Sixth International Workshop, From Agent Theory to Agent
                Implementation, pp. 3–9, 2008.

                F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent
                Systems with JADE (Wiley Series in Agent Technology). John Wiley
                and Sons, 2007.
February 16, 2011 – Valencia (Spain)
X.MAS at a Glance

  Macro-architecture




February 16, 2011 – Valencia (Spain)
X.MAS at a Glance
                                                   Information Agent
                                       Scheduler          Source
  Micro-architecture
                                                       Middle Agent
                                       Scheduler        Dispatcher

                                                       Filter Agent
                                       Scheduler         Actuator

                                                       Middle Agent
                                       Scheduler        Dispatcher

                                                        Task Agent
                                       Scheduler         Actuator

                                                       Middle Agent
                                       Scheduler        Dispatcher

                                                     Interface Agent
                                       Scheduler


February 16, 2011 – Valencia (Spain)
Pub.MAS




February 16, 2011 – Valencia (Spain)
Pub.MAS




                G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for
                Retrieving Bioinformatics Publications from Web Sources. IEEE
                Transactions on Nanobioscience, Special Session on GRID, Web
                Services, Software Agents and Ontology Applications for Life Science,
                6(2), pp. 104-109, 2007.
February 16, 2011 – Valencia (Spain)
Information Extraction

  It is supported by a set of agents explicitly devoted to
     wrap the selected information sources
     encode the extracted documents
  An information agent wraps BMC Bioinformatics web site
     HTML wrapper
  An information agent wraps PubMed Central digital archive
     Web service wrapper




February 16, 2011 – Valencia (Spain)
Hierarchical Text Categorization

  The PF approach previously described has been implemented
  Document has been encoded to
     remove all non-informative words
     remove the most common morphological and inflexional suffixes
     select the relevant features
     generate a feature vector for each document
  Classification is performed by wkNN classifiers
     the score is assigned using non parametric density estimation of the
      “ a posteriori” probability




February 16, 2011 – Valencia (Spain)
The Adopted Taxonomy




                P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A.
                 .
                Brass. An ontology for bioinformatics applications, Bioinformatics,
                15(6), pp. 510–520, 1999.
February 16, 2011 – Valencia (Spain)
The Adopted Taxonomy




February 16, 2011 – Valencia (Spain)
The Adopted Taxonomy




February 16, 2011 – Valencia (Spain)
Users' Feedback

  User feedback is aimed at dealing with any feedback provided
   by the user
  Two solutions have been experimented
       training an ANN
       using a kNN classifier




February 16, 2011 – Valencia (Spain)
Experiments

  Different kinds of tests have been performed, each aimed at
    highlighting a specific issue
       we estimated the (normalized) confusion matrix for each classifier
        belonging to the highest level of the taxonomy
       we studied the impact of taking into account pipelines of
        classifiers, also trying to assess whether a residual independence
        was in fact present
       we assessed the solution devised for implementing user’s feedback,
        based on the k-NN technique




February 16, 2011 – Valencia (Spain)
Experiments

  Tests have been performed using selected publications extracted
   from the BMC Bioinformatics site and from the PubMed Central
   digital archive
  Publications have been classified by an expert of the domain
   according to the proposed taxonomy
  For each item of the taxonomy, a set of about 100-150 articles
   has been selected to train the corresponding wk-NN classifier,
   and 300-400 articles have been used to test it




February 16, 2011 – Valencia (Spain)
Conclusions




February 16, 2011 – Valencia (Spain)
Conclusions

  Bioinformatics needs suitable, automated, and “ intelligent”
     solutions to acquire, analyse, organize, and store biological data
    IR might be very useful to face with bioinformatics problems
    Currently, few IR techniques have been adopted to solve some
     bioinformatics tasks
    A system aimed at retrieving and filtering bioinformatics
     publications has been presented as case study
    We argue that further investigations and experiments could be
     made to exploit IR in bioinformatics



February 16, 2011 – Valencia (Spain)
Acknowledgments

  This work was partially supported by the Italian Ministry of
   Education – Investment funds for basic research, under the
   project ITALBIONET – Italian Network of Bioinformatics
  I wish to thank all the IASC Group members for their valuable
   help
  IASC Group members are:
       G. Armano – head
       A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc
       A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students
       S. Curatti – collaborator, programmer
  I wish to thank also Andrea Manconi for his suggestions

February 16, 2011 – Valencia (Spain)
Thanks for your
           attention!
Contact: Eloisa Vargiu vargiu@diee.unica.it

February 16, 2011 – Valencia (Spain)

Mais conteúdo relacionado

Mais procurados

Gene regulatory networks
Gene regulatory networksGene regulatory networks
Gene regulatory networksMadiheh
 
Major resources of bioinformatics 2
Major resources of bioinformatics 2Major resources of bioinformatics 2
Major resources of bioinformatics 2Mohd Affan
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)ShivaniShewale2
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission ToolsRishikaMaji
 
Genomics and its application in crop improvement
Genomics and its application in crop improvementGenomics and its application in crop improvement
Genomics and its application in crop improvementKhemlata20
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsajay301
 
Bioinformatics and its Applications in Agriculture/Sericulture and in other F...
Bioinformatics and its Applications in Agriculture/Sericulture and in other F...Bioinformatics and its Applications in Agriculture/Sericulture and in other F...
Bioinformatics and its Applications in Agriculture/Sericulture and in other F...mohd younus wani
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsAsad Afridi
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsJTADrexel
 
sequence alignment
sequence alignmentsequence alignment
sequence alignmentammar kareem
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in BioinformaticsArindam Ghosh
 
Metabolic Network Analysis
Metabolic Network AnalysisMetabolic Network Analysis
Metabolic Network AnalysisAreejit Samal
 

Mais procurados (20)

Gene regulatory networks
Gene regulatory networksGene regulatory networks
Gene regulatory networks
 
Major resources of bioinformatics 2
Major resources of bioinformatics 2Major resources of bioinformatics 2
Major resources of bioinformatics 2
 
Protein database
Protein databaseProtein database
Protein database
 
Protein databases
Protein databasesProtein databases
Protein databases
 
EMBL
EMBLEMBL
EMBL
 
NCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology InformationNCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology Information
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
 
Genomics and its application in crop improvement
Genomics and its application in crop improvementGenomics and its application in crop improvement
Genomics and its application in crop improvement
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Bioinformatics and its Applications in Agriculture/Sericulture and in other F...
Bioinformatics and its Applications in Agriculture/Sericulture and in other F...Bioinformatics and its Applications in Agriculture/Sericulture and in other F...
Bioinformatics and its Applications in Agriculture/Sericulture and in other F...
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
Metabolic Network Analysis
Metabolic Network AnalysisMetabolic Network Analysis
Metabolic Network Analysis
 

Destaque

Windows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & RetrievalWindows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & RetrievalSaviant Consulting
 
Realtime search engine concept
Realtime search engine conceptRealtime search engine concept
Realtime search engine concept상욱 송
 
Developing Document Image Retrieval System
Developing Document Image Retrieval SystemDeveloping Document Image Retrieval System
Developing Document Image Retrieval SystemKonstantinos Zagoris
 
google search engine
google search enginegoogle search engine
google search engineway2go
 

Destaque (6)

NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
 
Windows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & RetrievalWindows Azure Casestudy on Document Search & Retrieval
Windows Azure Casestudy on Document Search & Retrieval
 
Text Indexing and Retrieval
Text Indexing and RetrievalText Indexing and Retrieval
Text Indexing and Retrieval
 
Realtime search engine concept
Realtime search engine conceptRealtime search engine concept
Realtime search engine concept
 
Developing Document Image Retrieval System
Developing Document Image Retrieval SystemDeveloping Document Image Retrieval System
Developing Document Image Retrieval System
 
google search engine
google search enginegoogle search engine
google search engine
 

Semelhante a Bioinformatics Meets Information Retrieval

BIOINFO unit 1.pptx
BIOINFO unit 1.pptxBIOINFO unit 1.pptx
BIOINFO unit 1.pptxrnath286
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu KAUSHAL SAHU
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Datavbrant
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Bryan Heidorn
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONIJwest
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION dannyijwest
 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in BioinformaticsMeghaj Mallick
 
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkRDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkASIS&T
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...SBituila
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...BibiQuinah
 
How do we know what we don’t know: Using the Neuroscience Information Framew...
How do we know what we don’t know:  Using the Neuroscience Information Framew...How do we know what we don’t know:  Using the Neuroscience Information Framew...
How do we know what we don’t know: Using the Neuroscience Information Framew...Maryann Martone
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBryan Heidorn
 
Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011PrattSILS
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
 
Nucleic acid and protein databanks
Nucleic acid and protein databanksNucleic acid and protein databanks
Nucleic acid and protein databanksNithyaNandapal
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppSimon Jupp
 
Phyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebPhyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebRutger Vos
 

Semelhante a Bioinformatics Meets Information Retrieval (20)

BIOINFO unit 1.pptx
BIOINFO unit 1.pptxBIOINFO unit 1.pptx
BIOINFO unit 1.pptx
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Data
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
 
Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptx
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in Bioinformatics
 
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkRDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
How do we know what we don’t know: Using the Neuroscience Information Framew...
How do we know what we don’t know:  Using the Neuroscience Information Framew...How do we know what we don’t know:  Using the Neuroscience Information Framew...
How do we know what we don’t know: Using the Neuroscience Information Framew...
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary Challenge
 
Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
Nucleic acid and protein databanks
Nucleic acid and protein databanksNucleic acid and protein databanks
Nucleic acid and protein databanks
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 
Presentation (3).pptx
Presentation (3).pptxPresentation (3).pptx
Presentation (3).pptx
 
Phyloinformatics and the Semantic Web
Phyloinformatics and the Semantic WebPhyloinformatics and the Semantic Web
Phyloinformatics and the Semantic Web
 

Mais de Eloisa Vargiu

Citizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of healthCitizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of healthEloisa Vargiu
 
Improving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and LleidaImproving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and LleidaEloisa Vargiu
 
Medical Technology in Sleep
Medical Technology in SleepMedical Technology in Sleep
Medical Technology in SleepEloisa Vargiu
 
Patient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care ApproachPatient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care ApproachEloisa Vargiu
 
Connected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in LleidaConnected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in LleidaEloisa Vargiu
 
Self-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A ProposalSelf-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A ProposalEloisa Vargiu
 
Patient Empowerment in CONNECARE
Patient Empowerment in CONNECAREPatient Empowerment in CONNECARE
Patient Empowerment in CONNECAREEloisa Vargiu
 
From Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-ManagementFrom Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-ManagementEloisa Vargiu
 
The CONNECARE Project
The CONNECARE ProjectThe CONNECARE Project
The CONNECARE ProjectEloisa Vargiu
 
Self-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experienceSelf-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experienceEloisa Vargiu
 
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...Eloisa Vargiu
 
Integrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic PatientsIntegrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic PatientsEloisa Vargiu
 
Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...Eloisa Vargiu
 
The CONNECARE project at a glance
The CONNECARE project at a glanceThe CONNECARE project at a glance
The CONNECARE project at a glanceEloisa Vargiu
 
Challenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECAREChallenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECAREEloisa Vargiu
 
Third Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the DifferenceThird Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the DifferenceEloisa Vargiu
 
Monitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons LearnedMonitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons LearnedEloisa Vargiu
 
Monitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experienceMonitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experienceEloisa Vargiu
 
Brain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons LearntBrain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons LearntEloisa Vargiu
 
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Eloisa Vargiu
 

Mais de Eloisa Vargiu (20)

Citizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of healthCitizen empowerment throughout the 4 pillars of health
Citizen empowerment throughout the 4 pillars of health
 
Improving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and LleidaImproving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
Improving Sleeping Habits: Preliminary Experiments in Barcelona and Lleida
 
Medical Technology in Sleep
Medical Technology in SleepMedical Technology in Sleep
Medical Technology in Sleep
 
Patient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care ApproachPatient Empowerment from an Integrated Care Approach
Patient Empowerment from an Integrated Care Approach
 
Connected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in LleidaConnected Care for Complex Chronic Patients in Lleida
Connected Care for Complex Chronic Patients in Lleida
 
Self-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A ProposalSelf-Management of Complex Chronic Patients: Needs and A Proposal
Self-Management of Complex Chronic Patients: Needs and A Proposal
 
Patient Empowerment in CONNECARE
Patient Empowerment in CONNECAREPatient Empowerment in CONNECARE
Patient Empowerment in CONNECARE
 
From Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-ManagementFrom Healthy to Happy Ageing: the Power of Self-Management
From Healthy to Happy Ageing: the Power of Self-Management
 
The CONNECARE Project
The CONNECARE ProjectThe CONNECARE Project
The CONNECARE Project
 
Self-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experienceSelf-management of complex chronic patients: the CONNECARE experience
Self-management of complex chronic patients: the CONNECARE experience
 
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
A Hierarchical Approach to Recognize Purposeful Movements Using Inertial Sens...
 
Integrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic PatientsIntegrated Care for Complex Chronic Patients
Integrated Care for Complex Chronic Patients
 
Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...Automatic Support for Improving Management and Treatment of Patients with Obt...
Automatic Support for Improving Management and Treatment of Patients with Obt...
 
The CONNECARE project at a glance
The CONNECARE project at a glanceThe CONNECARE project at a glance
The CONNECARE project at a glance
 
Challenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECAREChallenge - Choice - Change of CONNECARE
Challenge - Choice - Change of CONNECARE
 
Third Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the DifferenceThird Generation Teleassistance - Intelligent Monitoring Makes the Difference
Third Generation Teleassistance - Intelligent Monitoring Makes the Difference
 
Monitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons LearnedMonitoring Elderly People at Home: Results and Lessons Learned
Monitoring Elderly People at Home: Results and Lessons Learned
 
Monitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experienceMonitoring people that need assistance: the BackHome experience
Monitoring people that need assistance: the BackHome experience
 
Brain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons LearntBrain Computer Interfaces on Track to Home: Results and Lessons Learnt
Brain Computer Interfaces on Track to Home: Results and Lessons Learnt
 
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
Monitoring People that Need Assistance through a Sensor-based System: Evaluat...
 

Último

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Último (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Bioinformatics Meets Information Retrieval

  • 1. Bioinformatics Meets Information Retrieval State of the Art and a Case Study Eloisa Vargiu Intelligent Agents and Soft-Computing Group Dept. of Electrical and Electronic Engineering University of Cagliari, Italy February 16, 2011 – Valencia (Spain) email: vargiu@diee.unica.it
  • 2. My Background  2000 – 2004  2004 – 2009  Automatic planning  Bioinformatics  Classic domains: HW[]  Protein secondary structure  Dynamic domains: HIPE prediction: MASSP3 and GAME/SSP  2000 - …  2006 - …  Multiage s te nt ys ms  Information Retrieval  A Personalized Adaptive and Cooperative Multiagent  Hierarchical text System: PACMAS categorization: PF and TSA  A generic architecture to  Recommender systems and perform information retrieval contextual advertising: ConCA tasks: X.MAS February 16, 2011 – Valencia (Spain)
  • 3. Outline  Context and Mission  Why Bioinformatics Needs Information Retrieval  Bioinformatics Meets Information Retrieval  Case Study: Retrieving and Filtering Bioinformatics Publications  Conclusions February 16, 2011 – Valencia (Spain)
  • 4. Context and Mission February 16, 2011 – Valencia (Spain)
  • 5. Web Evolution  Web 1.0 1993  Source of information  Personal homepages  Web 2.0 2004  Social networks  (Micro)Blogging  Web 3.0 2005  Semantic web  Web composition February 16, 2011 – Valencia (Spain)
  • 6. Web Evolution and Bioinformatics  A long time ago...  Data was stored in local DBs  Data was shared as flat files  Biologists worked alone or in small groups February 16, 2011 – Valencia (Spain)
  • 7. Web Evolution and Bioinformatics  Today...  Online repositories  The major sources of nucleotide sequence are the ones belonging to the International Nucleotide Sequence Database Collaboration  DDBJ (DNA DataBank of Japan)  EMBL (European Molecular Biology Laboratory)  GenBank (NIH genetic sequence database)  Web services  Basic bioinformatics services are classified by the EBI into three categories  SSS (Sequence Search Services)  MSA (Multiple Sequence Alignment)  BSA (Biological Sequence Analysis) February 16, 2011 – Valencia (Spain)
  • 8. Web Evolution and Scientific Publications  A long time ago...  Publications were consulted at the library  Just two or three relevant available journals  Manual selection of relevant publications February 16, 2011 – Valencia (Spain)
  • 9. Web Evolution and Scientific Publications  Today...  Online journals  Online conference proceedings  Publications are often available for free  Manual selection of relevant publications becomes unfeasible February 16, 2011 – Valencia (Spain)
  • 10. As a Consequence...  Unstructured information  Information overload  Personalized information selection and input imbalance February 16, 2011 – Valencia (Spain)
  • 11. Our Mission  To cope with  Unstructured information, classifying documents according to a given taxonomy  Information overload, filtering information to reduce redundancy  Personalized information selection and input imbalance, filtering information according to user preferences  Case study  Retrieving and filtering bioinformatics publications February 16, 2011 – Valencia (Spain)
  • 12. Research Topics  Information Retrieval  Bioinformatics February 16, 2011 – Valencia (Spain)
  • 13. Information Retrieval Information Retrieval (IR) deals with the representation, Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. storage, organization of, and access to information items. The user must first translate this information need into a query The user must first translate this information need into a query which can be processed by an IR system. which can be processed by an IR system. Given the user query, the key goal of an IR system is to retrieve Given the user query, the key goal of an IR system is to retrieve information which might be useful or relevant to the user. information which might be useful or relevant to the user. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. New York: Addison-Wesley, 1999. February 16, 2011 – Valencia (Spain)
  • 14. Main IR Topics  Indexing  Search and Web Search  Information Filtering  Text Mining  Text Categorization and Hierarchical Text Categorization February 16, 2011 – Valencia (Spain)
  • 15. Bioinformatics Bioinformatics is the field of science in which biology, Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a computer science, and information technology merge to form a single discipline. single discipline. The ultimate goal of the field is to enable the discovery of new The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. which unifying principles in biology can be discerned. National Center for Biotechnology Information (NCBI), http://www.ncbi.nlm.nih.gov/. February 16, 2011 – Valencia (Spain)
  • 16. Main Bioinformatics Research Areas  Sequence analysis  Genome annotation  Computational evolutionary biology  Analysis of gene expression  Analysis of protein expression  Analysis of mutations in cancer  Comparative genomics  Modelling biological systems  Prediction of protein structure  Molecular interaction February 16, 2011 – Valencia (Spain)
  • 17. Why Bioinformatics Needs Information Retrieval February 16, 2011 – Valencia (Spain)
  • 18. Does Bioinformatics Need IR?  Bioinformatics is concerned with researching, developing and applying tools and methods to acquire, analyse, organize and store biological and medical data  Indexing and search techniques may help in the task of acquiring  Information filtering, text mining and text categorization techniques may be useful to the analysis of data  Text categorization, with particular reference to hierarchical text categorization, may be used in the organization and storage tasks February 16, 2011 – Valencia (Spain)
  • 19. Bioinformatics Data  A very huge amount of of data to be  Indexed  Searched for in large databases or on the web  Filtered according to users' preferences  Text mined  Categorized according to its textual content February 16, 2011 – Valencia (Spain)
  • 20. DB Indexing  Why  Data types are relegated to blob and unstructured text fields  Few results in building persistent access paths to support fast retrieval methods  Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample  Annotations are not mapped to concept in any ontology February 16, 2011 – Valencia (Spain)
  • 21. DB Indexing  Who  MoBIoS – Molecular Biological Information System  What  A specialized database management system  The storage manager is based on metric-space indexing  Query language entails biological data types  Where  Sequence homology: local alignment and mutations D. Miranker, W. Xu, and R. Mao. MoBIoS: a Metric-Space DBMS to Support Biological Discovery. Proceedings of the International Conference on Scientific and Statistical Database Management Systems, 2003. February 16, 2011 – Valencia (Spain)
  • 22. DB Indexing  Who  --  What  Ontology-driven indexing of public datasets for translational bioinformatics  Methods to map text annotations of gene expression datasets to concept in the UMLS  Where  Gene Expression Omnibus  Standford Tissue Microarray Database N.H. Shah , C. Jonquet, A.P Chiang, A.J. Butte, R. Chen, and M.A. . Musen. Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics, 10(Suppl 2):S1, 2009. February 16, 2011 – Valencia (Spain)
  • 23. Web Indexing  Why  Most often sequence retrieval tools and sequence analysis tools are separated  The usage of sequence DBs is often general and limited to keyword searching and entry retrieval  Discovering and accessing the appropriate bioinformatics resource for a specific task has become increasingly important February 16, 2011 – Valencia (Spain)
  • 24. Web Indexing  Who  SIRW – A Web Server for Simple Indexing and Retrieval System  What  A WWW interface to the Simple Indexing and Retrieval (SIR) system to parse and index flat file DBs  A framework for doing sequence analysis for selected biological sequences  Where  Sequence analysis: motif pattern searches C. Ramu. SIRW: a web server for the Simple Indexing and Retrieval System that combines sequence motif searches with keyword searches. Nucleic Acids Research, 31(13). pp. 3771-3774, 2003. February 16, 2011 – Valencia (Spain)
  • 25. Web Indexing  Who  BIRI - BIoinformatics Resource Inventory  What  An approach for automatically discovering and indexing public bioinformatics resources  Where  The scientific literature G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V. Maojo. BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC Bioinformatics, Oct 7;10:320, 2009. February 16, 2011 – Valencia (Spain)
  • 26. DB Search  Why  A wealth of bioinformatics tools and databases has been created over the last decade and most are freely available  Often it is desired to visualize the database hits stacked according to the query sequence  There is no inventory presenting an up-to-date and easily searchable index of all these resources February 16, 2011 – Valencia (Spain)
  • 27. DB Search  Who  MView – Multiple alignment Viewer  What  A tool for converting the result of a sequence database search into the form of a coloured multiple alignment of hits stacked against the query  Where  Multiple alignment N.P Brown, C. Leroy, and C. Sander. MView: a web-compatible . database search or multiple alignment viewer. Bioinformatics, 14(4), pp. 380-381, 1998. February 16, 2011 – Valencia (Spain)
  • 28. DB Search  Who  BioWareDB  What  An extensive and current catalog of software and DBs of relevance to researchers in the field of biology and medicine  Where  Current and available biomedical computing resources M.W. Matthiessen. BioWareDB: the biomedical software and database search engine. Bioinformatics, 19(17), pp. 2319-2320, 2003. February 16, 2011 – Valencia (Spain)
  • 29. Web Search  Why  Today, scientists can easily post their research findings on the Web or compare their discoveries with previous work  Manually maintaining a wrapper library will not scale to accommodate the growth of genomics data sources on the Web February 16, 2011 – Valencia (Spain)
  • 30. Web Search  Who  ---  What  An automated system able to find, classify, and wrap new sources without constant human intervention  Where  Distributed genomics data sources D. Rocco and T. Critchlow. Automatic discovery and classification of bioinformatics Web sources. Bioinformatics, 19(15), pp. 1927-1933, 2003. February 16, 2011 – Valencia (Spain)
  • 31. Web Search  Who  GoPubMed  What  An ontology-based literature search applied to Gene Ontology (GO) and PubMed  Where  Scientific literature R. Delfs, A. Doms, A. Kozlenkov, and M. Schroeder. GoPubMed: ontology-based literature search applied to gene ontology and PubMed. In Proceedings of German Bioinformatics Conference, pp. 169–178, 2004. February 16, 2011 – Valencia (Spain)
  • 32. Information Filtering  Why  In the Web 2.0 scenario, users look for collaborative environments, in which they can meet further users with similar preferences and needs  Researchers need to search for and/or generate specialized datasets that meet specific requirements February 16, 2011 – Valencia (Spain)
  • 33. Information Filtering  Who  ProDaMa-C Protein Dataset Management – Collaborative  What  A web application aimed at  Generating specialized protein structure datasets  Favouring the collaboration among researchers  Where  Protein structures G. Armano and A. Manconi. A Collaborative Web Application for Supporting Researchers in the Task of Generating Protein Datasets. Advances in Distributed Agent-based Retrieval Tools, V. Pallotta, A. Soro, E. Vargiu (eds.), Springer-Verlag, 2011. February 16, 2011 – Valencia (Spain)
  • 34. Information Filtering  Who  Gene Recommender  What  An algorithm that ranks genes according to how strongly they correlate with a set of query genes  Where  Analysis of gene expression A.B. Owen, J. Stuart, K. Mach, A.M. Villeneuve, S. Kim. A gene recommender algorithm to identify coexpressed genes. Genome Research, Aug;13(8), pp. 1828-37, 2003. February 16, 2011 – Valencia (Spain)
  • 35. Text Mining  Why  Web-based tools capable of filtering public DBs are more and more required  Interesting and useful information, relevant to the researcher, could appear in documents (e.g., papers) they have not read and therefore be missed entirely  Of paramount importance to DB search methods is a reliable means of distinguishing true hits from false hits  Biologists construct a pathway by reading a large number of articles and interpreting them a consistent network, but the link to the original article is missed February 16, 2011 – Valencia (Spain)
  • 36. Text Mining  Who  MedMiner  What  An Internet text mining tool that filters the literature and presents the most relevant portions in a well-organized way that facilitate understanding  Where  Gene expression profiling L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter, and J.N. Weinstein. MedMiner: an Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling. Biotechniques, Dec;27(6), pp. 1210-4, 1999. February 16, 2011 – Valencia (Spain)
  • 37. Text Mining  Who  BioRAT  What  A research assistant that, given a query,  autonomously finds a set of papers  reads them  highlights the most relevant facts in each  Where  Scientific literature D. P A. Corney, B. F. Buxton, W. B. Langdon, and D. T. Jones. . BioRAT: Extracting biological information from full-length papers. Bioinformatics, 20(17), pp. 3206–3213, 2004. February 16, 2011 – Valencia (Spain)
  • 38. Text Mining  Who  SAWTED – Structure Assignment With Text Description  What  An automated system to filtering DB hits  Where  Homologues annotation R.M. MacCallum, L.A. Kelley, and M.J. Sternberg. SAWTED: structure assignment with text description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics, Feb;16(2), pp. 125-9, 2000. February 16, 2011 – Valencia (Spain)
  • 39. Text Mining  Who  PathText  What  A system to integrate a pathway visualized, text mining systems and annotation tools into a seamless environment  Where  Pathway visualizations B. Kemper, T. Matsuzaki, Y. Matsuoka, Y. Tsuruoka, H. Kitano, S. Ananiadou, and J. Tsujii. PathText: a text mining integrator for biological pathway visualizations. Bioinformatics, 26(12), pp. i374- i381, 2010. February 16, 2011 – Valencia (Spain)
  • 40. Text Categorization  Why  Information in text form, such as MEDLINE records, is a greatly underutilized source of biological information  Individual researchers find it difficult to keep up with all the new, relevant information  Systems that extract structured information from natural language passages have been highly successful in specialized domains  Time is ripe for developing such applications for molecular biology and genomics February 16, 2011 – Valencia (Spain)
  • 41. Text Categorization  Who  --  What  Constructing biological knowledge bases by extracting information from text sources  Where  MEDLINE M. Craven and J. Kumlien. Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, 1999. February 16, 2011 – Valencia (Spain)
  • 42. Text Categorization  Who  Genies  What  A natural-language processing system for the extraction of molecular pathways  Where  Scientific publications C. Friedman, P Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. Genies: . a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, pp. 574–582, 2001. February 16, 2011 – Valencia (Spain)
  • 43. Hierarchical Text Categorization  Why  A great deal of genomics information accumulated through years is available in online text repositories (such as MEDLINE)  These resources do not still provide adequate mechanisms for retrieving the required information  Traditional filtering techniques based on keyword search are often inadequate to express what the user is really searching for  Web repositories, such as Medical Subject Headings (MeSH) in MEDLINE, encompass an underlying taxonomy February 16, 2011 – Valencia (Spain)
  • 44. Hierarchical Text Categorization  Who  --  What  A tool for assisting biologists with literature search for the task of associating genes with Gene Ontology codes  Where  MEDLINE S. Kiritchenko, S. Matwin, and A. F. Famili. Hierarchical text categorization as a tool of associating genes with gene ontology codes. In 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, pp. 26–30, 2004. February 16, 2011 – Valencia (Spain)
  • 45. Hierarchical Text Categorization  Who  Pub.MAS  What  A multiagent system for retrieving and classifying publications  Where  BMC Bioinformatics  PubMed Central G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for Retrieving Bioinformatics Publications from Web Sources. IEEE Transactions on Nanobioscience, Special Session on GRID, Web Services, Software Agents and Ontology Applications for Life Science, 6(2), pp. 104-109, 2007. February 16, 2011 – Valencia (Spain)
  • 46. Case Study: Retrieving and Filtering Bioinformatic Publications February 16, 2011 – Valencia (Spain)
  • 47. An IR Task Information Extraction Online Repositories Wrapping Information Sources Extracted Data/Information Text Categorization Selected Data/Information Taxonomic Classification of Items User's Feedback Adaptive Behavior February 16, 2011 – Valencia (Spain)
  • 48. Information Extraction  Essential to retrieve documents provided by heterogeneous and distributed sources A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J.S. Teixeira (2002) : A brief survey of web data extraction tools. SIGMOD Rec. 31(2), pp. 84–93. February 16, 2011 – Valencia (Spain)
  • 49. Text Categorization  It is the task of determining and assigning topical labels to content  Typical approaches to text categorization  Statistical  Semantic  In the last years several researchers have investigated the use of hierarchies for text categorization F. Sebastiani. A tutorial on automated text categorisation. Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pp. 7- 35, 1999. February 16, 2011 – Valencia (Spain)
  • 50. Users' Feedback  It is aimed at dealing with any feedback provided by the user  In semiautomated classification and adaptive filtering we may expect the user of a classifier to provide feedback on how test documents have been classified  In this case further training may be performed during the operating phase February 16, 2011 – Valencia (Spain)
  • 51. Hierarchical Text Categorization Hierarchical Text Categorization (HTC) deals with problems Hierarchical Text Categorization (HTC) deals with problems where categories are organized in the form of a hierarchy. where categories are organized in the form of a hierarchy. D. Koller, M. Sahami. Hierarchically classifying documents using very few words. Proceedings of 14th International Conference on Machine Learning, pp. 170– 178, 1997. February 16, 2011 – Valencia (Spain)
  • 52. HTC at a Glance  HTC studies how to improve the performances provided by classical text categorization techniques by exploiting the knowledge of the taxonomic relationships among classes February 16, 2011 – Valencia (Spain)
  • 53. Motivations  People organize large collections of documents in hierarchies of topics, or arrange a large body of knowledge in ontologies  The main goal of automatic text categorization is to deal with underlying taxonomies  A hierarchical approach can give benefits in real-world scenarios, characterized by information overload and imbalanced data February 16, 2011 – Valencia (Spain)
  • 54. HTC Approaches  Pachinko machine  At each level of the hierarchy  The classifier selects the one most probable category  It goes down the hierarchy inspecting only the children of the selected nodes  Probabilistic hierarchical local approach  At each level of the hierarchy  The classifier makes probabilistic decisions  It selects the leaf categories on the most probable paths S. Kiritchenko. Hierarchical text categorization and its application to bioinformatics. Ph.D. Thesis, University of Ottawa, Canada, 2006. February 16, 2011 – Valencia (Spain)
  • 55. HTC Approaches  Local classifier per node  Each classifier decides if forwarding the document to its children  Local classifier per parent node  Each classifier decides to which subtree(s) the document should be sent to  Local classifier per level  The number of outputs per level grows while going down through the taxonomy  Global classifier  One classifier is trained, able to discriminate among all categories C.J. Silla and A. Freitas. A survey on hierarchical classification across different application domains. Journal of Data Mining and Knowledge Discovery, 2(1-2), pp. 31-72, 2010. February 16, 2011 – Valencia (Spain)
  • 56. Progressive Filtering  Progressive Filtering (PF) is a simple categorization technique that operates on hierarchically structured categories  A way to implement PF consists of decomposing a given rooted taxonomy into pipelines, one for of each path that exists between the root and each node of the taxonomy  Each node is a binary classifier able to recognize whether or not an input belongs to the corresponding class  A threshold selection algorithm (TSA) can be run to identify an optimal, or sub-optimal, combination of thresholds for each pipeline A. Addis, G. Armano, E. Vargiu. Assessing Progressive Filtering to Perform Hierarchical Text Categorization in Presence of Input Imbalance. Proceedings of International Conference on Knowledge February 16, 2011 – Valencia (Spain) Information Retrieval, pp. 14-23, 2010. Discovery and
  • 57. PF at a Glance  Starting from the root, each input traverses the taxonomy as a “token” February 16, 2011 – Valencia (Spain)
  • 58. Classifiers in PF  Partitioning the taxonomy in pipelines gives rise to a set of new classifiers, each represented by a pipeline February 16, 2011 – Valencia (Spain)
  • 59. Classifiers in PF February 16, 2011 – Valencia (Spain)
  • 60. Classifiers in PF  The same classifier may have different behaviours, depending on which pipeline it is embedded  Each pipeline can be considered in isolation from the others February 16, 2011 – Valencia (Spain)
  • 61. Threshold Selection in PF  A relevant problem is how to calibrate the threshold of the binary classifiers embedded by each pipeline in order to optimize the pipeline behaviour  Searching for a optimal or sub-optimal combination of thresholds in a pipeline can be actually viewed as the problem of finding a maximum in a utility function F that depends on the corresponding threshold vector θ February 16, 2011 – Valencia (Spain)
  • 62. TSA  For each pipeline the best combination of thresholds is calculated according to a bottom up algorithm that uses two functions  Repair which increases/decreases (↑ / ↓ the threshold until the ) utility function reaches a maximum  Calibrate which recursively operates downward from the given classifier by repeatedly calling repair (↑ / ↓) A. Addis, G. Armano, E. Vargiu. A comparative experimental assessment of a threshold selection algorithm in hierarchical text categorization. In: Advances in Information Retrieval. The 33rd European Conference on Information Retrieval (ECIR 2011), 2011 February 16, 2011 – Valencia (Spain)
  • 63. TSA: An Example February 16, 2011 – Valencia (Spain)
  • 64. The Prototype  MultiAgent Architecture  X.MAS  Agent Framework  JADE A. Addis, G. Armano, E. Vargiu. From a Generic Multiagent Architecture to Multiagent Information Retrieval Systems. In: AT2AI-6, Sixth International Workshop, From Agent Theory to Agent Implementation, pp. 3–9, 2008. F. Bellifemine, G. Caire,D. Greenwood. Developing Multi-Agent Systems with JADE (Wiley Series in Agent Technology). John Wiley and Sons, 2007. February 16, 2011 – Valencia (Spain)
  • 65. X.MAS at a Glance  Macro-architecture February 16, 2011 – Valencia (Spain)
  • 66. X.MAS at a Glance Information Agent Scheduler Source  Micro-architecture Middle Agent Scheduler Dispatcher Filter Agent Scheduler Actuator Middle Agent Scheduler Dispatcher Task Agent Scheduler Actuator Middle Agent Scheduler Dispatcher Interface Agent Scheduler February 16, 2011 – Valencia (Spain)
  • 67. Pub.MAS February 16, 2011 – Valencia (Spain)
  • 68. Pub.MAS G. Armano, A. Manconi, and E. Vargiu. A MultiAgent System for Retrieving Bioinformatics Publications from Web Sources. IEEE Transactions on Nanobioscience, Special Session on GRID, Web Services, Software Agents and Ontology Applications for Life Science, 6(2), pp. 104-109, 2007. February 16, 2011 – Valencia (Spain)
  • 69. Information Extraction  It is supported by a set of agents explicitly devoted to  wrap the selected information sources  encode the extracted documents  An information agent wraps BMC Bioinformatics web site  HTML wrapper  An information agent wraps PubMed Central digital archive  Web service wrapper February 16, 2011 – Valencia (Spain)
  • 70. Hierarchical Text Categorization  The PF approach previously described has been implemented  Document has been encoded to  remove all non-informative words  remove the most common morphological and inflexional suffixes  select the relevant features  generate a feature vector for each document  Classification is performed by wkNN classifiers  the score is assigned using non parametric density estimation of the “ a posteriori” probability February 16, 2011 – Valencia (Spain)
  • 71. The Adopted Taxonomy P G. Baker, C. A. Goble, S. Bechhofer, N. W. Paton, R. Stevens, and A. . Brass. An ontology for bioinformatics applications, Bioinformatics, 15(6), pp. 510–520, 1999. February 16, 2011 – Valencia (Spain)
  • 72. The Adopted Taxonomy February 16, 2011 – Valencia (Spain)
  • 73. The Adopted Taxonomy February 16, 2011 – Valencia (Spain)
  • 74. Users' Feedback  User feedback is aimed at dealing with any feedback provided by the user  Two solutions have been experimented  training an ANN  using a kNN classifier February 16, 2011 – Valencia (Spain)
  • 75. Experiments  Different kinds of tests have been performed, each aimed at highlighting a specific issue  we estimated the (normalized) confusion matrix for each classifier belonging to the highest level of the taxonomy  we studied the impact of taking into account pipelines of classifiers, also trying to assess whether a residual independence was in fact present  we assessed the solution devised for implementing user’s feedback, based on the k-NN technique February 16, 2011 – Valencia (Spain)
  • 76. Experiments  Tests have been performed using selected publications extracted from the BMC Bioinformatics site and from the PubMed Central digital archive  Publications have been classified by an expert of the domain according to the proposed taxonomy  For each item of the taxonomy, a set of about 100-150 articles has been selected to train the corresponding wk-NN classifier, and 300-400 articles have been used to test it February 16, 2011 – Valencia (Spain)
  • 77. Conclusions February 16, 2011 – Valencia (Spain)
  • 78. Conclusions  Bioinformatics needs suitable, automated, and “ intelligent” solutions to acquire, analyse, organize, and store biological data  IR might be very useful to face with bioinformatics problems  Currently, few IR techniques have been adopted to solve some bioinformatics tasks  A system aimed at retrieving and filtering bioinformatics publications has been presented as case study  We argue that further investigations and experiments could be made to exploit IR in bioinformatics February 16, 2011 – Valencia (Spain)
  • 79. Acknowledgments  This work was partially supported by the Italian Ministry of Education – Investment funds for basic research, under the project ITALBIONET – Italian Network of Bioinformatics  I wish to thank all the IASC Group members for their valuable help  IASC Group members are:  G. Armano – head  A. Addis, F. Mascia and E. Vargiu – PhD, Post Doc  A. Giuliani, N. Hatami, M. Javarone and F. Ledda – PhD students  S. Curatti – collaborator, programmer  I wish to thank also Andrea Manconi for his suggestions February 16, 2011 – Valencia (Spain)
  • 80. Thanks for your attention! Contact: Eloisa Vargiu vargiu@diee.unica.it February 16, 2011 – Valencia (Spain)