SlideShare uma empresa Scribd logo
1 de 164
Baixar para ler offline
Bioinformática: la biología por otros medios

               Alberto Labarga
             UGR, Noviembre 2008
Computational Biology
          Bioinformatics
     [Biological Information]
Hacia una teoría científica de la herencia




1859         1866         1870             1900   1902
Charles Darwin publica en 1859
              'The Origin of Species‘
              donde se propone que los seres
              vivos son el resultado de la
              selección natural y que todas
              las criaturas han evolucionado
              a lo largo de las generaciones a
              través de pequeños cambios.




1859   1866     1870            1900       1902
Leyes de Mendel,
                     publicadas en 1866,
                     redescubiertas en 1900




1859   1866   1870            1900       1902
En 1870, un científico alemán llamado
Friedrich Miescher aísla los
componentes almacenados en el
núcleo, compuesto principalmente por
proteinas y ácidos nucleicos. En aquel
momento se creía que el elemento que
almacenaba la información
hereditaria tenía que ser la proteína,
compuesta por 20 aminoacidos,
mientras que los ácidos nucleicos
tenían sólo 4 componentes.




  1859           1866            1870    1900   1902
A comienzo de siglo, Phoebus Levene,
descubrió que el ADN es una cadena de
nucleótidos, en la que cada nucleótido está
compuesto de un azucar (desoxirribosa), un
grupo fosfato y una base nitrogenada, que
podía ser de cuatro tipos, Adenin, Timina,
guanina y Citosina




1859          1866          1870              1900   1902
Walter Sutton, a graduate student in E. B. Wilson’s
lab at Columbia University, observed that in the
process of cell division, called meiosis, that produces
sperm and egg cells, each sperm or egg receives only
one chromosome of each type. (In other parts of the
body, cells have two chromosomes of each type, one
inherited from each parent.) The segregation pattern
of chromosomes during meiosis matched the
segregation patterns of Mendel’s genes.




   1859          1866         1870              1900      1902
El descubrimiento del ADN




1928        1944       1949   1952   1953
1928 Frederick Griffith: principio de transformación


                    si mezclaba a los neumococos R
                    con neumococos S previamente
                    muertos por calor, entonces los
                    ratones se morían. Aún más, en la
                    sangre de estos ratones muertos
                    Griffith encontró neumococos
                    con cápsula (S).




1928    1944         1949              1952          1953
En 1944 Oswald Avery y sus colaboradores, que
estaban estudiando la bacateria que causa la
neumonía, Pneumococcus, descubrieron que las
bacterias tienen ácidos nucleicos y que es la molécula
de ADN la encargada de almacenar los genes. Otros
estudios con virus se encargaronde confirmar esta
teoría a pesar de que se seguía creyendo que el ADN
era demasiado simple.




1928            1944          1949               1952    1953
La vida puede verse como un proceso
              de almacenamiento y transmisión de
              información biológica.
              Los cromosomas son los portadores de
              esta información.
              La información está almacenada en la
              forma de un código molecular
              Para entender la vida debemos
              identificar estas moléculas y descifrar
              el código




1928   1944   1949              1952         1953
1949 DNA se duplica durante la división celular
     Chargaff: A = T and G = C




1928        1944       1949            1952       1953
1952 - Hershey-Chase Experiment




 1928       1944      1949        1952   1953
M.H.F. Wilkins, A.R. Stokes, H.R. Wilson:
               Molecular Structure of Deoxypentose Nucleic
               Acids. Nature 171, 738 (1953)



                         R.E. Franklin and R.G. Gosling
                         Molecular Configuration in Sodium
                         Thymonucleate, Nature 171, 740
                         (1953)




1928   1944   1949                 1952             1953
MOLECULAR STRUCTURE
OF NUCLEIC ACIDS
“We wish to propose a
structure for the salt of
desoxyribose nucleic acid
(DNA). This structure has
novel features which are of
considerable biological
interest”
Nature. 25 de abril de 1953




1928          1944        1949   1952   1953
“It has not escaped our
              attention that the specific
              pairing we have
              postulated immediately
              suggests a possible
              copying mechanism for
              the genetic material.”




1928   1944    1949              1952       1953
The base pairs
Retos de la Bioinformatica
En 1955 Ochoa publicó en Journal of the American
       Chemical Society con la bioquímica francorrusa
       Marianne Grunberg-Manago, el aislamiento de una
       enzima del colibacilo que cataliza la síntesis de ARN, el
       intermediario entre el ADN y las proteínas. Los
       descubridores llamaron «polinucleótido-fosforilasa» a
       la enzima, conocida luego como ARN-polimerasa. El
       descubrimiento de la polinucleótido fosforilasa dio
       lugar a la preparación de polinucleótidos sintéticos de
       distinta composición de bases con los que el grupo de
       Severo Ochoa, en paralelo con el grupo de Marshall
       Nirenberg, llegaron al desciframiento de la clave
       genética.




1955          1959                   1962             1966
Retos de la Bioinformatica
1955   1959   1962   1966
Cuando Perutz llegó a Cambridge la
estructura molecular más grande que se
había resuelto era la del pigmento natural
ficocianina, de 58 átomos. Una proteína
tiene miles de átomos. Bernal, su director,
había realizado algunas imágenes de
difracción de rayos X de cristales de una
proteína, la pepsina, pero sin llegar a
interpretarlas. El tema escogido por Perutz
para su tesis fue otra proteína, la
hemoglobina, el transportador de oxígeno
que da color rojo a nuestra sangre. La
hemoglobina tiene nada menos que 11.000
átomos. Tardo 23 años.




              1955                  1959      1962   1966
1955   1959   1963   1966
Over the course of several years,
Marshall Nirenberg, Har Khorana and
Severo Ochoa and their colleagues
elucidated the genetic code – showing
how nucleic acids with their 4-letter
alphabet determine the order of the 20
kinds of amino acids in proteins.
Messenger RNA is interpreted three
letters at a time; a set of three
nucleotides forms a "codon" that
encodes an amino acid. A three-letter
word made of four possible letters can
have 64 (4 x 4 x 4) permutations, which
is more than enough to encode the 20
amino acids in living beings.




           1955                   1959    1962   1966
Retos de la Bioinformatica
From DNA to protein
Entendiendo los mecanismos, creando las herramientas




1970    1971           1975         1977             1980
El Central Dogma




1970   1971       1975   1977   1980
Created in 1971
                     with seven
                     structures




1970   1971   1975       1977          1980
El ADN recombinante, o ADN recombinado, es
               una molécula de ADN formada por la unión de
               dos moléculas heterólogas, es decir, de diferente
               origen.
               Se realiza a través de las enzimas de restricción
               que son capaces de "cortar" el ADN en puntos
               concretos.
               De una manera muy simple podemos decir que
               "cortamos" un gen humano y se lo "pegamos" al
               ADN de una bacteria; si por ejemplo es el gen
               que regula la fabricación de insulina, lo que
               haríamos al ponérselo a una bacteria es
               "obligar" a ésta a que fabrique la insulina.




1970   1971   1975             1977                   1980
1970   1971   1975   1977   1980
A precursor-RNA may often be matured to
                      mRNAs with alternative structures. An example
                      where alternative splicing has a dramatic
                     consequence is somatic sex determination in the
                     fruit fly Drosophila melanogaster.

                     In this system, the female-specific sxl-protein
                     is a key regulator. It controls a cascade of
                     alternative RNA splicing decisions that finally
                     result in female flies.




1970   1971   1975                1977                        1980
Entendiendo los mecanismos, creando las herramientas




1981   1982   1983       1985       1987             1990
Read out the letters from a DNA sequence




                                GTGAGGCGCTGC




1981   1982   1983   1985    1987          1990
1983 La reacción en cadena de la polimerasa,
                          conocida como PCR por sus siglas en inglés
                          (Polymerase Chain Reaction), es una técnica
                          de biología molecular descrita en 1986 por
                          Kary Mullis,[1] cuyo objetivo es obtener un
                          gran número de copias de un fragmento de
                          ADN particular, partiendo de un mínimo; en
                          teoría basta partir de una única copia de ese
                          fragmento original, o molde.




1981   1982   1983   1985           1987                    1990
Total nucleotides                Number of entries
  (Nov 07: 188,490,792,445)          (Nov 07: 106,144,026)




1981   1982    1983           1985   1987              1990
1981   1982   1983   1985   1987   1990
El Proyecto Genoma Humano (PGH) (Human
Genome Project en inglés) consiste en
determinar las posiciones relativas de todos los
nucleótidos (o pares de bases) e identificar
100.000 genes presentes en él.
El proyecto, dotado con 3.000 millones de
dólares, fue fundado en 1990 por el
Departamento de Energía y los Institutos de la
Salud de los Estados Unidos, con un plazo de
realización de 15 años.




 1981    1982    1983        1985        1987      1990
”Imagine varias copias de un libro, cortadas en
10 millones de trocitos cada una, de manera
que los trocitos se solapan. Supongamos que 1
millón de trocitos se han perdido, y que los
otros 9 millones están manchados de tinta.
Recupere el texto original.”
Retos de la Bioinformatica
HUGO: Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by
fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The
genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones
are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct
the sequence of the genome.
Descifrando el libro de la vida




1990              1995     1996      1997 1998 1999   2001
S.F. Altschul, et al. (1990), "Basic Local
                Alignment Search Tool," J. Molec.
                Biol., 215(3): 403-10, 1990. 15,306
                citations
                Altschul, S.F. et al (1997), “Gapped
                BLAST and PSI-BLAST: a new
                generation of protein database search
                programs”, Nucleic Acids Res., vol. 25,
                no. 17, pp. 3389-402.




1990   1995   1996     1997 1998 1999              2001
Retos de la Bioinformatica
Retos de la Bioinformatica
• SSAHA (Ning et al., 2001)
   •   http://www.sanger.ac.uk/Software/analysis/SSAHA/
   •   SSAHA is an algorithm for very fast matching and alignment of DNA
       sequences. It stands for Sequence Search and Alignment by Hashing
       Algorithm. It achieves its fast search speed by converting sequence
       information into a `hash table' data structure, which can then be
       searched very rapidly for matches.

• BLAT (J. Kent, 2002)
   •   http://genome.ucsc.edu/cgi-bin/hgBlat
   •   BLAT on DNA is designed to quickly find sequences of 95% and greater
       similarity of length 40 bases or more. It may miss more divergent or
       shorter sequence alignments. It will find perfect sequence matches of 33
       bases, and sometimes find them down to 20 bases. BLAT on proteins
       finds sequences of 80% and greater similarity of length 20 amino acids
       or more.
J. Thompson, T. Gibson, D.
                Higgins (1994), CLUSTAL W:
                improving the sensitivity of
                progressive multiple sequence
                alignment … Nuc. Acids. Res. 22,
                4673 - 4680




1990   1995   1996   1997 1998 1999       2001
Flowchart of computation steps in
Clustal W (Thompson et al., 1994)

    Pairwise alignment: calculation of distance matrix



      Creation of unrooted neighbor-joining tree




Rooted nJ tree (guide tree) and calculation of sequence weights



      Progressive alignment following the guide tree
Otros métodos


Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for
   fast and accurate multiple sequence alignment. J. Mol. Biol, 302, 205–217.

Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high
   accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797.

Katoh, K., Kuma, K., Toh, H., Miyata, T. (2005) MAFFT version 5:
   improvement in accuracy of multiple sequence alignment. Nucleic Acids
   Res, 33, 511–518.

Lassmann, T., Sonnhammer, E. (2005) Kalign – an accurate and fast multiple
   sequence alignment algorithm. BMC Bioinformatics , 6, 298.

Larkin M.A. et al. (2007) ClustalW and ClustalX version 2. Bioinformatics 2007
   23(21): 2947-2948.
Tree of Life




http://tolweb.org/tree/phylogeny.html       http://itol.embl.de/
1995
              • El primer genoma completo
              de un organismo
               Hemophilus influenzae.




1990   1995   1996   1997 1998 1999   2001
1996
• El genoma de la levadura se completa:
aproximadamente, 6,000 genes y
14.000.000 de pares de bases




 1990           1995   1996   1997 1998 1999   2001
1990   1995   1996   1997 1998 1999   2001
1997

•Ecuenciado el genoma de la
bacteria E. Coli: 4,600 genes
4,5 millones de nucleótidos.




1990           1995   1996   1997 1998 1999   2001
1998

El genoma del gusano
Caenorhabditis elegans,
tiene 18,000 genes unos
100 millones de nucleotidos




1990         1995   1996   1997 1998 1999   2001
1999
   •Se consigue la secuencia
   completa del cromosoma 22
   El HGP va por delante de lo
   planeado.
   Sorprende el reducido
   número de genes encontrado
   (unos 300)




1990           1995   1996       1997 1998 1999   2001
Fire A, Xu S, Montgomery M, Kostas
S, Driver S, Mello C (1998). "Potent
and specific genetic interference by
double-stranded RNA in
Caenorhabditis elegans". Nature 391
(6669): 806–11. doi:10.1038/35888.
PMID 9486653
Hamilton A, Baulcombe D
(1999). "A species of small
antisense RNA in
posttranscriptional gene
silencing in plants". Science
286 (5441): 950–2.
PMID 10542148
Dr Alan Wolffe (1999)
• Epigenetics is heritable
  changes in gene expression
  that occur without a change
  in DNA sequence
• Such changes cannot be
  attributed to changes in DNA
  sequence (mutations)
• They are as Irreversible as
  mutations (or difficult to
  reverse)
1990   1995   1996   1997 1998 1999   2001
Gene prediction




            Where are the genes?




                  In humans:

                  ~22,000 genes
                  ~1.5% of human DNA
the gencode pipeline




1.   mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the
     human genome
2.   manual curation to resolve conflicting evidence
3.   additional computational predictions
4.   experimental verification
5.   FINAL ANNOTATION
Genome annotation - building a pipeline

                          Genome sequence



       Map repeats             Map ESTs                       Map Peptides


                                                 Genefinding


                           nc-RNAs                  Protein-coding genes


                                                    Functional annotation


                                   Release


August 2008            Bioinformatics tools for Comparative                  64
                               Genomics of Vectors
Genefinding - ab initio predictions

    Use compositional features of the DNA sequence to define coding
   segments (essentially exons)
          ORFs
          Coding bias
          Splice site consensus sequences
          Start and stop codons
    Each feature is assigned a log likelihood score
    Use dynamic programming to find the highest scoring path
    Need to be trained using a known set of coding sequences
    Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh




August 2008              Bioinformatics tools for Comparative    65
                                 Genomics of Vectors
ab initio prediction

Genome


Coding
potential

ATG & Stop
codons

Splice sites
ATG & Stop
codons
Coding
potential




    August 2008   Bioinformatics tools for Comparative   66
                          Genomics of Vectors
ab initio prediction

Genome


Coding
potential

ATG & Stop
codons

Splice sites
ATG & Stop
codons
Coding
potential




    August 2008   Bioinformatics tools for Comparative   67
                          Genomics of Vectors
ab initio prediction

Genome


Coding
potential

ATG & Stop
codons

Splice sites
ATG & Stop
codons
Coding
potential
                                        Find best prediction




    August 2008   Bioinformatics tools for Comparative         68
                          Genomics of Vectors
Genefinding - similarity

  Use known coding sequence to define coding regions
        EST sequences
        Peptide sequences
  Needs to handle fuzzy alignment regions around splice sites
  Needs to attempt to find start and stop codons
  Examples: EST2Genome, exonerate, genewise


  Use 2 or more genomic sequences to predict genes based on
 conservation of exon sequences
  Examples: Twinscan and SLAM




August 2008           Bioinformatics tools for Comparative       69
                              Genomics of Vectors
Similarity-based prediction


Genome


                                                Align
           cDNA/peptide



                                                Create prediction




  August 2008             Bioinformatics tools for Comparative      70
                                  Genomics of Vectors
Example of a simple HMM




    Top: model architecture and parameters. Bottom: sequence generation process.
    green: state transition probabilities, red: emission probabilities.
    Prob(sequence, path|model) = 6.8e-8.
EPFL – Bioinformatics I – 05 Dec 2005
Automatic Annotation vs Manual



Automatic Annotation                   Manual Annotation
• Quick whole genome analysis ~        • Extremely slow~3 months Chr 6
  weeks                                • Need finished seq
• Consistent annotation                • Flexible, can deal with
• Use unfinished sequence/shotgun        inconsistencies in data
  assembly                             • Most rules have exception
• No polyA sites/signals, pseudogene   • Consult publications as well as
• Predicts ~70% loci                     databases
Analysis EGASP predictions vs manual
100
                      annotation                           100
                  Exon Sn                                                       Nuc Sn
 90                                                            90               Nuc Sp
                  Exon Sp
 80                                                            80

 70                                                            70

 60                                                            60

 50                                                            50

 40                                                            40

 30                                                            30

 20                                                            20

 10                                                            10

  0                                                             0
          9_101_1           20_79_1   36_46_1   41_77_1               9_101_1            20_79_1     36_46_1      41_77_1


  80
                                                          80
           Trans Sn
  70                                                                 Gene Sn
           Trans Sp                                       70
                                                                     Gene Sp
  60
                                                          60

  50
                                                          50

  40                                                      40

  30                                                      30

  20                                                      20

  10                                                      10

      0                                                    0
           9_101_1          20_79_1   36_46_1   41_77_1             9_101_1          20_79_1       36_46_1     41_77_1
Y sólo es el principio




2002           2004         2005   2007   2010
2002   2004   2005   2007   2010
10/3/02   8/28/03    5/07   10/08

Published complete genomes:      104      156       500      874

Ongoing prokaryotic genomes:     316      386       1500    2124

Ongoing eukaryotic genomes:      218      246       700    1004

http://www.genomesonline.org                               4000




  2002          2004      2005               2007          2010
32,000,000
                                                                                                     454-GS20




                                                           Millions
                                                                      4 .5 4

                                                                      4 .0 4
Applied Biosystems                                                    3 .5 4
                     Roche / 454




                                             # Bases/Run
                                                                      3 .0 4
ABI 3730XL                                                                                                    ABI
                     Genome Sequencer FLX                             2 .5 4
                                                                                          ABI
1 Mb / day                                                            2 .0 4
                                                                               ABI                            3730
                     100 Mb / run                                     1 .5 4              3700
                                                                      1 .0 4   370/377
                                                                      0 .5 4

                                                                      0 .0 4
                                                                          1994     1996   1998    2000      2002     2004   2006
                                                                                          Dat e of Int roduct ion




                                 Applied Biosystems
                                 SOLiD
Illumina / Solexa                3000 Mb / run
Genetic Analyzer
2000 Mb / run




2002                 2004          2005                                                  2007                        2010
Aunque los seres humanos compartimos
              99.9 por ciento de la información genética,
              tenemos pequeñas variaciones, llamadas
              poliformismos singulares de nucléotido o
              SNP (por su siglas en inglés; se pronuncia
              snip). Se estima que existen unos 10
              millones de SNP en la especie humana y
              supuestamente esas diferencias estarían
              relacionadas con la mayor resistencia o
              susceptibilidad a enfermedades y
              medicamentos.




2002   2004   2005                2007            2010
VARIACIÓN EN LA SECUENCIA HUMANA DE
      DNA




Tasa de mutación = 10-8 /sitio/generación
Nº generaciones ancestro común-humano actual: 104-105
ENCyclopedia Of DNA Elements




2002   2004   2005       2007    2010
2002   2004   2005   2007   2010
Genómica funcional
Sequence (DNA/RNA)
Comparative                                                                & phylogeny
 genomics

                                                                            Protein sequence analysis &
  Regulation of gene                                                                  evolution
     expression;
transcription factors &
     micro RNAs
                                                                                    Protein structure & function:
                                                                                   computational crystallography
               Protein families,
              motifs and domains

                                                                                            Chemical biology

         Protein interactions & complexes: modelling and
                             prediction




                                                    Pathway analysis

    Data integration & literature
              mining



                                                                 Image analysis                      Systems
                                                                                                     modelling
Se preparan las
Se preparan copias del ADN    muestras de ARN
de los genes de interés       de interés                       Laser 1 Laser 2

                              control muestr
                                        a
                                                             El chip se excita
                                                             con láseres
                                                             diferentes: el
                 ...que se      Transcripción
                                                             control
                 imprimen          inversa
                                                             reacciona a uno
                 en el chip        Añadir                    de ellos y la
                                fluorescencia
                                                             muestra al otro
                                                             La comparación
                                                             de ambas
                                                             imágenes nos
                                                             indica que genes
                                                             se expresan de
                                                             manera diferente




  Las muestras se hibridan
  en el microarray
                                                Schena et al. Science 1995
Microarray analysis
            Clinical prediction of Leukemia type

• 2 types
    – Acute lymphoid (ALL)
    – Acute myeloid (AML)
• Different treatment & outcomes
• Predict type before treatment?




                                   Golub et. al. Science 286:531-537. (1999)
Biomarkers discovery

   Data        statistical
Management      analysis                  Network
                             Annotation   análisis      Selection




30.000       1500 genes      150 genes    50 elements   10 targets
genes
RT-PCR Standard Processing Procedure

                  TaqMan
                  Assays
                                                  !   Overview Plates & Samples


                                                          !   Quality Control
Step1: Calculate Ct with
SDS and export text file                                       Raw Values


                                                         !    Discard Samples
                   Step2: Retrieve
                   data and define
                  experiment design
                                                         !    Quality Control
                                                              ΔCt Overview




                                  Step 4: Selection of Optimal             Step 5: Differential
  Step 3: Biological                Endogenous Controls &               Expression Analysis ΔΔCt
     Replicates                        Calculation of ΔCt
Example of Array CGH Technology*




Chari et al, Cancer Informatics, 2006, 2, 48-58   88
89
Chip-on-chip




     Source: http://www.chiponchip.org/
ChIP (Chromatin ImmunoPrecipitation)

• Chromatin immunoprecipitation, or ChIP, refers to a procedure
  used to determine whether a given protein binds to a specific
  DNA sequence in vivo

                            DNA-binding proteins are crosslinked
                            to DNA with formaldehyde in vivo

                            Bind antibodies specific to the DNA-
                            binding protein to isolate the complex
                            by precipitation. Reverse the cross-
                            linking to release the DNA and digest
                            the proteins.

                            Isolate the chromatin. Shear DNA
                            along with bound proteins into small
                            fragments.
                            Use PCR( Polymerase Chain Reaction )
                            to amplify specific DNA sequences to
                            see if they were precipitated with the
                            antibody
Retos de la Bioinformatica
Protein Microarray
         G. MacBeath and S.L. Schreiber, 2000, Science 289:1760




                                         arrayIT TM



Spotting platform and protein microarray
Different Kinds of Protein Arrays*



Antibody Array    Antigen Array       Ligand Array




 Detection by: SELDI MS, fluorescence, SPR,
 electrochemical, radioactivity, microcantelever
The Microarray Study Process
Preprocesado
Some Questions:



• Which genes have expression levels that are correlated
  with some external variable?
• For a given pathway, which of the genes in our collection
  are most likely to be involved?
• For a diffuse disease, which genes are associated with
  different outcomes?
Challenges for Data Analysis


• Normalization (removing systematic measurement effects)
• Variable Selection (Identification of relevant Variables)
• Large sample Effects:

        Type I and Type II errors (False positives / False negatives)

• Dimensionality Reduction
• Identification of new disease classes
• Classification of data into known disease classes
Data Analysis Methods
Dimension Reduction
 • PCA (Principle Component Analysis)
 • ICA (Independent Component Analysis)
 • Multidimensional Scaling

Unsupervised Learning
 • K-Means / K-Medoid
 • Hierarchical Clustering Algorithms

Supervised Learning
 • Linear Discriminant Analysis
 • Maximum Likelihood Discrimination
 • Nearest Neighbor Methods
 • Decision Trees
 • Random Forests
Matrix factorization
Retos de la Bioinformatica
Popular Classification Methods

• Decision Trees/Rules
   – Find smallest gene sets, but not robust – poor performance
• Neural Nets - work well for reduced number of genes
• K-nearest neighbor – good results for small number of genes, but
  no model
• Naïve Bayes – simple, robust, but ignores gene interactions
• Support Vector Machines (SVM)
   – Good accuracy, does own gene selection,
     but hard to understand
• Specialized methods, D/S/A (Dudoit), …




  102
Support Vector Machine (SVM)




• Main idea: Select hyperplane that is more likely to
  generalize on a future datum
Best Practices



• Capture the complete process, from raw data to final
  results
• Gene (feature) selection inside cross-validation
• Randomization testing
• Robust classification algorithms
      – Simple methods give good results
      – Advanced methods can be better
• Wrapper approach for best gene subset selection
• Use bagging to improve accuracy
• Remove/relabel mislabeled or poorly differentiated
  samples


104
Enrichment Analysis

•   What are major enriched GO terms?
•   What are the highly active pathways?
•   What are the frequently interacting proteins?
•   What are the known disease associations?




Alistair Chalk, 2008
Meta-analysis example: “Creation and
    implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
Meta-analysis example: “Creation and
     implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
•   Clustered experiments based on
    mapping concepts found in sample
    annotations to UMLS meta-thesaurus.
•   Relationships found between
    phenotype (e.g., aging), disease (e.g.,
    leukemia), environmental (e.g., injury)
    and experimental (e.g., muscle cells)
    factors and genes with differential
    expression.
•   “the ease and accuracy of automating
    inferences across data are crucially
    dependent on the accuracy and
    consistency of the human annotation
    process, which will only happen when
    every investigator has a better
    prospective understanding of the long-
    term value of the time invested in
    improving annotations.”
Biología de sistemas
Retos de la Bioinformatica
PPI ANNOTATION AND DATABASES


Database    Reference                                      URL
MINT        (Zanoni et al., 2002)                          http://mint.bio.uniroma2.it/mint

IntAct      (Hermjakob et al., 2004)                       http://www.ebi.ac.uk/intact

DIP         (Xenarios et al., 2002)                        http://dip.doe-mbi.ucla.edu/

HPID        (Han et al., 2004)                             http://www.hpid.org

HPRD        (Peri et al., 2004)                            http://www.hprd.org/

                                     iMEX agreement to share curation efforts

                                     Protein Standard Initiative (PSI) recommendation

                                     Molecular Interaction (MI) Ontology

                                     Large scale experiments

                                    Literature curation
Retos de la Bioinformatica
Complex networks


• Many systems can be represented as
  networks (graphs)
   – Nodes: individual component (proteins)
   – Edges: relationships (interactions)
• They share common properties
   – Scale-free
   – Hierarchical
   – Clustering
• Some properties may be intrinsic
  and can be understood better when
  putting into the context of evolution
Detecting Hierarchical Organization
Summary: Network Measures

• Degree ki
   The number of edges involving node i
• Degree distribution P(k)
   The probability (frequency) of nodes of degree k
• Mean path length
   The avg. shortest path between all node pairs
• Network Diameter
   – i.e. the longest shortest path
• Clustering Coefficient
   – A high CC is found for modules
Mapping the phenotypic data to the network
                             •Systematic phenotyping
                             of 1615 gene knockout
                             strains in yeast
                             •Evaluation of growth of
                             each strain in the presence
                             of MMS (and other DNA
                             damaging agents)
                             •Screening against a
                             network of 12,232 protein
                             interactions


                             Begley TJ, Rosenbach AS, Ideker T,
                             Samson LD. Damage recovery pathways
                             in Saccharomyces cerevisiae revealed by
                             genomic phenotyping and interactome
                             mapping. Mol Cancer Res. 2002
                             Dec;1(2):103-12.
Retos de la Bioinformatica
The Role of Proteomics


• The existence of an ORF does not imply the
    existence of a functional gene.
•   Limitations of comparative genomics.
•   mRNA levels may not correlate with protein levels.
•   Protein modifications  post-transcriptional
    modifications, isoforms, post-translational
    modifications, mutants.
•   Issues of proteolysis, sequestration, etc. relevant only
    at the protein level.
•   Protein complex composition, protein-protein
    interactions, structures.
Structural proteomics



•   Folding
•   Structure and function
•   Protein structure prediction
•   Secondary structure
•   Tertiary structure
•   Function
•   Post-translational modification
•   Prot.-Prot. Interaction -- Docking algorithm
•   Molecular dynamics/Monte Carlo
What kind of methods around?


5 main levels of protein Structure prediction:

  1. Extensive Sequence Search
  2. Threading and 1D-3D profiles
  3. Ab initio prediction of protein structure
  4. Comparative Modelling
  5. Docking (domain interaction prediction)
Retos de la Bioinformatica
Prediction of Protein Structures


• Examples – a few good examples




       actual              predicted   actual     predicted




      actual          predicted        actual     predicted
Retos de la Bioinformatica
MODPIPE: Large-Scale Comparative Protein Structure Modeling
                          START


                                                                                     1

                Get profile for sequence (NR)                             Expand match to cover
                                                                            complete domains
    PSI-BLAST




                                                                                                                                       For each template structure
                                                                                                            For each target sequence
                Scan sequence profile against




                                                         MODELLER
                 representative PDB chains                          Align matched parts of sequence and
                                                                                structure



                  Scan PDB chain profiles                            Build model for target segment by
                     against sequence                                 satisfaction of spatial restraints



                                                                              Evaluate model
                  Select templates using
                 permissive E-value cutoff


                             1                                                     END

R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998.
N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, A. Šali.                              3/25/03
Structural Proteomics:
                            The Motivation*


            2000000                                     200000
            1800000                                     180000
            1600000                                     160000
            1400000                                     140000
Sequences




                                                                 Structures
            1200000                                     120000
            1000000                                     100000
             800000                                     80000
             600000                                     60000
             400000                                     40000
             200000                                     20000
                  0                                       0
                  1980   1985   1990   1995   2000   2005
The hierarchies of protein structure
Docking Programs

• Dock (UCSF)
• Autodock (Scripps)
• Glide
  (Schrodinger)
• ICM (Molsoft)
• FRED (Open Eye)
• Gold, FlexX, etc.




                                     126
Cell cycle network from KEGG
Graphical Notation: a necessity for the conceptual representation
    of biopathways



      Qualitative                                                Mechanistic


                                   various degree of
                                   detail, mixed level
                                   of presentation


                                                         Aladjem et al., Science STKE pe8
Thiery & Sleeman, Nat. Rev. Mol.                                      (2004)
     Cell. Biol 7:131 (2006)




               128
Strategies: simulate or analyse?
                 (or rather what to do first)


                                          obtain qualitative
 convert diagram      simulate model        understanding
into a quantitative      behavior        through numerical
       model            numerically       results and model
                                              reduction




   build and            identify          qualitatively
   simulate a         “elementary       analyze network
 reduced model          modes”         topology, stability,
                                               etc




   129
130
         stochsim
                        Boolean
                        networks
                                   Space of modeling methods




      continuous ↔ discrete
Continuum of modeling approaches




Top-down                          Bottom-up
Frazier et al. (2003) Science 11 April Vol 300:290-293
Integración de datos
Nucleic Acids Research article lists
      1078 public databases




  Nucleic Acids Research, 2008, Vol. 36, Database issue
  http://nar.oxfordjournals.org/cgi/reprint/36/suppl_1/D2
Growth in Available Bioinformatics Databases
Too much unintegrated data



• Data sources incompatible
• No (or few) standard naming convention
• No common interface (varying tools for browsing,
  querying and visualizing data)
– Large experiments or large research    – Small, isolated, independent,
  groups/labs, possibly distributed        groups/individuals
– Large service provider institutes.     – Loosely coupled provider-
                                           consumer of resources.
– Tightly coupled provider-consumer of
  resources.                             – Commonly resource consumers
– Commonly resource providers.           – Boutique suppliers.
– Some or lots of access to sys admin    – Poor access systems admins
Challenges: Names and Identity

•     WSL-1 protein                             Q93038 = Tumor necrosis factor
•     Apoptosis-mediating receptor DR3          receptor superfamily member
•     Apoptosis-mediating receptor              25 precursor
      TRAMP
•     Death domain receptor 3
                                                 Annotation history:
•     WSL protein
•     Apoptosis-inducing receptor AIR            Q92983      P78515
•     Apo-3                                      O00275      Q93036
•     Lymphocyte-associated receptor of death    O00276      Q93037
•     LARD                                       O00277      Q99722
•     GENE: Name=TNFRSF25                        O00278      Q99830
                                                 O00279      Q99831
                                                 O00280      Q9BY86
                                                 O14865      Q9UME0
                  GUIDs                          O14866      Q9UME1
                Life Science                     P78507      Q9UME5
                Identifier?
               Normalisation

138                                   http://www.expasy.org/uniprot/Q93038
Retos de la Bioinformatica
Why must support standards?


• Unambiguous representation, description
  and communication
  – Final results and metadata
• Interoperability
  – Data management and analysis
• Integration of OMICS  system biology
What to standarize?


•   CONTENT: Minimal/Core Information to be reported
•   MIBBI (http://www.mibbi.org)
•   SEMANTIC: Terminology Used -> Ontologies
•   OBI (http://obi-ontology.org)
•   SYNTAX: Data Model, Data Exchange
•   Fuge (http://fuge.sourceforge.net/)
MIBBI: Standard Content




Promoting Coherent Minimum Reporting Requirements for
Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotech.
Link Integration: Integration Lite




                                      Application interface



                                                                            User interface
                                                              Application
    Ontology
    Authority
Identity Authority

      143
Warehouse




                                     Wrappers Wrappers




                                                                   Data Access and Query



                                                                                                         User interface
                                                                                           Application
                                                         Unified

                                     Wrappers
                                                         model


•   Copy the data sets, clean and massage data into shape
•   Combine them into a (different) pre-determined model before query
•   ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART
•   Often called “Knowledge bases” 
    144
View integration




                                    Wrappers Wrappers




                                                                    Data Access and Query



                                                                                                          User interface
                                                                                            Application
                                                        Unified

                                    Wrappers
                                                        model


•   Data at Source; Virtual integrating database view
•   Global as View / Local as View mappings between models
•   Map from model to databases dynamically so always fresh
•   TAMBIS, Information Integrator, K4, ComparaGrid, UTOPIA, caCORE
                                                                  145
Specialist Integrating Application




                                  Wrappers Wrappers




                                                                    User interface
                                                      Application
                                  Wrappers


E.g. Ensembl, UTOPIA
• Very popular. Known to be one application.


146
Workflows


                                                  Workflow
                                                  Engine




                                                                          User interface
                                                            Application
                                        Wrapper
•     Data flow protocol. Automated data chaining.
•     General technique for describing and enacting a process
•     Describes what you want to do, not how you want to do it
•     Various degrees of data type compliance anticipated
147
Mash-Up Data Marshalling
                          objects




                                      Protocol




                                                            Mash Up Application
                                                                                  User interface
                                      Protocol
                                      Protocol
•     Content syndication and feeds
•     Emphasis on User creating specific integration by mapping.
•     Just in time, just enough design
•     On demand integration
148
Composite applications
Semantic Web help?




                                                   Access and Query
                                  Wrappers




                                                                                    User interface
                                                                      Application
                                  Wrapper
                                  Wrappers
                     Semantic Enrichment
                       Model flattening
                    Mapping Transparency

• Slight problem: we have no first class metadata migration and
  management infrastructure, where metadata is outside the application and
  in the middleware, and we can handle progressive curation
150
Retos de la Bioinformatica
Service Oriented Architecture



                               Advanced Search
                               Retrieve data
                               Submit data




                                                 submission
                                                  curation
ws   ws   ws       ws     ws



               dataflow                            workflow
Distributed Annotation System
Distributed Annotation System
An Integrative Analysis Example


Relational
   data
    Decision
 mining                                     Text
  tree model
       of                                  mining
  Visualizing
 metabonomi
 serial/spect               Visualizing
    c profile
   rum data                   cluster
                             statistics Visualizing
                                      Visualizing
                                         Visualizin
                                 Chemical
                                       multidimensi
                               Visualizing g
                                       sequence
                                 structure data
                                 pathway onal
                                          data
                                        Chemical
                                         relational
                                  Text mining
                Spectrum       visualization
                                   data data
                                        sequence
                                 visualization
                  data                     data
                                          clusters
                 mining                   model
From experiments to scientific publications

1- Experiments    2- Results          3- Scientific Peer-
                                      reviewed articles
Planning and      Processing and
carrying out      interpretation of    'Relevant' results are
experiments       obtained results     published in scientific
(lab work)                                    journals
PubMed/Medline database at NCBI

                       - Developed at the National
                         Center for Biotechnology
                         Information (NCBI).

                       - The core 'Textome'.

                       - repository of citation
                         entries of scientific
                       articles.

                       - PubMed titles and
                       abstracts
                         are primary data source for
                         Bio-NLP.

                       - ~ 450,000 new abstracts/a

                       - > 4,800 biomedical
                       journals

                       - ENTREZ search engine
Data in scientific articles

   Scientific      Free Text                         Tables              Figures
   Journals
                               Title

                               Abstracts
                               Keywords
                               Text body
                               References

   Journal-                   Biomedical literature characteristics
   specific
 Information:           - Heavy use of domain specific terminology (12%
                        biochemistry
    •Format
•Paper structure          related technical terms).
   (sections)           - Polysemic words (word sense disambiguation).
  •Article type
                        - Most words with low frequency (data sparseness).
                        - New names and terms created.
                        - Typographical variants
                        - Different writing styles (native languages)
Retos de la Bioinformatica
BioCreative
BioCreative
BioCreative results

                      TP: prediction evaluated as protein
                      and GO terms correct

                      Precision: TP / Total nr. of
                                  evaluated submissions



                                   1: Chiang et al.
                                   2: Couto et al.
                                   3: Ehrler et al.
                                   4: Ray et al.
                                   5: Rice et al.
                                   6: Verspoor et al.
 Data Integration
   • Standards, DBs                                 Infrastructure

 Knowledge Discovery
   • Algorithms, Informatics, Machine Learning

 Integrate knowledge
   • Text mining, Ontologies

 Modelling
   • Pathways, Circuits, Abstraction

                                         Research                    Support
Los retos de la biología en los próximos
                    50 years
• Listado de todos los componentes moleculares que
  forman un organismo:
     – Genes, proteinas, y otros elementos funcionales
•   Comprender la funcion de cada componente
•   Comprender como interaccionan
•   Estudiar como la función ha evolucionado
•   Encontrar defectos geneticos que causan enfermedades
•   Diseñar medicamentos y terapias de manera racional
•   Secuenciar el genoma de cada individuo y usarlo en una
    medicina personalizada

• La Bioinformatica es un componente esencial
  para conseguir todos estos objetivos

Mais conteúdo relacionado

Mais procurados

Metabolismo de Nucleótidos. Ribonucleótido Reductasa. Timidilato Sintasa
Metabolismo de Nucleótidos. Ribonucleótido Reductasa. Timidilato SintasaMetabolismo de Nucleótidos. Ribonucleótido Reductasa. Timidilato Sintasa
Metabolismo de Nucleótidos. Ribonucleótido Reductasa. Timidilato SintasaNaky Zambrano
 
Estructura del ADN
Estructura del ADNEstructura del ADN
Estructura del ADNRoger Lopez
 
Introducción a biologia molecular
Introducción a biologia molecularIntroducción a biologia molecular
Introducción a biologia molecularLACBiosafety
 
Dipylidium caninum clase
Dipylidium caninum  claseDipylidium caninum  clase
Dipylidium caninum claseVíctor Bravo P
 
Control genético de la síntesis de proteínas,
Control genético de la síntesis de proteínas,Control genético de la síntesis de proteínas,
Control genético de la síntesis de proteínas,Rebeca Soledad Caballero
 
Parasitología - Céstodos
Parasitología - CéstodosParasitología - Céstodos
Parasitología - CéstodosLuis R. Puglla
 
Tema 42 Enzimología de la replicación del ADN, estructura y función de topois...
Tema 42 Enzimología de la replicación del ADN, estructura y función de topois...Tema 42 Enzimología de la replicación del ADN, estructura y función de topois...
Tema 42 Enzimología de la replicación del ADN, estructura y función de topois...Dian Alex Gonzalez
 
Bioquimica de los parasitos parasitologia 2016
Bioquimica de los parasitos parasitologia 2016Bioquimica de los parasitos parasitologia 2016
Bioquimica de los parasitos parasitologia 2016Marcela Paniagua
 
Secuenciación de ADN
Secuenciación de ADNSecuenciación de ADN
Secuenciación de ADNRai Encalada
 
Trichinella spiralis
Trichinella spiralisTrichinella spiralis
Trichinella spiralisLuis Fernando
 
isospora bellis
isospora bellisisospora bellis
isospora bellisJose Mouat
 
Capitulo 1 introduccion a la inmunologia
Capitulo 1 introduccion a la inmunologiaCapitulo 1 introduccion a la inmunologia
Capitulo 1 introduccion a la inmunologiaAlfonso Sánchez Cardel
 

Mais procurados (20)

Metabolismo de Nucleótidos. Ribonucleótido Reductasa. Timidilato Sintasa
Metabolismo de Nucleótidos. Ribonucleótido Reductasa. Timidilato SintasaMetabolismo de Nucleótidos. Ribonucleótido Reductasa. Timidilato Sintasa
Metabolismo de Nucleótidos. Ribonucleótido Reductasa. Timidilato Sintasa
 
Estructura del ADN
Estructura del ADNEstructura del ADN
Estructura del ADN
 
Introducción a biologia molecular
Introducción a biologia molecularIntroducción a biologia molecular
Introducción a biologia molecular
 
Balantidium coli ( i parcial)
Balantidium coli ( i parcial)Balantidium coli ( i parcial)
Balantidium coli ( i parcial)
 
Dipylidium caninum clase
Dipylidium caninum  claseDipylidium caninum  clase
Dipylidium caninum clase
 
Muerte celular
Muerte celularMuerte celular
Muerte celular
 
Control genético de la síntesis de proteínas,
Control genético de la síntesis de proteínas,Control genético de la síntesis de proteínas,
Control genético de la síntesis de proteínas,
 
Enzimas 4
Enzimas 4Enzimas 4
Enzimas 4
 
Amebas comensales y amebas de vida libre entamoeba
Amebas comensales y amebas de vida libre entamoeba  Amebas comensales y amebas de vida libre entamoeba
Amebas comensales y amebas de vida libre entamoeba
 
Parasitología - Céstodos
Parasitología - CéstodosParasitología - Céstodos
Parasitología - Céstodos
 
Extraccion ADN
Extraccion ADNExtraccion ADN
Extraccion ADN
 
FUNDAMENTOS DE LA PCR
FUNDAMENTOS DE LA PCRFUNDAMENTOS DE LA PCR
FUNDAMENTOS DE LA PCR
 
Tema 42 Enzimología de la replicación del ADN, estructura y función de topois...
Tema 42 Enzimología de la replicación del ADN, estructura y función de topois...Tema 42 Enzimología de la replicación del ADN, estructura y función de topois...
Tema 42 Enzimología de la replicación del ADN, estructura y función de topois...
 
Bioquimica de los parasitos parasitologia 2016
Bioquimica de los parasitos parasitologia 2016Bioquimica de los parasitos parasitologia 2016
Bioquimica de los parasitos parasitologia 2016
 
Secuenciación de ADN
Secuenciación de ADNSecuenciación de ADN
Secuenciación de ADN
 
Trichinella spiralis
Trichinella spiralisTrichinella spiralis
Trichinella spiralis
 
Balantidium coli
Balantidium coliBalantidium coli
Balantidium coli
 
isospora bellis
isospora bellisisospora bellis
isospora bellis
 
Capitulo 1 introduccion a la inmunologia
Capitulo 1 introduccion a la inmunologiaCapitulo 1 introduccion a la inmunologia
Capitulo 1 introduccion a la inmunologia
 
TRADUCCIÓN
TRADUCCIÓNTRADUCCIÓN
TRADUCCIÓN
 

Semelhante a Retos de la Bioinformatica

research done to prove DNA a genetic material
research done to prove DNA a genetic materialresearch done to prove DNA a genetic material
research done to prove DNA a genetic materialPartha Sarathi
 
Lectut btn-202-ppt-l1. introduction and historical background part i (1)
Lectut btn-202-ppt-l1. introduction and historical background part i (1)Lectut btn-202-ppt-l1. introduction and historical background part i (1)
Lectut btn-202-ppt-l1. introduction and historical background part i (1)Rishabh Jain
 
genetic material in organization, Central dogma,transcription in prokaryotes ...
genetic material in organization, Central dogma,transcription in prokaryotes ...genetic material in organization, Central dogma,transcription in prokaryotes ...
genetic material in organization, Central dogma,transcription in prokaryotes ...Patelrushi11
 
Molecular Genetics
Molecular GeneticsMolecular Genetics
Molecular GeneticsJolie Yu
 
The brief history of molecular biology
The brief history of molecular biologyThe brief history of molecular biology
The brief history of molecular biologyZohaib HUSSAIN
 
Frederick sanger
Frederick sangerFrederick sanger
Frederick sangerShreya Ray
 
A Short History of DNA
A Short History of DNAA Short History of DNA
A Short History of DNADan Graur
 
Basic concepts & scope of recombinant DNA technology
Basic concepts & scope of recombinant DNA technologyBasic concepts & scope of recombinant DNA technology
Basic concepts & scope of recombinant DNA technologyRavi Kant Agrawal
 
Unit 7 :Molecular Genetics
Unit 7 :Molecular GeneticsUnit 7 :Molecular Genetics
Unit 7 :Molecular Geneticsaurorabiologia
 
History of Genomics
History of Genomics History of Genomics
History of Genomics Sonal Chavan
 
Important events in the field of biochemistry
Important events in the field of biochemistryImportant events in the field of biochemistry
Important events in the field of biochemistryNidhi Jodhwani
 
Introduction to Biotechnology (Lecture-2).ppt.pptx
Introduction to Biotechnology (Lecture-2).ppt.pptxIntroduction to Biotechnology (Lecture-2).ppt.pptx
Introduction to Biotechnology (Lecture-2).ppt.pptxNavneetChaudhary36
 
Principles of microbial genetics
Principles of microbial geneticsPrinciples of microbial genetics
Principles of microbial geneticsRinaldo John
 

Semelhante a Retos de la Bioinformatica (20)

research done to prove DNA a genetic material
research done to prove DNA a genetic materialresearch done to prove DNA a genetic material
research done to prove DNA a genetic material
 
Lectut btn-202-ppt-l1. introduction and historical background part i (1)
Lectut btn-202-ppt-l1. introduction and historical background part i (1)Lectut btn-202-ppt-l1. introduction and historical background part i (1)
Lectut btn-202-ppt-l1. introduction and historical background part i (1)
 
DNA History & Structure
DNA History & StructureDNA History & Structure
DNA History & Structure
 
genetic material in organization, Central dogma,transcription in prokaryotes ...
genetic material in organization, Central dogma,transcription in prokaryotes ...genetic material in organization, Central dogma,transcription in prokaryotes ...
genetic material in organization, Central dogma,transcription in prokaryotes ...
 
Molecular Genetics
Molecular GeneticsMolecular Genetics
Molecular Genetics
 
The brief history of molecular biology
The brief history of molecular biologyThe brief history of molecular biology
The brief history of molecular biology
 
Frederick sanger
Frederick sangerFrederick sanger
Frederick sanger
 
A Short History of DNA
A Short History of DNAA Short History of DNA
A Short History of DNA
 
Nucleic acid..biochem
Nucleic acid..biochemNucleic acid..biochem
Nucleic acid..biochem
 
History of DNA Development
History of DNA DevelopmentHistory of DNA Development
History of DNA Development
 
Basic concepts & scope of recombinant DNA technology
Basic concepts & scope of recombinant DNA technologyBasic concepts & scope of recombinant DNA technology
Basic concepts & scope of recombinant DNA technology
 
Dna Research Paper
Dna Research PaperDna Research Paper
Dna Research Paper
 
Unit 7 :Molecular Genetics
Unit 7 :Molecular GeneticsUnit 7 :Molecular Genetics
Unit 7 :Molecular Genetics
 
History of Genomics
History of Genomics History of Genomics
History of Genomics
 
Important events in the field of biochemistry
Important events in the field of biochemistryImportant events in the field of biochemistry
Important events in the field of biochemistry
 
Introduction to Biotechnology (Lecture-2).ppt.pptx
Introduction to Biotechnology (Lecture-2).ppt.pptxIntroduction to Biotechnology (Lecture-2).ppt.pptx
Introduction to Biotechnology (Lecture-2).ppt.pptx
 
Vinay @ dna
Vinay @ dnaVinay @ dna
Vinay @ dna
 
Mol bio
Mol bio Mol bio
Mol bio
 
Dna And Cloning Research Paper
Dna And Cloning Research PaperDna And Cloning Research Paper
Dna And Cloning Research Paper
 
Principles of microbial genetics
Principles of microbial geneticsPrinciples of microbial genetics
Principles of microbial genetics
 

Mais de Alberto Labarga

El Salto Communities - EditorsLab 2017
El Salto Communities - EditorsLab 2017El Salto Communities - EditorsLab 2017
El Salto Communities - EditorsLab 2017Alberto Labarga
 
Shokesu - Premio Nobel de Literatura a Bob Dylan
Shokesu - Premio Nobel de Literatura a Bob DylanShokesu - Premio Nobel de Literatura a Bob Dylan
Shokesu - Premio Nobel de Literatura a Bob DylanAlberto Labarga
 
Genome visualization challenges
Genome visualization challengesGenome visualization challenges
Genome visualization challengesAlberto Labarga
 
SocialLearning: descubriendo contenidos educativos de manera colaborativa
SocialLearning: descubriendo contenidos educativos de manera colaborativaSocialLearning: descubriendo contenidos educativos de manera colaborativa
SocialLearning: descubriendo contenidos educativos de manera colaborativaAlberto Labarga
 
Hacksanfermin 2015 :: Dropcoin Street
Hacksanfermin 2015 :: Dropcoin StreetHacksanfermin 2015 :: Dropcoin Street
Hacksanfermin 2015 :: Dropcoin StreetAlberto Labarga
 
hacksanfermin 2015 :: Parking inteligente
hacksanfermin 2015 :: Parking inteligentehacksanfermin 2015 :: Parking inteligente
hacksanfermin 2015 :: Parking inteligenteAlberto Labarga
 
Vidas Contadas :: Visualizar 2015
Vidas Contadas :: Visualizar 2015Vidas Contadas :: Visualizar 2015
Vidas Contadas :: Visualizar 2015Alberto Labarga
 
Periodismo de datos y visualización de datos abiertos #siglibre9
Periodismo de datos y visualización de datos abiertos #siglibre9Periodismo de datos y visualización de datos abiertos #siglibre9
Periodismo de datos y visualización de datos abiertos #siglibre9Alberto Labarga
 
Arduino: Control de motores
Arduino: Control de motoresArduino: Control de motores
Arduino: Control de motoresAlberto Labarga
 
Entrada/salida analógica con Arduino
Entrada/salida analógica con ArduinoEntrada/salida analógica con Arduino
Entrada/salida analógica con ArduinoAlberto Labarga
 
Práctica con Arduino: Simon Dice
Práctica con Arduino: Simon DicePráctica con Arduino: Simon Dice
Práctica con Arduino: Simon DiceAlberto Labarga
 
Entrada/Salida digital con Arduino
Entrada/Salida digital con ArduinoEntrada/Salida digital con Arduino
Entrada/Salida digital con ArduinoAlberto Labarga
 
Presentación Laboratorio de Fabricación Digital UPNA 2014
Presentación Laboratorio de Fabricación Digital UPNA 2014Presentación Laboratorio de Fabricación Digital UPNA 2014
Presentación Laboratorio de Fabricación Digital UPNA 2014Alberto Labarga
 
Conceptos de electrónica - Laboratorio de Fabricación Digital UPNA 2014
Conceptos de electrónica - Laboratorio de Fabricación Digital UPNA 2014Conceptos de electrónica - Laboratorio de Fabricación Digital UPNA 2014
Conceptos de electrónica - Laboratorio de Fabricación Digital UPNA 2014Alberto Labarga
 
Introducción a la plataforma Arduino - Laboratorio de Fabricación Digital UPN...
Introducción a la plataforma Arduino - Laboratorio de Fabricación Digital UPN...Introducción a la plataforma Arduino - Laboratorio de Fabricación Digital UPN...
Introducción a la plataforma Arduino - Laboratorio de Fabricación Digital UPN...Alberto Labarga
 
Introducción a la impresión 3D
Introducción a la impresión 3DIntroducción a la impresión 3D
Introducción a la impresión 3DAlberto Labarga
 

Mais de Alberto Labarga (20)

El Salto Communities - EditorsLab 2017
El Salto Communities - EditorsLab 2017El Salto Communities - EditorsLab 2017
El Salto Communities - EditorsLab 2017
 
Shokesu - Premio Nobel de Literatura a Bob Dylan
Shokesu - Premio Nobel de Literatura a Bob DylanShokesu - Premio Nobel de Literatura a Bob Dylan
Shokesu - Premio Nobel de Literatura a Bob Dylan
 
Genome visualization challenges
Genome visualization challengesGenome visualization challenges
Genome visualization challenges
 
SocialLearning: descubriendo contenidos educativos de manera colaborativa
SocialLearning: descubriendo contenidos educativos de manera colaborativaSocialLearning: descubriendo contenidos educativos de manera colaborativa
SocialLearning: descubriendo contenidos educativos de manera colaborativa
 
Hacksanfermin 2015 :: Dropcoin Street
Hacksanfermin 2015 :: Dropcoin StreetHacksanfermin 2015 :: Dropcoin Street
Hacksanfermin 2015 :: Dropcoin Street
 
hacksanfermin 2015 :: Parking inteligente
hacksanfermin 2015 :: Parking inteligentehacksanfermin 2015 :: Parking inteligente
hacksanfermin 2015 :: Parking inteligente
 
jpd5 big data
jpd5 big datajpd5 big data
jpd5 big data
 
Vidas Contadas :: Visualizar 2015
Vidas Contadas :: Visualizar 2015Vidas Contadas :: Visualizar 2015
Vidas Contadas :: Visualizar 2015
 
Periodismo de datos y visualización de datos abiertos #siglibre9
Periodismo de datos y visualización de datos abiertos #siglibre9Periodismo de datos y visualización de datos abiertos #siglibre9
Periodismo de datos y visualización de datos abiertos #siglibre9
 
myHealthHackmedicine
myHealthHackmedicinemyHealthHackmedicine
myHealthHackmedicine
 
Big Data y Salud
Big Data y SaludBig Data y Salud
Big Data y Salud
 
Arduino: Control de motores
Arduino: Control de motoresArduino: Control de motores
Arduino: Control de motores
 
Entrada/salida analógica con Arduino
Entrada/salida analógica con ArduinoEntrada/salida analógica con Arduino
Entrada/salida analógica con Arduino
 
Práctica con Arduino: Simon Dice
Práctica con Arduino: Simon DicePráctica con Arduino: Simon Dice
Práctica con Arduino: Simon Dice
 
Entrada/Salida digital con Arduino
Entrada/Salida digital con ArduinoEntrada/Salida digital con Arduino
Entrada/Salida digital con Arduino
 
Presentación Laboratorio de Fabricación Digital UPNA 2014
Presentación Laboratorio de Fabricación Digital UPNA 2014Presentación Laboratorio de Fabricación Digital UPNA 2014
Presentación Laboratorio de Fabricación Digital UPNA 2014
 
Conceptos de electrónica - Laboratorio de Fabricación Digital UPNA 2014
Conceptos de electrónica - Laboratorio de Fabricación Digital UPNA 2014Conceptos de electrónica - Laboratorio de Fabricación Digital UPNA 2014
Conceptos de electrónica - Laboratorio de Fabricación Digital UPNA 2014
 
Introducción a la plataforma Arduino - Laboratorio de Fabricación Digital UPN...
Introducción a la plataforma Arduino - Laboratorio de Fabricación Digital UPN...Introducción a la plataforma Arduino - Laboratorio de Fabricación Digital UPN...
Introducción a la plataforma Arduino - Laboratorio de Fabricación Digital UPN...
 
Introducción a la impresión 3D
Introducción a la impresión 3DIntroducción a la impresión 3D
Introducción a la impresión 3D
 
Vidas Contadas
Vidas ContadasVidas Contadas
Vidas Contadas
 

Último

IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 

Último (20)

IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 

Retos de la Bioinformatica

  • 1. Bioinformática: la biología por otros medios Alberto Labarga UGR, Noviembre 2008
  • 2. Computational Biology Bioinformatics [Biological Information]
  • 3. Hacia una teoría científica de la herencia 1859 1866 1870 1900 1902
  • 4. Charles Darwin publica en 1859 'The Origin of Species‘ donde se propone que los seres vivos son el resultado de la selección natural y que todas las criaturas han evolucionado a lo largo de las generaciones a través de pequeños cambios. 1859 1866 1870 1900 1902
  • 5. Leyes de Mendel, publicadas en 1866, redescubiertas en 1900 1859 1866 1870 1900 1902
  • 6. En 1870, un científico alemán llamado Friedrich Miescher aísla los componentes almacenados en el núcleo, compuesto principalmente por proteinas y ácidos nucleicos. En aquel momento se creía que el elemento que almacenaba la información hereditaria tenía que ser la proteína, compuesta por 20 aminoacidos, mientras que los ácidos nucleicos tenían sólo 4 componentes. 1859 1866 1870 1900 1902
  • 7. A comienzo de siglo, Phoebus Levene, descubrió que el ADN es una cadena de nucleótidos, en la que cada nucleótido está compuesto de un azucar (desoxirribosa), un grupo fosfato y una base nitrogenada, que podía ser de cuatro tipos, Adenin, Timina, guanina y Citosina 1859 1866 1870 1900 1902
  • 8. Walter Sutton, a graduate student in E. B. Wilson’s lab at Columbia University, observed that in the process of cell division, called meiosis, that produces sperm and egg cells, each sperm or egg receives only one chromosome of each type. (In other parts of the body, cells have two chromosomes of each type, one inherited from each parent.) The segregation pattern of chromosomes during meiosis matched the segregation patterns of Mendel’s genes. 1859 1866 1870 1900 1902
  • 9. El descubrimiento del ADN 1928 1944 1949 1952 1953
  • 10. 1928 Frederick Griffith: principio de transformación si mezclaba a los neumococos R con neumococos S previamente muertos por calor, entonces los ratones se morían. Aún más, en la sangre de estos ratones muertos Griffith encontró neumococos con cápsula (S). 1928 1944 1949 1952 1953
  • 11. En 1944 Oswald Avery y sus colaboradores, que estaban estudiando la bacateria que causa la neumonía, Pneumococcus, descubrieron que las bacterias tienen ácidos nucleicos y que es la molécula de ADN la encargada de almacenar los genes. Otros estudios con virus se encargaronde confirmar esta teoría a pesar de que se seguía creyendo que el ADN era demasiado simple. 1928 1944 1949 1952 1953
  • 12. La vida puede verse como un proceso de almacenamiento y transmisión de información biológica. Los cromosomas son los portadores de esta información. La información está almacenada en la forma de un código molecular Para entender la vida debemos identificar estas moléculas y descifrar el código 1928 1944 1949 1952 1953
  • 13. 1949 DNA se duplica durante la división celular Chargaff: A = T and G = C 1928 1944 1949 1952 1953
  • 14. 1952 - Hershey-Chase Experiment 1928 1944 1949 1952 1953
  • 15. M.H.F. Wilkins, A.R. Stokes, H.R. Wilson: Molecular Structure of Deoxypentose Nucleic Acids. Nature 171, 738 (1953) R.E. Franklin and R.G. Gosling Molecular Configuration in Sodium Thymonucleate, Nature 171, 740 (1953) 1928 1944 1949 1952 1953
  • 16. MOLECULAR STRUCTURE OF NUCLEIC ACIDS “We wish to propose a structure for the salt of desoxyribose nucleic acid (DNA). This structure has novel features which are of considerable biological interest” Nature. 25 de abril de 1953 1928 1944 1949 1952 1953
  • 17. “It has not escaped our attention that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.” 1928 1944 1949 1952 1953
  • 20. En 1955 Ochoa publicó en Journal of the American Chemical Society con la bioquímica francorrusa Marianne Grunberg-Manago, el aislamiento de una enzima del colibacilo que cataliza la síntesis de ARN, el intermediario entre el ADN y las proteínas. Los descubridores llamaron «polinucleótido-fosforilasa» a la enzima, conocida luego como ARN-polimerasa. El descubrimiento de la polinucleótido fosforilasa dio lugar a la preparación de polinucleótidos sintéticos de distinta composición de bases con los que el grupo de Severo Ochoa, en paralelo con el grupo de Marshall Nirenberg, llegaron al desciframiento de la clave genética. 1955 1959 1962 1966
  • 22. 1955 1959 1962 1966
  • 23. Cuando Perutz llegó a Cambridge la estructura molecular más grande que se había resuelto era la del pigmento natural ficocianina, de 58 átomos. Una proteína tiene miles de átomos. Bernal, su director, había realizado algunas imágenes de difracción de rayos X de cristales de una proteína, la pepsina, pero sin llegar a interpretarlas. El tema escogido por Perutz para su tesis fue otra proteína, la hemoglobina, el transportador de oxígeno que da color rojo a nuestra sangre. La hemoglobina tiene nada menos que 11.000 átomos. Tardo 23 años. 1955 1959 1962 1966
  • 24. 1955 1959 1963 1966
  • 25. Over the course of several years, Marshall Nirenberg, Har Khorana and Severo Ochoa and their colleagues elucidated the genetic code – showing how nucleic acids with their 4-letter alphabet determine the order of the 20 kinds of amino acids in proteins. Messenger RNA is interpreted three letters at a time; a set of three nucleotides forms a "codon" that encodes an amino acid. A three-letter word made of four possible letters can have 64 (4 x 4 x 4) permutations, which is more than enough to encode the 20 amino acids in living beings. 1955 1959 1962 1966
  • 27. From DNA to protein
  • 28. Entendiendo los mecanismos, creando las herramientas 1970 1971 1975 1977 1980
  • 29. El Central Dogma 1970 1971 1975 1977 1980
  • 30. Created in 1971 with seven structures 1970 1971 1975 1977 1980
  • 31. El ADN recombinante, o ADN recombinado, es una molécula de ADN formada por la unión de dos moléculas heterólogas, es decir, de diferente origen. Se realiza a través de las enzimas de restricción que son capaces de "cortar" el ADN en puntos concretos. De una manera muy simple podemos decir que "cortamos" un gen humano y se lo "pegamos" al ADN de una bacteria; si por ejemplo es el gen que regula la fabricación de insulina, lo que haríamos al ponérselo a una bacteria es "obligar" a ésta a que fabrique la insulina. 1970 1971 1975 1977 1980
  • 32. 1970 1971 1975 1977 1980
  • 33. A precursor-RNA may often be matured to mRNAs with alternative structures. An example where alternative splicing has a dramatic consequence is somatic sex determination in the fruit fly Drosophila melanogaster. In this system, the female-specific sxl-protein is a key regulator. It controls a cascade of alternative RNA splicing decisions that finally result in female flies. 1970 1971 1975 1977 1980
  • 34. Entendiendo los mecanismos, creando las herramientas 1981 1982 1983 1985 1987 1990
  • 35. Read out the letters from a DNA sequence GTGAGGCGCTGC 1981 1982 1983 1985 1987 1990
  • 36. 1983 La reacción en cadena de la polimerasa, conocida como PCR por sus siglas en inglés (Polymerase Chain Reaction), es una técnica de biología molecular descrita en 1986 por Kary Mullis,[1] cuyo objetivo es obtener un gran número de copias de un fragmento de ADN particular, partiendo de un mínimo; en teoría basta partir de una única copia de ese fragmento original, o molde. 1981 1982 1983 1985 1987 1990
  • 37. Total nucleotides Number of entries (Nov 07: 188,490,792,445) (Nov 07: 106,144,026) 1981 1982 1983 1985 1987 1990
  • 38. 1981 1982 1983 1985 1987 1990
  • 39. El Proyecto Genoma Humano (PGH) (Human Genome Project en inglés) consiste en determinar las posiciones relativas de todos los nucleótidos (o pares de bases) e identificar 100.000 genes presentes en él. El proyecto, dotado con 3.000 millones de dólares, fue fundado en 1990 por el Departamento de Energía y los Institutos de la Salud de los Estados Unidos, con un plazo de realización de 15 años. 1981 1982 1983 1985 1987 1990
  • 40. ”Imagine varias copias de un libro, cortadas en 10 millones de trocitos cada una, de manera que los trocitos se solapan. Supongamos que 1 millón de trocitos se han perdido, y que los otros 9 millones están manchados de tinta. Recupere el texto original.”
  • 42. HUGO: Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct the sequence of the genome.
  • 43. Descifrando el libro de la vida 1990 1995 1996 1997 1998 1999 2001
  • 44. S.F. Altschul, et al. (1990), "Basic Local Alignment Search Tool," J. Molec. Biol., 215(3): 403-10, 1990. 15,306 citations Altschul, S.F. et al (1997), “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, Nucleic Acids Res., vol. 25, no. 17, pp. 3389-402. 1990 1995 1996 1997 1998 1999 2001
  • 47. • SSAHA (Ning et al., 2001) • http://www.sanger.ac.uk/Software/analysis/SSAHA/ • SSAHA is an algorithm for very fast matching and alignment of DNA sequences. It stands for Sequence Search and Alignment by Hashing Algorithm. It achieves its fast search speed by converting sequence information into a `hash table' data structure, which can then be searched very rapidly for matches. • BLAT (J. Kent, 2002) • http://genome.ucsc.edu/cgi-bin/hgBlat • BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 20 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more.
  • 48. J. Thompson, T. Gibson, D. Higgins (1994), CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment … Nuc. Acids. Res. 22, 4673 - 4680 1990 1995 1996 1997 1998 1999 2001
  • 49. Flowchart of computation steps in Clustal W (Thompson et al., 1994) Pairwise alignment: calculation of distance matrix Creation of unrooted neighbor-joining tree Rooted nJ tree (guide tree) and calculation of sequence weights Progressive alignment following the guide tree
  • 50. Otros métodos Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol, 302, 205–217. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. Katoh, K., Kuma, K., Toh, H., Miyata, T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 33, 511–518. Lassmann, T., Sonnhammer, E. (2005) Kalign – an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics , 6, 298. Larkin M.A. et al. (2007) ClustalW and ClustalX version 2. Bioinformatics 2007 23(21): 2947-2948.
  • 52. 1995 • El primer genoma completo de un organismo Hemophilus influenzae. 1990 1995 1996 1997 1998 1999 2001
  • 53. 1996 • El genoma de la levadura se completa: aproximadamente, 6,000 genes y 14.000.000 de pares de bases 1990 1995 1996 1997 1998 1999 2001
  • 54. 1990 1995 1996 1997 1998 1999 2001
  • 55. 1997 •Ecuenciado el genoma de la bacteria E. Coli: 4,600 genes 4,5 millones de nucleótidos. 1990 1995 1996 1997 1998 1999 2001
  • 56. 1998 El genoma del gusano Caenorhabditis elegans, tiene 18,000 genes unos 100 millones de nucleotidos 1990 1995 1996 1997 1998 1999 2001
  • 57. 1999 •Se consigue la secuencia completa del cromosoma 22 El HGP va por delante de lo planeado. Sorprende el reducido número de genes encontrado (unos 300) 1990 1995 1996 1997 1998 1999 2001
  • 58. Fire A, Xu S, Montgomery M, Kostas S, Driver S, Mello C (1998). "Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans". Nature 391 (6669): 806–11. doi:10.1038/35888. PMID 9486653
  • 59. Hamilton A, Baulcombe D (1999). "A species of small antisense RNA in posttranscriptional gene silencing in plants". Science 286 (5441): 950–2. PMID 10542148
  • 60. Dr Alan Wolffe (1999) • Epigenetics is heritable changes in gene expression that occur without a change in DNA sequence • Such changes cannot be attributed to changes in DNA sequence (mutations) • They are as Irreversible as mutations (or difficult to reverse)
  • 61. 1990 1995 1996 1997 1998 1999 2001
  • 62. Gene prediction Where are the genes? In humans: ~22,000 genes ~1.5% of human DNA
  • 63. the gencode pipeline 1. mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the human genome 2. manual curation to resolve conflicting evidence 3. additional computational predictions 4. experimental verification 5. FINAL ANNOTATION
  • 64. Genome annotation - building a pipeline Genome sequence Map repeats Map ESTs Map Peptides Genefinding nc-RNAs Protein-coding genes Functional annotation Release August 2008 Bioinformatics tools for Comparative 64 Genomics of Vectors
  • 65. Genefinding - ab initio predictions  Use compositional features of the DNA sequence to define coding segments (essentially exons)  ORFs  Coding bias  Splice site consensus sequences  Start and stop codons  Each feature is assigned a log likelihood score  Use dynamic programming to find the highest scoring path  Need to be trained using a known set of coding sequences  Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh August 2008 Bioinformatics tools for Comparative 65 Genomics of Vectors
  • 66. ab initio prediction Genome Coding potential ATG & Stop codons Splice sites ATG & Stop codons Coding potential August 2008 Bioinformatics tools for Comparative 66 Genomics of Vectors
  • 67. ab initio prediction Genome Coding potential ATG & Stop codons Splice sites ATG & Stop codons Coding potential August 2008 Bioinformatics tools for Comparative 67 Genomics of Vectors
  • 68. ab initio prediction Genome Coding potential ATG & Stop codons Splice sites ATG & Stop codons Coding potential Find best prediction August 2008 Bioinformatics tools for Comparative 68 Genomics of Vectors
  • 69. Genefinding - similarity  Use known coding sequence to define coding regions  EST sequences  Peptide sequences  Needs to handle fuzzy alignment regions around splice sites  Needs to attempt to find start and stop codons  Examples: EST2Genome, exonerate, genewise  Use 2 or more genomic sequences to predict genes based on conservation of exon sequences  Examples: Twinscan and SLAM August 2008 Bioinformatics tools for Comparative 69 Genomics of Vectors
  • 70. Similarity-based prediction Genome Align cDNA/peptide Create prediction August 2008 Bioinformatics tools for Comparative 70 Genomics of Vectors
  • 71. Example of a simple HMM Top: model architecture and parameters. Bottom: sequence generation process. green: state transition probabilities, red: emission probabilities. Prob(sequence, path|model) = 6.8e-8. EPFL – Bioinformatics I – 05 Dec 2005
  • 72. Automatic Annotation vs Manual Automatic Annotation Manual Annotation • Quick whole genome analysis ~ • Extremely slow~3 months Chr 6 weeks • Need finished seq • Consistent annotation • Flexible, can deal with • Use unfinished sequence/shotgun inconsistencies in data assembly • Most rules have exception • No polyA sites/signals, pseudogene • Consult publications as well as • Predicts ~70% loci databases
  • 73. Analysis EGASP predictions vs manual 100 annotation 100 Exon Sn Nuc Sn 90 90 Nuc Sp Exon Sp 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 9_101_1 20_79_1 36_46_1 41_77_1 9_101_1 20_79_1 36_46_1 41_77_1 80 80 Trans Sn 70 Gene Sn Trans Sp 70 Gene Sp 60 60 50 50 40 40 30 30 20 20 10 10 0 0 9_101_1 20_79_1 36_46_1 41_77_1 9_101_1 20_79_1 36_46_1 41_77_1
  • 74. Y sólo es el principio 2002 2004 2005 2007 2010
  • 75. 2002 2004 2005 2007 2010
  • 76. 10/3/02 8/28/03 5/07 10/08 Published complete genomes: 104 156 500 874 Ongoing prokaryotic genomes: 316 386 1500 2124 Ongoing eukaryotic genomes: 218 246 700 1004 http://www.genomesonline.org 4000 2002 2004 2005 2007 2010
  • 77. 32,000,000 454-GS20 Millions 4 .5 4 4 .0 4 Applied Biosystems 3 .5 4 Roche / 454 # Bases/Run 3 .0 4 ABI 3730XL ABI Genome Sequencer FLX 2 .5 4 ABI 1 Mb / day 2 .0 4 ABI 3730 100 Mb / run 1 .5 4 3700 1 .0 4 370/377 0 .5 4 0 .0 4 1994 1996 1998 2000 2002 2004 2006 Dat e of Int roduct ion Applied Biosystems SOLiD Illumina / Solexa 3000 Mb / run Genetic Analyzer 2000 Mb / run 2002 2004 2005 2007 2010
  • 78. Aunque los seres humanos compartimos 99.9 por ciento de la información genética, tenemos pequeñas variaciones, llamadas poliformismos singulares de nucléotido o SNP (por su siglas en inglés; se pronuncia snip). Se estima que existen unos 10 millones de SNP en la especie humana y supuestamente esas diferencias estarían relacionadas con la mayor resistencia o susceptibilidad a enfermedades y medicamentos. 2002 2004 2005 2007 2010
  • 79. VARIACIÓN EN LA SECUENCIA HUMANA DE DNA Tasa de mutación = 10-8 /sitio/generación Nº generaciones ancestro común-humano actual: 104-105
  • 80. ENCyclopedia Of DNA Elements 2002 2004 2005 2007 2010
  • 81. 2002 2004 2005 2007 2010
  • 83. Sequence (DNA/RNA) Comparative & phylogeny genomics Protein sequence analysis & Regulation of gene evolution expression; transcription factors & micro RNAs Protein structure & function: computational crystallography Protein families, motifs and domains Chemical biology Protein interactions & complexes: modelling and prediction Pathway analysis Data integration & literature mining Image analysis Systems modelling
  • 84. Se preparan las Se preparan copias del ADN muestras de ARN de los genes de interés de interés Laser 1 Laser 2 control muestr a El chip se excita con láseres diferentes: el ...que se Transcripción control imprimen inversa reacciona a uno en el chip Añadir de ellos y la fluorescencia muestra al otro La comparación de ambas imágenes nos indica que genes se expresan de manera diferente Las muestras se hibridan en el microarray Schena et al. Science 1995
  • 85. Microarray analysis Clinical prediction of Leukemia type • 2 types – Acute lymphoid (ALL) – Acute myeloid (AML) • Different treatment & outcomes • Predict type before treatment? Golub et. al. Science 286:531-537. (1999)
  • 86. Biomarkers discovery Data statistical Management analysis Network Annotation análisis Selection 30.000 1500 genes 150 genes 50 elements 10 targets genes
  • 87. RT-PCR Standard Processing Procedure TaqMan Assays ! Overview Plates & Samples ! Quality Control Step1: Calculate Ct with SDS and export text file Raw Values ! Discard Samples Step2: Retrieve data and define experiment design ! Quality Control ΔCt Overview Step 4: Selection of Optimal Step 5: Differential Step 3: Biological Endogenous Controls & Expression Analysis ΔΔCt Replicates Calculation of ΔCt
  • 88. Example of Array CGH Technology* Chari et al, Cancer Informatics, 2006, 2, 48-58 88
  • 89. 89
  • 90. Chip-on-chip Source: http://www.chiponchip.org/
  • 91. ChIP (Chromatin ImmunoPrecipitation) • Chromatin immunoprecipitation, or ChIP, refers to a procedure used to determine whether a given protein binds to a specific DNA sequence in vivo DNA-binding proteins are crosslinked to DNA with formaldehyde in vivo Bind antibodies specific to the DNA- binding protein to isolate the complex by precipitation. Reverse the cross- linking to release the DNA and digest the proteins. Isolate the chromatin. Shear DNA along with bound proteins into small fragments. Use PCR( Polymerase Chain Reaction ) to amplify specific DNA sequences to see if they were precipitated with the antibody
  • 93. Protein Microarray G. MacBeath and S.L. Schreiber, 2000, Science 289:1760 arrayIT TM Spotting platform and protein microarray
  • 94. Different Kinds of Protein Arrays* Antibody Array Antigen Array Ligand Array Detection by: SELDI MS, fluorescence, SPR, electrochemical, radioactivity, microcantelever
  • 97. Some Questions: • Which genes have expression levels that are correlated with some external variable? • For a given pathway, which of the genes in our collection are most likely to be involved? • For a diffuse disease, which genes are associated with different outcomes?
  • 98. Challenges for Data Analysis • Normalization (removing systematic measurement effects) • Variable Selection (Identification of relevant Variables) • Large sample Effects: Type I and Type II errors (False positives / False negatives) • Dimensionality Reduction • Identification of new disease classes • Classification of data into known disease classes
  • 99. Data Analysis Methods Dimension Reduction • PCA (Principle Component Analysis) • ICA (Independent Component Analysis) • Multidimensional Scaling Unsupervised Learning • K-Means / K-Medoid • Hierarchical Clustering Algorithms Supervised Learning • Linear Discriminant Analysis • Maximum Likelihood Discrimination • Nearest Neighbor Methods • Decision Trees • Random Forests
  • 102. Popular Classification Methods • Decision Trees/Rules – Find smallest gene sets, but not robust – poor performance • Neural Nets - work well for reduced number of genes • K-nearest neighbor – good results for small number of genes, but no model • Naïve Bayes – simple, robust, but ignores gene interactions • Support Vector Machines (SVM) – Good accuracy, does own gene selection, but hard to understand • Specialized methods, D/S/A (Dudoit), … 102
  • 103. Support Vector Machine (SVM) • Main idea: Select hyperplane that is more likely to generalize on a future datum
  • 104. Best Practices • Capture the complete process, from raw data to final results • Gene (feature) selection inside cross-validation • Randomization testing • Robust classification algorithms – Simple methods give good results – Advanced methods can be better • Wrapper approach for best gene subset selection • Use bagging to improve accuracy • Remove/relabel mislabeled or poorly differentiated samples 104
  • 105. Enrichment Analysis • What are major enriched GO terms? • What are the highly active pathways? • What are the frequently interacting proteins? • What are the known disease associations? Alistair Chalk, 2008
  • 106. Meta-analysis example: “Creation and implications of a phenome-genome network” Butte and Kohane. Nat Biotech. 2006
  • 107. Meta-analysis example: “Creation and implications of a phenome-genome network” Butte and Kohane. Nat Biotech. 2006 • Clustered experiments based on mapping concepts found in sample annotations to UMLS meta-thesaurus. • Relationships found between phenotype (e.g., aging), disease (e.g., leukemia), environmental (e.g., injury) and experimental (e.g., muscle cells) factors and genes with differential expression. • “the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long- term value of the time invested in improving annotations.”
  • 110. PPI ANNOTATION AND DATABASES Database Reference URL MINT (Zanoni et al., 2002) http://mint.bio.uniroma2.it/mint IntAct (Hermjakob et al., 2004) http://www.ebi.ac.uk/intact DIP (Xenarios et al., 2002) http://dip.doe-mbi.ucla.edu/ HPID (Han et al., 2004) http://www.hpid.org HPRD (Peri et al., 2004) http://www.hprd.org/  iMEX agreement to share curation efforts  Protein Standard Initiative (PSI) recommendation  Molecular Interaction (MI) Ontology  Large scale experiments Literature curation
  • 112. Complex networks • Many systems can be represented as networks (graphs) – Nodes: individual component (proteins) – Edges: relationships (interactions) • They share common properties – Scale-free – Hierarchical – Clustering • Some properties may be intrinsic and can be understood better when putting into the context of evolution
  • 114. Summary: Network Measures • Degree ki The number of edges involving node i • Degree distribution P(k) The probability (frequency) of nodes of degree k • Mean path length The avg. shortest path between all node pairs • Network Diameter – i.e. the longest shortest path • Clustering Coefficient – A high CC is found for modules
  • 115. Mapping the phenotypic data to the network •Systematic phenotyping of 1615 gene knockout strains in yeast •Evaluation of growth of each strain in the presence of MMS (and other DNA damaging agents) •Screening against a network of 12,232 protein interactions Begley TJ, Rosenbach AS, Ideker T, Samson LD. Damage recovery pathways in Saccharomyces cerevisiae revealed by genomic phenotyping and interactome mapping. Mol Cancer Res. 2002 Dec;1(2):103-12.
  • 117. The Role of Proteomics • The existence of an ORF does not imply the existence of a functional gene. • Limitations of comparative genomics. • mRNA levels may not correlate with protein levels. • Protein modifications  post-transcriptional modifications, isoforms, post-translational modifications, mutants. • Issues of proteolysis, sequestration, etc. relevant only at the protein level. • Protein complex composition, protein-protein interactions, structures.
  • 118. Structural proteomics • Folding • Structure and function • Protein structure prediction • Secondary structure • Tertiary structure • Function • Post-translational modification • Prot.-Prot. Interaction -- Docking algorithm • Molecular dynamics/Monte Carlo
  • 119. What kind of methods around? 5 main levels of protein Structure prediction: 1. Extensive Sequence Search 2. Threading and 1D-3D profiles 3. Ab initio prediction of protein structure 4. Comparative Modelling 5. Docking (domain interaction prediction)
  • 121. Prediction of Protein Structures • Examples – a few good examples actual predicted actual predicted actual predicted actual predicted
  • 123. MODPIPE: Large-Scale Comparative Protein Structure Modeling START 1 Get profile for sequence (NR) Expand match to cover complete domains PSI-BLAST For each template structure For each target sequence Scan sequence profile against MODELLER representative PDB chains Align matched parts of sequence and structure Scan PDB chain profiles Build model for target segment by against sequence satisfaction of spatial restraints Evaluate model Select templates using permissive E-value cutoff 1 END R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998. N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, A. Šali. 3/25/03
  • 124. Structural Proteomics: The Motivation* 2000000 200000 1800000 180000 1600000 160000 1400000 140000 Sequences Structures 1200000 120000 1000000 100000 800000 80000 600000 60000 400000 40000 200000 20000 0 0 1980 1985 1990 1995 2000 2005
  • 125. The hierarchies of protein structure
  • 126. Docking Programs • Dock (UCSF) • Autodock (Scripps) • Glide (Schrodinger) • ICM (Molsoft) • FRED (Open Eye) • Gold, FlexX, etc. 126
  • 127. Cell cycle network from KEGG
  • 128. Graphical Notation: a necessity for the conceptual representation of biopathways Qualitative Mechanistic various degree of detail, mixed level of presentation Aladjem et al., Science STKE pe8 Thiery & Sleeman, Nat. Rev. Mol. (2004) Cell. Biol 7:131 (2006) 128
  • 129. Strategies: simulate or analyse? (or rather what to do first) obtain qualitative convert diagram simulate model understanding into a quantitative behavior through numerical model numerically results and model reduction build and identify qualitatively simulate a “elementary analyze network reduced model modes” topology, stability, etc 129
  • 130. 130 stochsim Boolean networks Space of modeling methods continuous ↔ discrete
  • 131. Continuum of modeling approaches Top-down Bottom-up
  • 132. Frazier et al. (2003) Science 11 April Vol 300:290-293
  • 134. Nucleic Acids Research article lists 1078 public databases Nucleic Acids Research, 2008, Vol. 36, Database issue http://nar.oxfordjournals.org/cgi/reprint/36/suppl_1/D2
  • 135. Growth in Available Bioinformatics Databases
  • 136. Too much unintegrated data • Data sources incompatible • No (or few) standard naming convention • No common interface (varying tools for browsing, querying and visualizing data)
  • 137. – Large experiments or large research – Small, isolated, independent, groups/labs, possibly distributed groups/individuals – Large service provider institutes. – Loosely coupled provider- consumer of resources. – Tightly coupled provider-consumer of resources. – Commonly resource consumers – Commonly resource providers. – Boutique suppliers. – Some or lots of access to sys admin – Poor access systems admins
  • 138. Challenges: Names and Identity • WSL-1 protein Q93038 = Tumor necrosis factor • Apoptosis-mediating receptor DR3 receptor superfamily member • Apoptosis-mediating receptor 25 precursor TRAMP • Death domain receptor 3 Annotation history: • WSL protein • Apoptosis-inducing receptor AIR Q92983 P78515 • Apo-3 O00275 Q93036 • Lymphocyte-associated receptor of death O00276 Q93037 • LARD O00277 Q99722 • GENE: Name=TNFRSF25 O00278 Q99830 O00279 Q99831 O00280 Q9BY86 O14865 Q9UME0 GUIDs O14866 Q9UME1 Life Science P78507 Q9UME5 Identifier? Normalisation 138 http://www.expasy.org/uniprot/Q93038
  • 140. Why must support standards? • Unambiguous representation, description and communication – Final results and metadata • Interoperability – Data management and analysis • Integration of OMICS  system biology
  • 141. What to standarize? • CONTENT: Minimal/Core Information to be reported • MIBBI (http://www.mibbi.org) • SEMANTIC: Terminology Used -> Ontologies • OBI (http://obi-ontology.org) • SYNTAX: Data Model, Data Exchange • Fuge (http://fuge.sourceforge.net/)
  • 142. MIBBI: Standard Content Promoting Coherent Minimum Reporting Requirements for Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotech.
  • 143. Link Integration: Integration Lite Application interface User interface Application Ontology Authority Identity Authority 143
  • 144. Warehouse Wrappers Wrappers Data Access and Query User interface Application Unified Wrappers model • Copy the data sets, clean and massage data into shape • Combine them into a (different) pre-determined model before query • ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART • Often called “Knowledge bases”  144
  • 145. View integration Wrappers Wrappers Data Access and Query User interface Application Unified Wrappers model • Data at Source; Virtual integrating database view • Global as View / Local as View mappings between models • Map from model to databases dynamically so always fresh • TAMBIS, Information Integrator, K4, ComparaGrid, UTOPIA, caCORE 145
  • 146. Specialist Integrating Application Wrappers Wrappers User interface Application Wrappers E.g. Ensembl, UTOPIA • Very popular. Known to be one application. 146
  • 147. Workflows Workflow Engine User interface Application Wrapper • Data flow protocol. Automated data chaining. • General technique for describing and enacting a process • Describes what you want to do, not how you want to do it • Various degrees of data type compliance anticipated 147
  • 148. Mash-Up Data Marshalling objects Protocol Mash Up Application User interface Protocol Protocol • Content syndication and feeds • Emphasis on User creating specific integration by mapping. • Just in time, just enough design • On demand integration 148
  • 150. Semantic Web help? Access and Query Wrappers User interface Application Wrapper Wrappers Semantic Enrichment Model flattening Mapping Transparency • Slight problem: we have no first class metadata migration and management infrastructure, where metadata is outside the application and in the middleware, and we can handle progressive curation 150
  • 152. Service Oriented Architecture Advanced Search Retrieve data Submit data submission curation ws ws ws ws ws dataflow workflow
  • 155. An Integrative Analysis Example Relational data Decision mining Text tree model of mining Visualizing metabonomi serial/spect Visualizing c profile rum data cluster statistics Visualizing Visualizing Visualizin Chemical multidimensi Visualizing g sequence structure data pathway onal data Chemical relational Text mining Spectrum visualization data data sequence visualization data data clusters mining model
  • 156. From experiments to scientific publications 1- Experiments 2- Results 3- Scientific Peer- reviewed articles Planning and Processing and carrying out interpretation of 'Relevant' results are experiments obtained results published in scientific (lab work) journals
  • 157. PubMed/Medline database at NCBI - Developed at the National Center for Biotechnology Information (NCBI). - The core 'Textome'. - repository of citation entries of scientific articles. - PubMed titles and abstracts are primary data source for Bio-NLP. - ~ 450,000 new abstracts/a - > 4,800 biomedical journals - ENTREZ search engine
  • 158. Data in scientific articles Scientific Free Text Tables Figures Journals Title Abstracts Keywords Text body References Journal- Biomedical literature characteristics specific Information: - Heavy use of domain specific terminology (12% biochemistry •Format •Paper structure related technical terms). (sections) - Polysemic words (word sense disambiguation). •Article type - Most words with low frequency (data sparseness). - New names and terms created. - Typographical variants - Different writing styles (native languages)
  • 162. BioCreative results TP: prediction evaluated as protein and GO terms correct Precision: TP / Total nr. of evaluated submissions 1: Chiang et al. 2: Couto et al. 3: Ehrler et al. 4: Ray et al. 5: Rice et al. 6: Verspoor et al.
  • 163.  Data Integration • Standards, DBs Infrastructure  Knowledge Discovery • Algorithms, Informatics, Machine Learning  Integrate knowledge • Text mining, Ontologies  Modelling • Pathways, Circuits, Abstraction Research Support
  • 164. Los retos de la biología en los próximos 50 years • Listado de todos los componentes moleculares que forman un organismo: – Genes, proteinas, y otros elementos funcionales • Comprender la funcion de cada componente • Comprender como interaccionan • Estudiar como la función ha evolucionado • Encontrar defectos geneticos que causan enfermedades • Diseñar medicamentos y terapias de manera racional • Secuenciar el genoma de cada individuo y usarlo en una medicina personalizada • La Bioinformatica es un componente esencial para conseguir todos estos objetivos