SlideShare uma empresa Scribd logo
1 de 27
Unison: Enabling easy, rapid, and
comprehensive proteomic mining
http://unison-db.org/
Online access, download, documentation, references.

Reece Hart
Genentech, Inc.

UCSF / SF PostgreSQL Users' Group
March 11, 2009
San Francisco, CA
                      Slides available at http://harts.net/reece/pubs/
A Bestiary of Life Sciences Data Types
              Genomics
                                                        Proteomics
        assemblies, transcripts,
                                               sequences, domains, PTMs,
         probes, trans. factors,
                                             localization, structure, orthology,
           expression, SNPs,
                                                 predictions, networks …
             haplotypes …



                                                                        Annotation
     Chemistry                                                    GO, taxonomy, SCOP,
compounds, HCS, HTS,                                               disease, OMIM …
    properties …




       Clinincal                                                          LIMS
   assays, protocols,                                          animal records, protocols
    patient records,                                               request systems,
      samples …                                                 personnel, samples …
                                     Communications
                                   literature, patents, and
                                        presentations …
                                                                                           2
Types of Integration

    Source Aggregation
➢                                      RefSeq       In-house
                                      Sequences     Structures
         Aggregates data of the
     ●


         same type from multiple     UniProt         PDB
         sources.                   Sequences     Structures
         Ensures completeness of
     ●


         data.


                                                    Structures
                                    Sequences
    Semantic Integration
➢
         Integrates fundamentally
     ●


         distinct data types.
         Abstracts types to
     ●


         essential features.
                                                    Structures
                                    Sequences
         Improves contextual
     ●


         understanding of data.
                                                                 3
A Survey of Integration Methods

Presentation
                                                                                        Mashups
                                                       Link Integration                 AJAX, iframe
                                                       Hypertext links between
                                                                sites




Middle Tier

                                                         Server
                                                        Mashups


Database
Integration
                                                  F                              W
(Federation /
Warehouse)
                                          Federation                                 Warehouse

Source Databases
                                            A                       B                  C
or Files
For review, see:
Goble C, Stevens R
                                                                                                       4
J Biomed Inform. 2008 Oct;41(5):687-93.
The Problems in a Nutshell
    Data integration is complex.
➢
         Establishing semantic equivalences and
     ●


         relationships are difficult.
         Source database contents are updated often.
     ●




    Existing tools don't cut it.
➢
         Licensing restrictions prevent sharing.
     ●


         Narrow in type of data and/or content, and not
     ●


         easily updated.
         Not specifically designed for mining.
     ●




    Scientists develop ad hoc and integration
➢
    solutions.
         Results are difficult to repeat.
     ●


         It wastes a lot of time.
     ●


         Questions don't get asked.
     ●
                                                          5
Unison in a Nutshell




                         Domain,
                                                         Structures
                  Structure & Homology
                                                         & Ligands
                       Predictions

                                        Protein
                                     Sequences and
                                      Annotations
                                                        Auxiliary
                      Genomes,
                                                      Annotations
                    Gene Mapping &
                                                     GO, RIF, SCOP,
                      Structure,
                                                          etc.
                       Probes



      Sequences and Annotations         Auxiliary Data   Precomputed predictions
UniProt, IPI, Ensembl, RefSeq, PDB, HomoloGene, Gene     Domains, homology, structure, TMs,
 PHANTOM, HUGE, ROUGE, MGC,               Ontology,      localization, signals, disorder, etc.
              Derwent, pataa, nr, etc. taxonomy, PDB,    >200M predictions, 23 types,
                                                                                               6
>13M seqs, >17k species, 69 origins HUGO, SCOP, etc.     ~6 CPU-years
Unison Contents
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
                                                      homologs
  1991-10-29                    TNFSF10
  SUNTORY                       TNFSF11
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation

                       aliases
      elongation

                       TNFA_HUMAN
Entrez                                                sequences                         protein features
                       Q1XHZ6
                       IPI00001671.1
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                                                 133   |   138   |         | ITIM

                                                                   alignments
9606 Homo sapiens
                                                                   TNFA 1tnfA
10090 Mus musculus
                                                                                                   aa-to-resid
                                                                   TNFA 1tnfB
10028 Rattus rattus

                              loci                                 ...
                                                                                                   MSTESMIR
                                                                   TNFA 5tswF
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                                                SCOP
                                                                                1tnf
  genomes                                                                       1a8m                            all alpha
                                            probes                              2tun                            all beta
  Hs35
                                                                                4tsv                             Ig
  Hs36
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
                                                                                                                             7
Ex1: Mine for sequences w/conserved features.
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
                                                      homologs
  1991-10-29                    TNFSF10
  SUNTORY                       TNFSF11
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation

                       aliases
      elongation

                       TNFA_HUMAN
Entrez                                                sequences                         protein features
                       Q1XHZ6
                       IPI00001671.1
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                                                 133   |   138   |         | ITIM

                                                                   alignments
9606 Homo sapiens
                                                                   TNFA 1tnfA
10090 Mus musculus
                                                                                                   aa-to-resid
                                                                   TNFA 1tnfB
10028 Rattus rattus

                              loci                                 ...
                                                                                                   MSTESMIR
                                                                   TNFA 5tswF
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                                                SCOP
                                                                                1tnf
  genomes                                                                       1a8m                            all alpha
                                            probes                              2tun                            all beta
  Hs35
                                                                                4tsv                             Ig
  Hs36
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
                                                                                                                             8
Ex2: Locate SNPs and domains on structure.
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
                                                      homologs
  1991-10-29                    TNFSF10
  SUNTORY                       TNFSF11
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation

                       aliases
      elongation

                       TNFA_HUMAN
Entrez                                                sequences                         protein features
                       Q1XHZ6
                       IPI00001671.1
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                                                 133   |   138   |         | ITIM

                                                                   alignments
9606 Homo sapiens
                                                                   TNFA 1tnfA
10090 Mus musculus
                                                                                                   aa-to-resid
                                                                   TNFA 1tnfB
10028 Rattus rattus

                              loci                                 ...
                                                                                                   MSTESMIR
                                                                   TNFA 5tswF
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                                                SCOP
                                                                                1tnf
  genomes                                                                       1a8m                            all alpha
                                            probes                              2tun                            all beta
  Hs35
                                                                                4tsv                             Ig
  Hs36
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
                                                                                                                             9
Analysis and data mining have distinct needs.
                                                       (semantic integration)
                                                       feature types/models HMM, TM, signal, etc.
   (source integration)
   sequences non-redundant superset of all sequences
                                                                                                                                         Sequence Analysis
                                                                                                                                         i.e., show predictions for a given sequence
                                                                                                                                         Typically involves minutes to hours of computing per sequence.
                                                         Typically entails days to months of computing results.
                                                         i.e., show sequences that contain specified features.

                                                                                                                  Feature-Based Mining
                                                                                                                                            Prediction results
                                                                                                                                            method-specific data such as score, e-value, p-
                                                                                                                                            value, kinase probability, etc.




                                                                                                                                                                                         parameters
                                                                                                                                                                         execution arguments/options for every
                                                                                                                                                                                      prediction type and result



                                                                                                                                                                                                                   10
Mining for ITIMs the Old Way

                        Ig            TM         ITIM



     Collect sequences.
 ➢
     Prune redundant sequences. (How?!)
 ➢
     For each unique sequence, predict
 ➢
          Immunoglobulin domains.
      ●


          Transmembrane domains.
      ●


          ITIM domains.
      ●


     Write a program that filters predictions.
 ➢
     Summarize hits with external data.
 ➢
     Do it again when source data are updated.
 ➢




For Review: Daëron M Immunol Rev. 2008 Aug;224:11-43    11
Mining for ITIMs the Unison Way

                                Ig                  TM             ITIM
SELECT IG.pseq_id,
          IG.start as ig_start,IG.stop as ig_stop,IG.score,IG.eval,
          TM.start as tm_start,TM.stop as tm_stop,
          ITIM.start as itim_start,ITIM.stop as itim_stop
 FROM pahmm_current_pfam_v IG
 JOIN pftmhmm_tms_v TM ON IG.pseq_id=TM.pseq_id                           AND IG.stop<TM.start
 JOIN pfregexp_v ITIM                ON TM.pseq_id=ITIM.pseq_id AND TM.stop<ITIM.start
WHERE IG.name='ig' AND IG.eval<1e-2
          AND ITIM.acc='MOD_TYR_ITIM';

              Ig      Ig                      TM     Tm    ITIM     ITIM
           start   stop score               start   stop   start    stop best_annotation
pseq_id                              eval
                                                                     523 UniProtKB/Swiss-Prot:SIGL5_HUMAN (RecNam
    234     262     316    30    7.40E-06    440     462    518
                                                                     391 UniProtKB/Swiss-Prot:VSIG4_HUMAN (RecNam
    254     158     213    36    1.90E-07    284     306    386
                                                                     436 UniProtKB/Swiss-Prot:SIGL9_HUMAN (RecNam
    544     157     215    24    6.60E-04    348     370    431
    797     254     312    40    7.60E-09   1099    1121   1361     1366 UniProtKB/Swiss-Prot:DCC_HUMAN (RecName
   1113      42     102    30    1.20E-05    243     265    300      305 UniProtKB/Swiss-Prot:KI2L2_HUMAN (RecNam
                                                                     335 UniProtKB/Swiss-Prot:KI2L1_HUMAN (RecNam
   1114      42     102    30    6.50E-06    243     265    330
                                                                     306 UniProtKB/Swiss-Prot:KI2L3_HUMAN (RecNam
   1115      42     102    31    4.20E-06    243     265    301
                                                                     401 UniProtKB/TrEMBL:Q95368_HUMAN (SubName
   1116      42      97    30    1.10E-05    339     361    396
                                                                                                              12
   1134     340     388    26    1.40E-04    603     625    688      693 UniProtKB/Swiss-Prot:PECA1_HUMAN (RecNam
Unison has many applications.
Unison Web Tools                                   Other In-House Tools                                                  Ad Hoc Mining



                                                                                                                             Mining and
                                                                                                                             analysis
                                                                                                                             projects




                                              Domain,
                                                                                 Structures
                                       Structure & Homology
                                                                                 & Ligands
                                            Predictions

                                                               Protein
                                                            Sequences and
                                                             Annotations
                                                                                Auxiliary
                                            Genomes,
                                                                              Annotations
                                          Gene Mapping &
                                                                             GO, RIF, SCOP,
                                            Structure,
                                                                                  etc.
                                              Probes



                          Sequences and Annotations          Auxiliary Data      Precomputed predictions
                     UniProt, IPI, Ensembl, RefSeq, PDB    HomoloGene, Gene      Domains, homology, structure, TMs,
                   STRING, PHANTOM, HUGE, ROUGE,           Ontology, taxonomy,   localization, signals, disorder, etc.
                                                                                                                                          13
                           MGC, Derwent, pataa, nr, etc.   PDB, HUGO, SCOP,      >200M predictions, 23 types,
                     >13M seqs, >17k species, 69 origins           etc.          ~6 CPU-years
Unison facilitates complex mining.




                             Jason Hackney
                             Nandini Krishnamurthy
                             Li Li
                             Yun Li
                             Jinfeng Liu
                             Shiu-ming Loh
                             Kiran Mukhyala     14
Data integration led to Bcl-2 discoveries.



                                                     +           sequences,
                                                                   models,
                                                               HMM alignments,
                                                                 automation




        Custom model building

      Z'fish        Source Database        Human                                   %
                                                     E-value    Score   % Ide
     Protein         and Accession         Protein                              Coverage
       Bax         RefSeq:NP_571637                  2.00E-47    189     51        98


⇒                                                                                          ⇒
                                            BAX
      Bax2     E35:ENSDARP00000040899                1.00E-14    81      33        51
       Bik        UP:Q5RGV6_BRARE           BIK      1.41E+04    20      47        12
      Bmf        RefSeq:NP_001038689                 1.00E-05    50      32        91
                                            BMF
     Bmf2      E35:FGENESH00000082230                1.10E-02    42      41        42
     BBC3      E35:FGENESH00000078270      PUMA      2.10E+01    30      25        49

                        4 novel Bcl-2 proteins in zebrafish




Kratz et al., Cell Death Differ. (2006).                                                       15
Unison Web Tools




                   16
Structure Viewer with User Features!
http://unison-db.org/pseq_structure.pl?q=TNFA_HUMAN;userfeatures=Estrand@164-174,mysnp@170




                                                                                             17
Unison is a platform for diverse tools.




                                    Matt Brauer
                                    Guy Cavet
                                    Josh Kaminker
                                    Scott Lohr
                                    Kathryn Woods
                                    Jean Yuan
                                    Peng Yue 18
Unison Build Process

      Phase 0    Phase 1     Phase 2    Phase 3     Phase 4     Phase 5    Phase 6
     Download    Load Aux     Load      Update       Update     Update     House-
                   Data     Sequences    Sets      Predictions Mat Views   keeping
        2d          4h         2d         1h       50 CPU-d       6h          0

                            Makefile
Makefile
                            loads auxiliary data
downloads all data
                            loads sequences and annotations
                                (in-house is just another source)
                            updates sequence sets
                            updates precomputed predictions
                                (incremental update!)
                            updates precomputed analyses and mat'd views
                            builds public database



     Runs in a cron job
 ➢
     Requires ~10% time of 1 person
 ➢
     Consistent, reliable builds
 ➢

                                                                                     19
Benefit Lessons

    Integrate to enable reasoning based on a
➢
    corpus of data of multiple types and/or
    from multiple origins.
         To analyze biological data in broad context.
     ●


         To generate hypotheses by data mining.
     ●


         To enable business decisions based on a holistic
     ●


         view of decision criteria.

    Ancillary benefits:
➢
         Data preparation is hard. Centralization means
     ●


         that questions get asked and asked efficiently.
         Integrated data provides a consistent foundation
     ●


         on which others can build.
         Integration improves currency.
     ●




                                                            20
Design Lessons

    Know what data to integrate, how they'll
➢
    be used, and the converse.

    Integrate on simple, intuitively meaningful
➢
    abstract concepts.
         Precise definitions are critical.
     ●


         Represent proprietary data elsewhere, if needed.
     ●




    Aggregate on data types.
➢
         Corollary: Partitioning on content makes data
     ●


         silos.

    Design for Integrity.
➢
         Reliability is everything.
     ●

                                                            21
Process Lessons

    Explicitly track the provenance of data.
➢
         All data in Unison are tied to an origin –
     ●


         predictions, annotations, sequences, models.

    Plan for updates.
➢
         Updates are completely automated and
     ●


         idempotent.

              idempotent
              i⋅dem⋅po⋅tent (/ˈaɪdəmˈpoʊtnt, ˈɪdəm-/)
              adj. [from mathematical techspeak] Acting as if
              used only once, even if used multiple times.


              idempotent. Dictionary.com. Jargon File 4.2.0.
              http://dictionary.reference.com/browse/idempotent (accessed:
              February 25, 2009).
                                                                             22
Other Lessons

    Design security from the start.
➢
         Internal version of Unison use Kerberos.
     ●


         Especially important in a world of distributed
     ●


         services and data.

    Include web services early in the design.
➢
         (Ooops, I blew it on this.)
     ●




                                                          23
A Few Reasons for PostgreSQL.

    Excellent support for server-side functions
➢
         in PL/PGSQL, Perl, C, Java, Python, R, sh, ...
     ●


    Table inheritance
➢
         Facilitates type abstraction
     ●


    GSSAPI/Kerberos support
➢
         No password admin
     ●


         User identity all the way to the database
     ●


    psql rocks
➢
    Pedantic and responsive development
➢
    community
    Ease community adoption (?)
➢




                                                          24
Kiran Mukhyala

                                        Fernando Bazan, Matt Brauer,
                                        David Cavanaugh, Jason Hackney,
                                        Pete Haverty, Ken Jung, Josh
                                        Kaminker, Nandini Krishnamurthy,
                                        Li Li, Yun Li, Scott Lohr, Shiuh-
                                        ming Loh, Jinfeng Liu, Peng Yue,
                                        Jianjun Zhang, Yan Zhang

                                        Simran Hansrai, Marc Lambert,
                                        Dave Windgassen

                                        A huge open science and open
                                        source community.

                                        http://unison-db.org/
                                        Open access web site, downloads,
                                        documentation, references,
  “Are you sure about this              credits.
 Stan? It seems odd that a
pointy head and a long beak             unison-db.org:5432
                                        PostgreSQL & odbc/jdbc/sdbc
 is what makes them fly.”
                                        access                              25
  J. Workman, Science 245:1399 (1989)
26
Unison form follows function.
            Params/Models




Sequences   Results




                                   27

Mais conteúdo relacionado

Destaque

Sql 學習繪本
Sql 學習繪本Sql 學習繪本
Sql 學習繪本chuyenyin
 
Presentation mr ono
Presentation mr onoPresentation mr ono
Presentation mr onorjmchicago
 
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...Leonardo Duran
 
Unit 1 paper 1 answers
Unit 1   paper 1 answersUnit 1   paper 1 answers
Unit 1 paper 1 answersCAPE ECONOMICS
 
D acad-016 estructura de tesina o memoria de estadía profesional
D acad-016 estructura de tesina o memoria de estadía profesional D acad-016 estructura de tesina o memoria de estadía profesional
D acad-016 estructura de tesina o memoria de estadía profesional Edgar Mata
 
Spring_GHUCCTS_Newsletter_1st_edition
Spring_GHUCCTS_Newsletter_1st_editionSpring_GHUCCTS_Newsletter_1st_edition
Spring_GHUCCTS_Newsletter_1st_editionAlexander V. Libin
 
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...Andreas Ferus
 

Destaque (11)

Sql 學習繪本
Sql 學習繪本Sql 學習繪本
Sql 學習繪本
 
Presentation mr ono
Presentation mr onoPresentation mr ono
Presentation mr ono
 
Psicologia genesis
Psicologia genesisPsicologia genesis
Psicologia genesis
 
Landforms
Landforms Landforms
Landforms
 
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
 
Unit 1 paper 1 answers
Unit 1   paper 1 answersUnit 1   paper 1 answers
Unit 1 paper 1 answers
 
D acad-016 estructura de tesina o memoria de estadía profesional
D acad-016 estructura de tesina o memoria de estadía profesional D acad-016 estructura de tesina o memoria de estadía profesional
D acad-016 estructura de tesina o memoria de estadía profesional
 
Modern Loom
Modern Loom Modern Loom
Modern Loom
 
Aims, Objectives and Goals
Aims, Objectives and GoalsAims, Objectives and Goals
Aims, Objectives and Goals
 
Spring_GHUCCTS_Newsletter_1st_edition
Spring_GHUCCTS_Newsletter_1st_editionSpring_GHUCCTS_Newsletter_1st_edition
Spring_GHUCCTS_Newsletter_1st_edition
 
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
 

Mais de Joshua Drake

Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessJoshua Drake
 
Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessJoshua Drake
 
An evening with Postgresql
An evening with PostgresqlAn evening with Postgresql
An evening with PostgresqlJoshua Drake
 
Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)Joshua Drake
 
Introduction to PgBench
Introduction to PgBenchIntroduction to PgBench
Introduction to PgBenchJoshua Drake
 
Developing A Procedural Language For Postgre Sql
Developing A Procedural Language For Postgre SqlDeveloping A Procedural Language For Postgre Sql
Developing A Procedural Language For Postgre SqlJoshua Drake
 
PostgreSQL Conference: East 08
PostgreSQL Conference: East 08PostgreSQL Conference: East 08
PostgreSQL Conference: East 08Joshua Drake
 
PostgreSQL Conference: West 08
PostgreSQL Conference: West 08PostgreSQL Conference: West 08
PostgreSQL Conference: West 08Joshua Drake
 
What MySQL can learn from PostgreSQL
What MySQL can learn from PostgreSQLWhat MySQL can learn from PostgreSQL
What MySQL can learn from PostgreSQLJoshua Drake
 
Northern Arizona State ACM talk (10/08)
Northern Arizona State ACM talk (10/08)Northern Arizona State ACM talk (10/08)
Northern Arizona State ACM talk (10/08)Joshua Drake
 

Mais de Joshua Drake (14)

Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own Business
 
Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own Business
 
An evening with Postgresql
An evening with PostgresqlAn evening with Postgresql
An evening with Postgresql
 
Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)
 
East09 Keynote
East09 KeynoteEast09 Keynote
East09 Keynote
 
Go Replicator
Go ReplicatorGo Replicator
Go Replicator
 
Pitr Made Easy
Pitr Made EasyPitr Made Easy
Pitr Made Easy
 
Introduction to PgBench
Introduction to PgBenchIntroduction to PgBench
Introduction to PgBench
 
Developing A Procedural Language For Postgre Sql
Developing A Procedural Language For Postgre SqlDeveloping A Procedural Language For Postgre Sql
Developing A Procedural Language For Postgre Sql
 
PostgreSQL Conference: East 08
PostgreSQL Conference: East 08PostgreSQL Conference: East 08
PostgreSQL Conference: East 08
 
PostgreSQL Conference: West 08
PostgreSQL Conference: West 08PostgreSQL Conference: West 08
PostgreSQL Conference: West 08
 
What MySQL can learn from PostgreSQL
What MySQL can learn from PostgreSQLWhat MySQL can learn from PostgreSQL
What MySQL can learn from PostgreSQL
 
Northern Arizona State ACM talk (10/08)
Northern Arizona State ACM talk (10/08)Northern Arizona State ACM talk (10/08)
Northern Arizona State ACM talk (10/08)
 
Plproxy
PlproxyPlproxy
Plproxy
 

Último

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Último (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Unison Ucsf Sfpug

  • 1. Unison: Enabling easy, rapid, and comprehensive proteomic mining http://unison-db.org/ Online access, download, documentation, references. Reece Hart Genentech, Inc. UCSF / SF PostgreSQL Users' Group March 11, 2009 San Francisco, CA Slides available at http://harts.net/reece/pubs/
  • 2. A Bestiary of Life Sciences Data Types Genomics Proteomics assemblies, transcripts, sequences, domains, PTMs, probes, trans. factors, localization, structure, orthology, expression, SNPs, predictions, networks … haplotypes … Annotation Chemistry GO, taxonomy, SCOP, compounds, HCS, HTS, disease, OMIM … properties … Clinincal LIMS assays, protocols, animal records, protocols patient records, request systems, samples … personnel, samples … Communications literature, patents, and presentations … 2
  • 3. Types of Integration Source Aggregation ➢ RefSeq In-house Sequences Structures Aggregates data of the ● same type from multiple UniProt PDB sources. Sequences Structures Ensures completeness of ● data. Structures Sequences Semantic Integration ➢ Integrates fundamentally ● distinct data types. Abstracts types to ● essential features. Structures Sequences Improves contextual ● understanding of data. 3
  • 4. A Survey of Integration Methods Presentation Mashups Link Integration AJAX, iframe Hypertext links between sites Middle Tier Server Mashups Database Integration F W (Federation / Warehouse) Federation Warehouse Source Databases A B C or Files For review, see: Goble C, Stevens R 4 J Biomed Inform. 2008 Oct;41(5):687-93.
  • 5. The Problems in a Nutshell Data integration is complex. ➢ Establishing semantic equivalences and ● relationships are difficult. Source database contents are updated often. ● Existing tools don't cut it. ➢ Licensing restrictions prevent sharing. ● Narrow in type of data and/or content, and not ● easily updated. Not specifically designed for mining. ● Scientists develop ad hoc and integration ➢ solutions. Results are difficult to repeat. ● It wastes a lot of time. ● Questions don't get asked. ● 5
  • 6. Unison in a Nutshell Domain, Structures Structure & Homology & Ligands Predictions Protein Sequences and Annotations Auxiliary Genomes, Annotations Gene Mapping & GO, RIF, SCOP, Structure, etc. Probes Sequences and Annotations Auxiliary Data Precomputed predictions UniProt, IPI, Ensembl, RefSeq, PDB, HomoloGene, Gene Domains, homology, structure, TMs, PHANTOM, HUGE, ROUGE, MGC, Ontology, localization, signals, disorder, etc. Derwent, pataa, nr, etc. taxonomy, PDB, >200M predictions, 23 types, 6 >13M seqs, >17k species, 69 origins HUGO, SCOP, etc. ~6 CPU-years
  • 7. Unison Contents patents HUGO Geneseq:AAP60074 TNFSF9 homologs 1991-10-29 TNFSF10 SUNTORY TNFSF11 NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation aliases elongation TNFA_HUMAN Entrez sequences protein features Q1XHZ6 IPI00001671.1 gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy 133 | 138 | | ITIM alignments 9606 Homo sapiens TNFA 1tnfA 10090 Mus musculus aa-to-resid TNFA 1tnfB 10028 Rattus rattus loci ... MSTESMIR TNFA 5tswF DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures SCOP 1tnf genomes 1a8m all alpha probes 2tun all beta Hs35 4tsv Ig Hs36 HGU133P 5tsw TNF-like RAT WHG alpha+beta 7
  • 8. Ex1: Mine for sequences w/conserved features. patents HUGO Geneseq:AAP60074 TNFSF9 homologs 1991-10-29 TNFSF10 SUNTORY TNFSF11 NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation aliases elongation TNFA_HUMAN Entrez sequences protein features Q1XHZ6 IPI00001671.1 gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy 133 | 138 | | ITIM alignments 9606 Homo sapiens TNFA 1tnfA 10090 Mus musculus aa-to-resid TNFA 1tnfB 10028 Rattus rattus loci ... MSTESMIR TNFA 5tswF DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures SCOP 1tnf genomes 1a8m all alpha probes 2tun all beta Hs35 4tsv Ig Hs36 HGU133P 5tsw TNF-like RAT WHG alpha+beta 8
  • 9. Ex2: Locate SNPs and domains on structure. patents HUGO Geneseq:AAP60074 TNFSF9 homologs 1991-10-29 TNFSF10 SUNTORY TNFSF11 NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation aliases elongation TNFA_HUMAN Entrez sequences protein features Q1XHZ6 IPI00001671.1 gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy 133 | 138 | | ITIM alignments 9606 Homo sapiens TNFA 1tnfA 10090 Mus musculus aa-to-resid TNFA 1tnfB 10028 Rattus rattus loci ... MSTESMIR TNFA 5tswF DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures SCOP 1tnf genomes 1a8m all alpha probes 2tun all beta Hs35 4tsv Ig Hs36 HGU133P 5tsw TNF-like RAT WHG alpha+beta 9
  • 10. Analysis and data mining have distinct needs. (semantic integration) feature types/models HMM, TM, signal, etc. (source integration) sequences non-redundant superset of all sequences Sequence Analysis i.e., show predictions for a given sequence Typically involves minutes to hours of computing per sequence. Typically entails days to months of computing results. i.e., show sequences that contain specified features. Feature-Based Mining Prediction results method-specific data such as score, e-value, p- value, kinase probability, etc. parameters execution arguments/options for every prediction type and result 10
  • 11. Mining for ITIMs the Old Way Ig TM ITIM Collect sequences. ➢ Prune redundant sequences. (How?!) ➢ For each unique sequence, predict ➢ Immunoglobulin domains. ● Transmembrane domains. ● ITIM domains. ● Write a program that filters predictions. ➢ Summarize hits with external data. ➢ Do it again when source data are updated. ➢ For Review: Daëron M Immunol Rev. 2008 Aug;224:11-43 11
  • 12. Mining for ITIMs the Unison Way Ig TM ITIM SELECT IG.pseq_id, IG.start as ig_start,IG.stop as ig_stop,IG.score,IG.eval, TM.start as tm_start,TM.stop as tm_stop, ITIM.start as itim_start,ITIM.stop as itim_stop FROM pahmm_current_pfam_v IG JOIN pftmhmm_tms_v TM ON IG.pseq_id=TM.pseq_id AND IG.stop<TM.start JOIN pfregexp_v ITIM ON TM.pseq_id=ITIM.pseq_id AND TM.stop<ITIM.start WHERE IG.name='ig' AND IG.eval<1e-2 AND ITIM.acc='MOD_TYR_ITIM'; Ig Ig TM Tm ITIM ITIM start stop score start stop start stop best_annotation pseq_id eval 523 UniProtKB/Swiss-Prot:SIGL5_HUMAN (RecNam 234 262 316 30 7.40E-06 440 462 518 391 UniProtKB/Swiss-Prot:VSIG4_HUMAN (RecNam 254 158 213 36 1.90E-07 284 306 386 436 UniProtKB/Swiss-Prot:SIGL9_HUMAN (RecNam 544 157 215 24 6.60E-04 348 370 431 797 254 312 40 7.60E-09 1099 1121 1361 1366 UniProtKB/Swiss-Prot:DCC_HUMAN (RecName 1113 42 102 30 1.20E-05 243 265 300 305 UniProtKB/Swiss-Prot:KI2L2_HUMAN (RecNam 335 UniProtKB/Swiss-Prot:KI2L1_HUMAN (RecNam 1114 42 102 30 6.50E-06 243 265 330 306 UniProtKB/Swiss-Prot:KI2L3_HUMAN (RecNam 1115 42 102 31 4.20E-06 243 265 301 401 UniProtKB/TrEMBL:Q95368_HUMAN (SubName 1116 42 97 30 1.10E-05 339 361 396 12 1134 340 388 26 1.40E-04 603 625 688 693 UniProtKB/Swiss-Prot:PECA1_HUMAN (RecNam
  • 13. Unison has many applications. Unison Web Tools Other In-House Tools Ad Hoc Mining Mining and analysis projects Domain, Structures Structure & Homology & Ligands Predictions Protein Sequences and Annotations Auxiliary Genomes, Annotations Gene Mapping & GO, RIF, SCOP, Structure, etc. Probes Sequences and Annotations Auxiliary Data Precomputed predictions UniProt, IPI, Ensembl, RefSeq, PDB HomoloGene, Gene Domains, homology, structure, TMs, STRING, PHANTOM, HUGE, ROUGE, Ontology, taxonomy, localization, signals, disorder, etc. 13 MGC, Derwent, pataa, nr, etc. PDB, HUGO, SCOP, >200M predictions, 23 types, >13M seqs, >17k species, 69 origins etc. ~6 CPU-years
  • 14. Unison facilitates complex mining. Jason Hackney Nandini Krishnamurthy Li Li Yun Li Jinfeng Liu Shiu-ming Loh Kiran Mukhyala 14
  • 15. Data integration led to Bcl-2 discoveries. + sequences, models, HMM alignments, automation Custom model building Z'fish Source Database Human % E-value Score % Ide Protein and Accession Protein Coverage Bax RefSeq:NP_571637 2.00E-47 189 51 98 ⇒ ⇒ BAX Bax2 E35:ENSDARP00000040899 1.00E-14 81 33 51 Bik UP:Q5RGV6_BRARE BIK 1.41E+04 20 47 12 Bmf RefSeq:NP_001038689 1.00E-05 50 32 91 BMF Bmf2 E35:FGENESH00000082230 1.10E-02 42 41 42 BBC3 E35:FGENESH00000078270 PUMA 2.10E+01 30 25 49 4 novel Bcl-2 proteins in zebrafish Kratz et al., Cell Death Differ. (2006). 15
  • 17. Structure Viewer with User Features! http://unison-db.org/pseq_structure.pl?q=TNFA_HUMAN;userfeatures=Estrand@164-174,mysnp@170 17
  • 18. Unison is a platform for diverse tools. Matt Brauer Guy Cavet Josh Kaminker Scott Lohr Kathryn Woods Jean Yuan Peng Yue 18
  • 19. Unison Build Process Phase 0 Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Download Load Aux Load Update Update Update House- Data Sequences Sets Predictions Mat Views keeping 2d 4h 2d 1h 50 CPU-d 6h 0 Makefile Makefile loads auxiliary data downloads all data loads sequences and annotations (in-house is just another source) updates sequence sets updates precomputed predictions (incremental update!) updates precomputed analyses and mat'd views builds public database Runs in a cron job ➢ Requires ~10% time of 1 person ➢ Consistent, reliable builds ➢ 19
  • 20. Benefit Lessons Integrate to enable reasoning based on a ➢ corpus of data of multiple types and/or from multiple origins. To analyze biological data in broad context. ● To generate hypotheses by data mining. ● To enable business decisions based on a holistic ● view of decision criteria. Ancillary benefits: ➢ Data preparation is hard. Centralization means ● that questions get asked and asked efficiently. Integrated data provides a consistent foundation ● on which others can build. Integration improves currency. ● 20
  • 21. Design Lessons Know what data to integrate, how they'll ➢ be used, and the converse. Integrate on simple, intuitively meaningful ➢ abstract concepts. Precise definitions are critical. ● Represent proprietary data elsewhere, if needed. ● Aggregate on data types. ➢ Corollary: Partitioning on content makes data ● silos. Design for Integrity. ➢ Reliability is everything. ● 21
  • 22. Process Lessons Explicitly track the provenance of data. ➢ All data in Unison are tied to an origin – ● predictions, annotations, sequences, models. Plan for updates. ➢ Updates are completely automated and ● idempotent. idempotent i⋅dem⋅po⋅tent (/ˈaɪdəmˈpoʊtnt, ˈɪdəm-/) adj. [from mathematical techspeak] Acting as if used only once, even if used multiple times. idempotent. Dictionary.com. Jargon File 4.2.0. http://dictionary.reference.com/browse/idempotent (accessed: February 25, 2009). 22
  • 23. Other Lessons Design security from the start. ➢ Internal version of Unison use Kerberos. ● Especially important in a world of distributed ● services and data. Include web services early in the design. ➢ (Ooops, I blew it on this.) ● 23
  • 24. A Few Reasons for PostgreSQL. Excellent support for server-side functions ➢ in PL/PGSQL, Perl, C, Java, Python, R, sh, ... ● Table inheritance ➢ Facilitates type abstraction ● GSSAPI/Kerberos support ➢ No password admin ● User identity all the way to the database ● psql rocks ➢ Pedantic and responsive development ➢ community Ease community adoption (?) ➢ 24
  • 25. Kiran Mukhyala Fernando Bazan, Matt Brauer, David Cavanaugh, Jason Hackney, Pete Haverty, Ken Jung, Josh Kaminker, Nandini Krishnamurthy, Li Li, Yun Li, Scott Lohr, Shiuh- ming Loh, Jinfeng Liu, Peng Yue, Jianjun Zhang, Yan Zhang Simran Hansrai, Marc Lambert, Dave Windgassen A huge open science and open source community. http://unison-db.org/ Open access web site, downloads, documentation, references, “Are you sure about this credits. Stan? It seems odd that a pointy head and a long beak unison-db.org:5432 PostgreSQL & odbc/jdbc/sdbc is what makes them fly.” access 25 J. Workman, Science 245:1399 (1989)
  • 26. 26
  • 27. Unison form follows function. Params/Models Sequences Results 27