An integrated publicly accessible bioinformatics resource to support genomic/proteomic research and scientific discovery.
Established in 1984, by the National Biomedical Research Foundation (NBRF) Georgetown University Medial Center, Washington D.C., USA.
It is the source of annotated protein databases and analysis tools for the researchers.
Serve as primary resource for the exploration of protein information.
Accessible by text search for entry and list retrieval, and also BLAST search and peptide match.
2. Introduction
• An integrated publicly accessible bioinformatics resource to support
genomic/proteomic research and scientific discovery.
• Established in 1984, by the National Biomedical Research Foundation
(NBRF) Georgetown University Medial Center, Washington D.C.,
USA.
• It is the source of annotated protein databases and analysis tools for
the researchers.
• Serve as primary resource for the exploration of protein information.
• Accessible by text search for entry and list retrieval, and also BLAST
search and peptide match.
3. Features of PIR
Comprehensive, Non-redundant, Annotated database
contain protein sequences of prokaryotes, eukaryotes,
viruses, phages, archaea.
Data is well organized. Entries classified into protein
family and super-family.
Protein Sequence Database (PSD) cross-references to
other genomic and proteomic public databases
Updated weekly and full release are published
quarterly.
Provide cross reference between its own databases.
4. Database Organization and Annotation
• The basis of database organization and annotation lies in their proper
structuring according to protein family relationships.
• According to protein family relationships, the database can be
structured at three level:
1. Super families and families for full length sequence similarity
2. Homology domain for local functional and structural units
3. Motifs for functional and structural sites
5. Resources of PIR
The resources of PIR can be broadly classified into two
categories:
1. Data retrieval systems
2. Databases
6. Data Retrieval in PIR
Data Retrieval in PIR consist of search engines of three types.
Interactive text-based
search engine
Standard Sequence
similarity search engines
Advanced Search
Engines
Boolean queries of
text fields Peptide match
Pattern match
BLAST
FASTA
Pair-wise alignment
Multiple alignment
0 (false)
1 (true)
Combine sequence
similarity and
annotation searches
Evaluation of gene-
family relationship
7. Databases of PIR
UniProt- Universal Protein Resource
PIR +
EBI (European Bioinformatics Institute)
SIB (Swiss Institute of Bioinformatics)
UniProt
United Protein Database
Central resource of Protein Sequence & Function
8. UniProt- Universal Protein Resource
The UniProt database consist of the following three database:
1. UniProt Knowledgebase (UniProtKB)
2. UniProt Reference Cluster (UniRef)
3. UniProt Archive (UniParc)
9. UniProt Knowledgebase (UniProtKB)
• Central database of protein sequences with annotation and functional information.
• Provide single record for all protein products derived from a certain gene from a
certain species.
• Give details of accession number, alternative splicing, proteolytic cleavage, post-
translational modifications to each from of derived protein.
2 Parts
Contain Manually Annotated Records Contain Computationally Analyzed Records
UniProt/Swiss-Prot UniProt/TrEMBL
Which have to be manually annotated
10. UniProt Reference Cluster (UniRef)
• Provide non-redundant data collections based on UniProt
Knowledgebase and UniParc to obtain complete coverage of sequence
space at several resolution.
3 separate datasets that compress sequence space at different resolution:
• Sequences that are 100% identical (UniRef100 database)
• Sequences that are >= 90% identical (UniRef90 database)
• Sequences that are >= 50% identical (UniRef50 database)
11. UniProt Archive (UniParc)
• Provides a stable, comprehensive, non-redundant sequence collection
by storing the complete body of publicly available protein sequence
data.
• On addition of new or revised protein sequences, a UniParc sequence
version is provided or increased and thus makes it possible to track the
history of sequence changes in all the source databases.
• To avoid redundancy, each unique sequence is assigned a unique
identifier and is stored only once.
• Basic information stored with each UniParc entry are the identifier, the
sequence, cylic redundancy check number, source database with
accession or version number and a time stamp.
12. iProClass- Integrated Protein
Knowledgebase
• Provides comprehensive description of a protein family, function and
structure for UniProt protein sequences, and serve as a framework for
data integration in a distributed networking environment.
• Contain non-redundant protein sequences from PIR-PSD, Swiss-Prot,
TrEMBL.
iProClass
Family relationships Structural
classifications
Functional
classifications
Global level
(superfamily, family)
Local level
(domain, motif, site)
13. Types of Protein sequence reports
iProClass
2 Types
1st Types 2nd Types
Cover information on
Structure
Function
Family
Genetics
Disease
Ontology
Taxonomy
Literature
With reference to
relevant molecular
databases
Super-family report with
Length
Taxonomy
Keyword statistics
Complete member listing
14. PIRSF-Protein Family Classification
System
• PIR extended its super-family concept and developed the Super-
Family Classification system.
• To facilitate the sensible propagation and standardization of protein
annotation and systematic detection of annotation errors.
• Consists of two datasets: Preliminary clusters and curated families.
• Curated families include family name, protein membership, parent-
child relationship, domain architecture, optional description and
bibliography.
15. iProLINK
Integrated Protein Literature INformation and Knowledge
Provides annotated literature, protein name directory, and other
information to facilitate text mining in the area of literature based
database curation, protein ontology development and named entity
recognition.