O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Major databases in bioinformatics

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Databases
Databases
Carregando em…3
×

Confira estes a seguir

1 de 63 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (20)

Anúncio

Semelhante a Major databases in bioinformatics (20)

Mais de Vidya Kalaivani Rajkumar (20)

Anúncio

Mais recentes (20)

Major databases in bioinformatics

  1. 1. What is database???? • Database are convenient system to properly store, search and retrieve any type of data. • A database helps to easily handle and share large amount of data and supports large scale analysis by easy access and data updating.
  2. 2. What is Biological Database • Biological databases are libraries of life sciences information ,collected from scientific experiments, published literature, high- throughput experiment technology and computational analysis. • They contain information from genomics, proteomics, microarray gene expression. • Information contained in biological databases includes gene function, structure, localization(both cellular and chromosomal),biological sequences and structures.
  3. 3. Databases Architecture Information system )Query system Storage System Data (The Google, Entrez SRS) Your search key words Oracle,MySQL,PC binary files,Unix text files,Bookshelves GenBank flat file PDB file Interaction Record Title of a book Book
  4. 4. A Sequence Retrieving and Manipulation Network DNA Protein NCBI-GenBANK PIR DDBJ SWISSPROT EBI-EMBL EXPASY, PDB GCG SeqWEB Vector NTI GenoMAX Entrez SRS Sequnece, Pdb, Image GenBANK GCG FASTA Staden Image Sequence Converter Databases Softwares Formats Retrival System Information
  5. 5. Types of biological databases  Primary Database. Secondary database.
  6. 6. Primary databases Theses are the primary sources of data used to store nucleic acid, protein sequences and structural information of biological macromolecules. Some primary databases- • NCBI(The National Centre for Biotechnology Information) • GenBank • DDBJ (DNA data bank of Japan) • SWISS-PROT(Swiss-Prot ) • PIR (Protein Information Resource) • PDB(Protein Data Bank) This sequence collection of this database is due to the efforts of basic research from academic industrial and sequencing lab)
  7. 7. IAM: International Advisory Meeting ICM: International Collaborative Meeting GenBank/EMBL/DDBJ International Nucleotide Sequence Database EMBL: European Molecular Biology Laboratory EBI: European Bioinformatics Institute DDBJ: DNA Data Bank of Japan CIB: Center for Information Biology and DNA Data Bank of Japan NIG: National Institute of Genetics NCBI: National Center for Biotechnology Information NLM: National Library of Medicine
  8. 8. Secondary Database • A Secondary database contain additional information derived from the analysis of data available in primary sources. • Secondary databases are analysed in a variety of ways and contain different information in different formats. • Some secondary databases • TrEMBL • Pfam • PROSITE • Profiles • SCOP • CATH
  9. 9. PRIMARY VS. Secondary SEQUENCE DATABASES Sequencing Centers TATAGCCG TATAGCCGTATAGCCG TATAGCCG Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters
  10. 10. Flat File Storage Data Formats • When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards. • The flat file formats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.
  11. 11. The National Center for Biotechnology Information Created in 1988 as a part of the National Library of Medicine at NIH – Establish public databases – Research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information Bethesda, MD
  12. 12. NCBI Databases and Services • GenBank primary sequence database • Free public access to biomedical literature • PubMed free Medline (3 million searches per day) • PubMed Central full text online access • Entrez integrated molecular and literature databases • BLAST highest volume sequence search service (100 – 200 K searches per day) • VAST structure similarity searches • Software and Databases
  13. 13. GenBank (Genetic Sequence Databank) • GenBank® is the genetic sequence database at the National Center for Biotechnology Information (NCBI). • It was established in the year 1982 and now maintained by the NationalCenter for Biotechnology (NCBI). • DNA sequences can be submitted to GenBank using several different methods. • It contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects.
  14. 14. • It has a flat file structure that is an ASCII text file, readable & downloadable by both humans and computers. • There are two main ways of making batch sequence submissions to GenBank: NCBI’s Barcode SubmissionTool (BarSTool) and Sequin.
  15. 15. EMBL • The European Molecular Biology Laboratory (EMBL) is a molecular biology research institution supported by 22 member states, four prospect and two associate member states. • EMBL was created in 1974 and is an intergovernmental organisation funded by public research money from its member states. • The Laboratory operates from five sites: the main laboratory in Heidelberg, and outstations in Hinxton (the European Bioinformatics Institute (EBI), in England), Grenoble (France), Hamburg (Germany), and Monterotondo (near Rome). • EMBL groups and laboratories perform basic research in molecular biology and molecular medicine as well as training for scientists, students and visitors. • Israel is the only Asian state that has full membership. • The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/), maintained at the European Bioinformatics Institute (EBI),
  16. 16. • It is used to incorporate and distributes nucleotide sequences from public sources. • The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA). • Data are exchanged between the collaborating databases on a daily basis. • The web-based tool, Webin, is the preferred system for individual submission of nucleotide sequences, including Third Party Annotation (TPA) and alignment data.
  17. 17. • Automatic submission procedures are used for submission of data from large-scale genome sequencing • The latest data collection can be accessed via FTP, email and WWW interfaces. • The EBI's Sequence Retrieval System (SRS) integrates and links the main nucleotide and protein databases as well as many other specialist molecular biology databases. • For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that allow external users to compare their own sequences against the data in the EMBL Nucleotide Sequence Database and other databases. • All available resources can be accessed via the EBI home page at
  18. 18. EMBL format ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase XX KW sod gene; superoxide dismutase. XX OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. XX RN [1] RX MEDLINE; 92140371. RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product.";
  19. 19. RL Mol. Gen. Genet. 231:313-322(1992). XX RN [2] RP 1-756 RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG XX DR SWISS-PROT; P28763; SODM_LISIV. XX FH Key Location/Qualifiers FH FT source 1..756 FT /db_xref="taxon:1638" FT /organism="Listeria ivanovii" FT /strain="ATCC 19119" FT RBS 95..100 FT /gene="sod" FT terminator 723..746 FT /gene="sod" FT CDS 109..717 FT /db_xref="SWISS-PROT:P28763" FT /transl_table=11 FT /gene="sod" FT /EC_number="1.15.1.1" FT /product="superoxide dismutase" FT /protein_id="CAA45406.1" FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEA VSG FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNL KAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPV LGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" XX SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta 360 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca 420 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg 480 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt 540
  20. 20. ID - Identification. AC - Accession number(s). DT - Date. DE - Description. GN - Gene name(s). OS - Organism species. OG - Organelle. OC - Organism classification. RN - Reference number. RP - Reference position. RC - Reference comments. RX - Reference cross-references. RA - Reference authors. RL - Reference location. CC - Comments or notes. DR - Database cross-references. KW - Keywords. FT - Feature table data. SQ - Sequence header. - (blanks) sequence data. // - Termination line. Some entries do not contain all of the line types, and some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).
  21. 21. PubMed • PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on sciences and biomedical topics. • The PubMed system was offered free to the public in 1997. • The United States National Library of Medicine (NLM) the National Institutes of Health maintains the part of the Entrez system of information retrieval. • PMID is the unique identifier number used in
  22. 22. • They are assigned to each article record when it enters the PubMed system. • The PMID# is always found at the end of a PubMed citation. • PubMed Central (PMC) is a free digital system that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature. • A "PubMed Mobile" option, providing access to a mobile
  23. 23. Entrez • WWW-based data retrieval system. • Developed by NCBI (National Centre for Biotechnology Information). • - Integrates information held in different DBs.
  24. 24. Data bases covered by Entrez are • Nucleic acid - GenBank, RefSeq, PDB. • Protein seqs - SWISS- PROT, PIR. • 3D structures – MMDB • Genomes – Many sources • PopSet – From GenBank • OMIM – OMIM • Taxonomy – NCBI taxonomy database • Books- Bookshelf • ProbeSet – GEO (Gene Expression Omnibus) • Literature - PubMed
  25. 25. SRS • SRS is a Sequence Retrieval System • - Data retrieval tool developed by EBI • - Integrates 80 molecular biology DBs • - An Open source software (Can be installed locally) • SRS has an associated scripting language called Icarus • Central resource for molecular biology data • - more than 250 databanks have been indexed. More than 35 SRS servers over theWWW(world wide)
  26. 26. • Information retrieval • Easy way to retrieve information from sequence and sequence-related databases • Possibility to search for multiple words/other criteria • Linkage between different databases • E.g. Find all primary structures with known three-dimensional • Different types of database in SRS • Sequence & structure • DNA, protein, three-dimensional structures • Sequence-related • Gene-related • Genome, mapping, mutations, transcription factors • SNP • Bibliographic
  27. 27. • SRS main toolbar tabs: • Top Page: displays databases in different database groups • Query: displays either the standard or extended query form • Results or “the query manager”: maintains a history of all the results obtained during a session • Projects or “the project manager”: maintains a history of all queries and views used during a session • Views: allows a user to define a user specific view for one or more databases • Databanks: contains a list and some facts about the databases available in the system
  28. 28. • Search terms in SRS • SRS indexed fields can be searched using any of the • Single word search • Multiple word phrases • Numbers and dates • Regular expressions • Wildcards •
  29. 29. LocusLink • LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) is a National Center for Biotechnology Information (NCBI) online resource. • It is principally intended for use by graduate students and professional researchers in the biomedical sciences. • It is designed to bring together related information on genetic loci and gene products from several sources. • LocusLink provides a central point of access for basic biomedical information and molecular data for genes, transcripts, and proteins from model organisms, currently including human, rat, mouse, fruit fly, and zebrafish. • Now it is not available in NCBI.

×