SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Pau Corral Montañés

         RUGBCN
        03/05/2012
VIDEO:
“what we used to do in Bioinformatics“


note: due to memory space limitations, the video has not been embedded into
this presentation.

Content of the video:
A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done
interactively, leading to a single entry in the database related to Dengue Virus,
complete genome. Thanks to the facilities that the site provides, a FASTA file for
this complete genome is downloaded and thereafter opened with a text editor.

The purpose of the video is to show how tedious is accessing the site and
downleading an entry, with no less than 5 mouse clicks.
Unified records with in-house coding:
        The 3 databases share all sequences, but they use different                   A "mirror" of the content of other databases
        accession numbers (IDs) to refer to each entry.
        They update every night.

                                                                              ACNUC
                                                                                           A series of commands and their arguments were
        NCBI – ex.: #000134                                                                defined that allow
                                           EMBL – ex.: #000012
                                                                                           (1) database opening,
                                                                                           (2) query execution,
                                                                                           (3) annotation and sequence display,
                                                                                           (4) annotation, species and keywords browsing, and
                                                                                           (5) sequence extraction




                                              DDBJ – ex.:
                                              #002221




                                                                                  Other Databases
– NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov)
– EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl)
– DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp)
– ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
ACNUC retrieval programs:

     i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/)

     ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html)

     iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh)


     a.I) Query_win – GUI client for remote ACNUC retrieval operations
                   (http://pbil.univ-lyon1.fr/software/query_win.html)
     a.II) raa_query – same functionality as Query_win in command line interface
                  (http://pbil.univ-lyon1.fr/software/query.html)
Why use seqinR:
the Wheel          - available packages
Hotline            - discussion forum
Automation         - same source code for different purposes
Reproducibility    - tutorials in vignette format
Fine tuning        - function arguments

Usage of the R environment!!



#Install the package seqinr_3.0-6 (Apirl 2012)

>install.packages(“seqinr“)
package ‘seqinr’ was built under R version 2.14.2




                                                               A vingette:
                                                               http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
#Choose a mirror
>chooseCRANmirror(mirror_name)
>install.packages("seqinr")

>library(seqinr)

#The command lseqinr() lists all what is defined in the package:
>lseqinr()[1:9]

  [1]   "a"              "aaa"
  [3]   "AAstat"         "acnucclose"
  [5]   "acnucopen"      "al2bp"
  [7]   "alllistranks"   "alr"
  [9]   "amb"

>length(lseqinr())
[1] 209
How many different ways are there to work with biological sequences using SeqinR?

1)Sequences you have locally:

     i) read.fasta() and s2c() and c2s() and GC() and count() and translate()
     ii) write.fasta()
     iii) read.alignment() and consensus()

2) Sequences you download from a Database:

     i) browse Databases
     ii) query() and getSequence()
FASTA files example:
 Example with DNA data: (4 different characters, normally)

                                                                   The FASTA format is very
                                                                   simple and widely used for
                                                                   simple import of biological
                                                                   sequences.
                                                                   It begins with a single-line
                                                                   description starting with a
                                                                   character '>', followed by
                                                                   lines of sequence data of
                                                                   maximum 80 character each.
                                                                   Lines starting with a semi-
                                                                   colon character ';' are
                                                                   comment lines.

  Example with Protein data: (20 different characters, normally)
                                                                   Check Wikipedia for:
                                                                   i)Sequence representation
                                                                   ii)Sequence identifiers
                                                                   iii)File extensions
Read a file with read.fasta()

#Read the sequence from a local directory
> setwd("H:/Documents and Settings/Pau/Mis documentos/R_test")
> dir()
[1] "dengue_whole_sequence.fasta"

#Use the read.fasta (see next slide) function to load the sequence
> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA")
$`gi|9626685|ref|NC_001477.1|`
[1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g"
[19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c"
[37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a"
[55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t"
  [............................................................]
[10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c"
[10729] "a" "g" "g" "t" "t" "c" "t"
attr(,"name")
[1] "gi|9626685|ref|NC_001477.1|"
attr(,"Annot")
[1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome"
attr(,"class")
[1] "SeqFastadna"



> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
$`gi|9626685|ref|NC_001477.1|`
[1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"

> seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
> str(seq)
List of 1
 $ gi|9626685|ref|NC_001477.1|: chr
"agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
Usage:
rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower =
TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE,                   strip.desc
= FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong,
    endian = .Platform$endian, apply.mask = TRUE)

Arguments:
file - path (relative [getwd] is used if absoulte is not given) to FASTA file
seqtype - the nature of the sequence: DNA or AA
as.string - if TRUE sequences are returned as a string instead of a vector characters
forceDNAtolower - lower- or upper-case
set.attributes - whether sequence attributes should be set
legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored
seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3)
strip.desc - if TRUE, removes the '>' at the beginning
bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong
endian - relative to MAQ files
apply.mask - relative to MAQ files

Value:
a list of vector of chars
Basic manipulations:
#Turn seqeunce into characters and count how many are there
> length(s2c(seq[[1]]))
[1] 10735

#Count how many different accurrences are there
> table(s2c(seq[[1]]))
   a    c    g    t
3426 2240 2770 2299

#Count the fraction of G and C bases in the sequence
> GC(s2c(seq[[1]]))
[1] 0.4666977

#Count all possible words in a sequence with a sliding window of size = wordsize
> seq_2 <- "actg"
> count(s2c(seq_2), wordsize=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0

> count(s2c(seq_2), wordsize=2, by=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0

#Translate into aminoacids the genetic sequence.
> c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1))
[1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-...
RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD
QRSCCLYSIIPGTERQKMEWCC*INRF"
#Note:
#    1)frame=0 means that first nuc. is taken as the first position in the codon
#    2)sense='F' means that the sequence is tranlated in the forward sense
#    3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
Write a FASTA file (and how to save time)
> dir()
[1] "dengue_whole_sequence.fasta" "ER.fasta"
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33

> length(seq)
[1] 9984
> seq[[1]]
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata"
> names(seq)[1:5]
[1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F"

#Clean the sequences keeping only the ones > 25 nucleotides
> char_seq = rapply(seq, s2c, how="list")       # create a list turning strings to characters
> len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list
> bigger_25_list = c()
> for (val in 1:length(len_seq)){
+     if (len_seq[[val]] >= 25){
+         bigger_25_list = c(bigger_25_list, val)
+     }
+ }
> seq_25 = seq[bigger_25_list]            #indexing to get the desired list
> length(seq_25)
[1] 8928
> seq_25[1]
$ER0101A01F
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..."
#Write a FASTA file (see next slide)
> write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w")
> dir()
[1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
Usage:
w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80)

Arguments:
sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences.
names - The name(s) of the sequences.
file.out - The name of the output file.
open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file.
nbchar - The number of characters per line (default: 60)


Value:
none in the R space. A FASTA formatted file is created
Write a FASTA file (and how to save time)
# Remember what we did before
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33


# After cleanig seq, we produced a file "clean_seq.fasta" that we read now
> system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   1.30    0.01    1.31

#Time can be saved with the save function:
> save(seq_25, file = "ER_CLEAN_seqs.RData")

> system.time(load("ER_CLEAN_seqs.RData"))
   user system elapsed
   0.11    0.02    0.12
Read an alignment (or create an alignment object to be aligned):
#Create an alignment object through reading a FASTA file with two sequences (see next slide)
> fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta")
> fasta
$nb
[1] 2
$nam
[1] "LmjF01.0030" "LinJ01.0030"
$seq
$seq[[1]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$seq[[2]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$com
[1] NA
attr(,"class")
[1] "alignment"

# The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used.
> fixed_align = consensus(fasta, method="IUPAC")

> table(fixed_align)
fixed_align
  a   c   g   k   m  r    s   t   w    y
411 636 595   3   5 20   13 293   2   20
Usage:
rea d.a lig nm ent(file, format, forceToLower = TRUE)

Arguments:
file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative
path, the file name is relative to the current working directory, getwd.
format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f
forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in
lower case



Value:
An object is created of class alignment which is a list with the following components:
nb ->the number of aligned sequences
nam ->a vector of strings containing the names of the aligned sequences
seq ->a vector of strings containing the aligned sequences
com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
Access a remote server:
> choosebank()
 [1] "genbank"       "embl"          "emblwgs"         "swissprot"   "ensembl"       "hogenom"
 [7] "hogenomdna"    "hovergendna"   "hovergen"        "hogenom5"    "hogenom5dna"   "hogenom4"
[13] "hogenom4dna"   "homolens"      "homolensdna"     "hobacnucl"   "hobacprot"     "phever2"
[19] "phever2dna"    "refseq"        "greviews"        "bacterial"   "protozoan"     "ensbacteria"
[25] "ensprotists"   "ensfungi"      "ensmetazoa"      "ensplants"   "mito"          "polymorphix"
[31] "emglib"        "taxobacgen"    "refseqViruses"

#Access a bank and see complementary information:
> choosebank(bank="genbank", infobank=T)
> ls()
[1] "banknameSocket"
> str(banknameSocket)
List of 9
 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3
  .. ..- attr(*, "conn_id")=<externalptr>
 $ bankname: chr "genbank"
 $ banktype: chr "GENBANK"
 $ totseqs : num 1.65e+08
 $ totspecs: num 968157
 $ totkeys : num 3.1e+07
 $ release : chr "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
 $ status :Class 'AsIs' chr "on"
 $ details : chr [1:4] "              ****     ACNUC Data Base Content      ****                        " "
      GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170
sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive,
Universite Lyon I "
> banknameSocket$details
[1] "              ****     ACNUC Data Base Content      ****                        "
[2] "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
[3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers."
[4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
Make a query:
# query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and
# that are not partial sequences
> system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial""))
   user system elapsed
   0.78    0.00    6.72

> All_Dengue_viruses_NOTpartial[1:4]
$call
query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial"")
$name
[1] "All_Dengue_viruses_NOTpartial"
$nelem
[1] 7741
$typelist
[1] "SQ"

> All_Dengue_viruses_NOTpartial[[5]][[1]]
    name   length    frame   ncbicg
"A13666"    "456"      "0"      "1"

> myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]])
> myseq[1:20]
 [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a"

> closebank()
Usage:
query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F)

Arguments:
listname - The name of the list as a quoted string of chars
query - A quoted string of chars containing the request with the syntax given in the details section
socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened
database).
invisible - if FALSE, the result is returned visibly.
verbose - if TRUE, verbose mode is on
virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req
component of the list is set to NA.

Value:
The result is a list with the following 6 components:

call - the original call
name - the ACNUC list name
nelem - the number of elements (for instance sequences) in the ACNUC list
typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for
a list of species names.
req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE
socket - the socket connection that was used
Pau Corral Montañés

         RUGBCN
        03/05/2012

Mais conteúdo relacionado

Mais procurados

Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...VHIR Vall d’Hebron Institut de Recerca
 
Primary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxPrimary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxVandana Yadav03
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assemblyfnothaft
 
Motif Finding.pdf
Motif Finding.pdfMotif Finding.pdf
Motif Finding.pdfShimoFcis
 
Cytoscape basic features
Cytoscape basic featuresCytoscape basic features
Cytoscape basic featuresLuay AL-Assadi
 
Uses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsUses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsPragya Pai
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
 
R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)
R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)
R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)Koichi Hamada
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorHoffman Lab
 
Gene mapping | Genetic map | Physical Map | DNA Data Analysis (upgraded)
Gene mapping | Genetic map | Physical Map | DNA Data Analysis (upgraded)Gene mapping | Genetic map | Physical Map | DNA Data Analysis (upgraded)
Gene mapping | Genetic map | Physical Map | DNA Data Analysis (upgraded)NARC, Islamabad
 

Mais procurados (20)

Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
 
Primary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxPrimary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptx
 
I- Tasser
I- TasserI- Tasser
I- Tasser
 
Protein Database
Protein DatabaseProtein Database
Protein Database
 
Rishi
RishiRishi
Rishi
 
Dot plots-1.ppt
Dot plots-1.pptDot plots-1.ppt
Dot plots-1.ppt
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Motif Finding.pdf
Motif Finding.pdfMotif Finding.pdf
Motif Finding.pdf
 
Cytoscape basic features
Cytoscape basic featuresCytoscape basic features
Cytoscape basic features
 
Primer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applicationsPrimer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applications
 
Uses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsUses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in Bioinformatics
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
 
Bioinfomatics laboratory
Bioinfomatics laboratoryBioinfomatics laboratory
Bioinfomatics laboratory
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)
R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)
R言語による アソシエーション分析-組合せ・事象の規則を解明する-(第5回R勉強会@東京)
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Gene mapping | Genetic map | Physical Map | DNA Data Analysis (upgraded)
Gene mapping | Genetic map | Physical Map | DNA Data Analysis (upgraded)Gene mapping | Genetic map | Physical Map | DNA Data Analysis (upgraded)
Gene mapping | Genetic map | Physical Map | DNA Data Analysis (upgraded)
 

Destaque

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in Rschamber
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Paul Richards
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in RKlaus Schliep
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Kate Hertweck
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?seltzoid
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Rakesh Kumar
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Jesse Vincent
 
Exotic Orient
Exotic OrientExotic Orient
Exotic OrientRenny
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data ReferencesRob Thomas
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Paul Pivec
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSLsamueltay77
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Sajib Hossain Akash
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDPhillip CFD
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portalSebastien Goiffon
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Sayogyo Rahman Doko
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationStanford University
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study ShareThis
 

Destaque (20)

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in R
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)
 
Exotic Orient
Exotic OrientExotic Orient
Exotic Orient
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data References
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSL
 
Bailey capítulo-6
Bailey capítulo-6Bailey capítulo-6
Bailey capítulo-6
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFD
 
Gscm1
Gscm1Gscm1
Gscm1
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portal
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
 
Geohash
GeohashGeohash
Geohash
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentation
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study
 

Semelhante a SeqinR - biological data handling

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelchk49
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesDongsun Kim
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applicationsConstantine Slisenka
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуляPositive Hack Days
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from ScratchDenis Kolegov
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Agora Group
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekingeProf. Wim Van Criekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10wremes
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQDoncho Minkov
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8marctritschler
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Marc Tritschler
 

Semelhante a SeqinR - biological data handling (20)

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernel
 
2015 bioinformatics bio_python
2015 bioinformatics bio_python2015 bioinformatics bio_python
2015 bioinformatics bio_python
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Biopython
BiopythonBiopython
Biopython
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applications
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуля
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from Scratch
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011
 
Java
JavaJava
Java
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQ
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 

Último

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

SeqinR - biological data handling

  • 1. Pau Corral Montañés RUGBCN 03/05/2012
  • 2. VIDEO: “what we used to do in Bioinformatics“ note: due to memory space limitations, the video has not been embedded into this presentation. Content of the video: A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done interactively, leading to a single entry in the database related to Dengue Virus, complete genome. Thanks to the facilities that the site provides, a FASTA file for this complete genome is downloaded and thereafter opened with a text editor. The purpose of the video is to show how tedious is accessing the site and downleading an entry, with no less than 5 mouse clicks.
  • 3. Unified records with in-house coding: The 3 databases share all sequences, but they use different A "mirror" of the content of other databases accession numbers (IDs) to refer to each entry. They update every night. ACNUC A series of commands and their arguments were NCBI – ex.: #000134 defined that allow EMBL – ex.: #000012 (1) database opening, (2) query execution, (3) annotation and sequence display, (4) annotation, species and keywords browsing, and (5) sequence extraction DDBJ – ex.: #002221 Other Databases – NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov) – EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl) – DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp) – ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
  • 4.
  • 5. ACNUC retrieval programs: i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/) ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html) iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh) a.I) Query_win – GUI client for remote ACNUC retrieval operations (http://pbil.univ-lyon1.fr/software/query_win.html) a.II) raa_query – same functionality as Query_win in command line interface (http://pbil.univ-lyon1.fr/software/query.html)
  • 6. Why use seqinR: the Wheel - available packages Hotline - discussion forum Automation - same source code for different purposes Reproducibility - tutorials in vignette format Fine tuning - function arguments Usage of the R environment!! #Install the package seqinr_3.0-6 (Apirl 2012) >install.packages(“seqinr“) package ‘seqinr’ was built under R version 2.14.2 A vingette: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
  • 7. #Choose a mirror >chooseCRANmirror(mirror_name) >install.packages("seqinr") >library(seqinr) #The command lseqinr() lists all what is defined in the package: >lseqinr()[1:9] [1] "a" "aaa" [3] "AAstat" "acnucclose" [5] "acnucopen" "al2bp" [7] "alllistranks" "alr" [9] "amb" >length(lseqinr()) [1] 209
  • 8. How many different ways are there to work with biological sequences using SeqinR? 1)Sequences you have locally: i) read.fasta() and s2c() and c2s() and GC() and count() and translate() ii) write.fasta() iii) read.alignment() and consensus() 2) Sequences you download from a Database: i) browse Databases ii) query() and getSequence()
  • 9. FASTA files example: Example with DNA data: (4 different characters, normally) The FASTA format is very simple and widely used for simple import of biological sequences. It begins with a single-line description starting with a character '>', followed by lines of sequence data of maximum 80 character each. Lines starting with a semi- colon character ';' are comment lines. Example with Protein data: (20 different characters, normally) Check Wikipedia for: i)Sequence representation ii)Sequence identifiers iii)File extensions
  • 10. Read a file with read.fasta() #Read the sequence from a local directory > setwd("H:/Documents and Settings/Pau/Mis documentos/R_test") > dir() [1] "dengue_whole_sequence.fasta" #Use the read.fasta (see next slide) function to load the sequence > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA") $`gi|9626685|ref|NC_001477.1|` [1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g" [19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c" [37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a" [55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t" [............................................................] [10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c" [10729] "a" "g" "g" "t" "t" "c" "t" attr(,"name") [1] "gi|9626685|ref|NC_001477.1|" attr(,"Annot") [1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome" attr(,"class") [1] "SeqFastadna" > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) $`gi|9626685|ref|NC_001477.1|` [1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......" > seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) > str(seq) List of 1 $ gi|9626685|ref|NC_001477.1|: chr "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
  • 11. Usage: rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower = TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE, strip.desc = FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong, endian = .Platform$endian, apply.mask = TRUE) Arguments: file - path (relative [getwd] is used if absoulte is not given) to FASTA file seqtype - the nature of the sequence: DNA or AA as.string - if TRUE sequences are returned as a string instead of a vector characters forceDNAtolower - lower- or upper-case set.attributes - whether sequence attributes should be set legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3) strip.desc - if TRUE, removes the '>' at the beginning bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong endian - relative to MAQ files apply.mask - relative to MAQ files Value: a list of vector of chars
  • 12. Basic manipulations: #Turn seqeunce into characters and count how many are there > length(s2c(seq[[1]])) [1] 10735 #Count how many different accurrences are there > table(s2c(seq[[1]])) a c g t 3426 2240 2770 2299 #Count the fraction of G and C bases in the sequence > GC(s2c(seq[[1]])) [1] 0.4666977 #Count all possible words in a sequence with a sliding window of size = wordsize > seq_2 <- "actg" > count(s2c(seq_2), wordsize=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 > count(s2c(seq_2), wordsize=2, by=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 #Translate into aminoacids the genetic sequence. > c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1)) [1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-... RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD QRSCCLYSIIPGTERQKMEWCC*INRF" #Note: # 1)frame=0 means that first nuc. is taken as the first position in the codon # 2)sense='F' means that the sequence is tranlated in the forward sense # 3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
  • 13. Write a FASTA file (and how to save time) > dir() [1] "dengue_whole_sequence.fasta" "ER.fasta" > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 > length(seq) [1] 9984 > seq[[1]] [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata" > names(seq)[1:5] [1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F" #Clean the sequences keeping only the ones > 25 nucleotides > char_seq = rapply(seq, s2c, how="list") # create a list turning strings to characters > len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list > bigger_25_list = c() > for (val in 1:length(len_seq)){ + if (len_seq[[val]] >= 25){ + bigger_25_list = c(bigger_25_list, val) + } + } > seq_25 = seq[bigger_25_list] #indexing to get the desired list > length(seq_25) [1] 8928 > seq_25[1] $ER0101A01F [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..." #Write a FASTA file (see next slide) > write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w") > dir() [1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
  • 14. Usage: w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80) Arguments: sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences. names - The name(s) of the sequences. file.out - The name of the output file. open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file. nbchar - The number of characters per line (default: 60) Value: none in the R space. A FASTA formatted file is created
  • 15. Write a FASTA file (and how to save time) # Remember what we did before > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 # After cleanig seq, we produced a file "clean_seq.fasta" that we read now > system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 1.30 0.01 1.31 #Time can be saved with the save function: > save(seq_25, file = "ER_CLEAN_seqs.RData") > system.time(load("ER_CLEAN_seqs.RData")) user system elapsed 0.11 0.02 0.12
  • 16. Read an alignment (or create an alignment object to be aligned): #Create an alignment object through reading a FASTA file with two sequences (see next slide) > fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta") > fasta $nb [1] 2 $nam [1] "LmjF01.0030" "LinJ01.0030" $seq $seq[[1]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $seq[[2]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $com [1] NA attr(,"class") [1] "alignment" # The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used. > fixed_align = consensus(fasta, method="IUPAC") > table(fixed_align) fixed_align a c g k m r s t w y 411 636 595 3 5 20 13 293 2 20
  • 17. Usage: rea d.a lig nm ent(file, format, forceToLower = TRUE) Arguments: file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative path, the file name is relative to the current working directory, getwd. format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in lower case Value: An object is created of class alignment which is a list with the following components: nb ->the number of aligned sequences nam ->a vector of strings containing the names of the aligned sequences seq ->a vector of strings containing the aligned sequences com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
  • 18. Access a remote server: > choosebank() [1] "genbank" "embl" "emblwgs" "swissprot" "ensembl" "hogenom" [7] "hogenomdna" "hovergendna" "hovergen" "hogenom5" "hogenom5dna" "hogenom4" [13] "hogenom4dna" "homolens" "homolensdna" "hobacnucl" "hobacprot" "phever2" [19] "phever2dna" "refseq" "greviews" "bacterial" "protozoan" "ensbacteria" [25] "ensprotists" "ensfungi" "ensmetazoa" "ensplants" "mito" "polymorphix" [31] "emglib" "taxobacgen" "refseqViruses" #Access a bank and see complementary information: > choosebank(bank="genbank", infobank=T) > ls() [1] "banknameSocket" > str(banknameSocket) List of 9 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3 .. ..- attr(*, "conn_id")=<externalptr> $ bankname: chr "genbank" $ banktype: chr "GENBANK" $ totseqs : num 1.65e+08 $ totspecs: num 968157 $ totkeys : num 3.1e+07 $ release : chr " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" $ status :Class 'AsIs' chr "on" $ details : chr [1:4] " **** ACNUC Data Base Content **** " " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I " > banknameSocket$details [1] " **** ACNUC Data Base Content **** " [2] " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" [3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." [4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
  • 19. Make a query: # query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and # that are not partial sequences > system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"")) user system elapsed 0.78 0.00 6.72 > All_Dengue_viruses_NOTpartial[1:4] $call query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"") $name [1] "All_Dengue_viruses_NOTpartial" $nelem [1] 7741 $typelist [1] "SQ" > All_Dengue_viruses_NOTpartial[[5]][[1]] name length frame ncbicg "A13666" "456" "0" "1" > myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]]) > myseq[1:20] [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a" > closebank()
  • 20. Usage: query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F) Arguments: listname - The name of the list as a quoted string of chars query - A quoted string of chars containing the request with the syntax given in the details section socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened database). invisible - if FALSE, the result is returned visibly. verbose - if TRUE, verbose mode is on virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req component of the list is set to NA. Value: The result is a list with the following 6 components: call - the original call name - the ACNUC list name nelem - the number of elements (for instance sequences) in the ACNUC list typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for a list of species names. req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE socket - the socket connection that was used
  • 21. Pau Corral Montañés RUGBCN 03/05/2012