Use of bio-informatic tools in bacterial genetics

USE OF BIO-INFORMATIC TOOLS TO
STUDY IMPLICATIONS OF G-C
CONTENT OF DNA ON THE PROTEIN.

DEBTANU CHAKRABORTY

Index
1) Note of Acknowledgement
2) Bio-informatics
3) G-C content
4) Classification tree of Bacteria
5) List of low G-C bacteria
6) List of high G-C bacteria
7) Introduction to Carbonic Anhydrase
8) Peptide Sequence and their analysis
9) Gene Sequences and their analysis
10) Codon usage plot
11) Conclusion
12) Future work-scope

Note of Acknowledgement
The project would have been incomplete without the help of a number of persons. First I
would like to thank my mentor and guide Prof. Chanchal K. Das Gupta who gave me the
idea and inspiration to do the project and helped me in every step whenever I was in
trouble. I would like to thank Prof. Punyasloke Bhadury who helped me by introducing to
NCBI website and showing me to perform tasks like alignment, BLAST in internet.

I cannot repay the sin if I don’t mention the names of my superiors Papri di, Amit da and
Shimonti di who also helped me with the project.

I have in my work, extensively used the websites- NCBI and Uniprot.

Bioinformatics is the application of statistics and computer science to the field
of molecular biology.
The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of
informatic processes in biotic systems. Its primary use since at least the late 1980s has
been in genomics and genetics, particularly in those areas of genomics involving large-
scale DNA sequencing.

Bioinformatics now entails the creation and advancement of databases, algorithms,
computational and statistical techniques and theory to solve formal and practical
problems arising from the management and analysis of biological data.

Over the past few decades rapid developments in genomic and other molecular
research technologies and developments in information technologies have combined to
produce a tremendous amount of information related to molecular biology. It is the
name given to these mathematical and computing approaches used to glean
understanding of biological processes.

Common activities in bioinformatics include mapping and analyzing DNA and protein
sequences, aligning different DNA and protein sequences to compare them and
creating and viewing 3-D models of protein structures.

The primary goal of bioinformatics is to increase the understanding of biological
processes. What sets it apart from other approaches, however, is its focus on
developing and applying computationally intensive techniques (e.g., pattern
recognition, data mining, machine learning algorithms, and visualization) to achieve this
goal. Major research efforts in the field include sequence alignment, gene
finding, genome assembly, drug design, drug discovery, protein structure
alignment, protein structure prediction, prediction of gene expression and protein-
protein interactions, genome-wide association studies and the modeling of evolution.

GC-content (or guanine-cytosine content), in molecular biology, is the percentage
of nitrogenous bases on a molecule which are either guanine or cytosine (from a
possibility of four different ones, also including adenine and thymine). This may refer to
a specific fragment of DNA or RNA, or that of the whole genome. When it refers to a
fragment of the genetic material, it may denote the GC-content of part of a gene
(domain), single gene, group of genes (or gene clusters) or even a non-coding region. G
(guanine) and C (cytosine) undergo a specific hydrogen bonding whereas A (adenine)
bonds specifically with T (thymine).

The GC pair is bound by three hydrogen bonds, while AT pairs are bound by two
hydrogen bonds. DNA with high GC-content is more stable than DNA with low GC-
content, but contrary to popular belief, the hydrogen bonds do not stabilize the DNA
significantly and stabilization is mainly due to stacking interactions. In spite of the
higher conferred to the genetic material, it is envisaged that cells with DNA with high
GC-content undergo autolysis, thereby reducing the longevity of the cell per se. Due to
the robustness endowed to the genetic materials in high GC organisms it was
commonly believed that the GC content played a vital part in adaptation temperatures, a
hypothesis which has recently been refuted.

In PCR experiments, the GC-content of primers are used to predict their annealing
temperature to the template DNA. A higher GC-content level indicates a higher melting
temperature.

THE EVOLUTION TREE IN BACTERIA WHERE IS G-C CONTENT STUDY IS AN
ANALYTICAL TOOL.

The guanine plus cytosine (GC) content in bacteria ranges from ~20% to 75% where as
we will see in a later lecture that eukaryotic genomes have GC contents that often have
a restricted range from ~35-50% (about 40%-45% in vertebrates).

Some Bacteria with low G-C content -

Some Bacteria with high G-C content-

For our convenience, we chose Carbonic Anhydrase because it is present in all bacteria
across the G-C content spectrum of Bacterias-

The carbonic anhydrases (or carbonate dehydratases) form a family of enzymes that
catalyze the rapid conversion of carbon dioxide and water to bicarbonate and protons, a
reaction that occurs rather slowly in the absence of a catalyst.[1] The active site of most
carbonic anhydrases contains a zinc ion; they are therefore classified
as metalloenzymes.

THE CARBONIC ANHYDRASE PROTEIN-

In our analysis, we choose the following bacteria-

1) Methaococcus voltae A3 (UI-A8TF20) (G-Cc=27%)
2) Staphylococcus carnosus (UI-B9DMU8_STACT) (G-Cc=34%)
3) Vibrio cholera (UI-Q9KMP6_VIBCH) (G-Cc=47%)
4) Escherichia coli (UI-P61517) (G-Cc=50%%)
5) Truepera radiovictrix DSM1703 (UI-ADI14363) (G-Cc=68.2%)
6) Salinispora arenicola (UI-A8MOD8) (G-Cc=69.2%%)
7) Frankia CcI (UI-Q2JF50) (G-Cc=71%)

*UI stands for the Uniprot Accession number of the Carbonic Anhydrase protein
of the respective bacteria.

We begin analyzing the protein Carbonic Anhydrase from these bacteria-

The peptide sequence goes as follows-

>Methanococcus voltae Carbonic Anhydrase Protein
LN*LFNLASVNVNHKPFNFHIFRNCRVIFD*FDTFQHVFFFVIHFTHPSFKVWRKVWIYS
SFNHFFSYLFNICSCHSTVGMTYDSYLFNI*TVYCNY*RP*YIVCNNITCVFDDFCVASF
*THFFR*EIYESCIHTSYYC*FLFRFGFCSDSFTYTQ

>Staphylococcus carnosus Carbonic Anhydrase Protein
YPXXXMTLLESILAYNKDFVGNKEFENYTTSKKPDKKAVLFTCMDTRLQDLGTKALGFNN
GDLKVVKNAGAIITHPYGSTIKSLLVGIYALGAEEIIIMAHKDCGMGCLDVSTVKDAMKE
RGVTEETFKIIEHSGVDVDSFLQGFKDAEENVRRNIDMVYNHPLFDKSVPIHGLVIDPHT
GELDLIQDGYELAAQNK*

>Vibrio cholerae Carbonic Anhydrase Protein
MKKTTWVLAMVASMSFGVQASEWGYEGEHAPEHWGKVAPLCAEGKNQSPIDVAQSVEADL
QPFTLNYQGQVVGLLNNGHTLQAIVRGNNPLQIDGKTFQLKQFHFHTPSENLLKGKQFPL
EAHFVHADEQGNLAVVAVMYQVGSENPLLKVLTADMPTKGNSTQLTQGIPLADWIPESKH
YYRFNGSLTTPPCSEGVRWIVLKEPAHLSNQQEQQLSAVMGHNNRPVQPHNARLVLQAD*

>1st Escherichia coli Carbonic Anhydrase Protein
LFVVGVFQLEVGDPVTVTLLKGFAVSRCDIQITQQAVVNAVGPAVNGDFLPAFPR*LHNS
GVAQVIHLFHDVQFTQGIQTALLRHFAEQ*AMFEPDIADMQQPVVDKPQFRVFNCGLYAA
ATVV

>2nd Escherichia coli Carbonic Anhydrase Final rip
MKDIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGEL
FVHRNVANLVIHINNWLLHIRDIWFKHSSLLGEMPQERRLDTLCELNVMEQVYNLGHSTI
MQSAWKRGQKVTIHGWAYGIHDGLLRDLDVTATNRETLEQRYRHGISNLKLKHANHK*

>3rd Escherichia coli Carbonic Anhydrase Final rip2
VKEIIDGFLKFQREAFPKREALFKQLATQQSPRTLFISCSDSRLVPELVTQREPGDLFVI
RNAGNIVPSYGPEPGGVSASVEYAVAALRVSDIVICGHSNCGAMTAIASCQCMDHMPAVS
HWLRYADSARVVNEARPHSDLPSKAAAMVRENVIAQLANLQTHPSVRLALEEGRIALHGW
VYDIESGSIAAFDGATRQFVPLAANPRVCAIPLRQPTAA*

.

>Truepera radiovictrix DSM1703 Carbonic Anhydrase Protein
S*PFQKRAVSGRAG*KGCRQQLEPARLEVVHGADDGERALGDARL*GRVRGDEANGRLDV
LPHGPLERTPRPPLSRVAATSGAPQSGLERPHEGRQRGVGAPLLEGCGGGRDRAAAGVPQ
HHDERHAEHRDAVGEARQNRVVDDVAGDPVGKEVAQALVEDDLRRHARVGAAEHRREGVL
LARQGRAPARVLVRVRHAPLEVALVPGQQALERPLGGQGRLGGGH

>1 Salinispora arenicola Carbonic Anhydrase 1st Protein
MNCPGTPDTQPGSHPVSSSGIGGSRSGPVGPEQALAELYDGNRRFAVGVPIRPHQDIDRR
VALADGQQPFAVIVGCSDSRLAAEIIFDRGLGDLFVVRTAGHTVGPEVLGSVEYAVTVLG
APLVVVLGHDSCGAVQAARTADATGAPASGHLRAVVDGVVPSVRRAGARGVTEIDQIVDI
HIEQTVEAVLGRSEAVAAAVAGGRCAVVGMSYRLTAGEVHTVTAVGLAAPTTPPAAPETR
PSAGPA*

>2 Salinispora arenicola Carbonic Anhydrase 2nd Protein
XXTXXESGRVAESESTAFRWAGGRCGRACGVFVDEGALVGDQRITDSVAHHAHRRIREAD
GGQPAVGAWRPSTTQPGSSSASSRGPRTEALWGGPRMH*LAGAA*TPLRHRPG*SFRDTY
GR*GDRPSGHWFCRVWTSDQWHSAHRGPWASALRRRQGGVHLPS*GQAAARQPTGDRYGP
PAGV*TGSLSGERRPDRRHGPSPGRADRKRPALQPGTSPTRGEAGPCRGQCLLFPRYRRG
GSPQWQTLL

>1 Frankia CcI Carbonic Anhydrase 1st Protein
CPSPTTT*PTTPPTRRPSPGRFRCRRPSTSPPSPAWTHGSTSTRSLAWATARLTSSATPA
ASSPTTRSVPSRSASACSAPARSS*STTPTAAC*PSPTTILNARSRTRPGSNQNGPWSRL
PTWPKTYASRLRGSRRARSSRIPTPSAASSSMLPPDCSPKSR

>2 Frankia CcI3 Carbonic Anhydrase 2nd Protein
VDTDDHTAVDPVADVHADDVHADTVRPADTVSPVSGAATATELLLSYAAGHPARRREAGL
PALPGARPRLGVAVVACMDVRIQVEALLGLVEGDAHILRNAGGVITPDVVRSLAVSQHVL
GTTEIILLHHTGCGLERITDDGFRDQLECKTGVRPEWAVYSFPDVEEDVRKSVRVLRSSP
FLQSTTSVRGFVYQVETGALVEVLP*

We have 3 protein sequences for E coli and 2 sequences each for Salinispora and
Frankia. We now compare them amongst themselves.

For E coli-

The sequence marked Escherishsia is the 1st sequence.

The sequence Ecoli is the 2nd sequence.

The sequence Final is the 3rd sequence.

For Salinispora-

. :

For Frankia-

After viewing the alignment of the suspected Carbonic Anhydrase within the same
species, we wish to align the proteins from all the sources, all proteins from same
species is also incorporated.

The alignment sequence of the bacteria is as follows-

Analysis- we can see two things from the above.

1) Bacteria with high G-C have two genes for Carbonic Anhydrase and
consequently 2 proteins suspected to be Carbonic Anhydrase.
2) Bacteria with high G-C incorporate synonymous amino acid which requires G-C
rich codons to compensate in their protons.

We will elaborate on the 2nd point later using Codon-plot. We can show that the
corresponding codon of the DNA of Carbonic Anhydrase gene of this bacteria.

Now we move to analyzing the DNA of the genes of Carbonic Anhydrase-

The DNA sequences are as follows-

>Methanococcus voltae Carbonic Anhydrase of 471 bases
ttaaattaactttttaatctcgccagtgttaatgtcaatcataagcccttcaacttccac
atctttaggaattgcagggtgatttttgattaattcgacacctttcaacacgttttcttc
ttcgttatccattttacccatccaagcttcaaagtctggcgtaaagtatggatttactcc
tcttttaatcatttcttttcttatctcttcaatatctgctcctgccattccacagtcggt
atgacctacgatagctatcttttcaacatctaaacagtatattgcaactactaacgacct
taatacatcgtctgtaataatattacctgcgtttttgatgacttttgcgtcgcctctttc
taaacccattttttcaggtaagaaatttacgagtcttgtatccatacaagttattactgc
taatttctttttcggtttggcttctgctccgatagtttcacctatactcaa

>Staphylococcus carnosus Carbonic Anhydrase of 594 bases
taccccancancanaatgacgttattagaaagcattttagcttataataaagattttgtc
ggcaacaaagaatttgaaaactatacaacaagtaaaaaaccagataaaaaagcagtgtta
tttacatgtatggatacacgtttgcaagatttaggtacaaaagcactcggttttaataat
ggtgacttgaaagttgttaaaaatgcaggtgcaattatcacgcacccatatggttcaact
ataaaaagcttactagtaggtatttatgcattaggtgctgaagaaattattattatggca
cataaagattgcggaatgggttgtcttgatgtcagcactgttaaagacgcaatgaaagaa
cgtggcgtaacagaagaaacatttaaaatcatcgaacattctggtgtagatgtagacagc
tttttacaaggtttcaaagatgctgaagaaaatgtccgcagaaatatcgatatggtatat
aatcatcccttatttgataaatccgtacctattcacggcttagtcatcgatcctcatacg
ggggaattagatttaattcaagacggctatgaattagctgctcaaaataaataa

>Vibrio cholerae Carbonic Anhydrase of 720 bases
atgaaaaagacaacgtgggtattagcgatggtagccagtatgagcttcggcgtacaggct
tccgagtgggggtatgaaggagagcatgctccggagcattggggcaaagttgcccctctt
tgcgcagagggtaaaaatcaaagcccgattgatgtcgcgcaaagcgtagaagcggatcta
cagcctttcacgctcaattatcaagggcaagtggttgggctgctcaataacgggcacact
ttacaagcgatagtccgtggtaataacccactgcagatcgatggcaaaacgtttcagctt
aagcagtttcattttcataccccttctgaaaatttgctaaaaggaaaacaattcccactg
gaagcgcattttgttcatgccgacgagcaaggcaatctggcggttgttgcggtgatgtac
caagtggggtcggaaaatccgctgcttaaggttctcacggcggatatgccgaccaaaggg
aattcgactcagctcacgcaagggatccctttggctgattggatcccagaatcgaagcac
tactatcgtttcaatggttcattgactacgccgccttgcagtgaaggtgtacgttggatt
gtgttaaaagagccagcacatttgtcgaatcaacaagagcagcagcttagtgccgtgatg
ggacacaataatcgacccgtacaaccgcataatgctcgtcttgtcttgcaagccgactaa

>Escherichia coli Carbonic Anhydrase of 372 bases
ttatttgtggttggcgtgtttcagcttgaggttggagatcccgtgacggtaacgttgctc
aagggtttcgcggttagtcgctgtgacatccagatcacgcagcaagccgtcgtgaatgcc
gtaggcccagccgtgaatggtgactttctgcccgcgtttccacgctgattgcataatagt
ggagtggcccaggttatacacctgttccatgacgttcagttcacacaaggtatccagacg
gcgctcttgcggcatttcgccgagcaatgagctatgtttgaaccagatatcgcggatatg
cagcagccagttgttgataagccccagttccgggttttcaactgcggcttgtacgccgcc
gcaaccgtagtg

>123 Escherichia coli carbonic Anhydrase Final
aagccccagttccgggttttcaactgcggcttgtacgccgccgcaaccgtagtggccaca
gataataatgtgttcaacttcgagtacatccactgcatactgaaccacggaaaggcagtt
caggtcagtttatttgtggttggcgtgtttcagcttgaggttggaaatcccgtgacggta
acgttgctcaagggtttcgcggttggtggcggtaacatccagatcacgcagcaagccgtc
gtgaatgccgtaggcccagccgtgaatggtaactttctgcccgcgtttccacgctgattg
cataatggtggagtggcccaggttatacacctgttccatgacgttcagttcacacaaggt
atccagacggcgctcttgcggcatttcgccgagcaatgagctatgtttgaaccagatatc
gcggatatgcagcagccagttgttgatgtgaatgaccaggttagcaacattacggtgaac
aaagagttcgcccggctcaagaccggttaaacgttctgcaggaacgcgactgtcggaaca
tccaatccatagaaagcgcggtttttgcgcttgtgccagtttctcaaaaaacccgggatc
ctcttccaccagcatttttgaccatagtgcattgttgctgatgagtgtatctatgtcttt cat

>456 Escherichia coli Carbonic Anhydrase Final 2
gtgaaagagattattgatggattccttaaattccagcgcgaggcatttccgaagcgggaagcct
tgtttaaacagctggcgacacagcaaagcccgcgcacactttttatctcctgctccgacagccg

tctggtccctgagctggtgacgcaacgtgagcctggcgatctgttcgttattcgcaacgcgggc
aatatcgtcccttcctacgggccggaacccggtggcgtttctgcttcggtggagtatgccgtcg
ctgcgcttcgggtatctgacattgtgatttgtggtcattccaactgtggcgcgatgaccgccat
tgccagctgtcagtgcatggaccatatgcctgccgtctcccactggctgcgttatgccgattca
gcccgcgtcgttaatgaggcgcgcccgcattccgatttaccgtcaaaagctgcggcgatggtac
gtgaaaacgtcattgctcagttggctaatttgcaaactcatccatcggtgcgcctggcgctcga
agaggggcggatcgccctgcacggctgggtctacgacattgaaagcggcagcatcgcagctttt
gacggcgcaacccgccagtttgtgccactggccgctaatcctcgcgtttgtgccataccgctac
gccaaccgaccgcagcgtaa

>Truepera radiovictrix DSM1703 Carbonic Anhydrase consisting
of 675 bases
tcataaccgttccaaaagcgggccgtgagcgggcgcgctgggtaaaaggggtgtcggcag
cagctcgagcccgcccgtctcgaggtcgtacacggcgccgacgacggcgagcgtgcgctg
ggcgatgcgcgcctttaggggcgggtgcgcggcgatgaggcgaacggacgcctcgacgtt
ctccctcacggcccccttgagcgtacaccccgtccccccctcagccgtgtcgcagcgacg
agcggcgcgccgcaaagcgggctcgagcgcccgcacgaggggcgtcagcgaggggtcggc
gcccccctcctcgagggctgcggcggcggccgcgaccgcgccgcagccggtgtgccccag
caccacgatgagcggcacgccgagcaccgagacgccgtaggtgaggctcgccaaaatcgc
gtcgtcgacgatgttgccggcgacccggttggtaaagaggtcgcccaagccctggtcgaa
gatgatctgcggcggcacgcgcgagtcggcgcagccgagcaccgccgcgaaggggtgctg
ctcgcgcgtcaaggacgcgcgccagcgcgcgtcttggtgcgggtgcgccatgcgcccctc
gaggtagcgctggtgcccggccaacaggcgctcgagcgccccttggggggtcaggggcgt
ctggggggcggtcat

>Salinispora arenicola Carbonic Anhydrase 1 of 741 bases.
atgaactgcccaggaacgcccgacacacagccgggctcgcacccggtgtcctccagtgga
atcggcggttcccggagcgggccggtcgggcccgagcaggcgcttgccgagttgtacgac
ggcaaccggcgattcgccgttggtgttccgatccgcccacaccaggacatcgaccgtcgg
gtcgccctggcggatggtcagcagcccttcgcggtgatcgtcggctgttccgactcccga
cttgctgctgagatcatctttgaccgtggtctcggtgacctgttcgtggtacgcaccgct
gggcacacggtcgggccagaggtgctgggcagcgtcgagtacgcggtcaccgtgctgggt
gcgccgctggtggtggtgctcggccacgactcctgtggagcggtacaggcggcccggacc
gccgacgccaccggcgcaccggcgtccgggcacctccgcgctgtggtggacggggtggtg
ccgagcgtgcgtcgggccggggcccgtggggttaccgagatcgaccagatcgtcgacatt
catatcgagcagaccgttgaggcggtgcttggccgttctgaggcggtcgcagccgcggtg
gccggcggacggtgtgcggtggtgggaatgtcgtaccggctcaccgcaggtgaggtgcac
acggttaccgcggttggcctcgcggcgccgaccacaccaccggccgcgcctgagacccgc
cccagcgccggaccggcgtaa

>abc Salinispora arenicola Carbonic Anhydrase 2 of 748 bases
naancacancanatgaatcgggccgtgtggccgagtcggagagcactgctttccggtgg
gctggtgggcgctgtggccgtgcttgcggggtgttcgtcgacgaaggcgcgctcgtcggc

gaccagcgcatcaccgacagcgtcgcccaccacgcccaccgccgcattcgagaggctgat
ggagggcaaccagcggtgggtgcgtggagaccttcaacaacccaaccgggatccagctcg
gcgtcaagtcgtggcccacgaacagaagccctttggggcggtcctcgcatgcattgactc
gcgggtgccgcctgaactcctcttcgacaccggcctgggtgatcttttcgtgacacgtac
gggaggtgaggcgatcggcccagtggtcactggttctgtcgagtttggacctctgaccag
tggcactccgctcatcgtggtccttgggcatcagcgttgcggcgccgtcaaggcggcgta
cacctcccttcgtgagggcaagccgctgcccggcaacctaccggcgatcgttacggccct
ccagccggcgtatgaacaggtagcctcagcggggagcgccgacccgatcgacgccatggc
ccgagcccaggccgagctgatcgcaaacgacctgcgctccaacccggaactagccccact
cgtggcgaagcgggaccttgccgtggtcagtgcctactattccctcgataccggcgcggt
ggaagtcctcagtggcagaccctcctga

>Frankia CcI Carbonic Anhydrase 1 of 488 bases
tgtccgtcaccgacgactacctgaccaacaacgccgcctacgcgaagaccttcgccgggc
cgcttccgctgccgccgtccaagcacatcgccgccgtcgcctgcatggacgcacggctca
acgtctacgcgatccttggcctgggcgacggcgaggctcacgtcatccgcaacgccggcg
gcgtcgtcaccgacgacgagatccgttccctcgcgatcagccagcgcctgctcggcaccc
gcgagatcatcctgatccaccacaccgactgcggcatgctgaccttcaccgacgacgatt
ttaaacgctcgatccaggacgagaccgggatcaaaccagaatgggccgtggagtcgttta
ccgacctggccgaagacatacgccagtcgattgcgcggatcaaggcgagcccgttcatcc
cgcataccgacgccatccgcggcttcatcttcgatgttgccaccggactgctcaccgaag
tcgcgtga

>xyz Frankia CcI3 Carbonic Anhydrase 2 of 618 bases
gtggacaccgatgaccacaccgctgtcgaccccgttgccgatgtccatgcagacgatgtc
catgcggacaccgtgcgccccgcggatacggtgagcccggtgagcggcgctgccacggcg
accgaactcctgctgagctacgctgcaggtcaccccgcccggcggcgggaggccgggcta
cctgccctgcccggcgcgcggccgcgcctgggcgtcgcggtggttgcgtgtatggacgtg
cggatccaggtggaggccttgctcggtcttgtcgaaggtgacgcccacatcctgcgcaac
gccggtggtgtcatcaccccggatgtggtccgctcgctcgccgtgagccagcacgtgctg
ggaacgacggagatcattcttttgcatcacaccgggtgtggtctcgaaaggatcaccgac
gacgggttccgggaccagttggagtgcaagacgggcgttcgtcccgaatgggccgtgtat
tcctttcccgatgtcgaggaggacgtgcgcaagtccgtcagggtgctgcgttcgtcgccg
ttcctgcagtccaccacctcggtacgcgggttcgtctaccaggtggagaccggggcactg
gtcgaggttctgccgtag

We will now proceed to compare the translation product of the ORF of the gene with the
original protein product. Methanococcus produces the protein in reading frame 1 of the
reverse strand of the DNA segment. It does not start with ATG.first amino acid is L
inplace of M.Staphylococcus and Vibrio does the same thing in frame 1 of forward
direction. The same is observed in Frankia and Salinispora.

The gene product is typically labeled ‘orf’.

1) 1) Methanococcus Voltae A3-

2)Staphylococcus Carnosus-

3)Vibrio cholera-

3)The comparison of E.coli gene-pro and protein are as follows-

For the rest, we will be comparing only 1 suspected protein and gene product
for consistency.For Truepera-

5) For Salinispora-

6) For Frankia-

Codon Analysis is as follows-

Results for 411 residue sequence "Methanococcus voltae Carbonic
Anhydrase of 471 bases
AmAcid Codon Number /1000 Fraction ..
Ala GCG 0.00 0.00 0.00
Ala GCA 0.00 0.00 0.00
Ala GCT 0.00 0.00 0.00
Ala GCC 1.00 7.30 1.00

Cys TGT 2.00 14.60 0.20
Cys TGC 8.00 58.39 0.80

Asp GAT 4.00 29.20 0.67
Asp GAC 2.00 14.60 0.33

Glu GAG 1.00 7.30 0.50
Glu GAA 1.00 7.30 0.50

Phe TTT 12.00 87.59 0.50
Phe TTC 12.00 87.59 0.50

Gly GGG 0.00 0.00 0.00
Gly GGA 0.00 0.00 0.00
Gly GGT 1.00 7.30 0.50
Gly GGC 1.00 7.30 0.50

His CAT 6.00 43.80 0.86
His CAC 1.00 7.30 0.14

Ile ATA 0.00 0.00 0.00
Ile ATT 4.00 29.20 0.40

Ile ATC 6.00 43.80 0.60

Lys AAG 0.00 0.00 0.00
Lys AAA 2.00 14.60 1.00

Leu TTG 0.00 0.00 0.00
Leu TTA 0.00 0.00 0.00
Leu CTG 0.00 0.00 0.00
Leu CTA 0.00 0.00 0.00
Leu CTT 2.00 14.60 0.67
Leu CTC 1.00 7.30 0.33

Met ATG 1.00 7.30 1.00

Asn AAT 5.00 36.50 0.71
Asn AAC 2.00 14.60 0.29

Pro CCG 0.00 0.00 0.00
Pro CCA 1.00 7.30 0.50
Pro CCT 1.00 7.30 0.50
Pro CCC 0.00 0.00 0.00

Gln CAG 0.00 0.00 0.00
Gln CAA 2.00 14.60 1.00

Arg AGG 3.00 21.90 0.50
Arg AGA 0.00 0.00 0.00
Arg CGG 1.00 7.30 0.17
Arg CGA 1.00 7.30 0.17
Arg CGT 1.00 7.30 0.17
Arg CGC 0.00 0.00 0.00

Ser AGT 2.00 14.60 0.17
Ser AGC 2.00 14.60 0.17
Ser TCG 0.00 0.00 0.00
Ser TCA 0.00 0.00 0.00
Ser TCT 4.00 29.20 0.33
Ser TCC 4.00 29.20 0.33

Thr ACG 0.00 0.00 0.00
Thr ACA 3.00 21.90 0.30
Thr ACT 1.00 7.30 0.10
Thr ACC 6.00 43.80 0.60

Val GTG 1.00 7.30 0.10
Val GTA 2.00 14.60 0.20
Val GTT 3.00 21.90 0.30
Val GTC 4.00 29.20 0.40

Trp TGG 2.00 14.60 1.00

Tyr TAT 5.00 36.50 0.45
Tyr TAC 6.00 43.80 0.55

End TGA 0.00 0.00 0.00
End TAG 0.00 0.00 0.00
End TAA 7.00 51.09 1.00

Results for 594 residue sequence "Staphylococcus carnosus”
Carbonic Anhydrase of 594 bases"
AmAcid Codon Number /1000 Fraction ...
Ala GCG 0.00 0.00 0.00
Ala GCA 7.00 35.90 0.58
Ala GCT 5.00 25.64 0.42
Ala GCC 0.00 0.00 0.00

Cys TGT 2.00 10.26 0.67
Cys TGC 1.00 5.13 0.33

Asp GAT 12.00 61.54 0.75
Asp GAC 4.00 20.51 0.25

Glu GAG 0.00 0.00 0.00
Glu GAA 13.00 66.67 1.00

Phe TTT 7.00 35.90 0.88
Phe TTC 1.00 5.13 0.13

Gly GGG 1.00 5.13 0.06
Gly GGA 1.00 5.13 0.06
Gly GGT 10.00 51.28 0.63
Gly GGC 4.00 20.51 0.25

His CAT 4.00 20.51 0.67
His CAC 2.00 10.26 0.33

Ile ATA 1.00 5.13 0.07
Ile ATT 8.00 41.03 0.57
Ile ATC 5.00 25.64 0.36

Lys AAG 0.00 0.00 0.00
Lys AAA 17.00 87.18 1.00

Leu TTG 2.00 10.26 0.11
Leu TTA 13.00 66.67 0.72
Leu CTG 0.00 0.00 0.00
Leu CTA 1.00 5.13 0.06
Leu CTT 1.00 5.13 0.06
Leu CTC 1.00 5.13 0.06

Met ATG 6.00 30.77 1.00

Asn AAT 8.00 41.03 0.80
Asn AAC 2.00 10.26 0.20

Pro CCG 0.00 0.00 0.00
Pro CCA 2.00 10.26 0.33
Pro CCT 2.00 10.26 0.33
Pro CCC 2.00 10.26 0.33

Gln CAG 0.00 0.00 0.00
Gln CAA 4.00 20.51 1.00

Arg AGG 0.00 0.00 0.00
Arg AGA 1.00 5.13 0.25
Arg CGG 0.00 0.00 0.00
Arg CGA 0.00 0.00 0.00
Arg CGT 2.00 10.26 0.50
Arg CGC 1.00 5.13 0.25

Ser AGT 1.00 5.13 0.13
Ser AGC 4.00 20.51 0.50
Ser TCG 0.00 0.00 0.00
Ser TCA 1.00 5.13 0.13
Ser TCT 1.00 5.13 0.13
Ser TCC 1.00 5.13 0.13

Thr ACG 3.00 15.38 0.25
Thr ACA 7.00 35.90 0.58
Thr ACT 2.00 10.26 0.17
Thr ACC 0.00 0.00 0.00

Val GTG 1.00 5.13 0.07
Val GTA 6.00 30.77 0.43
Val GTT 3.00 15.38 0.21
Val GTC 4.00 20.51 0.29

Trp TGG 0.00 0.00 0.00

Tyr TAT 6.00 30.77 0.86
Tyr TAC 1.00 5.13 0.14

End TGA 0.00 0.00 0.00
End TAG 0.00 0.00 0.00
End TAA 1.00 5.13 1.00

Results for 660 residue sequence "Vibrio cholerae Carbonic
Anhydrase of 720 bases

Ala GCG 7.00 31.82 0.44
Ala GCA 2.00 9.09 0.13
Ala GCT 3.00 13.64 0.19
Ala GCC 4.00 18.18 0.25

Cys TGT 0.00 0.00 0.00
Cys TGC 2.00 9.09 1.00

Asp GAT 5.00 22.73 0.71
Asp GAC 2.00 9.09 0.29

Glu GAG 7.00 31.82 0.50
Glu GAA 7.00 31.82 0.50

Phe TTT 4.00 18.18 0.57
Phe TTC 3.00 13.64 0.43

Gly GGG 7.00 31.82 0.41
Gly GGA 3.00 13.64 0.18
Gly GGT 4.00 18.18 0.24
Gly GGC 3.00 13.64 0.18

His CAT 8.00 36.36 0.73
His CAC 3.00 13.64 0.27

Ile ATA 1.00 4.55 0.17
Ile ATT 2.00 9.09 0.33
Ile ATC 3.00 13.64 0.50

Lys AAG 3.00 13.64 0.30
Lys AAA 7.00 31.82 0.70

Leu TTG 5.00 22.73 0.22
Leu TTA 2.00 9.09 0.09
Leu CTG 5.00 22.73 0.22
Leu CTA 2.00 9.09 0.09
Leu CTT 5.00 22.73 0.22
Leu CTC 4.00 18.18 0.17

Met ATG 3.00 13.64 1.00

Asn AAT 13.00 59.09 0.87
Asn AAC 2.00 9.09 0.13

Pro CCG 6.00 27.27 0.38
Pro CCA 4.00 18.18 0.25
Pro CCT 5.00 22.73 0.31
Pro CCC 1.00 4.55 0.06

Gln CAG 7.00 31.82 0.35
Gln CAA 13.00 59.09 0.65

Arg AGG 0.00 0.00 0.00
Arg AGA 0.00 0.00 0.00
Arg CGG 0.00 0.00 0.00
Arg CGA 1.00 4.55 0.20
Arg CGT 4.00 18.18 0.80
Arg CGC 0.00 0.00 0.00

Ser AGT 2.00 9.09 0.18
Ser AGC 2.00 9.09 0.18
Ser TCG 4.00 18.18 0.36
Ser TCA 1.00 4.55 0.09
Ser TCT 1.00 4.55 0.09
Ser TCC 1.00 4.55 0.09

Thr ACG 5.00 22.73 0.50
Thr ACA 0.00 0.00 0.00
Thr ACT 3.00 13.64 0.30
Thr ACC 2.00 9.09 0.20

Val GTG 5.00 22.73 0.29
Val GTA 3.00 13.64 0.18
Val GTT 6.00 27.27 0.35
Val GTC 3.00 13.64 0.18

Trp TGG 4.00 18.18 1.00

Tyr TAT 3.00 13.64 0.60
Tyr TAC 2.00 9.09 0.40

End TGA 0.00 0.00 0.00
End TAG 0.00 0.00 0.00
End TAA 1.00 4.55 1.00

Results for 372 residue sequence "Eschereshia coli Carbonic
Anhydrase of 372 bases"

Ala GCG 4.00 32.26 0.31
Ala GCA 1.00 8.06 0.08
Ala GCT 1.00 8.06 0.08
Ala GCC 7.00 56.45 0.54

Cys TGT 1.00 8.06 0.50
Cys TGC 1.00 8.06 0.50

Asp GAT 4.00 32.26 0.57

Asp GAC 3.00 24.19 0.43

Glu GAG 2.00 16.13 0.67
Glu GAA 1.00 8.06 0.33

Phe TTT 5.00 40.32 0.45
Phe TTC 6.00 48.39 0.55

Gly GGG 0.00 0.00 0.00
Gly GGA 2.00 16.13 0.25
Gly GGT 3.00 24.19 0.38
Gly GGC 3.00 24.19 0.38

His CAT 3.00 24.19 0.75
His CAC 1.00 8.06 0.25

Ile ATA 1.00 8.06 0.20
Ile ATT 0.00 0.00 0.00
Ile ATC 4.00 32.26 0.80

Lys AAG 2.00 16.13 1.00
Lys AAA 0.00 0.00 0.00

Leu TTG 4.00 32.26 0.40
Leu TTA 1.00 8.06 0.10
Leu CTG 2.00 16.13 0.20
Leu CTA 0.00 0.00 0.00
Leu CTT 1.00 8.06 0.10
Leu CTC 2.00 16.13 0.20

Met ATG 2.00 16.13 1.00

Asn AAT 3.00 24.19 0.75
Asn AAC 1.00 8.06 0.25

Pro CCG 0.00 0.00 0.00
Pro CCA 4.00 32.26 0.57
Pro CCT 0.00 0.00 0.00
Pro CCC 3.00 24.19 0.43

Gln CAG 9.00 72.58 0.75
Gln CAA 3.00 24.19 0.25

Arg AGG 0.00 0.00 0.00
Arg AGA 0.00 0.00 0.00
Arg CGG 2.00 16.13 0.50
Arg CGA 0.00 0.00 0.00
Arg CGT 0.00 0.00 0.00
Arg CGC 2.00 16.13 0.50

Ser AGT 2.00 16.13 1.00
Ser AGC 0.00 0.00 0.00
Ser TCG 0.00 0.00 0.00
Ser TCA 0.00 0.00 0.00
Ser TCT 0.00 0.00 0.00
Ser TCC 0.00 0.00 0.00

Thr ACG 4.00 32.26 0.67
Thr ACA 1.00 8.06 0.17
Thr ACT 0.00 0.00 0.00
Thr ACC 1.00 8.06 0.17

Val GTG 7.00 56.45 0.37
Val GTA 3.00 24.19 0.16
Val GTT 8.00 64.52 0.42
Val GTC 1.00 8.06 0.05

Trp TGG 0.00 0.00 0.00

Tyr TAT 0.00 0.00 0.00
Tyr TAC 1.00 8.06 1.00

End TGA 2.00 16.13 1.00
End TAG 0.00 0.00 0.00
End TAA 0.00 0.00 0.00

Results for 660 residue sequence "456 Ecoli Carbonic anhydrase
Final 2"

Ala GCG 9.00 40.91 0.30
Ala GCA 4.00 18.18 0.13
Ala GCT 7.00 31.82 0.23
Ala GCC 10.00 45.45 0.33

Cys TGT 4.00 18.18 0.67
Cys TGC 2.00 9.09 0.33

Asp GAT 4.00 18.18 0.44
Asp GAC 5.00 22.73 0.56

Glu GAG 7.00 31.82 0.58
Glu GAA 5.00 22.73 0.42

Phe TTT 5.00 22.73 0.63
Phe TTC 3.00 13.64 0.38

Gly GGG 2.00 9.09 0.17
Gly GGA 1.00 4.55 0.08
Gly GGT 2.00 9.09 0.17
Gly GGC 7.00 31.82 0.58

His CAT 4.00 18.18 0.67
His CAC 2.00 9.09 0.33

Ile ATA 1.00 4.55 0.08
Ile ATT 8.00 36.36 0.62
Ile ATC 4.00 18.18 0.31

Lys AAG 1.00 4.55 0.20
Lys AAA 4.00 18.18 0.80

Leu TTG 3.00 13.64 0.18
Leu TTA 1.00 4.55 0.06
Leu CTG 8.00 36.36 0.47
Leu CTA 1.00 4.55 0.06
Leu CTT 3.00 13.64 0.18
Leu CTC 1.00 4.55 0.06

Met ATG 4.00 18.18 1.00

Asn AAT 4.00 18.18 0.57
Asn AAC 3.00 13.64 0.43

Pro CCG 7.00 31.82 0.47
Pro CCA 2.00 9.09 0.13
Pro CCT 5.00 22.73 0.33
Pro CCC 1.00 4.55 0.07

Gln CAG 6.00 27.27 0.60
Gln CAA 4.00 18.18 0.40

Arg AGG 0.00 0.00 0.00
Arg AGA 0.00 0.00 0.00
Arg CGG 3.00 13.64 0.19
Arg CGA 0.00 0.00 0.00
Arg CGT 4.00 18.18 0.25
Arg CGC 9.00 40.91 0.56

Ser AGT 0.00 0.00 0.00
Ser AGC 5.00 22.73 0.29
Ser TCG 2.00 9.09 0.12
Ser TCA 2.00 9.09 0.12
Ser TCT 2.00 9.09 0.12
Ser TCC 6.00 27.27 0.35

Thr ACG 1.00 4.55 0.14
Thr ACA 2.00 9.09 0.29
Thr ACT 1.00 4.55 0.14
Thr ACC 3.00 13.64 0.43

Val GTG 6.00 27.27 0.32
Val GTA 2.00 9.09 0.11
Val GTT 4.00 18.18 0.21
Val GTC 7.00 31.82 0.37

Trp TGG 2.00 9.09 1.00

Tyr TAT 2.00 9.09 0.50
Tyr TAC 2.00 9.09 0.50

End TGA 0.00 0.00 0.00
End TAG 0.00 0.00 0.00
End TAA 1.00 4.55 1.00

Results for 663 residue sequence "123 Ecoli carbonic Anhydrase
Final"

Ala GCG 6.00 27.15 0.32
Ala GCA 2.00 9.05 0.11
Ala GCT 2.00 9.05 0.11
Ala GCC 9.00 40.72 0.47

Cys TGT 1.00 4.52 0.17
Cys TGC 5.00 22.62 0.83

Asp GAT 5.00 22.62 0.71
Asp GAC 2.00 9.05 0.29

Glu GAG 5.00 22.62 0.83
Glu GAA 1.00 4.52 0.17

Phe TTT 9.00 40.72 0.45
Phe TTC 11.00 49.77 0.55

Gly GGG 1.00 4.52 0.06
Gly GGA 4.00 18.10 0.25
Gly GGT 7.00 31.67 0.44
Gly GGC 4.00 18.10 0.25

His CAT 5.00 22.62 0.56
His CAC 4.00 18.10 0.44

Ile ATA 2.00 9.05 0.18
Ile ATT 2.00 9.05 0.18
Ile ATC 7.00 31.67 0.64

Lys AAG 4.00 18.10 0.50
Lys AAA 4.00 18.10 0.50

Leu TTG 6.00 27.15 0.38
Leu TTA 1.00 4.52 0.06
Leu CTG 3.00 13.57 0.19
Leu CTA 0.00 0.00 0.00
Leu CTT 1.00 4.52 0.06
Leu CTC 5.00 22.62 0.31

Met ATG 2.00 9.05 1.00

Asn AAT 8.00 36.20 0.50
Asn AAC 8.00 36.20 0.50

Pro CCG 0.00 0.00 0.00
Pro CCA 6.00 27.15 0.60
Pro CCT 0.00 0.00 0.00
Pro CCC 4.00 18.10 0.40

Gln CAG 13.00 58.82 0.81
Gln CAA 3.00 13.57 0.19

Arg AGG 1.00 4.52 0.14
Arg AGA 0.00 0.00 0.00
Arg CGG 4.00 18.10 0.57
Arg CGA 0.00 0.00 0.00
Arg CGT 0.00 0.00 0.00
Arg CGC 2.00 9.05 0.29

Ser AGT 1.00 4.52 0.33
Ser AGC 1.00 4.52 0.33
Ser TCG 0.00 0.00 0.00
Ser TCA 0.00 0.00 0.00
Ser TCT 0.00 0.00 0.00
Ser TCC 1.00 4.52 0.33

Thr ACG 6.00 27.15 0.50
Thr ACA 3.00 13.57 0.25
Thr ACT 1.00 4.52 0.08
Thr ACC 2.00 9.05 0.17

Val GTG 10.00 45.25 0.36
Val GTA 3.00 13.57 0.11
Val GTT 11.00 49.77 0.39
Val GTC 4.00 18.10 0.14

Trp TGG 0.00 0.00 0.00

Tyr TAT 1.00 4.52 0.33
Tyr TAC 2.00 9.05 0.67

End TGA 3.00 13.57 0.50
End TAG 2.00 9.05 0.33
End TAA 1.00 4.52 0.17

Results for 675 residue sequence "Truepera radiovictrix DSM1703
Carbo Anhyd consisting of 675 bases"
Ala GCG 12.00 53.33 0.41
Ala GCA 3.00 13.33 0.10
Ala GCT 2.00 8.89 0.07
Ala GCC 12.00 53.33 0.41

Cys TGT 1.00 4.44 0.50
Cys TGC 1.00 4.44 0.50

Asp GAT 6.00 26.67 0.46
Asp GAC 7.00 31.11 0.54

Glu GAG 15.00 66.67 0.88
Glu GAA 2.00 8.89 0.12

Phe TTT 0.00 0.00 0.00
Phe TTC 1.00 4.44 1.00

Gly GGG 11.00 48.89 0.33
Gly GGA 2.00 8.89 0.06
Gly GGT 5.00 22.22 0.15
Gly GGC 15.00 66.67 0.45

His CAT 2.00 8.89 0.18
His CAC 9.00 40.00 0.82

Ile ATA 0.00 0.00 0.00
Ile ATT 0.00 0.00 0.00
Ile ATC 0.00 0.00 0.00

Lys AAG 2.00 8.89 0.67
Lys AAA 1.00 4.44 0.33

Leu TTG 2.00 8.89 0.10
Leu TTA 0.00 0.00 0.00
Leu CTG 6.00 26.67 0.29
Leu CTA 0.00 0.00 0.00
Leu CTT 2.00 8.89 0.10
Leu CTC 11.00 48.89 0.52

Met ATG 0.00 0.00 0.00

Asn AAT 1.00 4.44 0.50

Asn AAC 1.00 4.44 0.50

Pro CCG 4.00 17.78 0.25
Pro CCA 1.00 4.44 0.06
Pro CCT 1.00 4.44 0.06
Pro CCC 10.00 44.44 0.63

Gln CAG 6.00 26.67 0.50
Gln CAA 6.00 26.67 0.50

Arg AGG 0.00 0.00 0.00
Arg AGA 0.00 0.00 0.00
Arg CGG 7.00 31.11 0.21
Arg CGA 3.00 13.33 0.09
Arg CGT 8.00 35.56 0.24
Arg CGC 15.00 66.67 0.45

Ser AGT 0.00 0.00 0.00
Ser AGC 4.00 17.78 0.80
Ser TCG 0.00 0.00 0.00
Ser TCA 1.00 4.44 0.20
Ser TCT 0.00 0.00 0.00
Ser TCC 0.00 0.00 0.00

Thr ACG 1.00 4.44 0.50
Thr ACA 1.00 4.44 0.50
Thr ACT 0.00 0.00 0.00
Thr ACC 0.00 0.00 0.00

Val GTG 7.00 31.11 0.32
Val GTA 3.00 13.33 0.14
Val GTT 3.00 13.33 0.14
Val GTC 9.00 40.00 0.41

Trp TGG 0.00 0.00 0.00

Tyr TAT 0.00 0.00 0.00
Tyr TAC 0.00 0.00 0.00

End TGA 0.00 0.00 0.00
End TAG 1.00 4.44 0.33
End TAA 2.00 8.89 0.67

Results for 765 residue sequence " Salinispora arenicola "

Ala GCG 17.00 68.55 0.47
Ala GCA 3.00 12.10 0.08
Ala GCT 4.00 16.13 0.11
Ala GCC 12.00 48.39 0.33

Cys TGT 3.00 12.10 0.75
Cys TGC 1.00 4.03 0.25

Asp GAT 1.00 4.03 0.08
Asp GAC 12.00 48.39 0.92

Glu GAG 11.00 44.35 1.00
Glu GAA 0.00 0.00 0.00

Phe TTT 1.00 4.03 0.25
Phe TTC 3.00 12.10 0.75

Gly GGG 8.00 32.26 0.26
Gly GGA 6.00 24.19 0.19
Gly GGT 7.00 28.23 0.23
Gly GGC 10.00 40.32 0.32

His CAT 1.00 4.03 0.13
His CAC 7.00 28.23 0.88

Ile ATA 0.00 0.00 0.00
Ile ATT 1.00 4.03 0.10
Ile ATC 9.00 36.29 0.90

Lys AAG 0.00 0.00 0.00
Lys AAA 0.00 0.00 0.00

Leu TTG 1.00 4.03 0.07
Leu TTA 0.00 0.00 0.00
Leu CTG 5.00 20.16 0.36
Leu CTA 0.00 0.00 0.00
Leu CTT 3.00 12.10 0.21
Leu CTC 5.00 20.16 0.36

Met ATG 2.00 8.06 1.00

Asn AAT 0.00 0.00 0.00
Asn AAC 2.00 8.06 1.00

Pro CCG 10.00 40.32 0.53
Pro CCA 4.00 16.13 0.21
Pro CCT 1.00 4.03 0.05
Pro CCC 4.00 16.13 0.21

Gln CAG 8.00 32.26 1.00
Gln CAA 0.00 0.00 0.00

Arg AGG 0.00 0.00 0.00
Arg AGA 0.00 0.00 0.00
Arg CGG 7.00 28.23 0.39
Arg CGA 2.00 8.06 0.11
Arg CGT 5.00 20.16 0.28
Arg CGC 4.00 16.13 0.22

Ser AGT 1.00 4.03 0.07
Ser AGC 4.00 16.13 0.27
Ser TCG 2.00 8.06 0.13
Ser TCA 0.00 0.00 0.00
Ser TCT 1.00 4.03 0.07
Ser TCC 7.00 28.23 0.47

Thr ACG 3.00 12.10 0.20
Thr ACA 2.00 8.06 0.13
Thr ACT 0.00 0.00 0.00
Thr ACC 10.00 40.32 0.67

Val GTG 18.00 72.58 0.53
Val GTA 2.00 8.06 0.06
Val GTT 6.00 24.19 0.18
Val GTC 8.00 32.26 0.24

Trp TGG 0.00 0.00 0.00

Tyr TAT 0.00 0.00 0.00
Tyr TAC 3.00 12.10 1.00

End TGA 0.00 0.00 0.00
End TAG 0.00 0.00 0.00
End TAA 1.00 4.03 1.00

Results for 774 residue sequence " abc Salinispora " starting

Ala GCG 13.00 52.85 0.37
Ala GCA 5.00 20.33 0.14
Ala GCT 2.00 8.13 0.06
Ala GCC 15.00 60.98 0.43

Cys TGT 1.00 4.07 0.33
Cys TGC 2.00 8.13 0.67

Asp GAT 3.00 12.20 0.30
Asp GAC 7.00 28.46 0.70

Glu GAG 6.00 24.39 0.55
Glu GAA 5.00 20.33 0.45

Phe TTT 2.00 8.13 0.40
Phe TTC 3.00 12.20 0.60

Gly GGG 5.00 20.33 0.23
Gly GGA 3.00 12.20 0.14
Gly GGT 4.00 16.26 0.18
Gly GGC 10.00 40.65 0.45

His CAT 1.00 4.07 0.33
His CAC 2.00 8.13 0.67

Ile ATA 0.00 0.00 0.00
Ile ATT 1.00 4.07 0.17
Ile ATC 5.00 20.33 0.83

Lys AAG 5.00 20.33 1.00
Lys AAA 0.00 0.00 0.00

Leu TTG 0.00 0.00 0.00
Leu TTA 0.00 0.00 0.00
Leu CTG 8.00 32.52 0.32
Leu CTA 2.00 8.13 0.08
Leu CTT 7.00 28.46 0.28
Leu CTC 8.00 32.52 0.32

Met ATG 3.00 12.20 1.00

Asn AAT 1.00 4.07 0.17
Asn AAC 5.00 20.33 0.83

Pro CCG 9.00 36.59 0.45
Pro CCA 3.00 12.20 0.15
Pro CCT 2.00 8.13 0.10
Pro CCC 6.00 24.39 0.30

Gln CAG 6.00 24.39 0.67
Gln CAA 3.00 12.20 0.33

Arg AGG 1.00 4.07 0.06
Arg AGA 2.00 8.13 0.11
Arg CGG 7.00 28.46 0.39
Arg CGA 1.00 4.07 0.06
Arg CGT 5.00 20.33 0.28
Arg CGC 2.00 8.13 0.11

Ser AGT 4.00 16.26 0.20
Ser AGC 2.00 8.13 0.10
Ser TCG 6.00 24.39 0.30
Ser TCA 2.00 8.13 0.10
Ser TCT 1.00 4.07 0.05
Ser TCC 5.00 20.33 0.25

Thr ACG 4.00 16.26 0.27
Thr ACA 2.00 8.13 0.13
Thr ACT 2.00 8.13 0.13
Thr ACC 7.00 28.46 0.47

Val GTG 13.00 52.85 0.57
Val GTA 1.00 4.07 0.04
Val GTT 1.00 4.07 0.04
Val GTC 8.00 32.52 0.35

Trp TGG 2.00 8.13 1.00

Tyr TAT 2.00 8.13 0.50
Tyr TAC 2.00 8.13 0.50

End TGA 1.00 4.07 1.00
End TAG 0.00 0.00 0.00
End TAA 0.00 0.00 0.00

Results for 488 residue sequence "Frankia CcI Carbonic Anhydrase
1 of 488 bases"

Ala GCG 7.00 43.21 0.39
Ala GCA 4.00 24.69 0.22
Ala GCT 2.00 12.35 0.11
Ala GCC 5.00 30.86 0.28

Cys TGT 1.00 6.17 0.20
Cys TGC 4.00 24.69 0.80

Asp GAT 0.00 0.00 0.00
Asp GAC 1.00 6.17 1.00

Glu GAG 0.00 0.00 0.00
Glu GAA 0.00 0.00 0.00

Phe TTT 0.00 0.00 0.00
Phe TTC 1.00 6.17 1.00

Gly GGG 1.00 6.17 0.20
Gly GGA 2.00 12.35 0.40
Gly GGT 0.00 0.00 0.00
Gly GGC 2.00 12.35 0.40

His CAT 0.00 0.00 0.00
His CAC 1.00 6.17 1.00

Ile ATA 1.00 6.17 0.50
Ile ATT 1.00 6.17 0.50
Ile ATC 0.00 0.00 0.00

Lys AAG 2.00 12.35 1.00
Lys AAA 0.00 0.00 0.00

Leu TTG 3.00 18.52 0.50
Leu TTA 2.00 12.35 0.33
Leu CTG 0.00 0.00 0.00
Leu CTA 0.00 0.00 0.00
Leu CTT 0.00 0.00 0.00
Leu CTC 1.00 6.17 0.17

Met ATG 1.00 6.17 1.00

Asn AAT 1.00 6.17 0.33
Asn AAC 2.00 12.35 0.67

Pro CCG 17.00 104.94 0.63
Pro CCA 4.00 24.69 0.15
Pro CCT 4.00 24.69 0.15
Pro CCC 2.00 12.35 0.07

Gln CAG 1.00 6.17 1.00
Gln CAA 0.00 0.00 0.00

Arg AGG 3.00 18.52 0.14
Arg AGA 4.00 24.69 0.18
Arg CGG 0.00 0.00 0.00
Arg CGA 6.00 37.04 0.27
Arg CGT 4.00 24.69 0.18
Arg CGC 5.00 30.86 0.23

Ser AGT 2.00 12.35 0.06
Ser AGC 2.00 12.35 0.06
Ser TCG 8.00 49.38 0.24
Ser TCA 12.00 74.07 0.35
Ser TCT 2.00 12.35 0.06
Ser TCC 8.00 49.38 0.24

Thr ACG 15.00 92.59 0.63
Thr ACA 4.00 24.69 0.17
Thr ACT 2.00 12.35 0.08
Thr ACC 3.00 18.52 0.13

Val GTG 0.00 0.00 0.00

Val GTA 0.00 0.00 0.00
Val GTT 1.00 6.17 1.00
Val GTC 0.00 0.00 0.00

Trp TGG 4.00 24.69 1.00

Tyr TAT 0.00 0.00 0.00
Tyr TAC 1.00 6.17 1.00

End TGA 3.00 18.52 1.00
End TAG 0.00 0.00 0.00
End TAA 0.00 0.00 0.00

Results for 618 residue sequence "xyz Frankia CcI3 Carbonic
Anhydrase 2 of 618 bases"

67
Ala GCG 6.00 29.13 0.27
Ala GCA 3.00 14.56 0.14
Ala GCT 3.00 14.56 0.14
Ala GCC 10.00 48.54 0.45

Cys TGT 2.00 9.71 0.67
Cys TGC 1.00 4.85 0.33

Asp GAT 6.00 29.13 0.35
Asp GAC 11.00 53.40 0.65

Glu GAG 8.00 38.83 0.67
Glu GAA 4.00 19.42 0.33

Phe TTT 1.00 4.85 0.25
Phe TTC 3.00 14.56 0.75

Gly GGG 5.00 24.27 0.31
Gly GGA 1.00 4.85 0.06
Gly GGT 6.00 29.13 0.38
Gly GGC 4.00 19.42 0.25

His CAT 3.00 14.56 0.38
His CAC 5.00 24.27 0.63

Ile ATA 0.00 0.00 0.00
Ile ATT 1.00 4.85 0.17
Ile ATC 5.00 24.27 0.83

Lys AAG 2.00 9.71 1.00
Lys AAA 0.00 0.00 0.00

Leu TTG 3.00 14.56 0.15
Leu TTA 0.00 0.00 0.00
Leu CTG 10.00 48.54 0.50
Leu CTA 1.00 4.85 0.05
Leu CTT 2.00 9.71 0.10
Leu CTC 4.00 19.42 0.20

Met ATG 1.00 4.85 1.00

Asn AAT 0.00 0.00 0.00
Asn AAC 1.00 4.85 1.00

Pro CCG 5.00 24.27 0.42
Pro CCA 0.00 0.00 0.00
Pro CCT 1.00 4.85 0.08
Pro CCC 6.00 29.13 0.50

Gln CAG 5.00 24.27 1.00
Gln CAA 0.00 0.00 0.00

Arg AGG 2.00 9.71 0.13
Arg AGA 0.00 0.00 0.00
Arg CGG 6.00 29.13 0.38
Arg CGA 0.00 0.00 0.00
Arg CGT 2.00 9.71 0.13
Arg CGC 6.00 29.13 0.38

Ser AGT 0.00 0.00 0.00
Ser AGC 4.00 19.42 0.36
Ser TCG 4.00 19.42 0.36
Ser TCA 0.00 0.00 0.00
Ser TCT 0.00 0.00 0.00
Ser TCC 3.00 14.56 0.27

Thr ACG 5.00 24.27 0.33
Thr ACA 0.00 0.00 0.00
Thr ACT 0.00 0.00 0.00
Thr ACC 10.00 48.54 0.67

Val GTG 14.00 67.96 0.47
Val GTA 1.00 4.85 0.03
Val GTT 4.00 19.42 0.13
Val GTC 11.00 53.40 0.37

Trp TGG 1.00 4.85 1.00

Tyr TAT 1.00 4.85 0.33
Tyr TAC 2.00 9.71 0.67

End TGA 0.00 0.00 0.00
End TAG 1.00 4.85 1.00
End TAA 0.00 0.00 0.00

From the above list, we conclude two things-
1) The codon-plot of the different gene o.r.f.s from the same organism are the same
except at some minor points.
2) The codon-plot of the organisms only confirm our suspicion while analyzing the
peptide sequences that choice of codons is different to suit the G-C content of
the organism.

Corrections-
We undertake this because we noticed that gene products of Methanococcus voltae and
Frankia were not starting with amino-acid Methionine.

Methanococcus voltae corrections-

The mistake seems to be in the database from where sequence has been downloaded. The DNA seq.
had ‘ata’ instead of ‘atg’.

Frankia sp CcI3 corrections-

The mistake seems to have been in the sequence again. The DNA seq. began 27 bp before and the
claimed starting site of the protein actually coded for Valine.

Conclusion:
After studying the three analysis we did with the protein, DNA and the ORF codons,we
conclude the following-

1) Bacteria choose codons based on its G-C composition to get same amino acid
for creation of protein. G-C rich codon of course gets preference for G-C rich
bacteria. Similarly and conversely, A-T rich codon gets preference for G-C poor
bacteria.

2) If same amino acid is not there, a synonymous amino acid is used having the
same or near about same chemical properties.

3) High G-C content bacteria often employ two different genes for same purpose.
The finding of two possible genes in their genome for Carbonic Anhydrase is the
proof for such a statement.

4) Most bacteria use Zinc at the metal site yet a small number of bacteria use
Cadmium and other metals.

5) Even if they are of varied length, one may look for Serine and Glycine on the
peptide chain and see that this region is conserved in all protein,. This is because
the protein domains must be similar for all the anhydrases.

Use of bio-informatic tools in bacterial genetics

Use of bio-informatic tools in bacterial genetics

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Use of bio-informatic tools in bacterial genetics

Semelhante a Use of bio-informatic tools in bacterial genetics (20)

Mais de Debtanu Chakraborty

Mais de Debtanu Chakraborty (7)

Último

Último (20)

Use of bio-informatic tools in bacterial genetics