Repetitive sequences in the eukaryotic genome

Repetitive Sequences in the
Eukaryotic Genome

Analysis of DNA Sequences in
Eukaryotic Genomes
• The technique that is used to determine the sequence
complexity of any genome involves the denaturation and
renaturation of DNA.
• DNA is denatured by heating which melts the H-bonds and
renders the DNA single-stranded.
• If the DNA is rapidly cooled, the DNA remains single-stranded.
• But if the DNA is allowed to cool slowly, sequences that are
complementary will find each other and eventually base pair
again.
• The rate at which the DNA reanneals is a function of the
species from which the DNA was isolated.

• The Y-axis is the percent of the DNA that remains
single stranded.
• This is expressed as a ratio of the concentration
of single-stranded DNA (C) to the total
concentration of the starting DNA (Co).
• The X-axis is a log-scale of the product of the
initial concentration of DNA (in moles/liter)
multiplied by length of time the reaction
proceeded (in seconds).
• The designation for this value is Cot and is called
the "Cot" value.
• The curve itself is called a "Cot" curve.

• As can be seen the curve is rather smooth
which indicates that reannealing occurs
slowing but gradually over a period of time.
• One particular value that is useful is Cot½ , the
Cot value where half of the DNA has
reannealed.
• The shape of a "Cot" curve for a given species
is a function of two factors:
– the size or complexity of the genome; and
– the amount of repetitive DNA within the genome

Reassociation kinetics
• A sample with a highly-repetitive sequence
will renature rapidly, while complex
sequences will renature slowly
• The Amount of renaturation is measured
relative to a C0t value.
• The C0t value is the product of C0 (the initial
concentration of DNA), t (time in seconds),
and a constant that depends on the
concentration of cations in the buffer.
• Repetitive DNA will renature at low C0t
values, while complex and unique DNA
sequences will renature at high C0t values.

• The larger the genome size the longer it will take for any one
sequence to encounter its complementary sequence in the
solution.
• This is because two complementary sequences must
encounter each other before they can pair.
• The more complex the genome, that is the more unique
sequences that are available, the longer it will take for any
two complementary sequences to encounter each other and
pair.
• Given similar concentrations in solution, it will then take a
more complex species longer to reach Cot½ .

Repetitive DNA Sequences
• Repeated DNA sequences are DNA sequences
that are found more than once in the
genome of the species, have distinctive
effects on "Cot" curves.
• If a specific sequence is represented twice in
the genome it will have two complementary
sequences to pair with and as such will have a
Cot value half as large as a sequence
represented only once in the genome.

• Genomes that contain these different
classes of sequences reanneal in a
different manner than genomes with
only single copy sequences.
• Instead of having a single smooth "Cot"
curve, three distinct curves can be
seen, each representing a different
repetition class.
• The first sequences to reanneal are the
highly repetitive sequences because so
many copies of them exist in the
genome, and because they have a low
sequence complexity.
• The second portion of the genome to
reanneal is the middle repetitive DNA,
and the final portion to reanneal is the
single copy DNA or unique DNA
sequence.

Single copy sequences are found once
or a few times in the genome.
• Unique or non-repetitive sequences are those
found once or a few times within the genome.
• Structural genes are typically unique sequences of
DNA.
• The vast majority of proteins in eukaryotic cells are
encoded by genes present in one or a few copies.
• In humans, unique sequences are estimated to make
up approximately 55–60% of the genome.

Some moderately repetitive
sequences are transcribed
• Moderately repetitive DNA present in a few
to about 105
copies in the genome.
• Middle repetitive DNA can vary from 100-
300bp to 5000 bp and can be dispersed
throughout the genome.
• In a few cases, moderately repetitive
sequences are multiple copies of the same
gene.

• For example, the genes that encode ribosomal RNA
(rRNA) are found in many copies.
– Ribosomal RNA is necessary for the functioning of
ribosomes. Cells need a large amount of rRNA for making
ribosomes, and this is accomplished by having multiple
copies of the genes that encode rRNA.
• Likewise, the histone genes are also found in
multiple copies because a large number of histone
proteins are needed for the structure of chromatin.
• In addition, other types of functionally important
sequences can be moderately repetitive

Highly repetitive sequences are
present in large numbers of copies
• The most abundant sequences are found in the
highly repetitive DNA class.
• Highly repetitive DNA present in about 105
to
107
copies in the genome and can range in size
from a few to several hundred bases in length.
• These sequences are found in regions of the
chromosome such as heterochromatin,
centromeres and telomeres and tend to be
arranged as a tandem repeats.

Species Sequence Distribution
Bacteria 99.7% Single Copy
Mouse
60% Single Copy
25% Middle Repetitive
10% Highly Repetitive
Human
70% Single Copy
Cotton
61% Single Copy
Corn
30% Single Copy
Wheat
10% Single Copy
Arabidopsis
55% Single Copy

Repetitive-Sequence DNA.
• Both moderately repetitive and highly
repetitive DNA sequences are sequences that
appear many times within a genome.
• These sequences can be arranged within the
genome in one of two ways:
– distributed at irregular intervals—known as
dispersed repeated DNA or interspersed repeated
DNA
– or clustered together so that the sequence
repeats many times in a row—known as tandemly
repeated DNA.

Interspersed genome-wide
repeats

Interspersed genome-wide repeats
• Dispersed repeated sequences consist of families of
repeated sequences interspersed throughout the
genome.
• They can be either short or long and many have the
added distinction of being either an actual mobile
elements (transposons or retrotransposons) or
sequences derived from mobile elements.
• Transposons are mobile DNA sequences which
migrate to different regions of the genome via
transposition.

Interspersed genome-wide repeats
• A large portion of portion of eukaryotic genomes are
composed of such sequences.
• They fall into several classes, and together they can
form a substantial part of the genome about 45% or
more in humans and 50% in maize.
• Most dispersed, repeated sequences correspond to
the category of middle repetitive DNA, the number
of copies varying between a few and a few thousand.

• Two types of dispersed repeated sequences
are known:
– Long interspersed elements (LINEs), in which the
sequences in the families are about 1,000–
7,000bp long; and
– Short interspersed elements (SINEs), in which the
sequences in the families are 100–400 bp long.

• All eukaryotic organisms have LINEs and
SINEs, with a wide variation in their relative
proportions.
• Humans and frogs, for example, have mostly
SINEs, whereas Drosophila and birds have
mostly LINEs.
• LINEs and SINEs represent a significant
proportion of all the moderately repetitive
DNA in thegenome

Long interspersed repeat sequences (LINEs)
• Long interspersed repeat sequences (LINEs)
are mammalian retrotransposons that in
contrast to retroviruses lack long terminal
repeats (LTRs).
• LINEs (long interspersed nuclear elements),
comprise about 21% of the human genome.
and consist of repetitive sequences up to 6500
bp long that are adenine-rich at their 3’ends.

• Mammalian diploid genomes have about
500,000 copies of the LINE-1 (L1) family,
representing about 21% of the genome.
• Other LINE families may be present also, but
they are much less abundant than LINE-1.
Fulllength LINE-1 family members are 6–7 kb
long, although most are truncated elements of
about 1–2 kb.

• LINEs encode two open reading frames (ORF1 and 2),
which are translated.
• LINE1 (L1) element is about 6.1kb long and encode
two open reading frames (ORF1 {1kb} and 2 {4kb} )
– RNA-binding protein p40 and
– a protein with both endonuclease and reverse transcriptase activities.
• At the 5’ end and at the 3 end they have an
untranslated region (5’ UTR and 3’UTR).

• The 5' UTR contains the promoter sequence,
while the 3' UTR contains a polyadenylation
signal (AATAAA) and a poly-A tail.
• Approximately 600000 L1 elements are
dispersed throughout the human genome.
• This can result in genetic disease if one is
inserted into a gene (e.g., hemophilia A).
• LINEs-2 and -3 are inactive because reverse
transcription from the 3’ end often fails to
proceed to the 5’ end

Short interspersed nuclear elements (SINEs)
• SINEs are found in a diverse array of
eukaryotic species, including mammals,
amphibians, and sea urchins.
• Each species with SINEs has its own
characteristic array of SINE families.
• A well-studied SINE family is the Alu family of
certain primates.

• This family is named for the cleavage site for the
restriction enzyme AluI typically found in the
repeated sequence.
• In humans, the Alu family is the most abundant SINE
family in the genome, consisting of 200–300-bp
sequences repeated as many as a million times and
making up about 10% of the human genome.
• One Alu repeat is located every 5,000 bp in the
genome, on average.

• The SINEs are also transposons, but they do
not encode the enzymes they need for
movement. They can move, however, if those
enzymes are supplied by an active LINE
transposon.
• SINEs can be best described as
nonautonomous LINEs, because they have the
structural features of LINEs but do not encode
their own reverse transcriptase

Role of LINEs and SINEs
• While historically viewed as "junk DNA",
recent research suggests that in some rare
cases both LINEs and SINEs were incorporated
into novel genes, so as to evolve new
functionality.
• The distribution of these elements has been
implicated in some genetic diseases and
cancers.

Tandem Repeats
• However, some moderately and highly
repetitive sequences are clustered together in
a tandem array, also known as tandem
repeats.
• In a tandem array, a very short nucleotide
sequence is repeated many times in a row.
• In Drosophila, for example, 19% of the
chromosomal DNA is highly repetitive DNA
found in tandem arrays.

• Depending on the average size of the arrays of
repeat units, highly repetitive noncoding DNA
belonging to this class can be grouped into
three subclasses: satellite, minisatellite and
microsatellite DNA.
– Classical satellite DNA: repeat unit 100-5000 kb
– Minisatellite DNA: 100 bp – 20 kb
– Microsatellite DNA: <150bp; usually 4 bp or less

Satellite DNA
• Human satellite DNA is comprised of very
large arrays of tandemly repeated DNA with
the repeat unit being a simple or moderately
complex sequence (100kb to several Mb)
• Repeated DNA of this type is not transcribed
• Accounts for the bulk of the heterochromatic
regions of the genome, being notably found in
the vicinity of the centromeres.

Minisatellite DNA
• Minisatellite DNA comprises a collection of
moderately sized arrays of tandemly repeated
DNA sequences which are dispersed over
considerable portions of the nuclear genome
• Like satellite DNA sequences, they are not
normally transcribed
• Arrays often within 0.1-20kb range

Minisatellite DNA
• In humans, 90% of minisatellites are found at the
sub-telomeric region of chromosomes.
• The telomere sequence itself is a tandem repeat:
TTAGGG TTAGGG TTAGGG .
• Variation in size (array length) of these regions
between individuals in humans was originally the
basis for DNA fingerprinting.

Minisatellite DNA
• Hypervariable minisatellite DNA
– many of the arrays are found near the telomeres
– 9-64bp repeating unit with array of 0.1–20 kb
long.
• Telomeric DNA
– 10–15 kb of tandem hexanucleotide repeat units,
especially TTAGGG, which are added by a
specialized enzyme, telomerase

Microsatellites (SSRs, STRs)
• Also known as Short Tandem Repeat (STR), Simple
Sequence length polymorphism (SSLP) and
Simple Sequence Repeat (SSR)
• Repeating sequences of 1-6 base pairs of DNA and
can be repeated 10 to 100 times.
• Most common in humans is the (CA)n sequence
where n varies from 5 -50 or more.
• Found on average every 10kbp in the human genome

• Trinucleotide and tetranucleotide tandem repeats
are comparatively rare.
• The lengths of particular microsatellite sequences
tend to be highly variable among individuals. These
differences make up molecular "alleles".
• Although microsatellite DNA has generally been
identified in intergenic DNA or within the introns of
genes, a few examples have been recorded within
the coding sequences of genes.

VNTR
• At a tandem repeat site, the number of repeats
varies widely in the population, although the repeat
number is usually well preserved during
transmission.
• Therefore each different repeat number can be
treated as a separate "allele" and the site can be
treated as a highly polymorphic site with multiple
alleles. Such a site is known as a VNTR (variable
number of tandem repeats) site.

VNTR
• A Variable Number Tandem Repeat (or VNTR) is a location in
a genome where a short nucleotide sequence is organized as
a tandem repeat.
• These can be found on many chromosomes, and often show
variations in length between individuals.

VNTR
• Each variant acts as an inherited allele,
allowing them to be used for personal or
parental identification. Their analysis is useful
in genetics and biology research, forensics,
and DNA fingerprinting, DNA profiling.
• Two principal families of VNTRs:
microsatellites and minisatellites

VNTR
• VNTR via recombination or replication errors,
leading to alleles with different numbers of
repeats

Repetitive sequences in the eukaryotic genome

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Repetitive sequences in the eukaryotic genome

Semelhante a Repetitive sequences in the eukaryotic genome (20)

Último

Último (20)

Repetitive sequences in the eukaryotic genome