1. Integrated DNA Technologies
Use of NCBI Databases in qPCR Assay
Design
Elisabeth Wagner, PhD
Scientific Applications Specialist
2. 1
Session Outcomes
You will:
Learn which NCBI tools are useful for designing qPCR assays
Become proficient using tools for qPCR design in the IDT SciTools® suite
Navigate the features and tools available on the NCBI website
Obtain sequence information for your gene of interest
Perform a BLAST search for assay specificity
Search for SNPs
Understand how to proceed with a basic qPCR design
3. 2
qPCR Design Covers A Lot of Ground
There are many uses for quantitative PCR.
For some examples:
Gene expression
Copy number variation
Genotyping
Multi-species analysis
Splice variant specific (or common) expression
We will address the general considerations for design in this session,
and cover more specific examples later this afternoon.
4. 3
SciTools® Overview
http://www.idtdna.com/pages/scitools
Several Tools are available in the IDT SciTools® suite to assist with qPCR design
1. RealTime PCR Tool
2. PrimerQuest® Tool
3. OligoAnalyzer® Tool
4. PrimeTime® Predesigned qPCR Assay Database
5. 4
NCBI Databases Overview:
1. Obtain sequence information for your gene of interest-
NCBI Nucleotide or Gene
2. Perform a BLAST search for assay specificity
NCBI BLAST
3. Search for SNPs
NCBI dbSNP
NCBI enables you to access all of this information necessary for design
in one location.
7. NCBI Overview (National Center for Biotechnology and Information)
Founded in 1988 as part of the United States National Library of Medicine
Houses a series of databases relevant to biotechnology and biomedicine
Curates Genbank, a database of over 1x1012 bp of DNA sequences
Gene database, which integrates gene-specific information from numerous
species
dbSNP, which is a database of reported Single Nucleotide Polymorphisms (SNPs)
Contains the BLAST sequence similarity search program
Maintains PubMed, a journal database for biomedical literature
Much, much more information!
6
8. NCBI Database Search: Sequence Information for qPCR Assay Design
http://www.ncbi.nlm.nih.gov/
7
9. NCBI Sequence Files
Files:
Can be entered by anyone
May or may not be checked for accuracy
May contain contaminated sequence (plasmid or other)
May contain annotation errors
Accession numbers:
Letters at the beginning indicate the type of file
Nucleotide sequences start with 1 or 2 letters:
8
10. The RefSeq Database
non-redundant
explicitly linked nucleotide and protein
sequences
ongoing curation by NCBI staff and
collaborators, with reviewed records
indicated
includes data validation and
format consistency
distinct accession numbers
all accessions include an
underscore '_' character
Different versions are tracked
9
21. 20
Once Sequence Entered, 3 Defaults Become Available
Often you will need to adjust the
parameters of the tool to meet
experimental design requirements
23. 22
Changing Parameters Depend on the Assay Required
Before changing anything, make sure
you have selected the correct assay
Sometimes you
simply need to
increase the number
of designs returned
It is unlikely that you will need
to change these parameters
27. 26
Once Initial Design Completed, Back to NCBI
Use NCBI tools to:
Check whether assay is specific (BLAST)
Ensure there are no SNPs to worry about (dbSNP)
Use IDT OligoAnalzyer® Tool
Check primers (and probe) for secondary structure and dimer
formation
29. 28
What is BLAST?—Getting to BLAST
http://www.ncbi.nlm.nih.gov/
Or http://blast.ncbi.nlm.nih.gov/Blast.cgi
30. 29
What is BLAST (Basic Local Alignment Search Tool)?
BLAST stands for Basic Local Alignment Search Tool and is provided by the National Center for
Biotechnology and Information (NCBI)
Aligns a user defined query (sequence) to a wide variety of databases
Can translate the query or the database to align sequences
Can align 2 or more sequences together
Heuristic algorithm to create alignments very fast
Breaks sequences into “words” and searches the database for matches
Reassembles these matches based on the criteria entered
32. 31
How BLAST Works—Words
BLAST divides the query sequence into subsets called “words”,
which the algorithm uses to perform the alignment
Example (35 nt sequence):
CGATCGGGCATCACACAAAGTTATGTAGTAGAAAT
All possible words that can be generated from the sequence are
used for the alignment
The max number of words for this sequence is 29
7-letter word
33. 32
Overview—Definitions
Hit: A sequence to which the query is aligned and is returned in the
results of BLAST
Identity: the extent of exact matches between 2 sequences (eg
ACGT and ACGG have 75% identity)
Similarity = Positives (in BLAST scoring)
34. 33
How BLAST Works—Scores
The BLAST raw score is converted to a bit score for each alignment using
parameters based on statistics described in Karlin and Altschul (1990)
(www.ncbi.nlm.nih.gov/pmc/articles/PMC53667/pdf/pnas01031-0226.pdf).
A high score does not necessarily indicate that the query is unique
The score is only dependent on the alignment, length of the sequence, and the
length of the database
E-value is the expected amount of random sequences that have equivalent
sequence alignment
Calculated using the Max bit score and the length of the query and database
Tells you the relative strength of the alignment
Shorter sequences have higher E-values because the probability of finding that
sequence is higher
A low E-value does not mean you have a unique match!
35. 34
BLAST Assessment for qPCR Primers
Go to the BLAST server:
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Enter primer sequences
separated by 7+ N’s
36. 35
Select the Correct Database
“Others” is the most general but contains a lot of sequences. If possible use
Human or Mouse specific databases
For species with completed genome
projects, consider using “NCBI
Genomes” to limit BLAST results
37. 36
Change the parameters of the BLAST scoring
Select less
rigorous
algorithm
Change
Word size
to “7”
38. 37
Looking at the Results
The Graphic
Summary can
immediately give
you a sense of
what the overall
results are
Hover over
each result in
the graphic to
identify the
sequence
name
39. 38
Then Look at Results List
Look at E-value and Query Coverage. Look for
jumps in either/both.
Looks like assay is specific to a single
gene by transcript
Ignore the “alternate”
chromosome assemblies
40. 39
Investigate details of alignment
Check distance between primer
binding if looking at mRNA
Open
Graphics
result in a
new
tab/window
41. 40
BLAST Shows Primer Aligned to Sequence
Zoom out with “-” sign
You can grab within
window and drag
sequence side to side
42. 41
The Target Gene is on Chromosome 6
This looks promising with primers on different exons.
43. 42
But We Had Other Chromosomal Hits……
“real” transcript
Pseudogene—
doesn’t look
transcribed
Primers (red
bar indicates
mismatch)
44. 43
And Another One……
Another pseudogene.
But what’s
this?
Intron of a transcribed gene. So potentially in
RNA samples. Recommend avoiding if possible
48. 47
Assessing SNP Data
Tells you it’s a
single base
substitution
Indicates alternate
forms (here recorded
on opposite strand)
Indicates allele
frequency if known
Sometimes more frequency
data at bottom of page
49. 48
SNP Data Roughly Divided by Risk
Trusted source
Very low frequency
No data, likely not
going to be
problematic
Significant risk. Look
to redesign if possible
51. 50
Checking Primers with OligoAnalyzer® Tool
PrimerQuest® design tools give you the “best” assays for the region
specified
They check for self- and hetero-dimers, but this is only part of the scoring
system used
An assay maybe be “better” even with dimer issues if it scores well on
other parameters
Go to the OligoAnalyzer Tool
Perform self-dimer checks for primers and probe
Perform heterodimer checks on all primer/probe combinations (especially
important to include all combinations when multiplexing)
Check hairpin structures.
Look for stability of < -9 kcal/mol
Or multiple hairpins forming with < -4 kcal/mol
52. 51
Assessing Dimer Data
Looks stable < -9kcal/mol
But this is not “dangerous”,
avoid if possible but ok
Looks stable < -9kcal/mol
Not extendable, not a problem
Doesn’t look stable > -9kcal/mol
Danger of extension,
exponential amplification!
55. 54
Primer and Probe Design Criteria for PrimeTime® Assays
Primers
equal Tm (60–63oC)
15–30 bases in length
no runs of 4 or more Gs
amplicon size 50–150 bp (max 400 bp)
Probe
Probe length no longer than 30–35 bases
Tm value 4–10oC higher than primers
no runs of 4 or more consecutive Gs
G+C content 30–80%
no G at the 5′ end
Sequences can be pasted in or Accession number can allow download of sequence. This is NOT just RefSeq sequences, it will find any accession number
Important to note PrimerQuest will not look for introns, lower and upper case treated the same way. Clicking Orange buttons is preferred if possibleAmplicon size: general PCR 200-1000, qPCR 75-150 default
The default for “set Design parameters for….” is General PCR. Must click the appropriate assay button first as the page will reload and give all defaults (removing any changes you made)First thing to try is just increasing number of assays returned and seeing if one matches the parameters of your design.
Design across junction(s) allows you to specify intron junctions or to focus design on a specific location
Just some examples of using parameters to restrict design to specific regions
This is a place to discuss particularly GC or AT rich sequences. For AT rich need to increase max oligo length as well.
The first link is the URL for the NCBI homepage. The first link to the right under Popular Resources is the link for the BLAST website. The BLAST program is directly accessible by using the link on the bottom of the slide.
There are two different types of alignments: local and global. As the name implies, BLAST is a local alignment tool. Global alignments try to align each letter of the query sequence with the sequence in a database. A local alignment tries to find the best match without trying to align the entire sequence but breaks the query into smaller parts and aligns these smaller sequences. This approach is called a heuristic algorithm. A heuristic algorithm is one that uses knowledge about the problem to create a faster method to solve the problem. This will reduce the time of the search but will not always retrieve all sequences that have high identity.BLAST can also align two sequences together but this is generally not the best method for this type of alignment.BLAST can restrict a search by an Entrez query such as aligning a query to only a specific gene across all species in the database.
This section has the most popular BLAST tools that give users the most flexibility.
The concept of splitting the query into words is essential to understanding how BLAST works. The sequence above is 35 bases long. The box that position over the first 7-letters is the first 7-letter word. The next 7-letter word is the red box shifted one base to the right. Each subsequent word is one base shifted to the right until the end of the sequence is reached.
Homology indicates that the proteins that are related which usually implies similar function but this is not necessary. An example of homologous proteins is human actin and mouse actin. Identity is the percent of bases (or amino acids) of aligned sequences that are exactly the same.Similarity is the percent of amino acids aligned that have either identical or have conserved properties.Hits are the same sequences returned by BLAST that aligned to the query. Homology is not the same as identity or similarity! These terms are frequently improperly used.
The BLAST raw score is converted to a bit score for each alignment using parameters based on statistics described in Karlin et al. paper (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC53667/pdf/pnas01031-0226.pdf). The details of the conversion are beyond the scope of this presentation. The Max bit score is the alignment with the highest bit score for a hit. For example, a primer may align in several places within a mRNA sequence all with different percent identity. The match with the highest identity will have the max bit score while the total score for that transcript will be the sum of all the matches. Blastn: For a given database, bit score is only dependent on the alignment, length of the sequence, and the length of the database. Technically, there are scoring matrices in blastn but these are not frequently changed.A high bit score does not indicate that the query is unique. A database that is completely composed of the same 30nt repeat will give the same bit score for a perfect match of that 30nt sequence as a database that only has this sequence once.
This of course only works with genomic targets. But another chance to show some variations on BLAST that can give clearer resultsThe Database dropdown menu includes the first two radio button options on the top. Users can select the database from the dropdown menu and then click on the question mark to find more the sources of the data in this database, molecule type, update date, and number of sequences.The Nucleotide collection (nr/nt) is the most common choice for nucleotide information and this contains information from GenBank as well information from the European (EMBL) and Japanese (DDBJ) databases. This database also includes information from the Protein data bank (PDB) which is a depository for solved biology structures (mostly protein and nucleic acids). Nr stands for non-redundant, however, this database is full of redundancies. Reference RNA sequences are all RNA sequences that are curated by NCBI and deposited in RefSeq. RefSeq attempts to be non-redundant and more thoroughly annotated than the Nucleotide database.RefSeq genomic sequences is the DNA equivalent to the RNA RefSeq. 16S ribosomal RNA sequences is a new database that allows users to search just the 16S database of bacteria and Achaea. The nucleotide collection does not contain these ribosomal RNAs.
Note that you can also increase the “expect threshold” to 1000
Always good to give a sense of what the BLAST alignments are like. The thin line indicates the two matches are on the same sequence contig. If the bar is short at the 3’ end of either sequence it is likely not a concern
The order that sequences are listed seems a bit random assuming they all have the same score.
Two checks here: 1. does match extend full length of primers? 2. what is the distance between primers? For genomic matches like this a large distance may indicate an intron
How to use the graphic interface
When looking at other off-target matches it may be a pseudogene. This one is non-transcribed and so likely not a problem for an expression assay
Some intron sequences hang around for a while so it’s not always clear if a pseudogene in an intron is a problem. Always best to avoid if possible but inform customer if necessary
SNP analysis on BLAST can be temperamental but it is the simplest analysis so we’ll keep that here
Remember to hover to get pop-up window then click on the rs number
An example of unhelpful frequency data for a SNP.
Examples of different frequencies and discussion of relative danger
The delta G value represents the likelihood an oligo will form a specific structure. The more negative the value, the more spontaneously the oligo will form this structure and remain in this conformation. Therefore, the more negative the value, the stronger the structure. For most analyses, we recommend values less negative than -9 kcal/mole. However, this value is relatively conservative, especially for longer oligos. PCR and qPCR reactions can often perform with delta G’s of up to -12 kcal/mole.