1. interPopula: Database and tool
integration for population genetics
With a focus on the HapMap project
˜
Tiago Rodrigues Antao
http://popgen.eu/soft/interPop
tiagoantao@gmail.com
Liverpool School of Tropical Medicine, UK
interPopula – p.
2. Preamble – the HapMap project
(and UCSC Known Genes)
interPopula – p.
3. HapMap
The goal of the International HapMap Project is to develop
a haplotype map of the human genome, the HapMap, which
will describe the common patterns of human DNA
sequence variation. The HapMap is expected to be a key
resource for researchers to use to find genes affecting
health, disease, and responses to drugs and environmental
factors. The information produced by the Project will be
made freely available.
http://hapmap.ncbi.nlm.nih.gov/
interPopula – p.
4. What is there?
11 pops, 90–180 individuals/pop (some cases with family
trios), >3M SNPs
Frequencies (e.g. for population P and SNP S, there are
30% of As and 70% of Cs)
Genotypes (data per individual)
Phasing data
Pedigree info
LD (linkage disequilibrium) computations
Copy Number Variation (CNV) info – New!
A second generation human haplotype map of over 3.1
million SNPs. Nature 449, 851-861. 2007.
interPopula – p.
5. UCSC Known Genes
A gene set constructed by an automated process, based
on protein data from Swiss-Prot/TrEMBL (UniProt) and
the associated mRNA data from Genbank
Inside UCSC Genome Browser
http://genome.ucsc.edu/
Not only for humans (but options limited, less than a
handful of species)
Really useful for HapMap data (allows to relate SNPs
with gene information in a much easier way than Entrez
SNP)
Hsu et al, Bioinformatics, 2006 22(9):1036-1046 (but see
Genome Browser updates on NAR)
interPopula – p.
6. We now return to our regularly
scheduled program – interPopula
interPopula – p.
7. Introduction – 1
A Python library to access HapMap and UCSC Known
Genes data
A set of scripts providing integration examples.
Integrating interPopula with Biopython, matplotlib,
Genepop and Entrez SNP. Interaction with the ecology of
PopGen databases and Python tools encouraged
A set of guidelines to deal with inconsistencies across
databases
Very easy to use, many examples
For Perl: Ensembl Variation API (Rios et al. BMC
Bioinformatics 2010, 11:238)
interPopula – p.
8. Introduction – 2
Python (2.6) based. Test coverage very high
Uses sqlite (Python built-in, no extra dependencies)
Creates a local SQL database from ftp data files
Can be disk and network intensive
Intelligent download: on-demand and never repeats the
same data twice
Database not normalized (for perfomance and space
reasons)
Family support (triage of offspring)
Data export (Genepop). X and Y aware.
interPopula – p.
9. HapMap example
To have a feel of the interface...
freqDB = Frequency()
freqDB.requireChrPop(chr, pop)
RSs = freqDB.getRSsForInterval(chr,
startPos, endPos)
for rs in RSs:
#We get frequency information
freqSNP = freqDB.getPopSNPs(pop, rs)
nuc1, nuc2 = freqSNP[5], freqSNP[6]
a1a1, a2a2, a1a2 =
freqSNP[7], freqSNP[8], freqSNP[9]
interPopula – p.
10. UCSC Known Genes support
Everything is supported (not that much, just a long text
file plus a link table)
Get different IDs (Ascension ID, Prot ID, other links)
What is near a certain genomic position (chromosome
and position in chromosome)
Get exons for a certain gene
interPopula – p. 1
11. Integration
Many examples provided on interoperability (with
matplotlib, Entrez SNP, Genepop and Biopython)
Integrating heterogeneous databases
Databases do use different reference assemblies
Example: The exon positions given by the last version
of UCSC Table Browser are not compatible with
HapMap (v37 vs v36)
Silent bug where rarely applications crash and results
seem correct
This issue is discussed in the context of
HapMap/TableBrowser/EntrezSNP and might be useful
in other cases
interPopula – p. 1
14. Future work
Focus on HapMap and maybe 1000 Genomes project
The whole UCSC Table Browser will be spin off later in a
different project
Copy Number Variation support (since June on HapMap)
Phasing support due very soon (like next week)
Provide examples with genome wide association studies
interPopula – p. 1