This document discusses using W-curves and the traveling salesman problem (TSP) to cluster and compare sequences of the HIV-1 virus, specifically focusing on the gp120 envelope protein. Standard clades based on the full genome do not correlate well with immune response data, so the authors explore using smaller clinically relevant sequences. They describe how W-curves allow geometric comparison of sequences that can identify smaller epitope features dispersed throughout gp120. Clustering sequences using the TSP may define "clinical clades" that better match neutralization outcomes compared to full genome clades. The goal is to correlate neutralization data with small regions near epitopes to find DNA clusters predictive of immune response.
Comparing HIV-1 Epitopes with W-curves and TSP Clusters
1. Taking a walk on the W-side:
Comparing Epitopes on HIV-1
with the W-curve & TSP.
Douglas J. Cork1,2,4, Steven Lembark3, Bruce K. Brown1,4, Victoria
R. Polonis1,4, Jerome Kim1,4, Nelson L. Michael5
US Military HIV Research Program (MHRP)/Henry Jackson
Foundation(HJF)1, Rockville, MD., Illinois Institute of Technology2,
Chicago, IL., Workhorse Computing3, Woodhaven, NY., Walter Reed
Army Institute For Research4, Rockville, MD., Walter Reed Army
Institute for Research, Washington, DC5
2. Statistically, HIV1 is a problem.
● One of the major problems in studying HIV1 is
the apparent randomness of clinical response.
● Tests using clades based on genome sequences
show no correlation with immune response.
● Part of the answer may be clades based on
smaller, clinicallyspecific sequences.
● HIV1 mutates 10,000 times faster than people.
● Existing clades end up including too much white
noise to correlate well with anything.
3. The Structure of HIV1
● gp120 is the
primary focus
for immune
studies.
● gp120 and
gp41 make up
the envelope
protein, gp160.
4. Standard Clades vs. Neutralization Data
● Standard clades of HIV1 are based on
phylogenetic trees of the genome.
● They do not correlate well with neutralization data.
● Between and withinclade have similar variability.
● Antibody and Cell studies have low correlation for
withinclade results.
● Lack of a correlation prevents developing any
broadly neutralizing treatments.
● Today we have to sequence the virus to treat it.
6. Neutralization
Heat Map
● Distribution of
response to
antibody pools
lacks any
correlation with
the standard
clades.
7. HIV1 Genetics Complicate Analysis
● Genes and proteins are normally reported with
respect to a single strain, HXB2.
● Hard to compare local features between strains.
● Need to rediscover them for each study.
● Neutralization data are specific to gp120.
● Variable regions in gp120 leave corresponding
locations in different samples off by 10's of bases.
● Antibody binding sites (epitopes) are only a few
bases long, with a majority in the variable regions.
8. Another approach: Wcurves
● The Wcurve is based on chaos and game
theory.
● It abstracts a sequence of DNA into a three
dimensional structure.
● Originally designed for visualization, we have now
adapted it for machine comparison.
● Geometric analysis of the curves allows for
piecewise comparison of the sequences.
9. The Wcurve
● Start with a square at the origin and a discrete
Zaxis matching the sequence base numbers.
● Each point moves halfway towards the corner
for the next base.
10. ● All curves
start at
(0,0,0).
● The curve
(blue)
moves half
way towards
“C” then “G”
(red lines).
11. Autoregression
● Converge by
base 7 after a
SNP at base3.
● Convergence
is quick even
after large
indels.
12. Handling Gaps
● Curves converge as SNP's do but with a phase
shift.
13. Scoring Curves
● Approximating the
distance smooths over
SNP's.
● Smaller angles reduce
difference, large
angles add them.
14. Needle in a Haystack: CD4 Epitope
● The CD4 epitopes occupy only a few, widely
dispersed locations on gp120.
● Locating portions of the discontinuous epitope
is difficult.
● Variable regions between them change the
locations between samples.
● Portions of the epitope within the variable region
can be hidden by nearby changes.
15. Analyzing the 3D Structure
● The advantage to Wcurves is that even small
features of the gene generate unique geometry.
● Features are easier to identify in 3D than the 1D
CATGstrings.
● By first locating largescale features, we can
search for smaller ones more easily.
● First align extreme points on the curves.
● Then compare regions between them.
● With a library of fragments, we pick the best match.
16. Wcurve Algorithm & Serial Comparison
● Largescale features guide the search for
smaller pieces.
● Conserved regions anchor search.
● After aligning 'peaks' in the curves, we align smaller
and less discriminating features.
● A library of Wcurve fragments finds best fit with
multiple samples.
● Repeatable process allows examining and
scoring large numbers of finer features.
17. Wcurves of HXB2 genome and gp120
● The curve for HXB2 illustrates the most
important features of Wcurves.
● Looking at each section of the Wcurve you'll notice
that each area is different from the others.
● This is what allows us to locate small features: it is
easier to discern them in 3D than a character string.
● This figure also highlights the location of gp120.
18.
19. A detailed view of gp120
● The next slide shows the first portion of HXB2's
env gene: gp120.
● Again, notice that each portion of the curve is
distinct from the others.
● The different conserved (C) and variable (V)
regions are marked across the bottom of the
image.
20.
21. The CD4 epitope in gp120
● This is where the Wcurve really becomes
useful: isolating the epitope locations within
gp120.
● The highlighted areas show the epitope
locations with an additional 3bases of
conformational region before and after (which
combines a few of the regions).
● Note that the epitope is dispersed and lives
largely in the variable regions.
22.
23. Clustering With the TSP
● Solutions to the Traveling Salesman Problem
can be used to cluster genes.
● The shortest path clusters moresimilar sequences.
● The difficulty is in getting clades out of the TSP.
● One approach uses dummy cities with small
distances to all other cities.
● Dummys end up in the intercluster regions.
● This approach has proven fast & repeatable.
28. Further Work on Clusters
● Detection.
● Find algorithm for repeatably assigning the number
of dummy cities.
● Comparison.
● Automate detecting “similar” clusters.
● Timeseries analysis.
● Watch sample groups for new members.
● Track evolution of drug resistance in clinical trial
groups, individual patients.
29. Ongoing Research
● Our goal is to correlate neutralization outcomes.
● Compare small regions near the epitopes.
● Find DNA that clusters similarly to neutralization
data.
● DNA clusters that match the Neutralization data
are “clinical” clades.
● Biggest issue will be deciding what “similar” is.
● Probably a good application for Fuzzy Logic.
30. Acknowledgments
● Thanks to the authors of Brown, et al, study.
All of the work we've shown you was done on a
computer. Without fieldwork and wet labs, it would
be empty. Next time you sit down to crunch some
numbers, stop and picture for a moment the
process of acquiring it. You'll get a whole new
appreciation for your work.