This document describes the development of an interactive tool to visualize errors in animal pedigree genotype data to help locate and isolate errors. The tool uses a novel "sandwich layout" to display pedigree data between generations in an easy-to-follow format. Individual errors are color-coded for easy identification. The tool allows users to mask problematic areas to infer corrected data and progressively "clean" the data set. An evaluation found the tool helped users better understand error patterns compared to previous table-based methods.
Student Profile Sample - We help schools to connect the data they have, with ...
Visualising Errors in Animal Pedigree Genotype Data
1. VISUALISING ERRORS IN
ANIMAL PEDIGREE
GENOTYPE DATA
Martin Graham, Jessie Kennedy, Trevor Paterson & Andy
Law
Edinburgh Napier University & The Roslin Institute, Univ of
Edinburgh, UK
2. 2 years ago at Firbush...
I said:
“Aim is to develop interactive tools to locate and isolate errors in pedigree genotype
data in their datasets”
Where a
Pedigree= Family tree of related animals
Genotype = Genetic makeup of an organism
3. Inheritance Basics (Very)
Humans have DNA
They in fact have 2 lots of DNA
(diploidy), which may or may not match at
certain points
Two lots of DNA bundled in a
chromosome
When two parents produce offspring, one lot of
DNA is passed onto the child from each parent
Which lot is used changes just to shuffle things up
a bit more
4. Inheritance Basics (Very)
By looking at many, many Single Nucleotide
Polymorphisms markers (points where we
know things vary between individuals at the
level of single DNA letters) we can check for
errors
A G A C A C
If one letter from each parent at these points
turns up in the same place in the child’s DNA
everything is good
5. Errorz
But inevitably.... Nothing inherited from mum
Errorscreep in for various
reasons, bad record- A G C C C C
keeping, observations...
Nothing inherited from dad
A G C A G G
Novel allele. No inheritance
from one parent, but we
Muddled DNA can’t tell which...
sampling, animals “jumping A G C A T A
the fence” etc etc
Unusable data in this state
6. Thus
There is a constant need to clean up pedigree
data
Roslin have a tool that views data as a table
(markers by individuals), so pedigree-based
patterns to error, such as the wrong dad for an
entire set of offspring, were very hard to spot
So they wanted a new tool, with a funky
7. Layouts
So (2 years ago) we looked at pedigree
layouts
And they were all rubbish
8. Layouts
Didn’t scale, became intractable to follow relationships, couldn’t
resolve generations, often only individual-out views rather than
whole pedigree etc
9. Layouts
So we developed what we called the sandwich
view. Between neighbouring generations, we
draw
Dads as the top slice of bread
Mums as the bottom slice of bread
Kids as the filling
Errors colour-coded across the marker set, more
10. Layouts
Each family forms a block between the
respective mum and dad, making it easy to
see who is who’s offspring/parents
Layout works as males mate with multiple
females in each generation but the opposite is
rare
11. Layouts
Each child forms a glyph used to
show error
Divided into three parts
Up triangle coloured if error with dad
Down triangle coloured if error with
mum
Middle band coloured if error, but
parent in error is unknown (novel
allele)
Lo, pedigree-based error patterns
revealed themselves
12. Layouts
Tables full of data and histograms to show
error distribution by marker and individuals
also help
13. Cleaning
So, we can show errors nicely
But the aim is to get rid of all these errors
Masking is when we pretend we don’t know
the values for particular markers / individuals /
combinations thereof
What happens then is that those values are
inferred from the corresponding values in the
parents A G G C
A G C C C C
? ? C C C C
14. Cleaning
The visualisations lets the biologist mask
individuals / bunches of markers / individual
genotype points / relationships
These are then shown in blue in the interface
15. Cleaning
This last point’s important as pedigree errors
just propagate down the pedigree. A wrong
parent for a child can’t be cured by hiding the
child
It’s also why we cant clean these data sets
automatically, the biologists judgement in what
16. The Goal
Eventually we want a display with no nasty red
colours and then we can save it as a “clean”
data set
Though obviously with lots of missing data
But the biologists say their tools can handle
missing things, but wrong things blow them up
And we did have to stick in a final “auto clean up”
button to fix sporadic errors that would have taken
ages to fix manually
But the major systematic errors are fixed by the
biologist
17. User Test
We did a user test with 11 biologists at Roslin
They preferred the new tool to the table-like
tool
Probably the most interesting thing past the
numbers was once again how much a bunch
of scientists are in thrall to Excel
Just like the taxonomists we’ve worked with /
social scientists we’re writing a proposal with
Which is why the Roslin guys made a table-a-like
tool in the first place to try and appease them
18. Conclusion
Built successful tool (got it published in
EuroVis, BioVis and AVI)
Whether it’s successful from the biologists
point of view...
During the project, marker set sizes jumped from
thousands to hundreds of thousands
Sequencing the data used to be the costly part of
the process, staff time to clean it up was relatively
cheap
Biology in general is having a data crisis, some
opinions say its cheaper/easier to redo
experiments than store the TBs of information
19. Conclusion
Available at www.viper-project.org
Did do JavaDocs this time
I enjoyed it