2. Ancestry.com
2
• World’s
largest
online
family
history
resource
• Started
as
a
publishing
company
in
1983,
online
from
1996
• 2.7
million
worldwide
subscribers
3. Data
at
Ancestry
• Historical
records
–
company
acquired
content
collecFons
• User
created
content:
– Ancestor
profiles
and
family
trees
– Uploaded
photographs
and
stories
• User
behavior
data
on
Ancestry.com
• Customer
DNA
data
• 10
PB
of
structured
and
unstructured
data
3
4. Historical
records
• Historical
Content
– 14
billion
historical
records
going
back
to
17th
century
– DigiFzed
and
searchable
4
10. Record
linkage
10
• Record
linkage
–
finding
and
matching
records
in
mulFple
data
sets
with
non-‐unique
idenFfiers
(data
matching,
enFty
disambiguaFon,
duplicate
detecFon
etc)
• Goal:
bring
together
informaFon
about
the
same
person
• Some
non-‐unique
idenFfiers:
– Names:
first
name,
last
name
(John
Smith
–
300,000
records)
– Dates:
date
of
birth,
date
of
death
– Places:
place
of
birth,
residence,
place
of
death
– Extra:
family
members,
life
events
• Records
o_en
incomplete
and
contain
mistakes
• Other
industries:
banking,
insurance,
government
etc
11. User
behavior
data
• User
behavior
data:
– 75
mln
searches
daily
– 10
mln
profiles
added
daily
– 3.5
mln
records
aaached
daily
11
12. DNA
Data
• Direct
to
consumer
DNA
test
• 700,000
SNPs
per
sample
• 400,000
DNA
samples
• No
medical
studies
12
14. Ancestry
DNA
14
• GeneFc
inheritance
– IdenFty-‐by-‐descent
– Cousin
matching
Matching DNA
15. DNA
data:
privacy
and
research
15
ding how
influence
nd the re-
atments is
communi-
ability of
e distribu-
ences and
ever, like
l informa-
ata are pri-
sensitive.
ed special
imination,
of insur-
r individu-
es (1, 2).
of these data poses
allenges.
differ in about 0.1%
es in their genomes
entific data has led to a search for new tech-
nologies. However, the hurdles may be
greater than had been suspected. For exam-
ple, one approach to protecting privacy is to
dustrial, or governmental r
agrees to our usage policies
of data access) (10). Althou
prevent data abuse, it pro
monitor usage.
Social concern
are intricately con
about benefits o
trustworthiness of
governmental ag
United States, the
Portability and Ac
of 1996 (HIPAA)
ed Privacy Rules o
erally forbid sha
data without p
However, they do
address use or di
for human genetic
bates in Iceland,
and elsewhere (1
range of views on
by genetic information. Th
may be at one end of this sp
izens seem to strongly desir
Whatever the setting, we rec
man Subject Privacy
Zhen Lin,1 Art B. Owen,2 Russ B.Altman1*
Privacy
Independent SNPs
Low
Medium
High
5 75 100 125 1000 2000 3000 4000
Insufficient for future genomic research
Insufficient for privacy protection
Needed to find genetic relationshops
Trade-offs between SNPs and privacy.
Z.
Lin,
A.
Owen,
R.
Altman,
Science,
vol
305,
2004
16. Challenges
• Engineering
– Scalability
– Availability
– Security
• Research
– InformaFon
retrieval
– DNA
genomic
research
• Privacy
16