QCon San Francisco Presentation, Scaling Ancestry DNA with HBase and Hadoop

Scaling AncestryDNA
Using Hadoop and HBase

November 11, 2013
Jeremy Pollack (Engineer) and Bill Yetman (Manager)

1

What Does This Talk Cover?

What does Ancestry do?
How does the science work?
How did our journey with Hadoop start?
DNA matching with Hadoop and Hbase

Lessons Learned
What’s next?

2

Discoveries are the Key
We are the world's largest online family history resource

• Over 30,000 historical content collections

• 12 billion records and images
• Records dating back to 16th century
• 10 petabytes

4

Discoveries in Detail

The “eureka” moment drives our business

5

Discoveries with DNA

Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 200,000+ DNA samples
700,000 SNPs for each sample
10,000,000+ cousin matches
150,000

Genotyped Samples

100,000
50,000
-

DNA molecule 1 differs from DNA
molecule 2 at a single base-pair
location (a C/T polymorphism)
(http://en.wikipedia.org/wiki/Singlenucleiotide_polymorphism)

6

Network Effect – Cousin Matches
3,500,000

Cousin Matches

3,000,000
2,500,000
2,000,000
1,500,000
1,000,000
500,000
0
2,000

10,053

21,205

40,201

60,240

Database Size
7

80,405

115,756

Where Did We Start?
The process before Hadoop

8

What’s the Story?
Cast of Characters (Scientists and Software Engineers)
Scientists
Think they can code:
• Linux
• MySQL

• PERL and/or Python

Software Engineers
Think they are Scientists:
• Biology in HS and College
• Math/Statistics
• Read science papers

Pressures of a new business
– Release a product, learn, and then scale

Sr. Manager and 3 developers and 2 member Science Team
9

What Did “Get Something Running” Look Like?

Ethnicity Step
and Matching (Germline)
runs here

“Beefy Box”

Specifics:
1) Ran multiple threads for the two steps
2) Both steps were run in parallel
3) As the DNA Pool grew both steps required more memory

Single Beefy Box – Only option is to scale Vertically
10

Measure Everything Principle
• Start time, end time, duration in seconds, and sample
count for every step in the pipeline. Also the full end-toend processing time

• Put the data in pivot tables and graphed each step

• Normalize the data (sample size was changing)

• Use the data collected to predict future performance
11

Challenges and Pain Points
Performance degrades when DNA pool grows

• Static
(by batch size)

• Linear
(by DNA pool size)

• Quadratic (Matching related steps) – Time bomb

(Courtesy from Keith’s Plotting)
12

New Matching Algorithm
Hadoop and HBase

13

What is GERMLINE?
• GERMLINE is an algorithm that finds hidden
relationships within a pool of DNA
• GERMLINE also refers to the reference
implementation of that algorithm written in C++
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/

14

So What’s the Problem?
• GERMLINE (the implementation) was not meant to be
used in an industrial setting




Stateless
Single threaded
Prone to swapping (heavy memory usage)

• GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would
slow to a crawl
• Put simply: GERMLINE couldn't scale

15

Hours

GERMLINE Run Times (in hours)
25

20

15

10

5

0

60,000
57,500
55,000
52,500
50,000
47,500
45,000
42,500
40,000
37,500
35,000
32,500
30,000

27,500
25,000
22,500
20,000
17,500
15,000
12,500
10,000
7,500

16

5,000
2,500

Samples

Projected GERMLINE Run Times (in hours)
700

600

500

Hours

400

300

200
GERMLINE run
times
100

Projected
GERMLINE run
times

0

122,…

112,…

102,…

92,5…

82,5…

Samples

72,5…

62,5…

52,5…

42,5…

32,5…

22,5…

12,5…

2,500
17

The Mission : Create a Scalable Matching
Engine

... and thus was
born
(aka "Jermline with a J")

18

What is Hadoop?

• Hadoop is an open-source platform for processing large
amounts of data in a scalable, fault-tolerant, affordable
fashion, using commodity hardware
• Hadoop specifies a distributed file system called HDFS
• Hadoop supports a processing methodology known as
MapReduce
• Many tools are built on top of Hadoop, such as HBase,
Hive, and Flume
19

What is HBase?

• HBase is an open-source NoSQL data store that runs on top of
HDFS
• HBase is columnar; you can think of it as a weird amalgam of a
hashtable and a spreadsheet
• HBase supports unlimited rows and columns
• HBase stores data sparsely; there is no penalty for empty cells
• HBase is gaining in popularity: Salesforce, Facebook, and Twitter
have all invested heavily in the technology, as well as many others
21

Battlestar Galactica Characters, in an HBase Table

KEY

is_cylon hair_color

gender

is_final_five
no

Six

blonde

female

Adama

22

true
false

brown

male

rank
admiral

Adding a Row to an HBase Table

KEY

is_cylon hair_color

gender

is_final_five
no

Six

blonde

female

Adama

false

brown

male

Baltar

23

true
false

brown

male

rank
admiral

Adding a Column to an HBase Table

KEY

is_cylon

hair_color

gender

is_final_five

Six

true

blonde

female

no

Adama

false

brown

male

Baltar

false

brown

male

24

rank

friends

admiral

Kara Thrace,
Saul Tigh

DNA Matching : How it Works

The Input
Starbuck : ACTGACCTAGTTGAC
Adama : TTAAGCCTAGTTGAC

Kara Thrace, aka
Starbuck
• Ace viper pilot
• Has a special
destiny
• Not to be trifled
with
25

Admiral Adama
• Admiral of the
Colonial Fleet
• Routinely saves
humanity from
destruction


Separate into words

0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC

26


Build the hash table
0
1
2
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama

27


Iterate through genome and find matches
0
1
2
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
Starbuck and Adama match from position 1 to position 2

28

Does that mean they're related?

...maybe
29

IBD to Relationship Estimation

0.02

0.03

0.04

m1
m2
m3
m4
m5
m6
m7
m8
m9
m10
m11

0.00

0.01

• This is basically a
classification problem

probability

• We use the total length of
all shared segments to
estimate the relationship
between to genetic
relatives

0.05

ERSA

5

10

20

50

100 200
total_IBD(cM)

30

500 1000

5000

But Wait...What About Baltar?
Baltar : TTAAGCCTAGGGGCG

Gaius Baltar
• Handsome
• Genius
• Kinda evil
31

Adding a new sample, the GERMLINE way

32

The GERMLINE Way
Step one: Rebuild the entire hash table from scratch,
including the new sample
0
1
2
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
GGGCG_2 : Baltar

33

The GERMLINE Way
Step two: Find everybody's matches all over
again, including the new sample. (n x n comparisons)
0
1
2
ACTGA_0 : Starbuck
GGGCG_2 : Baltar
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
34

The GERMLINE Way
Step three: Now, throw away the evidence!
0
1
2
ACTGA_0 : Starbuck
GGGCG_2 : Baltar
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1

You have done this before, and you will have to do
it ALL OVER AGAIN.
35

The

Way
Step one: Update the hash table
Starbuck

2_ACTGA_0

Adama

1

2_TTAAG_0

1

2_CCTAG_1

1

1

2_TTGAC_2

1

Already stored in HBase

1


New sample to add

Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome
36

The

Way
Step two: Find matches, update the results table
2_Starbuck

2_Starbuck
2_Adama

2_Adama
{ (1, 2), ...}

Already stored
in HBase

{ (1, 2), ...}

Baltar and Adama match from position 0 to position 1
Baltar and Starbuck match at position 1

New matches
to add

Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
37

The

Way
Hash Table
Starbuck
2_ACTGA_0

Adama

Baltar

1

1
1

1

2_TTAAG_0
2_CCTAG_1

1

1

2_TTGAC_2

1

1

2_GGGCG_2

1

Results Table
2_Starbuck

2_Adama

38

{ (1), ...}

{ (1), ...}

{ (1, 2), ...}

2_Baltar

2_Baltar

{ (1, 2), ...}

2_Starbuck

2_Adama

{ (0,1), ...}
{ (0,1), ...}

But wait ... what about Zarek, Roslin, Hera,
and Helo?

39

Run them in parallel with Hadoop!

Photo by Benh Lieu Song

40

Parallelism with Hadoop
• Batches are usually about a thousand people
• Each mapper takes a single chromosome for a
single person
• MapReduce Jobs :
Job #1 : Match Words
o

Updates the hash table

Job #2 : Match Segments
o

41

Identifies areas where the samples match

How does

perform?

A 1700% performance improvement
over GERMLINE!
(Along with more accurate results)

42

Hours

Run times for Matching (in hours)
25

20

15

10

5

0

117,500
112,500
107,500
102,500

97,500
92,500
87,500
82,500
77,500
72,500
67,500
62,500
57,500
52,500
47,500

42,500
37,500
32,500
27,500
22,500
17,500
12,500

43

7,500
2,500

Samples

Run times for Matching (in hours)
180
160
140
120

Hours

100

GERMLINE run
times

80

Jermline run
times

60

Projected
GERMLINE run
times

40
20
0

44

Samples

Incremental Changes Over Time
• Support the business, move incrementally and adjust
• After H2, pipeline speed stays flat

(Courtesy from Bill’s plotting)
45

Dramatically Increased our Capacity

Bottom line: Without Hadoop and HBase, this would have
been expensive and difficult.

46

And now for everybody's favorite part ....

Lessons Learned

47

Lessons Learned

What went right?

48

Lessons Learned : What went right?

• This project would not have been possible without TDD
• Two sets of test data : generated and public domain
• 89% coverage

• Corrected bugs found in the reference implementation
• Has never failed in production

49

Lessons Learned

What would we do differently?

50

Lessons Learned : What would we do differently?
• Front-load some performance tests



HBase and Hadoop can have odd performance profiles
HBase in particular has some strange behavior if you're not
familiar with its inner workings

• Allow a lot of time for live tests, dry runs, and
deployment


51

These technologies are relatively new, and it isn't always
possible to find experienced admins. Be prepared to "get your
hands dirty"

What’s next for the Science Team?

52

Our new lab in Albuquerque, NM

53

Okay, for real this time. What’s next for the Science
Team?

54

More Accurate Results
Potential matches

Relevant matches

55

Mapping Potential Birth Locations for Ancestors
Birth locations from 1750-1900 of individuals with large amounts of genetic
ancestry from Senegal
1750-1850
1800-1900

Over-represented birth location in individuals with large amounts of Senegalese ancestry
Birth location common amongst individuals with W. African ancestry
56

How will the engineering team enable these
advances?

57

Engineering Improvements
• Implement algorithmic improvements to make our results
more accurate
• Recalculate data as needed to support new scientific
discoveries
• Utilize cloud computing for burst capacity
• Create asynchronous processes to continuously refine our
data
• Whatever science throws at us, we'll be there to turn
their discoveries into robust, scalable solutions

58

End of the Journey (for now)

Questions?
Tech Roots Blog: http://blogs.ancestry.com/techroots

59

Appendix A. Who are the presenters?

Bill Yetman
61

Jeremy Pollack

QCon San Francisco Presentation, Scaling Ancestry DNA with HBase and Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

QCon San Francisco Presentation, Scaling Ancestry DNA with HBase and Hadoop

Notas do Editor