At the StampedeCon 2015 Big Data Conference: The global Monsanto R&D pipeline produces millions of new plant populations every year; each which contributes to a dataset of genetic ancestry spanning several decades. Historically the constraints of modeling and processing this data within an RDBMS has made drawing inferences from this dataset complex and computationally infeasible at large scale. Fortunately, the genetic history of any plant population forms a naturally occurring directed acyclic graph, a property that has allowed us to utilize graph theory to re-imagine how ancestral lineage data is modeled, stored, and queried.
In this talk we present our solutions to these problems, as realized using a graph-based approach within Neo4j. We will discuss our learnings around using Neo4j in a production setting that includes transactional and high-throughput computation, including how we transitioned from recursive JOIN queries to using Cypher and the Neo4j traversal framework to take full advantage of index-free adjacency. Our approach to polyglot persistence will be discussed via our use of a distributed commit log, Apache Kafka, to feed our graph store from sources of live transactional data. Finally, we will touch upon how we are using these technologies to annotate our genetic ancestry dataset with molecular genomics data in order to build an pipeline-scale genotype imputation platform with core algorithms built using Apache Spark.
2. Food is a looming issue as populations
rise and farm acres shrink
2
By 2050, the world will grow by 2 billion people,
that’s as many people as there are currently in
North and South America combined
TWICE!!!
Copyright 2015 Monsanto Company
3. Breeding for a Better Harvest
3
Approaches to
make crops yield
better under
dwindling
resources requires
huge advances in
breeding
FEED
FOOD
10K YEARS
Copyright 2015 Monsanto Company
9. Our reads do not scale…
9
0
5
10
15
20
25
30
Response(s)
Response (s)
Copyright 2015 Monsanto Company
At a depth of 15 – We killed the query at 1.5 hours
10. Database indexes do not help
10
Identifying each
set of related
materials
potentially
requires a
full scan of an
index
O(m log n)
Copyright 2015 Monsanto Company
11. Ask a question about an Ancestry
11
Copyright 2015 Monsanto Company
Can you return to me all
ancestors of a given plant?
12. Index Free Adjacency (IFA)
12
A single index
hit finds my
starting point;
all other
relationship
identification is
O(1)
Copyright 2015 Monsanto Company
13. We were looking for…
13
Something that can
accurately represent the
domain model
14. We were looking for…
14
Query performance to
remain near constant as
we ask questions about
particular plants
15. We were looking for…
15
Something that easily
lends itself to TDD
16. We were looking for…
16
Ideally open source with
a low barrier to entry
19. Ask a question about an Ancestry
19
Copyright 2015 Monsanto Company
Can you return to me all
ancestors of a given plant?
20. Enabling Innovation
Providing the ability to consume raw trees gives
our consumers a way to leverage the power of
the Graph Database on top of our ancestry
grammar
20
Team identified a basic set of features and Codify
patterns to identify important features in an
Ancestry
Derived at query time
• Return “raw” ancestral trees to consumers
• Allow on-demand pruning of raw trees
• Promote language consistency across business
consumers
Copyright 2015 Monsanto Company
22. Predefined Ancestral Milestones
Given where I am at on the Earth
now, where is the closest sandwich
shop “X”?
22
Team identified a basic set of features and Codify
patterns to identify important features in an
Ancestry
Derived at query time
• Traverse raw crossing records at query time
• Derivation at query time allows patterns to more
easily adapt to changes in business process
• Prevents data decay
Copyright 2015 Monsanto Company
24. Let’s ask a more complex question
24
Do any ancestors of a given
plant show a strong resistance
to a particular disease?
Copyright 2015 Monsanto Company
Who are the first of my
ancestors to immigrate from
Germany and Ireland to
America?
25. Decorating the Ancestry
25
G G
G
Genotype
nodes act
as simple
pointers to
remote
systems
Copyright 2015 Monsanto Company
27. Architecture
Informing our Ancestry backbone of additional data
that identify significant events in a line’s history
allows our APIs to evolve and adapt as our
agronomic practices change.
27
Copyright 2015 Monsanto Company
28. Take this with you…
28
• Untie yourself from your database indexes
• Let Neo4j do the heavy lifting
• Value added even as non system of record
• Keep the storage model as close to mental
model as possible
Copyright 2015 Monsanto Company