Shobhna Srivastava discusses Elsevier's Research Citation network. She talks about how the journey of trying to simplify the existing data processing pipeline, to optimise costs, and choose the right solution to the problem opens the doors to other potential use cases and innovation. Graph technology has been applied to the scientific research domain to enhance content discovery.
2. Context
■ Elsevier is a global
information & analytics
business specializing in
Science & health
■ Scopus – “Expertly curated
abstract & citations
database”
■ https://www.scopus.com/
4. Problem definition
4
Doesn’t enable changes or enriching document with new data points
This processing is fragile
Costly solution
Hardware used
•90 nodes Solr indexing cluster (this is separate to live search cluster)
•Redshift
•Of course processing EC2 instances
Old document enrichment pipeline
•Index is created in Solr
•Redshift updated from Solr
•Then new counts are calculated, and diff done with old Solr index
•Then the updates are applied to Solr index
•And finally live Solr cluster is updated
5. Bounded context
Runtime system –
performance is
important
Aware of starting
node or nodes
Depth first or
breadth first
traversal
Metrics generation
5
6. Why
graph?
Classic multi-level graph traversals
Many-to-many relations on input data
Non-trivial & multi-level joins
Most enrichment is done
on relationships and how data are
connected to each other
6
7. Technology choice
Neo4J Neptune
Meets QPS ✓ ⚠ Neptune is much slower with with queries that require longer traversals
(i.e. "rolled up" queries per organisation count - 7 ms on Neo4j vs 7 seconds
on Neptune)
Scalability ⚠Tested with graph size that fits into cache, with larger graph some
smarter caching should be implemented
⚠ Works fast on larger instances (supposedly because of the cache size),
so with larger graph some application-level optimisations might be required.
A bit trickier than Neo4j because cache settings are not visible/configurable
Indexing ✓ ⚠Indexes are not configurable
Transaction management ✓ ⚠ Every traversal is a single transaction, manual commit/rollback are not
supported
Easy of cluster management ✓ Out-of-the box clustering with enterprise license
Unless enterprise licence purchased clustering and data replication
should be handled by us
✓ Easy out-of-the box data replication, immediate consistency
Cost 2 r4.4xlarge instances + LB ~ 1800 USD/month 2 r4.4xlarge instances + 250 GB storage (estimated based on test data) ~
2015 USD/month + 0.2 USD/1 million I/O requests (1,600 million requests
made only during testing)
7