3. Relationships Matter: Using Connected Data for Better Machine Learning
1. Relationships Matter:
Using Connected Data for
Better Machine Learning
Spring 2021
DR. ALICIA FRAME
Director of Data Science,
Neo4j
@ODSC #Neo4j @AliciaFrame1
4. Photo by Helena Lopes on Unsplash
Network Structure
is highly predictive of
pay and promotions
• People Near Structural Holes
• Organizational Misfits
“Organizational Misfits and the Origins of Brokerage in Intrafirm Networks” A. Kleinbaum
“Structural Holes and Good Ideas” R. Burt
5. But You Can’t Analyse
What You Can’t See
• Most data science ignores
relationships
• Graphs are built using
relationships
• You don’t have to guess at
correlations; with graphs the
relationships are inherent
5
James Fowler
Relationships
Are the Strongest
Predictors of Behavior
David Burkus
6. 6 Top 10 Tech Trends in Data and Analytics, 16 Feb 2021
According to Garner, “Graphs form
the foundation of modern D&A,
with capabilities to enhance and
improve user collaboration, ML models
and explainable AI.
The recent Gartner AI in Organizations
Survey demonstrates that graph
techniques are increasingly
prevalent as AI maturity grows,
going from 13% adoption when AI
maturity is lowest to 48% when
maturity is highest.”
AI Research Papers
Featuring Graph
Source: Dimensions Knowledge System
+168k
downloads since
2019
4x
Increase in
traffic to
Neo4j GDS
page in
2H-2020
Analytics & Data Science Interest
Exploding in Neo4j Community
3x
More Data
Scientists in
Neo4j
database in
2H-2020
7. 20 of the top 25 financial firms
7 of the top 10 retailers
7 of the top 10 software vendors
Neo4j: The Graph Company
Neo4j is the creator of:
• The world’s leading graph database
• The first graph data science platform
• The most flexible graph data model
• The easiest-to-use graph query language
Thousands of Organizations Use Neo4j
7
Silicon Valley
London
Munich
Paris
Malmö
8. Connections in Data are as
Valuable as the Data Itself
Networks of People Transaction Networks
Bought
B
ou
gh
t
V
i
e
w
e
d
R
e
t
u
r
n
e
d
Bought
Knowledge Networks
Pl
ay
s
Lives_in
In_sport
Likes
F
a
n
_
o
f
Plays_for
E.g., Risk management, Supply
chain, Payments
E.g., Employees, Customers,
Suppliers, Partners,
Influencers
E.g., Enterprise content,
Domain specific content,
eCommerce content
K
n
o
w
s
Knows
Knows
K
n
o
w
s
9. 9
What’s a graph?
Node
● Represent an entity in the graph
● Can have labels
Relationship
● Connect nodes to each other
● Has one type
Property
● Describes a node/relationship: e.g.
name, age, weight etc
● Key-value pair: String key; typed
value (string, number, list, ...)
Labeled Property Graph
10. What is Graph Data Science?
Rather than just crunching
numbers like traditional
analytics, Graph Data Science
analyzes data relationships
and structures...
... to produce answers,
insights, and predictions
10
11. Knowledge Graphs Graph Feature
Engineering and
Graph ML
Graph Analytics,
Investigations and
Counterfactuals
Integrations and
Knowledge Graphs
for Heuristic AI
Capitalize
Analysis
Data Modeling
11
Graphs Enhance All Phases of
Data Science & AI
12. Query (e.g. Cypher/Python)
Real-time, local decisioning
and pattern matching
Graph Algorithms
Global analysis
and iterations
You know what you’re
looking for and making a
decision
You’re learning the overall structure
of a network, updating data, and
predicting
Local
Patterns
Global
Computation
13. Better Predictions with Graphs
Using the Data You Already Have
• Current data science models ignore network structure
• Graphs add highly predictive features to ML models, increasing accuracy
• Otherwise unattainable predictions based on relationships
13
15. From Simple Queries to Advanced ML
15
Human-crafted query, human-readable result
MATCH (p1:Person)-[:ENEMY]->(:Person)<-[:ENEMY]-(p2:PERSON)
MERGE (p1)-[:FRIEND]->(p2)
AI-learned formula, machine-readable result
Predefined formula, human-readable result
PageRank(Emil) = 13.25
PageRank(Amy) = 4.83
PageRank(Alicia) = 4.75
Node2Vec(Emil) =[5.4 5.1 2.4 4.5 3.1]
Node2Vec(Amy) =[2.8 1.8 7.2 0.9 3.0]
Node2Vec(Alicia)=[1.4 5.2 4.4 3.9 3.2]
Queries
Algorithms
Embeddings
Machine
Learning
Workflows
Train ML models
based on results
16. The Neo4j Graph Data Science Library
50+ robust algorithms in one flexible analytics workspace
with supervised ML workflows
Pathfinding & Search
• Deep path analytics
• Optimal routing
Centrality & Importance
• Identifies importance of distinct nodes
• Influencer & risk identification
Community Detection
• Detects group clustering
• Partition options
Similarity
• Evaluates how alike
graph nodes are
Graph Embeddings
• Learns from structural information
• Reduces dimensionality for ML
Link Prediction
• Estimates likelihood of forming relationship
• Estimate missing information
16
17. A graph embedding is a way of representing each node in your
graph as a fixed-length vector.
• Preserves key features
• Reduces dimensionality
• Can be decoded
Different techniques may represent different aspects of a graph,
and may use different approaches to learn that representation
What Are Graph Embeddings?
17
18. Node2Vec FastRP
Random walk through the graph to
sample nodes and their properties
• Easy to understand
• Lots of examples
• Interpretable parameters
Just tell it how far to walk
Project a similarity matrix into lower
dimensional space with matrix math
• Up to 75,000 x faster than Node2Vec
• Equivalent accuracy when tuned
• Flexible parameters for tuning
Both produce output fixed-length embedding vectors
Both must be rerun when new data is added
18
19. GraphSAGE
• Assumes nodes in the same neighborhood have similar representations
• Uses node properties in addition to relationships
• Inductive approach that learns a function to calculate an embedding
Aggregate
Sample Predict
19
20. Graph Embeddings for
Feature Engineering
20
Rather than running multiple algorithms to describe specific aspects
of your graph topology, embeddings learn a unique representation
of what’s important for your graph and your problem, letting you use
graph structure as a predictor.
Financial Transaction Data
21. • Neo4j automates data
transformations
• Fast iterations & layering
• Production ready features,
parallelization & enterprise
support
Our Secret Sauce: The Graph Catalog
A graph-specific analytics workspace that’s mutable – integrated
with a native-graph database
Mutable In-Memory Workspace
Computational Graph
Native Graph Store
22. 22
Unsupervised ML
• Unlabelled data
• Data driven
• Pattern identification
Machine Learning
Unsupervised
Clustering
Dimension Reduction
(generalization)
Association
Data is not labeled at all Data is labeled or categorized
Divide by
similarity
Identify
sequences
Find hidden
dependencies
Stack similar clothing
Find clothes often worn
together
Make best outfit from given clothes
Supervised
Classification
Regression
Predict a
number
Predict a
category
Predict the length
of a sock
Predict if an outfit is
fancy or casual
Supervised ML
• Labeled data
• Task driven
• Value predicting
23. 23
Unsupervised ML
• Unlabelled data
• Data driven
• Pattern identification
• Graph Algorithms
Machine Learning
Unsupervised
Clustering
Dimension Reduction
(generalization)
Association
Data is not labeled at all Data is labeled or categorized
Divide by
similarity
Identify
sequences
Find hidden
dependencies
Which parts of my graph
are more connected?
Which nodes are
most similar?
How important is each node?
Supervised
Classification
Regression
Predict a
number
Predict a
category
Predict the length
of a sock
Predict if an outfit is
fancy or casual
Supervised ML
• Labeled data
• Task driven
• Value predicting
Community
Detection
Centrality
Embeddings
Similarity
Pathfinding
24. 24
Uses labeled data to learn a
function to map input data
onto outputs.
That model can then make
predictions on new data.
How do you measure if it’s any
good? Hold back some labeled
data and measure accuracy.
cat
Dog
Labeled
Data
Model
Training
Prediction Output
[cat]
[dog]
New Data
? It’s a
Cat!
⟮ ⟯ [cat]
⟮ ⟯ [dog]
Supervised Machine Learning
25. 25
Types of Supervised ML in Neo4j
Node
classification:
“What kind of
node is this?”
Link prediction:
“Should there be a
relationship between
these nodes?”
Labeled data: Pairs of nodes
that are either linked or not
Features: Pre-existing
attributes, algorithms
(pageRank), embedding
26. 26
Load your in-
memory graph with
labels & features
Use
nodeClassification.train
Specify the property you want to
predict and the features for making
that prediction
Train a Node Classification Model in Neo4j
Node classification:
Predicting a node label or (categorical) property
Neo4j Automates the Tricky Parts:
1. Splits data for train & test
2. Builds logistic regression models using the training data
& specified parameters to predict the correct label
3. Evaluates the accuracy of the models using the test data
4. Returns the best performing model
The predictive model
appears in the model
catalog, ready
to apply to
new data
27. 27
Load your in-
memory graph with
labels & features
Use
linkPrediction.train
Split your graph into train & test
splitRelationships.mutate
Train a Link Prediction Model in Neo4j
Link Prediction:
Predicting unobserved edges or relationships that will form in the future
Neo4j Automates the Tricky Parts:
1. Builds logistic regression models using the training data
& specified parameters to predict the correct label
2. Evaluates the accuracy of the models using the test data
3. Returns the best performing model
The predictive model
appears in the model
catalog, ready
to apply to
new data
28. Machine Learning Models in Neo4j
Train a model to make predictions on unseen parts of the graph,
or for new data
Not a data model — a predictive model
Models live in the Neo4j analytics
workspace in a model catalog
• Contains versioning information
What data was this model trained on?
• Time stamps
• Model names
• As of GDS 1.5 models can be
published, stored, and loaded from disk.
ML Models in the
Analytics Workspace
28
30. 30
Uranus is the third
biggest planet
R&D: Better health
outcomes through
machine learning on
patient journeys
Fraud Detection
with graph feature
engineering +
AutoML
Analytics to improve reliability
by predicting problems in a
supply-chain knowledge graph
From Simple to Highly Sophisticated Data Science
Graphs Accelerate Innovation
Analysis Repeatability
Analysis
Complexity
Full Production
Simple, Ad Hoc
High
Analytics
Data Science
FinServ
Customers
31. Graph Analytics: Improving Reliability
Medical device manufacturer with
10.74B annual revenue
Manufacture products like pacemakers,
stents and heart valves, all the way
through diagnostic tests. Integrated
development, design, manufacture, and
sales.
31
Neo4j GDS for supply chain & issues prediction
Simple data model: parts, finished product, and failures
• Knowledge Graph to support robust queries
• Centrality algorithms to rank nodes based on their proximity to
failures, similarity to find vulnerable components
• Creating new data from connections in Neo4j
Challenge: Predicting and preventing failures
• Integrated supply chain: from raw materials to complex devices
• Inconsistent analysis, unable to pinpoint cause of failures
32. Graph R&D: Improving Patient Outcomes
Global pharmaceutical with
$22.1Billion revenue
Focus on oncology, cardiovascular,
renal, metabolism, & respiratory
32
Neo4j GDS to map & predict patient journeys
• 3 yrs of visits, tests & diagnosis with 10’s of Bn of records
• Knowledge Graph, graph queries, algorithms and
traditional ML approaches
• Extracted paths to train embeddings to predict successful
interventions
Challenge: Better intervention for complex diseases
• Complex diseases develop over years with many touch points
• How can we intervene faster & improve outcomes?
33. Production Graph ML: Build Better Models
33
Neo4j GDS for Feature Engineering + AutoML
• Data science platforms help commoditize data science: build scalable, repeatable, and
deployable data science tools
• Embedding tuning is a major focus & challenge
• Feature generation as input for autoML priceline
Challenge: Adding predictive relationships to production ML
• Every percentage point of model accuracy matters
• Graphs are powerful in R&D & PoC models - but putting into production is challenging
Several FinServ
customers
34. Neo4j Graph Data
Science Library
Neo4j
Database
Neo4j
Bloom
Scalable Graph Algorithms &
Analytics Workspace
Native Graph Creation &
Persistence
Visual Graph
Exploration & Prototyping
50+ Graph
Algorithms
Graph-Native
ML
Data Scientist
Friendly
Neo4j Graphs Data Science Framework
35. Neo4j Graph Data Science
50+ Graph Algorithms
More supported algorithms
than any other vendor
Graph-Native ML
Only commercial offering
with full graph ML workflows
Humane Experience
Automatic transformation
from storage to analytics and
visualization
Scalable Data Science
Algorithms running over 10’s
billions of nodes in production
Extensible
Integrate with other data
sources and ML platforms
Strongest Community
220K+ practioners
72K+ meetups
35
36. 36
Get Started:
- Sandbox: https://neo4j.com/sandbox/
- Guides: neo4j.com/developer/graph-data-science/
- GitHub: github.com/neo4j/graph-data-science
Books
- O’Reilly Book on Graph Algorithms
neo4j.com/graph-algorithms-book/
- Graph Data Science For Dummies:
neo4j.com/graph-data-science-for-dummies