ADV Slides: Graph Databases on the Edge

Graph Databases on
the Edge
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
An Inc. 5000 Company in 2018 and 2017
@williammcknight
www.mcknightcg.com
(214) 514-1444
Second Thursday of Every Month, at 2:00 ET

AnzoGraph DB
Triplestore with labeled properties
Built for diverse data harmonization and
analytics at scale (trillions of triples &
more
Graph analytics like page rank and
shortest path.
BI-style analytics like graph views, named
queries, aggregates, built-in data science
functions
Inferencing and ontology native support

XXXXXX
XXXXXXXX
XXXXXXXX
XXXXXXXX
XXXXXXXXX
KEY VALUE
XXXXX
Order
Prod
uct
Product
XXXXX
XXXXX
XXXXX Location 1
2

Used_with
R
eceive_paym
ent
Sets_Up
Used_with
Used_with
Involved in Prior Fraud Cases
5

Customers Achieve Sustainable Competitive
Advantage By Adopting Graph Databases
New Products & Services Leveraging
Data Relationships
• First to market, up and running in
days, not weeks or months
• Reduced churn, increasing
engagement and uncovering fraud
• Achieved new company vision
centered around Business Graph
• Leapfrogged the competition with a
360 degree view of the customer
Reimagine Existing Applications, and Innovate
with Data Relationships
• Kept the business running when data growth
threatened to stop it
• Drastically reduced project complexity and
risk
• Increased revenue and delighted
customers by improving user experience
• Brought new offering to market to
compete with Amazon Prime & Fresh,
and Google Express
2

Graph Growth Ahead
“The application of graph processing and graph DBMSs will
grow at 100 percent annually through 2022 to continuously
accelerate data preparation and enable more complex and
adaptive data science.”
“Graph analytics will grow in the next few years due to the
need to ask complex questions across complex data, which is
not always practical or even possible at scale using SQL
queries.”
source - February 2019 press release by Gartner - https://www.gartner.com/en/newsroom/press-releases/2019-02-18-gartner-
identifies-top-10-data-and-analytics-technolo
3

What Can Be Vertices?
• Things
– Bank accounts
– Customer accounts
• Mobile phones
– Products
– Trading networks, auctions
– Water, power, gas grids
– Disease, drugs, molecules
• Interactions, transmission
– Insurance policies
– Machines, servers, URLs
– Sensor networks
4
• People
– Customers, families
– Employees
– Affinity groups, clubs
• Politics, causes, doctors
• Professionals (LinkedIn)
– Companies, institutions
• Places
– Map locations
• Cities, landmarks
– Retail stores
– Houses or buildings
– Communication networks
– Transportation hubs
• Airports, shipping lanes, etc.

What Can be Edges?
• People
– Relationships
– Ideas, preferences
– Email, phone calls, SMS, IM
– Collaborations
• Places
– Roads, routes, railways
– Water, power, gas,
pipelines, telephone lines
– Anything with GPS
coordinates
• Things
– Events
– Money, transactions
– Purchases
– Pressure, temperature
– Diseases
– Contraband
– URLs
– Phone calls
– Citations
– Weights, scores
– Timestamps
5

Social Network “path exists” Performance
• Experiment:
• 1000 persons
• Average 50 friends
per person
• pathExists(a,b)
limited to depth 4
# persons query time
Relational
database
1000 2000ms
Graph db 1000 2ms
Graph db 1000000 2ms

Excessive
relationships
Healthcare Fraud
• Monitor drugs and
treatments
– Excessive prescribers
– Excessive consumers
• Patients connected to
– Doctors, pharmacies
• Use Graph Access
– Find outliers and investigate
– Find X actual frauds
7

Relational DBs Can’t Handle Data
Relationships Well
• Cannot model or store data and
relationships without complexity
• Performance degrades with number
and levels of relationships, and
database size
• Query complexity grows with need for
JOINs
• Adding new types of data and
relationships requires schema redesign,
increasing time to market
8
Slow development
Poor performance
Low scalability
Hard to maintain
… making traditional databases inappropriate
when data relationships are valuable in real-time

Discrete Data
Minimally
connected data
Graph Databases are designed for data relationships
Use the Right Database for the Right Job
Other NoSQL Relational DBMS Graph DB
Connected Data
Focused on
Data Relationships
Development Benefits
Model maintenance
Deployment Benefits
Performance
Minimal resource usage

PageRank
12
Page A
1.0
Page C
1.0
Page B
1.0
Page D
1.0
1*0.85/2
1*0.85/2
1*0.85
1*0.85
1*0.85
Sum of inputs + 0.15
http://www.whitelines.nl/html/google-page-rank.html see spreadsheet
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

+0.150
page D +0.850
page B +0.850
page A +0.425
C Total 2.275
PageRank: After 1st Results
Page A
1.0
Page C
2.275
Page B
0.575
Page D
0.15
+0.150
page A +0.425
B Total 0.575
+0.15
Page C +0.85
A Total 1.00
+0.150
D Total 0.150
1*0.85/2
1*0.85/2
1*0.85
1*0.85
1*0.85
http://www.whitelines.nl/html/google-page-rank.html (see spreadsheet)
13

Page Rank Iterations
14
End of iteration A result B result C result D result
1 1.000 0.575 2.275 0.150
2 2.084 0.575 1.191 0.150
3 1.163 1.036 1.652 0.150
4 1.554 0.644 1.652 0.150
5 1.554 0.810 1.485 0.150
6 1.413 0.810 1.627 0.150
7 1.533 0.750 1.567 0.150
8 1.482 0.801 1.567 0.150
9 1.482 0.780 1.588 0.150
10 1.500 0.780 1.570 0.150
11 1.485 0.788 1.578 0.150
12 1.491 0.781 1.578 0.150
13 1.491 0.784 1.575 0.150
14 1.489 0.784 1.577 0.150
15 1.491 0.783 1.576 0.150
16 1.490 0.784 1.576 0.150
17 1.490 0.783 1.577 0.150
18 1.490 0.783 1.576 0.150
19 1.490 0.783 1.577 0.150
20 1.490 0.783 1.577 0.150

PageRank: 20 Iterations Until Convergence
Page A
1.49
Page C
1.58
Page B
0.78
Page D
0.15
Most important
web page
Page C
increases page A
importance
15

Betweenness
• Find bridges across different communities
• High score = edge links different
communities
Bridge
vertex
Bridge
vertex
16

Closeness
• The shortest paths between any two
vertices
17

Eigen Centrality
• Measures the importance of a vertex by
the importance of its neighbors
importantimportant
important
must be
important
18

Clustering Coefficient: Cascading Churn
19
If two people churn,
what is the likelihood
others will?
The two churners affect
the central influencer
Finally: All contacts churn.
Individual-focused model
underestimates churn by 6X.
SELECT *
FROM LocalClusteringCoefficient(
ON Calls as edges
PARTITION BY caller_from
ON caller_from as vertices
PARTITION BY caller_id
targetKey(caller_to')
directed('f')
degreeRange('[3:]')
accumulate('personId')
);

Loopy Belief Propagation
• Loopy belief works by peer-pressure
– Node X gets a ﬁnal belief value by listening to
its neighbors
– Nodes with known values propagate through
the graph
• Adjacent nodes send message saying
“update your beliefs”
– Based on priors, conditional probabilities, and
evidence
• Keep passing messages until a stable belief
state is reached
See https://www.ics.uci.edu/~welling/teaching/ICS279/GBP-vision.pdf
20

Great Questions for Graph Databases
• In what order did a specific set of related events
happen?
• Are there patterns of events in our data that seem
to be related by time?
• How far apart in a (social or physical) network are
two “actors” and how strong is their relationship?
• What are the identifiable social groups and what are
the general patterns of such groups?
• How important is any given “actor” in any given
network and event?
• What type of messages emanate from a specific
area?
21

How to Identify a Graph Workload
• Workload is identified by “network,
hierarchy, tree, ancestry, structure” words
• You are planning to use relational
performance tricks
• Your queries will be about pathing
• You are limiting queries by their complexity
• You are looking for “non-obvious” patterns
in the data
22

Actions
Model actions depending on what you want
as vertices
(Bill)-[:SENT]->(email)-[:TO]->(Jim)
OR
(Bill)-[:EMAILED]->(Jim)
25

Semantic Graphs
• Subject: John R Peterson Predicate: Knows Object: Frank T
Smith
• Subject: Triple #1 Predicate: Confidence Percent Object: 70
• Subject: Triple #1 Predicate: Provenance Object: Mary L Jones
26

Triple Store
• A triple is a data entity composed of
subject-predicate-object
– "Bob is 35”
– "Bob knows Fred”
– “William likes running”
27

Conclusion
• Graph is a Fast Growing data category
• It’s all about the Use Case; Good for Graph:
– Real-time recommendations
– Fraud detection
– Network and IT operations
– Identity and access management
– Graph-based search
– Identifying relative importance
• Reimagine your data as a graph
– The whiteboard model is the physical model
• Remember Page Rank
28

Graph Databases on
the Edge
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” OnAlytica
President, McKnight Consulting Group
An Inc. 5000 Company in 2018 and 2017
@williammcknight
www.mcknightcg.com
(214) 514-1444
Second Thursday of Every Month, at 2:00 ET

ADV Slides: Graph Databases on the Edge

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to ADV Slides: Graph Databases on the Edge

Similar to ADV Slides: Graph Databases on the Edge (20)

More from DATAVERSITY

More from DATAVERSITY (20)

Recently uploaded

Recently uploaded (20)

ADV Slides: Graph Databases on the Edge