SlideShare uma empresa Scribd logo
1 de 83
Baixar para ler offline
Morpheus
SQL and CypherÂŽ
in ApacheÂŽ
Spark
Extending Apache Spark Graph for the Enterprise
with Morpheus and Neo4j
Martin Junghanns
Software Engineer
Graph Analytics Team
Graphs in Analytics Workloads
Property graphs and the Cypher query language
Graphs are everywhere
… and growing
https://db-engines.com/en/ranking_categories
… and coming to Apache Spark 3.0
Property graphs
An intuitive data model for connected data
Property Graphs
Node
● Represents an entity within the graph
● Can have labels
Relationship
● Connects a start node with an end node
● Has one type
Property
● Describes a node/relationship: e.g. name, age, weight etc
● Key-value pair: String key; typed value (string, number, bool, list, ...)
Graph Patterns with Cypher
The OLTP / OLAP landscape
Tables Graphs
Transactional
PostgreSQL,
Oracle,
SQLServer
Neo4j
Data
Integration
& Analytics Spark SQL Morpheus
Graphs in Spark and Neo4j
Spark is an immutable data processing engine
• Spark graphs are compositions of tables (DFs)
• Spark graphs can be transformed and combined
• Functions (including queries) over multiple graphs
• Cypher query plans mapped to Catalyst
Neo4j is a native transactional CRUD database
• Neo4j graphs use a native graph data representation
• Neo4j has optimized in-process MT graph algos
• Morpheus helps move data in and out of Neo4j
Morpheus: SQL + Cypher in one session
Graphs and tables are both useful data models
• Finding paths and subgraphs, and transforming graphs
• Viewing, aggregating and ordering values
The Morpheus project parallels Spark SQL
• PropertyGraph type (composed of DataFrames)
• Catalog of graph data sources, named graphs, views,
• Cypher query language
A CypherSession adds graphs to a SparkSession
What is Morpheus used for?
• Data integration
• Integrate (non-)graphy data from multiple, heterogeneous data sources
into one or more property graphs
• Distributed Cypher execution
• OLAP-style graph analytics
• Data science
• Integration with other Spark libraries
• Feature extraction using Neo4j Graph Algorithms
Graph Algorithms
Pathnding
& Search
Centrality /
Importance
Community
Detection
Link
Prediction
Finds optimal paths
or evaluates route
availability and quality
Determines the
importance of distinct
nodes in the network
Detects group
clustering or partition
options
Evaluates how
alike nodes are
Estimates the likelihood
of nodes forming a
future relationship
Similarity
Morpheus creates Spark Graphs ...
PROPERTY
GRAPH
composing
DataFrames
Hive, DF, JDBC
TABLES
SUB-
GRAPH
FS snapshot
Morpheus
SOURCES
… wrangles Spark Graphs ...
DataFrame
Table Result
Cypher
QUERY
Property
Graph Result
Property
Graph Cypher
QUERY
Cypher
QUERY
Property
Graph Result
DataFrame
Driving Table
… analyses graphs in Spark and Neo4j ...
GRAPH
ALGOS
ANALYSIS
toolsets
DataFrame DataFrame
Property
Graph
Property
Graph
… and stores Spark Graphs
Morpheus
STORE
SUBGRAPH
FS snapshot
Property
Graph
Morpheus Architecture
Mapping Cypher to the Spark SQL API
High-level architecture
Morpheus
Query EngineProperty Graph Data Sources
Property Graph Catalog
Scala API
SQL JDBC
Query engine architecture
● Distributed executionSpark Core
Spark SQL
● Rule- and Cost-based query
optimization via Catalyst
MATCH (c:Captain)-[:COMMANDS]->(s:Ship)
WHERE c.name = ‘Morpheus’
RETURN c.name, s.name
openCypher
Frontend
● Parsing, Rewriting, Normalization
● Semantic Analysis (Scoping,
Typing, etc.)
Morpheus
● Data Import and Export
● Schema and Type handling
● Query translation to Spark
operations
Relational
Planning
Logical
Planning
Spark
Backend
● Translation into Logical
Operators
● Basic Logical Optimization
● Backend Agnostic Query
Representation
● Conversion and typing of
Frontend expressions
● Translation into Relational
Operations on abstract
tables
● Column layout computation
Intermediate
Language
● Spark-specific table
implementation
“Tables for Labels”
• In Morpheus, PropertyGraphs are represented by
• Node Tables and Relationship Tables
• Tables are represented by DataFrames
• Require a fixed schema
• Property Graphs have a Graph Type
• Node and relationship types that occur in the graph
• Node and relationship properties and their data type
Property Graph
Node Tables
Rel. Tables
Graph Type
“Tables for Labels”
:Captain:Person
name: Morpheus
:Ship
name: Nebuchadnezzar
:COMMANDS
id name
0 Morpheus
id name
1 Nebuchadnezzar
id source target
0 0 1
:Captain:Person
:Ship
:COMMANDS
Graph Type {
:Captain:Person (
name: STRING
),
:Ship (
name: STRING
),
:COMMANDS
}
Query engine architecture
Property Graph
⋈
⋈
π
MATCH (c:Captain)-[:COMMANDS]->(s:Ship)
WHERE c.name = ‘Morpheus’
RETURN c.name, s.name
π
π
Morpheus
Relational
Planning
...
CypherÂŽ
An open language for graph querying
Cypher query language
Cypher 9 is the latest full version of openCypher
• Implemented in Neo4j 3.5
• Includes date/time types and functions
• Implemented in whole/part by six other vendors
• Several other partial and research implementations
• Cypher for Gremlin is another openCypher project
Cypher 9 support in Morpheus
Cypher is a full CRUD language ← OLTP database
• RETURNs only tabular results: not composable
• Results can include graph elements (paths, relationships, nodes) or
property values
Morpheus implements most of read-only Cypher
• No MERGE or DELETE
• Spark immutable data + transformations
Cypher 10 in Morpheus - Multiple graphs
Cypher 10 proposes Multiple Graph features
• Multiple Graph CIP: https://git.io/fjmrx
Allows for Cypher Query composition
• Similar to chaining transformations on DataFrames
Support Graph Catalog for managing Graphs
• Analogous to Spark SQL catalog
Query support for Graph Construction
Returning tabular data Input: a property graph
Output: a table
FROM GRAPH socialNetwork
MATCH ({name: 'Dan'})-[:FRIEND*2]->(foaf)
RETURN toUpper(foaf.name) AS name
ORDER BY name DESC
Language features available in Morpheus
Constructing graphs Input: a property graph
Output: a property graph
FROM GRAPH socialNetwork
MATCH (p:Person)-[:FRIEND*2]->(foaf)
WHERE NOT (p)-[:FRIEND]->(foaf)
CONSTRUCT
CREATE (p)-[:POSSIBLE_FRIEND]->(foaf)
RETURN GRAPH
Language features available in Morpheus
Querying multiple graphs Input: property graphs
Output: a property graph
FROM GRAPH socialNetwork
MATCH (p:Person)
FROM GRAPH products
MATCH (c:Customer)
WHERE p.email = c.email
CONSTRUCT ON socialNetwork, products
CREATE (p)-[:IS]->(c)
RETURN GRAPH
Language features available in Morpheus
Creating graph views Input: property graphs
Output: a property graph
CATALOG CREATE VIEW youngFriends($inGraph){
FROM GRAPH $inGraph
MATCH (p1:Person)-[r]->(p2:Person)
WHERE p1.age < 25 AND p2.age < 25
CONSTRUCT
CREATE (p1)-[COPY OF r]->(p2)
RETURN GRAPH
}
Language features available in Morpheus
Using graph views Input: property graphs
Output: table or graph
FROM youngFriends(socialNetwork)
MATCH (p:Person)-[r]->(o)
RETURN p, r, o
// and views over views
FROM youngFriends(europe(socialNetwork))
MATCH ...
Language features available in Morpheus
Demo: Shaping data into graphs
Creating Property Graphs from DataFrames
Demo Big Picture
Part 1
From JSON to Graph
Create persistent
Property Graph from
raw Yelp dataset
Read Yelp Data from
JSON into DataFrames
Create Property Graph
from DataFrames
Store Property Graph
using Parquet
Part 2
A library of Graphs
Create a library of
graph projections
Read Property Graph
from Parquet
Create subgraph for a
specifc city
Project and persist city
subgraph
Part 3
Federated queries
Integrate reviews with
social network data
Dene Graph Type and
Mapping with Graph
DDL
Load data from Hive
and H2
Run analytical query on
the integrated graph
Part 5
Neo4j Integration II
Recommend
businesses to users
Load graph projections
from library
Write graphs to Neo4j,
run Louvain + Jaccard
Run analytical query in
Morpheus to nd
recommendations
Part 4
Neo4j Integration I
Find trending
businesses
Load graph projections
from library
Write graphs to Neo4j
and run PageRank
Combine graphs in
Morpheus and select
trending businesses
https://git.io/fjZ2b
The Yelp Open Dataset
• Yelp is a search service based on crowd-sourced reviews about local
businesses
• The Yelp Open Dataset is part of the Yelp Dataset Challenge
• Yelps’ effort to encourage researchers to explore the dataset
• ~150K businesses, 10M users, 5M reviews, 35M friendships
https://www.yelp.com
https://www.yelp.com/dataset
https://www.yelp.com/dataset/challenge
The Yelp Open Dataset
:Business
name : ACME
address : 123 ACME Rd.
city : San Jose
state : CA
:User
name : Alice
since : 2013
elite : [2014, 2016]
:User
name : Bob
since : 2014
elite : null
:REVIEWS
stars : 5
date : 2014-02-03
:REVIEWS
stars : 4
date : 2014-08-03
Part 1: From JSON to Graph
business.json
user.json
review.json
Create Node and
Relationship Tables
Create Property Graph Store Property Graph
https://git.io/fjZ2N
From DataFrame to NodeTable
// (:User)
val userDataFrame = spark.read.json(...).select(...)
val userNodeTable = MorpheusElementTable.create(NodeMappingBuilder.on("id")
.withImpliedLabel("User")
.withPropertyKey("name")
.withPropertyKey("yelping_since")
.withPropertyKey("elite")
.build, userDataFrame)
id name yelping_since elite
0 Alice 2013 [2014, 2016]
1 Bob 2014 null
Managing multiple graphs
The property graph catalog and graph data sources
Managing multiple graphs
• Property Graphs are managed within a catalog
Cypher Session
Property Graph Catalog
Property Graph Data Source <namespace>
Property Graph <name>
QualiedGraphName = <namespace>.<name>
Cypher Session
• API to operate with the query engine and the catalog
trait CypherSession {
def cypher(
query: String,
parameters: CypherMap = CypherMap.empty,
drivingTable: Option[CypherRecords] = None
): Result
def catalog: PropertyGraphCatalog
}
Property Graph Catalog
• API to manage multiple Property Graphs
• Catalog functions can be executed via Cypher or Scala API
trait PropertyGraphCatalog {
def register(namespace: Namespace, dataSource: PropertyGraphDataSource): Unit
def store(qualifiedGraphName: QualifiedGraphName, graph: PropertyGraph): Unit
def graph(qualifiedGraphName: QualifiedGraphName): PropertyGraph
def drop(qualifiedGraphName: QualifiedGraphName): Unit
// additional methods for managing views, listing namespaces and graphs
}
Property Graph Data Source (PGDS)
• API for loading and saving property graphs
trait PropertyGraphDataSource {
def hasGraph(name: GraphName): Boolean
def graph(name: GraphName): PropertyGraph
def schema(name: GraphName): Option[Schema]
// additional methods for storing, deleting, listing graphs
}
PGDS implementations in Morpheus
PGDS Multiple graphs Read graphs Write graphs
File-based
Parquet, ORC, CSV
HDFS, local, S3
Yes Yes Yes
SQL
Hive, Jdbc
Yes Yes No
Neo4j Bolt Yes Yes Yes
Neo4j Bulk Import No No Yes
Catalog operations via Cypher
Cypher Session
Property Graph Catalog
Property Graph Data Source <namespace>
Property Graph <name>
QualiedGraphName = <namespace>.<name>
Read from single Property Graph
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US” (Property Graph)
FROM social-net.US
MATCH (p:Person)
RETURN p
Read from multiple Property Graphs
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US”
“EU”
“products” (SQL PGDS)
“2018”
“2017”
FROM social-net.US
MATCH (p:Person)
FROM products.2018
MATCH (c:Customer)
WHERE p.email = c.email
RETURN p, c
Construct new Property Graphs
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US”
“EU”
“products” (SQL PGDS)
“2018”
“2017”
CATALOG CREATE GRAPH social-net.US_new {
FROM social-net.US
MATCH (p:Person)
FROM products.2018
MATCH (c:Customer)
WHERE p.email = c.email
CONSTRUCT ON social-net.US
CREATE (p)-[:SAME_AS]->(c)
RETURN GRAPH
}
Construct new Property Graphs
CATALOG CREATE GRAPH social-net.US_new {
FROM social-net.US
MATCH (p:Person)
FROM products.2018
MATCH (c:Customer)
WHERE p.email = c.email
CONSTRUCT ON social-net.US
CREATE (p)-[:SAME_AS]->(c)
RETURN GRAPH
}
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US”
“EU”
“products” (SQL PGDS)
“2018”
“2017”
“US_new”
Create and query Graph Views
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US”
“EU”
...
CATALOG CREATE VIEW youngPeople($sn) {
FROM $sn
MATCH (p:Person)-[r]->(n)
WHERE p.age < 21
CONSTRUCT
CREATE (p)-[COPY OF r]->(n)
RETURN GRAPH
}
FROM youngPeople(social-net.US)
MATCH (p:Person)
RETURN p
“youngPeople”
Views
Demo Big Picture
Part 1
From JSON to Graph
Create persistent
Property Graph from
raw Yelp dataset
Read Yelp Data from
JSON into DataFrames
Create Property Graph
from DataFrames
Store Property Graph
using Parquet
Part 2
A library of Graphs
Create a library of
graph projections
Read Property Graph
from Parquet
Create subgraph for a
specifc city
Project and persist city
subgraph
Part 3
Federated queries
Integrate reviews with
social network data
Dene Graph Type and
Mapping with Graph
DDL
Load data from Hive
and H2
Run analytical query on
the integrated graph
Part 5
Neo4j Integration II
Recommend
businesses to users
Load graph projections
from library
Write graphs to Neo4j,
run Louvain + Jaccard
Run analytical query in
Morpheus to nd
recommendations
Part 4
Neo4j Integration I
Find trending
businesses
Load graph projections
from library
Write graphs to Neo4j
and run PageRank
Combine graphs in
Morpheus and select
trending businesses
https://git.io/fjZ2b
Reminder: The Yelp Open Dataset
:Business
name : ACME
address : 123 ACME Rd.
city : San Jose
state : CA
:User
name : Alice
since : 2013
elite : [2014, 2016]
:User
name : Bob
since : 2014
elite : null
:REVIEWS
stars : 5
date : 2014-02-03
:REVIEWS
stars : 4
date : 2014-08-03
2015 - 2018
Part 2: Building a library of graphs
https://git.io/fjZ25
Boulder City
(:User)-[:CO_REVIEWS]->(:User)
(:User)-[:REVIEWS]->(:Business)
(:User)-[:CO_REVIEWS]->(:User)
Constuct graphs for each year
Extract Yelp
subgraph for
specic city
(:Business)-[:CO_REVIEWED]->(:Business)
Turning SQL tables into graphs
Property graph schema and mappings
PGDS on Steroids: The SQL PGDS
JDBC
Hive
Oracle
SQL Server
Orc
Parquet
Table/View
Table/View
Table/View
...
...
Graph DDL
Graph Instance
- Table mappings
SQL Tables Property Graphs
Property Graph
Node Tables
Rel. Tables
Graph Type
SQL Property Graph
Data Source
Spark SQL
Data Sources
Graph Type
- Element types
- Node types
- Relationship types
Demo Big Picture
Part 1
From JSON to Graph
Create persistent
Property Graph from
raw Yelp dataset
Read Yelp Data from
JSON into DataFrames
Create Property Graph
from DataFrames
Store Property Graph
using Parquet
Part 2
A library of Graphs
Create a library of
graph projections
Read Property Graph
from Parquet
Create subgraph for a
specifc city
Project and persist city
subgraph
Part 3
Federated queries
Integrate reviews with
social network data
Dene Graph Type and
Mapping with Graph
DDL
Load data from Hive
and H2
Run analytical query on
the integrated graph
Part 5
Neo4j Integration II
Recommend
businesses to users
Load graph projections
from library
Write graphs to Neo4j,
run Louvain + Jaccard
Run analytical query in
Morpheus to nd
recommendations
Part 4
Neo4j Integration I
Find trending
businesses
Load graph projections
from library
Write graphs to Neo4j
and run PageRank
Combine graphs in
Morpheus and select
trending businesses
https://git.io/fjZ2b
Reminder: The Yelp Open Dataset
:Business
name : ACME
address : 123 ACME Rd.
city : San Jose
state : CA
:User
name : Alice
since : 2013
elite : [2014, 2016]
email : alice@yelp.com
:User
name : Bob
since : 2014
elite : null
email : bob@yelp.com
:REVIEWS
stars : 5
date : 2014-02-03
:REVIEWS
stars : 4
date : 2014-08-03
The Yelp Friendships (YelpBook)
:User
email: alice@yelp.com
:User
email : bob@yelp.com
:FRIEND
Part 3: Integrating Yelp and YelpBook
Yelp Reviews
Yelp Book
Graph DDL
+
SQL PGDS
(:User)-[:REVIEWS]->(:Business)
(:User)-[:FRIEND]->(:User)
https://git.io/fjZ2p
CREATE GRAPH TYPE yelp (
-- Element types (concepts used to describe a graph)
User ( name STRING, since DATE ),
Business ( name STRING, city STRING ),
REVIEWS ( stars INTEGER, date LOCALDATETIME ),
FRIEND,
-- Node types
(User),
(Business),
-- Relationship types
(User)-[REVIEWS]->(Business),
(User)-[FRIEND]->(User)
)
Graph DDL: Graph Type definition
CREATE GRAPH yelp_and_yelpBook OF yelp (
-- Node type mappings
(User) FROM HIVE.yelp.user,
(Business) FROM HIVE.yelp.business,
-- Relationship type mappings
(User)-[REVIEWS]->(Business) FROM HIVE.yelp.review e
START NODES (User) FROM HIVE.yelp.user n JOIN e.user_email = n.email
END NODES (Business) FROM HIVE.yelp.business n JOIN e.business_id = n.business_id,
(User)-[FRIEND]->(User) FROM H2.yelpbook.friend e
START NODES (User) FROM HIVE.yelp.user n JOIN e.user1_email = n.email
END NODES (User) FROM HIVE.yelp.user n JOIN e.user2_email = n.email
)
Graph DDL: Graph Instance definition
Data Science with Graphs
Integrating Neo4j native graph algorithms
Spark and Neo4j Platforms
Coming in Spark 3.0
Native graph algorithms
The Neo4j database and Neo4j Graph Algorithms library
Neo4j Graph Algorithms
• Parallel Breadth First Search*
• Parallel Depth First Search
• Shortest Path*
• Single-Source Shortest Path
• All Pairs Shortest Path
• Minimum Spanning Tree
• A* Shortest Path
• Yen’s K Shortest Path
• K-Spanning Tree (MST)
• Random Walk
• Degree Centrality
• Closeness Centrality
• CC Variations: Harmonic, Dangalchev,
Wasserman & Faust
• Betweenness Centrality
• Approximate Betweenness Centrality
• PageRank*
• Personalized PageRank
• ArticleRank
• Eigenvector Centrality
• Triangle Count*
• Clustering Coefficients
• Connected Components (Union Find)*
• Strongly Connected Components*
• Label Propagation*
• Louvain Modularity – 1 Step & Multi-Step
• Balanced Triad (identification)
• Euclidean Distance
• Cosine Similarity
• Jaccard Similarity
• Overlap Similarity
• Pearson Similarity
Pathnding
& Search
Centrality /
Importance
Community
Detection
Similarity
neo4j.com/docs/
graph-algorithms/current/
Link
Prediction
• Adamic Adar
• Common Neighbors
• Preferential Attachment
• Resource Allocations
• Same Community
• Total Neighbors* Available in GraphFrames
Free O’Reilly Book
neo4j.com/
graph-algorithms-book
• Spark & Neo4j Examples
• Machine Learning Chapter
Simple Data Science workflows
From Spark to Neo4j to Spark
Demo Big Picture
Part 1
From JSON to Graph
Create persistent
Property Graph from
raw Yelp dataset
Read Yelp Data from
JSON into DataFrames
Create Property Graph
from DataFrames
Store Property Graph
using Parquet
Part 2
A library of Graphs
Create a library of
graph projections
Read Property Graph
from Parquet
Create subgraph for a
specifc city
Project and persist city
subgraph
Part 3
Federated queries
Integrate reviews with
social network data
Dene Graph Type and
Mapping with Graph
DDL
Load data from Hive
and H2
Run analytical query on
the integrated graph
Part 5
Neo4j Integration II
Recommend
businesses to users
Load graph projections
from library
Write graphs to Neo4j,
run Louvain + Jaccard
Run analytical query in
Morpheus to nd
recommendations
Part 4
Neo4j Integration I
Find trending
businesses
Load graph projections
from library
Write graphs to Neo4j
and run PageRank
Combine graphs in
Morpheus and select
trending businesses
https://git.io/fjZ2b
PageRank Algorithm
• Use when
• Anytime you’re looking for broad influence over a network
• Many domain specific variations for differing analysis, e.g. Personalized
PageRank for personalized recommendations
• Examples:
• Twitter Recommendations
• Fraud Detection
Part 4: Yelp trending businesses
2017
to
2018
call algo.pagerank
2017
2018
trendRank =
pageRank_2018 -
pageRank_2017
⋈
(:Business)
-[:CO_REVIEWED]->
(:Business)
https://git.io/fjZ2j
Demo Big Picture
Part 1
From JSON to Graph
Create persistent
Property Graph from
raw Yelp dataset
Read Yelp Data from
JSON into DataFrames
Create Property Graph
from DataFrames
Store Property Graph
using Parquet
Part 2
A library of Graphs
Create a library of
graph projections
Read Property Graph
from Parquet
Create subgraph for a
specifc city
Project and persist city
subgraph
Part 3
Federated queries
Integrate reviews with
social network data
Dene Graph Type and
Mapping with Graph
DDL
Load data from Hive
and H2
Run analytical query on
the integrated graph
Part 5
Neo4j Integration II
Recommend
businesses to users
Load graph projections
from library
Write graphs to Neo4j,
run Louvain + Jaccard
Run analytical query in
Morpheus to nd
recommendations
Part 4
Neo4j Integration I
Find trending
businesses
Load graph projections
from library
Write graphs to Neo4j
and run PageRank
Combine graphs in
Morpheus and select
trending businesses
https://git.io/fjZ2b
Louvain Modularity
• Use when
• Community Detection in large networks
• Uncover hierarchical structures in data
• Examples
• Money Laundering
• Protein-Protein-Interactions
Jaccard Similarity
• Use when
• Computing pair-wise similarities
• Accommodates vectors of different lengths
• Examples
• Recommendations
• Disambiguation
Part 5: Community-centric Recommendation
call algo.louvain
(:User)-[:REVIEWS]->(:Business)
(:User)-[:CO_REVIEWS]->(:User)
call algo.jaccard
Recommend
businesses similar
users have
reviewed
2017
Compute similarity
based on overlapping
reviewed businesses
Compute
communities based
on co-reviews
for each
community
:IS_SIMILAR
https://git.io/fjZaU
Spark 3.0 Graph
Spark Graph SPIP uses Dataframe-based Property Graphs
Spark and Neo4j Platforms
Coming in Spark 3.0
Spark Project Improvement Proposal
• SPARK-25994 Spark Graph for Apache Spark 3.0
• Property Graphs, Cypher Queries, and Algorithms
• Defines a Cypher-compatible Property Graph type based on
DataFrames
• Replaces GraphFrames querying with Cypher
• Reimplements GraphFrames/GraphX algos on the Property Graph
type
SPIP: What are we trying to do?
• “Spark Cypher”
• Run a Cypher 9 query on a Property Graph returning a tabular result
• Migrate GraphFrames to Spark Graph
• Implementation is based on Spark SQL
• Property Graphs are composed of one or more DFs
• Provide Scala, Python and Java APIs
SPIP: What are we not solving?
• Addresses the Cypher Property Graph Model
• Does not deal with variants of that model (e.g. RDF)
• No Cypher 10 multiple graph features
• API is flexible to support this in future iterations
• No Property Graph Catalog
• Also no Property Graph Data Sources
Try it out and get involved
[SPARK-27299][GRAPH][WIP] Spark Graph API design proposal
(GraphExamplesSuite.scala)
test("create PropertyGraph from Node- and RelationshipFrames") {
val nodeData: DataFrame = spark.createDataFrame(Seq(0 -> "Alice", 1 -> "Bob")).toDF("id", "name")
val relationshipData: DataFrame = spark.createDataFrame(Seq((0, 0, 1))).toDF("id", "source", "target")
val nodeFrame: NodeFrame = NodeFrame(nodeData, "id", Set("Person"))
val relationshipFrame: RelationshipFrame = RelationshipFrame(relationshipData, "id", "source", "target", "KNOWS")
val graph: PropertyGraph = cypherSession.createGraph(Seq(nodeFrame), Seq(relationshipFrame))
val result: CypherResult = graph.cypher(
"""
|MATCH (a:Person)-[r:KNOWS]->(:Person)
|RETURN a, r""".stripMargin)
result.df.show()
}
https://git.io/fjqp6
Morpheus + Cypher in Spark 3.0
Morpheus will be plug-compatible with Cypher in Spark 3.0
Morpheus and Spark Graph: API compatibility
spark-graph-api
spark-cypher
spark-sql
okapi morpheus
spark-sql
openCypherSPIP
Cypher to relational
operators compiler
openCypher
Q&A
Thanks for listening

Mais conteĂşdo relacionado

Mais procurados

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Lucidworks
 

Mais procurados (20)

Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataBuild Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 
Practical Graph Algorithms with Neo4j
Practical Graph Algorithms with Neo4jPractical Graph Algorithms with Neo4j
Practical Graph Algorithms with Neo4j
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
MCA and MyMobileBristol @ osjmob11
MCA and MyMobileBristol @ osjmob11MCA and MyMobileBristol @ osjmob11
MCA and MyMobileBristol @ osjmob11
 
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
 

Semelhante a Morpheus - SQL and Cypher in Apache Spark

GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 

Semelhante a Morpheus - SQL and Cypher in Apache Spark (20)

GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Cypher and apache spark multiple graphs and more in open cypher
Cypher and apache spark  multiple graphs and more in  open cypherCypher and apache spark  multiple graphs and more in  open cypher
Cypher and apache spark multiple graphs and more in open cypher
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AI
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Graph database in sv meetup
Graph database in sv meetupGraph database in sv meetup
Graph database in sv meetup
 
Leveraging Graphs for Better AI
Leveraging Graphs for Better AILeveraging Graphs for Better AI
Leveraging Graphs for Better AI
 
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1)
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Morpheus - SQL and Cypher in Apache Spark

  • 1. Morpheus SQL and CypherÂŽ in ApacheÂŽ Spark Extending Apache Spark Graph for the Enterprise with Morpheus and Neo4j Martin Junghanns Software Engineer Graph Analytics Team
  • 2. Graphs in Analytics Workloads Property graphs and the Cypher query language
  • 5. … and coming to Apache Spark 3.0
  • 6. Property graphs An intuitive data model for connected data
  • 7. Property Graphs Node ● Represents an entity within the graph ● Can have labels Relationship ● Connects a start node with an end node ● Has one type Property ● Describes a node/relationship: e.g. name, age, weight etc ● Key-value pair: String key; typed value (string, number, bool, list, ...)
  • 9. The OLTP / OLAP landscape Tables Graphs Transactional PostgreSQL, Oracle, SQLServer Neo4j Data Integration & Analytics Spark SQL Morpheus
  • 10. Graphs in Spark and Neo4j Spark is an immutable data processing engine • Spark graphs are compositions of tables (DFs) • Spark graphs can be transformed and combined • Functions (including queries) over multiple graphs • Cypher query plans mapped to Catalyst Neo4j is a native transactional CRUD database • Neo4j graphs use a native graph data representation • Neo4j has optimized in-process MT graph algos • Morpheus helps move data in and out of Neo4j
  • 11. Morpheus: SQL + Cypher in one session Graphs and tables are both useful data models • Finding paths and subgraphs, and transforming graphs • Viewing, aggregating and ordering values The Morpheus project parallels Spark SQL • PropertyGraph type (composed of DataFrames) • Catalog of graph data sources, named graphs, views, • Cypher query language A CypherSession adds graphs to a SparkSession
  • 12. What is Morpheus used for? • Data integration • Integrate (non-)graphy data from multiple, heterogeneous data sources into one or more property graphs • Distributed Cypher execution • OLAP-style graph analytics • Data science • Integration with other Spark libraries • Feature extraction using Neo4j Graph Algorithms
  • 13. Graph Algorithms Pathnding & Search Centrality / Importance Community Detection Link Prediction Finds optimal paths or evaluates route availability and quality Determines the importance of distinct nodes in the network Detects group clustering or partition options Evaluates how alike nodes are Estimates the likelihood of nodes forming a future relationship Similarity
  • 14. Morpheus creates Spark Graphs ... PROPERTY GRAPH composing DataFrames Hive, DF, JDBC TABLES SUB- GRAPH FS snapshot Morpheus SOURCES
  • 15. … wrangles Spark Graphs ... DataFrame Table Result Cypher QUERY Property Graph Result Property Graph Cypher QUERY Cypher QUERY Property Graph Result DataFrame Driving Table
  • 16. … analyses graphs in Spark and Neo4j ... GRAPH ALGOS ANALYSIS toolsets DataFrame DataFrame Property Graph Property Graph
  • 17. … and stores Spark Graphs Morpheus STORE SUBGRAPH FS snapshot Property Graph
  • 18. Morpheus Architecture Mapping Cypher to the Spark SQL API
  • 19. High-level architecture Morpheus Query EngineProperty Graph Data Sources Property Graph Catalog Scala API SQL JDBC
  • 20. Query engine architecture ● Distributed executionSpark Core Spark SQL ● Rule- and Cost-based query optimization via Catalyst MATCH (c:Captain)-[:COMMANDS]->(s:Ship) WHERE c.name = ‘Morpheus’ RETURN c.name, s.name openCypher Frontend ● Parsing, Rewriting, Normalization ● Semantic Analysis (Scoping, Typing, etc.) Morpheus ● Data Import and Export ● Schema and Type handling ● Query translation to Spark operations Relational Planning Logical Planning Spark Backend ● Translation into Logical Operators ● Basic Logical Optimization ● Backend Agnostic Query Representation ● Conversion and typing of Frontend expressions ● Translation into Relational Operations on abstract tables ● Column layout computation Intermediate Language ● Spark-specic table implementation
  • 21. “Tables for Labels” • In Morpheus, PropertyGraphs are represented by • Node Tables and Relationship Tables • Tables are represented by DataFrames • Require a fixed schema • Property Graphs have a Graph Type • Node and relationship types that occur in the graph • Node and relationship properties and their data type Property Graph Node Tables Rel. Tables Graph Type
  • 22. “Tables for Labels” :Captain:Person name: Morpheus :Ship name: Nebuchadnezzar :COMMANDS id name 0 Morpheus id name 1 Nebuchadnezzar id source target 0 0 1 :Captain:Person :Ship :COMMANDS Graph Type { :Captain:Person ( name: STRING ), :Ship ( name: STRING ), :COMMANDS }
  • 23. Query engine architecture Property Graph ⋈ ⋈ π MATCH (c:Captain)-[:COMMANDS]->(s:Ship) WHERE c.name = ‘Morpheus’ RETURN c.name, s.name π π Morpheus Relational Planning ...
  • 24. CypherÂŽ An open language for graph querying
  • 25. Cypher query language Cypher 9 is the latest full version of openCypher • Implemented in Neo4j 3.5 • Includes date/time types and functions • Implemented in whole/part by six other vendors • Several other partial and research implementations • Cypher for Gremlin is another openCypher project
  • 26. Cypher 9 support in Morpheus Cypher is a full CRUD language ← OLTP database • RETURNs only tabular results: not composable • Results can include graph elements (paths, relationships, nodes) or property values Morpheus implements most of read-only Cypher • No MERGE or DELETE • Spark immutable data + transformations
  • 27. Cypher 10 in Morpheus - Multiple graphs Cypher 10 proposes Multiple Graph features • Multiple Graph CIP: https://git.io/fjmrx Allows for Cypher Query composition • Similar to chaining transformations on DataFrames Support Graph Catalog for managing Graphs • Analogous to Spark SQL catalog Query support for Graph Construction
  • 28. Returning tabular data Input: a property graph Output: a table FROM GRAPH socialNetwork MATCH ({name: 'Dan'})-[:FRIEND*2]->(foaf) RETURN toUpper(foaf.name) AS name ORDER BY name DESC Language features available in Morpheus
  • 29. Constructing graphs Input: a property graph Output: a property graph FROM GRAPH socialNetwork MATCH (p:Person)-[:FRIEND*2]->(foaf) WHERE NOT (p)-[:FRIEND]->(foaf) CONSTRUCT CREATE (p)-[:POSSIBLE_FRIEND]->(foaf) RETURN GRAPH Language features available in Morpheus
  • 30. Querying multiple graphs Input: property graphs Output: a property graph FROM GRAPH socialNetwork MATCH (p:Person) FROM GRAPH products MATCH (c:Customer) WHERE p.email = c.email CONSTRUCT ON socialNetwork, products CREATE (p)-[:IS]->(c) RETURN GRAPH Language features available in Morpheus
  • 31. Creating graph views Input: property graphs Output: a property graph CATALOG CREATE VIEW youngFriends($inGraph){ FROM GRAPH $inGraph MATCH (p1:Person)-[r]->(p2:Person) WHERE p1.age < 25 AND p2.age < 25 CONSTRUCT CREATE (p1)-[COPY OF r]->(p2) RETURN GRAPH } Language features available in Morpheus
  • 32. Using graph views Input: property graphs Output: table or graph FROM youngFriends(socialNetwork) MATCH (p:Person)-[r]->(o) RETURN p, r, o // and views over views FROM youngFriends(europe(socialNetwork)) MATCH ... Language features available in Morpheus
  • 33. Demo: Shaping data into graphs Creating Property Graphs from DataFrames
  • 34. Demo Big Picture Part 1 From JSON to Graph Create persistent Property Graph from raw Yelp dataset Read Yelp Data from JSON into DataFrames Create Property Graph from DataFrames Store Property Graph using Parquet Part 2 A library of Graphs Create a library of graph projections Read Property Graph from Parquet Create subgraph for a specifc city Project and persist city subgraph Part 3 Federated queries Integrate reviews with social network data Dene Graph Type and Mapping with Graph DDL Load data from Hive and H2 Run analytical query on the integrated graph Part 5 Neo4j Integration II Recommend businesses to users Load graph projections from library Write graphs to Neo4j, run Louvain + Jaccard Run analytical query in Morpheus to nd recommendations Part 4 Neo4j Integration I Find trending businesses Load graph projections from library Write graphs to Neo4j and run PageRank Combine graphs in Morpheus and select trending businesses https://git.io/fjZ2b
  • 35. The Yelp Open Dataset • Yelp is a search service based on crowd-sourced reviews about local businesses • The Yelp Open Dataset is part of the Yelp Dataset Challenge • Yelps’ effort to encourage researchers to explore the dataset • ~150K businesses, 10M users, 5M reviews, 35M friendships https://www.yelp.com https://www.yelp.com/dataset https://www.yelp.com/dataset/challenge
  • 36. The Yelp Open Dataset :Business name : ACME address : 123 ACME Rd. city : San Jose state : CA :User name : Alice since : 2013 elite : [2014, 2016] :User name : Bob since : 2014 elite : null :REVIEWS stars : 5 date : 2014-02-03 :REVIEWS stars : 4 date : 2014-08-03
  • 37. Part 1: From JSON to Graph business.json user.json review.json Create Node and Relationship Tables Create Property Graph Store Property Graph https://git.io/fjZ2N
  • 38. From DataFrame to NodeTable // (:User) val userDataFrame = spark.read.json(...).select(...) val userNodeTable = MorpheusElementTable.create(NodeMappingBuilder.on("id") .withImpliedLabel("User") .withPropertyKey("name") .withPropertyKey("yelping_since") .withPropertyKey("elite") .build, userDataFrame) id name yelping_since elite 0 Alice 2013 [2014, 2016] 1 Bob 2014 null
  • 39. Managing multiple graphs The property graph catalog and graph data sources
  • 40. Managing multiple graphs • Property Graphs are managed within a catalog Cypher Session Property Graph Catalog Property Graph Data Source <namespace> Property Graph <name> QualiedGraphName = <namespace>.<name>
  • 41. Cypher Session • API to operate with the query engine and the catalog trait CypherSession { def cypher( query: String, parameters: CypherMap = CypherMap.empty, drivingTable: Option[CypherRecords] = None ): Result def catalog: PropertyGraphCatalog }
  • 42. Property Graph Catalog • API to manage multiple Property Graphs • Catalog functions can be executed via Cypher or Scala API trait PropertyGraphCatalog { def register(namespace: Namespace, dataSource: PropertyGraphDataSource): Unit def store(qualifiedGraphName: QualifiedGraphName, graph: PropertyGraph): Unit def graph(qualifiedGraphName: QualifiedGraphName): PropertyGraph def drop(qualifiedGraphName: QualifiedGraphName): Unit // additional methods for managing views, listing namespaces and graphs }
  • 43. Property Graph Data Source (PGDS) • API for loading and saving property graphs trait PropertyGraphDataSource { def hasGraph(name: GraphName): Boolean def graph(name: GraphName): PropertyGraph def schema(name: GraphName): Option[Schema] // additional methods for storing, deleting, listing graphs }
  • 44. PGDS implementations in Morpheus PGDS Multiple graphs Read graphs Write graphs File-based Parquet, ORC, CSV HDFS, local, S3 Yes Yes Yes SQL Hive, Jdbc Yes Yes No Neo4j Bolt Yes Yes Yes Neo4j Bulk Import No No Yes
  • 45. Catalog operations via Cypher Cypher Session Property Graph Catalog Property Graph Data Source <namespace> Property Graph <name> QualiedGraphName = <namespace>.<name>
  • 46. Read from single Property Graph Cypher Session Property Graph Catalog “social-net” (Neo4j PGDS) “US” (Property Graph) FROM social-net.US MATCH (p:Person) RETURN p
  • 47. Read from multiple Property Graphs Cypher Session Property Graph Catalog “social-net” (Neo4j PGDS) “US” “EU” “products” (SQL PGDS) “2018” “2017” FROM social-net.US MATCH (p:Person) FROM products.2018 MATCH (c:Customer) WHERE p.email = c.email RETURN p, c
  • 48. Construct new Property Graphs Cypher Session Property Graph Catalog “social-net” (Neo4j PGDS) “US” “EU” “products” (SQL PGDS) “2018” “2017” CATALOG CREATE GRAPH social-net.US_new { FROM social-net.US MATCH (p:Person) FROM products.2018 MATCH (c:Customer) WHERE p.email = c.email CONSTRUCT ON social-net.US CREATE (p)-[:SAME_AS]->(c) RETURN GRAPH }
  • 49. Construct new Property Graphs CATALOG CREATE GRAPH social-net.US_new { FROM social-net.US MATCH (p:Person) FROM products.2018 MATCH (c:Customer) WHERE p.email = c.email CONSTRUCT ON social-net.US CREATE (p)-[:SAME_AS]->(c) RETURN GRAPH } Cypher Session Property Graph Catalog “social-net” (Neo4j PGDS) “US” “EU” “products” (SQL PGDS) “2018” “2017” “US_new”
  • 50. Create and query Graph Views Cypher Session Property Graph Catalog “social-net” (Neo4j PGDS) “US” “EU” ... CATALOG CREATE VIEW youngPeople($sn) { FROM $sn MATCH (p:Person)-[r]->(n) WHERE p.age < 21 CONSTRUCT CREATE (p)-[COPY OF r]->(n) RETURN GRAPH } FROM youngPeople(social-net.US) MATCH (p:Person) RETURN p “youngPeople” Views
  • 51. Demo Big Picture Part 1 From JSON to Graph Create persistent Property Graph from raw Yelp dataset Read Yelp Data from JSON into DataFrames Create Property Graph from DataFrames Store Property Graph using Parquet Part 2 A library of Graphs Create a library of graph projections Read Property Graph from Parquet Create subgraph for a specifc city Project and persist city subgraph Part 3 Federated queries Integrate reviews with social network data Dene Graph Type and Mapping with Graph DDL Load data from Hive and H2 Run analytical query on the integrated graph Part 5 Neo4j Integration II Recommend businesses to users Load graph projections from library Write graphs to Neo4j, run Louvain + Jaccard Run analytical query in Morpheus to nd recommendations Part 4 Neo4j Integration I Find trending businesses Load graph projections from library Write graphs to Neo4j and run PageRank Combine graphs in Morpheus and select trending businesses https://git.io/fjZ2b
  • 52. Reminder: The Yelp Open Dataset :Business name : ACME address : 123 ACME Rd. city : San Jose state : CA :User name : Alice since : 2013 elite : [2014, 2016] :User name : Bob since : 2014 elite : null :REVIEWS stars : 5 date : 2014-02-03 :REVIEWS stars : 4 date : 2014-08-03
  • 53. 2015 - 2018 Part 2: Building a library of graphs https://git.io/fjZ25 Boulder City (:User)-[:CO_REVIEWS]->(:User) (:User)-[:REVIEWS]->(:Business) (:User)-[:CO_REVIEWS]->(:User) Constuct graphs for each year Extract Yelp subgraph for specic city (:Business)-[:CO_REVIEWED]->(:Business)
  • 54. Turning SQL tables into graphs Property graph schema and mappings
  • 55. PGDS on Steroids: The SQL PGDS JDBC Hive Oracle SQL Server Orc Parquet Table/View Table/View Table/View ... ... Graph DDL Graph Instance - Table mappings SQL Tables Property Graphs Property Graph Node Tables Rel. Tables Graph Type SQL Property Graph Data Source Spark SQL Data Sources Graph Type - Element types - Node types - Relationship types
  • 56. Demo Big Picture Part 1 From JSON to Graph Create persistent Property Graph from raw Yelp dataset Read Yelp Data from JSON into DataFrames Create Property Graph from DataFrames Store Property Graph using Parquet Part 2 A library of Graphs Create a library of graph projections Read Property Graph from Parquet Create subgraph for a specifc city Project and persist city subgraph Part 3 Federated queries Integrate reviews with social network data Dene Graph Type and Mapping with Graph DDL Load data from Hive and H2 Run analytical query on the integrated graph Part 5 Neo4j Integration II Recommend businesses to users Load graph projections from library Write graphs to Neo4j, run Louvain + Jaccard Run analytical query in Morpheus to nd recommendations Part 4 Neo4j Integration I Find trending businesses Load graph projections from library Write graphs to Neo4j and run PageRank Combine graphs in Morpheus and select trending businesses https://git.io/fjZ2b
  • 57. Reminder: The Yelp Open Dataset :Business name : ACME address : 123 ACME Rd. city : San Jose state : CA :User name : Alice since : 2013 elite : [2014, 2016] email : alice@yelp.com :User name : Bob since : 2014 elite : null email : bob@yelp.com :REVIEWS stars : 5 date : 2014-02-03 :REVIEWS stars : 4 date : 2014-08-03
  • 58. The Yelp Friendships (YelpBook) :User email: alice@yelp.com :User email : bob@yelp.com :FRIEND
  • 59. Part 3: Integrating Yelp and YelpBook Yelp Reviews Yelp Book Graph DDL + SQL PGDS (:User)-[:REVIEWS]->(:Business) (:User)-[:FRIEND]->(:User) https://git.io/fjZ2p
  • 60. CREATE GRAPH TYPE yelp ( -- Element types (concepts used to describe a graph) User ( name STRING, since DATE ), Business ( name STRING, city STRING ), REVIEWS ( stars INTEGER, date LOCALDATETIME ), FRIEND, -- Node types (User), (Business), -- Relationship types (User)-[REVIEWS]->(Business), (User)-[FRIEND]->(User) ) Graph DDL: Graph Type definition
  • 61. CREATE GRAPH yelp_and_yelpBook OF yelp ( -- Node type mappings (User) FROM HIVE.yelp.user, (Business) FROM HIVE.yelp.business, -- Relationship type mappings (User)-[REVIEWS]->(Business) FROM HIVE.yelp.review e START NODES (User) FROM HIVE.yelp.user n JOIN e.user_email = n.email END NODES (Business) FROM HIVE.yelp.business n JOIN e.business_id = n.business_id, (User)-[FRIEND]->(User) FROM H2.yelpbook.friend e START NODES (User) FROM HIVE.yelp.user n JOIN e.user1_email = n.email END NODES (User) FROM HIVE.yelp.user n JOIN e.user2_email = n.email ) Graph DDL: Graph Instance definition
  • 62. Data Science with Graphs Integrating Neo4j native graph algorithms
  • 63. Spark and Neo4j Platforms Coming in Spark 3.0
  • 64. Native graph algorithms The Neo4j database and Neo4j Graph Algorithms library
  • 65. Neo4j Graph Algorithms • Parallel Breadth First Search* • Parallel Depth First Search • Shortest Path* • Single-Source Shortest Path • All Pairs Shortest Path • Minimum Spanning Tree • A* Shortest Path • Yen’s K Shortest Path • K-Spanning Tree (MST) • Random Walk • Degree Centrality • Closeness Centrality • CC Variations: Harmonic, Dangalchev, Wasserman & Faust • Betweenness Centrality • Approximate Betweenness Centrality • PageRank* • Personalized PageRank • ArticleRank • Eigenvector Centrality • Triangle Count* • Clustering Coefcients • Connected Components (Union Find)* • Strongly Connected Components* • Label Propagation* • Louvain Modularity – 1 Step & Multi-Step • Balanced Triad (identication) • Euclidean Distance • Cosine Similarity • Jaccard Similarity • Overlap Similarity • Pearson Similarity Pathnding & Search Centrality / Importance Community Detection Similarity neo4j.com/docs/ graph-algorithms/current/ Link Prediction • Adamic Adar • Common Neighbors • Preferential Attachment • Resource Allocations • Same Community • Total Neighbors* Available in GraphFrames
  • 66. Free O’Reilly Book neo4j.com/ graph-algorithms-book • Spark & Neo4j Examples • Machine Learning Chapter
  • 67. Simple Data Science workflows From Spark to Neo4j to Spark
  • 68. Demo Big Picture Part 1 From JSON to Graph Create persistent Property Graph from raw Yelp dataset Read Yelp Data from JSON into DataFrames Create Property Graph from DataFrames Store Property Graph using Parquet Part 2 A library of Graphs Create a library of graph projections Read Property Graph from Parquet Create subgraph for a specifc city Project and persist city subgraph Part 3 Federated queries Integrate reviews with social network data Dene Graph Type and Mapping with Graph DDL Load data from Hive and H2 Run analytical query on the integrated graph Part 5 Neo4j Integration II Recommend businesses to users Load graph projections from library Write graphs to Neo4j, run Louvain + Jaccard Run analytical query in Morpheus to nd recommendations Part 4 Neo4j Integration I Find trending businesses Load graph projections from library Write graphs to Neo4j and run PageRank Combine graphs in Morpheus and select trending businesses https://git.io/fjZ2b
  • 69. PageRank Algorithm • Use when • Anytime you’re looking for broad influence over a network • Many domain specific variations for differing analysis, e.g. Personalized PageRank for personalized recommendations • Examples: • Twitter Recommendations • Fraud Detection
  • 70. Part 4: Yelp trending businesses 2017 to 2018 call algo.pagerank 2017 2018 trendRank = pageRank_2018 - pageRank_2017 ⋈ (:Business) -[:CO_REVIEWED]-> (:Business) https://git.io/fjZ2j
  • 71. Demo Big Picture Part 1 From JSON to Graph Create persistent Property Graph from raw Yelp dataset Read Yelp Data from JSON into DataFrames Create Property Graph from DataFrames Store Property Graph using Parquet Part 2 A library of Graphs Create a library of graph projections Read Property Graph from Parquet Create subgraph for a specifc city Project and persist city subgraph Part 3 Federated queries Integrate reviews with social network data Dene Graph Type and Mapping with Graph DDL Load data from Hive and H2 Run analytical query on the integrated graph Part 5 Neo4j Integration II Recommend businesses to users Load graph projections from library Write graphs to Neo4j, run Louvain + Jaccard Run analytical query in Morpheus to nd recommendations Part 4 Neo4j Integration I Find trending businesses Load graph projections from library Write graphs to Neo4j and run PageRank Combine graphs in Morpheus and select trending businesses https://git.io/fjZ2b
  • 72. Louvain Modularity • Use when • Community Detection in large networks • Uncover hierarchical structures in data • Examples • Money Laundering • Protein-Protein-Interactions
  • 73. Jaccard Similarity • Use when • Computing pair-wise similarities • Accommodates vectors of different lengths • Examples • Recommendations • Disambiguation
  • 74. Part 5: Community-centric Recommendation call algo.louvain (:User)-[:REVIEWS]->(:Business) (:User)-[:CO_REVIEWS]->(:User) call algo.jaccard Recommend businesses similar users have reviewed 2017 Compute similarity based on overlapping reviewed businesses Compute communities based on co-reviews for each community :IS_SIMILAR https://git.io/fjZaU
  • 75. Spark 3.0 Graph Spark Graph SPIP uses Dataframe-based Property Graphs
  • 76. Spark and Neo4j Platforms Coming in Spark 3.0
  • 77. Spark Project Improvement Proposal • SPARK-25994 Spark Graph for Apache Spark 3.0 • Property Graphs, Cypher Queries, and Algorithms • Defines a Cypher-compatible Property Graph type based on DataFrames • Replaces GraphFrames querying with Cypher • Reimplements GraphFrames/GraphX algos on the Property Graph type
  • 78. SPIP: What are we trying to do? • “Spark Cypher” • Run a Cypher 9 query on a Property Graph returning a tabular result • Migrate GraphFrames to Spark Graph • Implementation is based on Spark SQL • Property Graphs are composed of one or more DFs • Provide Scala, Python and Java APIs
  • 79. SPIP: What are we not solving? • Addresses the Cypher Property Graph Model • Does not deal with variants of that model (e.g. RDF) • No Cypher 10 multiple graph features • API is flexible to support this in future iterations • No Property Graph Catalog • Also no Property Graph Data Sources
  • 80. Try it out and get involved [SPARK-27299][GRAPH][WIP] Spark Graph API design proposal (GraphExamplesSuite.scala) test("create PropertyGraph from Node- and RelationshipFrames") { val nodeData: DataFrame = spark.createDataFrame(Seq(0 -> "Alice", 1 -> "Bob")).toDF("id", "name") val relationshipData: DataFrame = spark.createDataFrame(Seq((0, 0, 1))).toDF("id", "source", "target") val nodeFrame: NodeFrame = NodeFrame(nodeData, "id", Set("Person")) val relationshipFrame: RelationshipFrame = RelationshipFrame(relationshipData, "id", "source", "target", "KNOWS") val graph: PropertyGraph = cypherSession.createGraph(Seq(nodeFrame), Seq(relationshipFrame)) val result: CypherResult = graph.cypher( """ |MATCH (a:Person)-[r:KNOWS]->(:Person) |RETURN a, r""".stripMargin) result.df.show() } https://git.io/fjqp6
  • 81. Morpheus + Cypher in Spark 3.0 Morpheus will be plug-compatible with Cypher in Spark 3.0
  • 82. Morpheus and Spark Graph: API compatibility spark-graph-api spark-cypher spark-sql okapi morpheus spark-sql openCypherSPIP Cypher to relational operators compiler openCypher