When it comes to dealing with large, complex, and disparate data sets, traditional database technologies are unable to keep pace with the rich analytics necessary to power today’s data-driven applications. Graph analytics databases are becoming the underlying infrastructure for AI and machine learning. These databases allow users to ask complex questions across complex data, which is not always practical or even possible at scale using other approaches. They also enable faster insights against massive data sets when combined with pattern recognition, statistical analysis, and AI/ machine learning. And in the case of standards-based graph databases, they connect with popular visualization tools like Graphileon, allowing users to easily explore their data stores and quickly build compelling graph-based applications.
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Scalable, Fast Analytics with Graph - Why and How
1. Scalable, Fast Analytics with Graph-
Why and How
Thomas Cook, Director of Sales, AnzoGraph DB
Tom Zeppenfeldt, Founder Graphileon
2. ▪ Why Scalable, Fast Graph Analytics are Important
▪ How AnzoGraph DB will help you to achieve Scalable, Fast Graph Analytics
▪ Airline Flight Data Modeling and Analysis
▪ Graphileon Application Demo
Agenda
3. • Founded by senior team from IBM’s Advanced Internet
Technology Group
• Complemented by MPP technology team previously
founded Netezza & ParAccel (Amazon Redshift)
• Experienced executive team with proven track record of
success
About Cambridge Semantics
Based in Boston and San Diego
100+ Employees
PRODUCTS
• Anzo
– Enterprise scale data fabric for automated data
management & analytics
• AnzoGraph DB
– Part of the Anzo platform. Now available
separately
– Graph Analytics Database that is truly faster
than other graph databases.
TEAM
Award-winning Software
Select Customers
4. Automated Deployment and Operations
Storage and Compute Integration
MODEL
Graph Data Model
• Lift Data into
Data Fabric
• Design Ontologies
• Connect Data
Models
ON-BOARD
Ingest & Map
• Automated ETL
• Collaborative
Mapping
• Metadata
Capture
Enterprise
Data Sources
Machine
Learning and AI
Enterprise
Search
“Last Mile”
Analytics Tools
Metadata Catalog
Semantic-based Metadata Management, Governance and Lineage
Cloud or On-Prem Data Storage Infrastructure
Data Storage Layer
Ingest
BLEND
GraphMarts
• Combine and Align
Related Data Sets
• In-memory MPP
OLAP Query Engine
• Data Layers
ACCESS
Hi-Res Analytics
• Analyze All
Data Together
• Fast, Iterative Queries
Ad Hoc, What if
• Code Free or API
Graphical Application Interface
Anzo - The Modern Data Discovery and Integration Layer for the Enterprise Data Fabric
5. Connected Data
Visualization
Graph
Algorithms
Machine
Learning
Inferencing
Traditional
analytics
Common Use Cases
Call for
Why Scalable, Fast Graph Analytics are Important
Graph Patterns
Enterprise Knowledge Graphs
Customer 360
Cyber Security
Supply Chain
Fraud Detection
Anti-Money Laundering
Network Optimization
IoT
Unstructured Data Analysis
Advanced Analytics
Machine Learning/AI
Web Scale Applications
Graph Database
Capabilities
7. EDW
Need answers quickly from disparate systems
Rigid, brittle data model
Can take years
Millions of dollars
Does not adapt well to change
Many Hadoop projects fail
Raw operational data dumps become
unwieldly, difficult to consume and manage
Referred to as the “Data Swamp”
Data engineering efforts are
costly, complex, lack lineage,
often times not repeatable
Heavy volume Spark clusters
are difficult to manage and tune
properly
10. Kubernetes and Docker
Deployment
Best with Kubernetes cluster
managed by Helm
Image deployment directly on
Docker CE also supported
Redhat/CentOS
Deployment
AWS CloudFormation
Deployment
Deployment on Amazon Web
Services (AWS), Azure and GPC.
BARE METAL PLATFORM AS A SERVICE KUBERNETES AND HELM
One or many nodes
Scale and Deploy as your data grows
12. Introduction: Graph Database built for Analytics
Built for Analytics
Graph OLAP built for analytics
(Joins vs traverse. Shared
nothing architecture. Each core
contributes to query)
Well-suited for deep link analysis
of small and large data sets
Use for analytics or complement
your OLTP graph database
engine with OLAP
Massively Parallel
Native MPP graph database
The fastest data loading and
analytics capability
Highly parallelized and
horizontally scalable
Scales to trillions of triples
(benchmarked), making
analysis on even massive
data sets possible
Standards-based
Supports W3C standards
(RDF & SPARQL) and Labelled
Property Graphs standards
(RDF*/SPARQL* &
OpenCypher)
Access to a variety of NLP,
visualization, and analytics
partners supporting the
standards.
Analytics-rich
Semantic context with
labelled properties to
execute rich analytics
Graph algorithms, BI-style
analytics, inferencing, quad
store mode, views, user
defined extensions and
much more.
14. Graph OLAP Built for Analytics at Scale and Speed:
AnzoGraph DB Benchmark Results
Download benchmarking reports & Bloor Group Graph
Market Update 2019
at AnzoGraph.com
The Fastest and Most Scalable Graph Database
LUBM Benchmark: 113X faster than previous
benchmark
TPC-H (GHIB): 217X faster than a leading
Graph OLTP system for load
and query
10-300X faster than
SPARK SQL and GraphFrames
Graph500: Load 41.6M vertices & 1.47B
edges in 4.5 minutes
Graph Algorithm (WCC):
Neo4j in 73 seconds
AnzoGraph DB in 3.7 seconds
15. Scalability: Performance Impact on Scaling Servers with AnzoGraph DB
1 node
2.6 billion triples
40 nodes
105 billion triples
One Minute
19. Conversion from CSV to Graph – Defining Triples
Flight
Airport
Airport
FlightDeparture
FlightArrival
DESTINATION
FlightAirport
Airport
20. Conversion from CSV to Graph
Flight
AirportAirport
FlightDeparture FlightArrival
DESTINATION
21. Nodes have types and properties
Flight
YEAR
MONTH
DAY
DAY_OF_WEEK
AIRLINE
FLIGHT_NUMBER
TAIL_NUMBER
ORIGIN_AIRPORT
DESTINATION_AIRPORT
….
Node Type: Flight
Node Properties:
Airline,
Flight Number,
Tail Number,
etc
*Note: Types can also be called Labels, as in Labeled Property
Graphs or LPG
22. With RDF* edges can also have properties
AirportAirport
DESTINATION
DISTANCE = 187
AIRPORT_CODE = ‘BOS”
Edge Property:
DISTANCE
AIRPORT_CODE = ‘JFK”
24. Combining additional data sets
Flight
AirportAirport
FlightDeparture FlightArrival
DESTINATION
CityState
Aircraft
Airline
Country
Airline
Aircraft
CityState
Country
FAA Airline Census Data
Flight Delay
25. Now we are ready to ask questions like:
BI-Style Analytics
#1 Longest flight segments by distance from Boston (BOS)
#2 Airports less the 400 mi from Boston (BOS) - Network Viewer output
#3 Longest distances between two airports
#4 Longest flights by elapsed time
#5 Airlines with the longest average delays
#6 Airlines with the most flights
#7 Longest 2 segments reachable from Boston and the distances of each segment
#8 Which segments have the longest average departure delays ?
Graph Algorithms
#9 Page Rank - Graph Algorithm - Show most well-connected airports based on page rank algorithm
#10 Shortest Path Graph Algorithm - show shortest paths and # of segments (hops) from AUS