Financial crime prevention is something that affects everyone in one way or another. From the Deutsche Banks of the world to small and medium online merchants, regulations for anti-money laundering, know your customer, and customer due diligence apply.
Failing to comply with such regulations can bring on substantial fines. Even more importantly, it can hurt the bottom line and reputation of businesses, having far-reaching side effects. Complying with such regulations, and actively cracking down on financial crime, however, is not easy.
Cross-referencing interconnected data across various datasets, and trying to apply detection rules and to discover patterns in the data is complicated. It takes expertise, effort, and the right technology to be able to do this efficiently.
A natural and efficient way of looking for patterns and applying rules in troves of interconnected data is to model and view that data as a graph. By modeling data as a graph, and applying graph-based algorithms such as PageRank or Centrality, traversing paths, discovering connections and getting insights becomes possible.
Graphs and graph databases are the fastest-growing area of data management technology for a number of reasons. One of the reasons is because they are a perfect match for use cases involving interconnected data.
Queries that would be very complicated to express and very slow to execute using relational databases or other NoSQL database technology, are feasible using graph databases. With the rise in complexity of modern financial markets, financial crimes require going 4 to 11 levels deep into the account – payment graph: this requires a different solution than either relational or NoSQL databases.
How are organizations such as Alibaba, OpenCorporates, and Visa using graph database technology to not just stay on top of regulation, but be one step ahead in the race against financial crime?
Is it possible to do this in real time?
What do graph query languages have to do with this?
10. Using Subgraph or Relationship Discovery Combined with Graph Computation to
find Diamonds of Money Laundering
10
Financial institutions collaborate to build transactional + knowledge graphs to identify money
laundering rings and layering.
Layering: Split "dirty money" into smaller amounts,
transfer it from account to account, eventually mergeMoney Laundering Ring: Money is transferred in a circle.
23. The Age of the Graph Is Upon Us (Again)
• Early-mid-90s: semi- or unstructured data research was all the rage
• data logically viewed as graph, initially motivated by modeling WWW (page=vertex, link=edge)
• query languages expressing constrained reachability in graph
• Late 90s-late 2000s: special case XML (graph restricted to tree shape)
• Mature: W3C standard ecosystem for modeling and querying (XQuery, XPath, XLink, XSLT, XML Schema,
… )
• Since mid 2000s: JSON and friends (also graphs restricted to tree shape)
• Mongodb, Couchbase, SparQL, GraphQL, AsterixDB, …
• ~2010 to present: back to unrestricted graphs
• Initially motivated by analytic tasks in social networks
• Now universal use (most interesting data is linked, after all)
23
24. The Graph Data Model
• Nodes model real-world entities
• Edges are binary, they model relationships
• may be directed or undirected (asymmetric, resp. symmetric relationships)
• Nodes and edges may carry labels
• Nodes and edges annotated with data
• both have sets of attributes (key-value pairs)
24
25. Example Graph
Vertex types:
• Product (name, category, price)
• Customer (ssn, name, address)
Edge types:
• Bought (discount, quantity)
• Customer c bought 100 units of product p at discount 5%:
modeled by edge
c -- (Bought {discount=5%, quantity=100})--> p
25
26. Key Language Ingredients from the Past
• Pioneered by academic work on relational query extensions for graphs
(since ‘87)
• Path expressions (PEs) for navigation
• Variables for manipulating data found during navigation
• Stitching multiple PEs into complex navigation patterns: conjunctive path queries
• Constructors for new nodes and edges
26
28. Current Representative Graph QLs
in Order of Appearance
• SparQL
• mature, W3C standard recommendation, but not aimed at analytics of arbitrary graphs: RDF,
ontologies, semantic web
• Cypher (neo4j)
• essentially 1990’s StruQL with bells and whistles, inherits CRPQ syntactic style
• Gremlin (Apache project and commercial products)
• dataflow programming model: graph annotated with tokens (“traversers”) that flow through it
according to user program
• GSQL (TigerGraph)
• Inspired by SQL, with support for massively parallel graph analytics
28
29. Key Language Ingredients Needed in Modern Applications
• All primitives inherited from past (path expressions, conjunctive patterns, variables,
node/edge construction) SparQL, Cypher, Gremlin, GSQL
• Support for large-scale graph analytics
• Customizable path traversal semantics Gremlin, GSQL
• Aggregation of data encountered during traversal
SparQL (partial), Cypher, Gremlin, GSQL
• Control flow for class of iterative algorithms that converge in multiple steps
• (e.g. PageRank-class, recommender systems, shortest paths, etc.) Gremlin, GSQL
• Intermediate results assigned to nodes/edges support parallel computation (programming
mindset + execution) GSQL
29
31. Aggregation in Current Graph QLs
• Cypher’s RETURN clause uses similar syntax as
aggregation-extended CRPQs
• Gremlin and SparQL use an SQL-style GROUP BY clause
• GSQL uses aggregating containers called “accumulators”
• soon to add above modes as syntactic sugar, but will preserve
accumulators who remain strictly more versatile
31
32. GSQL Accumulators
• GSQL traversals collect and aggregate data by writing it into accumulators
• Accumulators are containers (data types) that
• hold a data value
• accept inputs
• aggregate inputs into the data value using a binary operation
• May be built-in (sum, max, min, etc.) or user-defined
• May be
• global (a single container)
• vertex-attached (one container per vertex)
32
33. Vertex-Attached Accumulator Example: Revenue per
Customer and per Product
• Maximize opportunities for parallel evaluation
SumAccum<float> @cSales, @pSales;
SELECT c
FROM Customer :c -(-Bought-> :b)- Product :p
ACCUM float thisSaleRevenue = b.quantity*(1-b.discount)*p.price,
c.@cSales += thisSaleRevenue,
p.@pSales += thisSaleRevenue;
vertex-attached accums: one instance per node
groups are distributed, each node accumulates its
own group. Can be parallelized!
this sale’s revenue contributes to two
aggregations, each by distinct grouping criteria
33
34. Recommended Toys Ranked by
Log-Cosine Similarity
SumAccum<float> @rank, @lc;
SumAccum<int> @inCommon;
I = {Customer.1};
SELECT p INTO ToysILike, o INTO OthersWhoLikeThem
FROM I:c -(-Likes->)- Product:p -(<-Likes-)- Customer:o
WHERE p.category == “Toys” and o != c
ACCUM o.@inCommon += 1
POST-ACCUM o.@lc = log (1 + o.@inCommon);
SELECT t INTO ToysTheyLike
FROM OthersWhoLikeThem:o –(-Likes->)- Product:t
WHERE t.category == "toy"
ACCUM t.@rank += o.@lc;
RecommendedToys = ToysTheyLike – ToysILike;
34
35. Essential: Control-Flow, Particularly Loops
• Loops (until condition is satisfied)
• Necessary to program iterative algorithms, e.g. PageRank, recommender
systems, shortest-path, etc.
• They synergize with accumulators. This GSQL-unique combination
concisely expresses sophisticated graph analytics
• Can be used to program unbounded-length path traversal under various
semantics
35
36. PageRank in GSQL
CREATE QUERY pageRank (float maxChange, int maxIteration, float dampingFactor) {
MaxAccum<float> @@maxDifference = 9999; // max score change in an iteration
SumAccum<float> @received_score = 0; // sum of scores received from neighbors
SumAccum<float> @score = 1; // initial score for every vertex is 1.
AllV = {Page.*}; // start with all vertices of type Page
WHILE @@maxDifference > maxChange LIMIT maxIteration DO
@@maxDifference = 0;
S= SELECT s
FROM AllV:s -(Linkto)-> :t
ACCUM t.@received_score += s.@score/s.outdegree()
POST-ACCUM s.@score = 1-dampingFactor + dampingFactor * s.@received_score,
s.@received_score = 0,
@@maxDifference += abs(s.@score - s.@score');
END;
}
36