Given a large-scale graph with millions of nodes and edges, how to reveal macro patterns of interest, like cliques, bi-partite cores, stars, and chains? Furthermore, how to visualize such patterns altogether getting insights from the graph to support wise decision-making? Although there are many algorithmic and visual techniques to analyze graphs, none of the existing approaches is able to present the structural information of graphs at large-scale. Hence, this paper describes StructMatrix, a methodology aimed at high-scalable visual inspection of graph structures with the goal of revealing macro patterns of interest. StructMatrix combines algorithmic structure detection and adjacency matrix visualization to present cardinality, distribution, and relationship features of the structures found in a given graph. We performed experiments in real, large-scale graphs with up to one million nodes and millions of edges. StructMatrix revealed that graphs of high relevance (e.g., Web, Wikipedia and DBLP) have characterizations that reflect the nature of their corresponding domains; our findings have not been seen in the literature so far. We expect that our technique will bring deeper insights into large graph mining, leveraging their use for decision making.
Schema on read is obsolete. Welcome metaprogramming..pdf
StructMatrix: large-scale visualization of graphs by means of structure detection and dense matrices
1. StructMatrix: large-scale visualization of graphs by
means of structure detection and dense matrices
Hugo Gualdron, Robson L. F. Cordeiro, Jose F Rodrigues-Jr
University of Sao Paulo
In collaboration with Carnegie Mellon University
(Prof. Christos Faloutsos, and PhD Danai Koutra)
Funding by research agency Fapesp (2013/03906-0, 2014/07879-0, 2015/18335)
In: The Fifth IEEE ICDM Workshop on Data Mining in Networks,
Atlantic City, NJ, USA - November, 2015
http://www.icmc.usp.br/pessoas/junio
Jose F Rodrigues-Jr (University of Sao Paulo) 1 / 20
2. Introduction
Motivation
Big Data!!!
A lot of information, much of it in the form of relationships;
Large-scale graphs: graphs generated by applications in which users
or entities are distributed along large geographical areas - even the
entire planet;
Social networks, recommendation networks, road nets, e-commerce,
computer networks, client-product logs, and many others.
Data analysis is the differential for industrial competition.
General Electric & Accenture.
Jose F Rodrigues-Jr (University of Sao Paulo) 2 / 20
3. Introduction
Problem
Such graphs are too big:
node-link visualization cannot handle even thousand-vertices graphs;
adjacency matrices are limited by the number of pixels of the screen;
in any case, the cardinality of the nodes prevents rationalization;
non-visual analytical techniques might produce way too many
patterns preventing human cognition.
Still, we want to characterize the structure of graphs for:
understanding the overall structure, and not only the
distribution-based analyses;
spotting outliers and trends that are not dominant;
requesting details on demand concerning subregions of the graph
topology.
Jose F Rodrigues-Jr (University of Sao Paulo) 3 / 20
4. Introduction
Problem
Layouts node-link and adjacency matrix
Node-link Adjacency matrix
Scalability:
Hundred nodes Thousand nodes
Jose F Rodrigues-Jr (University of Sao Paulo) 4 / 20
5. Introduction
Methodology overview
Assumptions:
graphs are made of recurrent simple structures (cliques, bi-partite
cores, stars, and chains);
such structures are more meaningful than sole nodes;
even at lower resolutions, the graph main properties are maintained in
a visualization.
Hypothesis: we reach more scalable and meaningful graph visualizations
with:
graph summarization by detecting recurrent structures of the graph;
dense adjacency matrices.
Jose F Rodrigues-Jr (University of Sao Paulo) 5 / 20
6. Methodology
Proposed method: StructMatrix
Our method has two parts:
1 An algorithm to detect substructures;
2 A dense adjacency matrix of the structures that were detected.
Jose F Rodrigues-Jr (University of Sao Paulo) 6 / 20
8. Methodology
1.Structure detection
We designed a graph partitioning algorithm based on the fact that
real-world graphs obey to power-law distributions;
In such graphs: few nodes with very high degree and the majority of
nodes with low degree;
Kang and Faloutsos [1] demonstrated that the ordered removal of the
higher degree nodes leads to the removal of hubs from the giant CC,
creating satellite (much smaller) connected components;
This ordered removal lends to a structural scanning of the graph.
Jose F Rodrigues-Jr (University of Sao Paulo) 8 / 20
15. Methodology
1.Structure detection–Algorithm
4 We store the classified subcomponents; the ones that were not
identified go to the queue waiting for a new round of shattering.
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
19. Methodology
2.Visualization–Projection
After structure detection, we build an adjacency matrix
structure-to-structure whose edges’ weights indicate the number of
edges between the nodes of each structure;
Although smaller than the original matrix, for million-scale graphs,
the struct matrix is still too large to fit in the screen;
For this reason we create a dense matrix according to a straight
proportion (x, y) → (ρx , ρy ) for:
ρx = (Resx − 1) x−xmin
xmax −xmin
+ 1
2
ρy = (Rexy − 1) y−ymin
ymax −ymin
+ 1
2
(1)
where (x, y) are points of the original matrix and Resx , Resy are the
target resolutions; the more resolution, the more details are presented
– these parameters allow for interactive grasping of details.
Jose F Rodrigues-Jr (University of Sao Paulo) 11 / 20
21. Methodology
2.Visualization–Layout
We organize the matrix according to structure type, and to number of
edges – size of structures (number of nodes) is given by color.
Jose F Rodrigues-Jr (University of Sao Paulo) 13 / 20
22. Methodology
2.Visualization–Layout
We organize the matrix according to structure type, and to number of
edges – size of structures (number of nodes) is given by color.
Jose F Rodrigues-Jr (University of Sao Paulo) 13 / 20
24. Experiments
Experiments–Real datasets–WWW-barabasi
WWW-barabasi: webpages and links between them.
Stars (st and fs) refer to webpages with many out links.
Most of the webpages have less than one thousand connections;
however, some present unusual thousand connections.
Jose F Rodrigues-Jr (University of Sao Paulo) 15 / 20
25. Experiments
Experiments–Real datasets–Road nets
Pennsylvania California Texas
The three road graphs have a similar structure – all U.S. roads;
There is a hierarchical connectivity: bigger to smaller cities;
Surprising grid-like (due to symmetry) structure: intersections refer to
hub cities, and lines refer to inter-city paths.
Jose F Rodrigues-Jr (University of Sao Paulo) 16 / 20
26. Experiments
Experiments–Real datasets–Road nets
Comparison: Structure-to-structure vs Node-to-node.
California (structure-to-structure) California (node-to-node)
Main differences:
1 The partitioning according to structures;
2 The ordering by number of edges to other structures;
3 There is a hierarchical connectivity: bigger to smaller cities;
4 Surprising grid-like structure: intersections refer to hub cities, and
lines refer to inter-city paths.
Jose F Rodrigues-Jr (University of Sao Paulo) 17 / 20
27. Experiments
Experiments–Real datasets–DBLP
Overall FC-FC zoom
DBLP is mainly characterized by false stars – possibly because
advisors have students, and students connect one to each other;
By zooming FC-FC, one can see outliers, for instance k3 = “The
Biomolecular Interaction Network Database and related tools 2005
update” 75 authors.
Jose F Rodrigues-Jr (University of Sao Paulo) 18 / 20
28. Conclusions
Contributions
Visualization technique: we introduce a processing and visualization
methodology that puts together algorithmic techniques and design in
order to reach large-scale visualizations;
Analytical scalability: our technique extends the most scalable
technique found in the literature; plus, it is engineered to plot millions
of edges in a matter of seconds;
Practical analysis: we show that large-scale graphs have well-defined
behaviors concerning the distribution of structures, their size, and
how they are related one to each other; finally, using a standard
laptop, our techniques allowed us to experiment in real, large-scale
graphs coming from domains of high impact, i.e., WWW, Wikipedia,
Roadnet, and DBLP.
Jose F Rodrigues-Jr (University of Sao Paulo) 18 / 20
29. References
U. Kang and C. Faloutsos, “Beyond ’caveman communities’: Hubs
and spokes for graph compression and mining,” in ICDM, 2011, pp.
300–309.
D. Koutra, U. Kang, J. Vreeken, and C. Faloutsos, “Vog:
Summarizing and understanding large graphs,” in SDM, 2014.
Jose F Rodrigues-Jr (University of Sao Paulo) 18 / 20