SlideShare a Scribd company logo
1 of 47
Graph Processing and Mining
in the Era of Big Data
Chengqi Zhang
Centre for Quantum Computation & Intelligent Systems (QCIS)
University of Technology, Sydney (UTS)
Outline
 Background
 Challenges and Opportunities
 Our Work: Graph Semantics
 Our Work: Graph Mining
 Our Work: Query Processing
 Our Work: Indexing
 Our Work: Computing Models
 Graph Processing System Design
 Future Developments
Graph Everywhere!
Big Data Characteristics
Big
Data
Volume
• Petabytes
• Records
• Transactions
Velocity
• Batch
• Real time
• Streaming
Variety
• Structured
• Unstructured
• Semi-
structured
Graph in Big Data: Volume
• 1.23 billon active users in 2013
• 190 friends/user on average
• 500 TB data/day in 2012
• 2.1 billion webpages in 2000
• 15 billion edges in 2000
• 20 PB data/day in 2008
• 180-200 PB data in 2011
• 6.5 PB data + 50 TB/day in 2009
Graph in Big Data: Velocity
• Fast flowing data
• Evolving data structures and relationships
Graph in Big Data: Variety
• Directed vs Undirected
• Labeled vs Unlabeled
• Weighted vs Unweighted
• Heterogeneous vs homogeneous
Outline
 Background
 Challenges and Opportunities
 Our Work: Graph Semantics
 Our Work: Graph Mining
 Our Work: Query Processing
 Our Work: Indexing
 Our Work: Computing Models
 Graph Processing System Design
 Future Developments
Challenges and Opportunities
New Graph Semantics (Variety)
New Query Processing Algorithms (Volume & Velocity)
New Indexing Techniques (Volume & Velocity)
New Computing Models (Volume)
New Graph Mining Tasks (Variety)
New Graph Semantics
Traditional (Google)
• Input: keywords
• Output: webpages
containing keywords
• Ranked by PageRank
New (Google)
• Input: keywords
• Output: knowledge
graph/subgraph
• Ranking should consider
both structural and
content information
New Graph Mining Tasks
Chemical Compound Database
Chemical Features
Team of Experts
Several Years
Graph Mining
Several Hours
New Query Processing Algorithms
Location
Relationship
Text
Spatial query processing, nearest neighbor search …
Link analysis, shortest path search, community detection …
Text processing, string matching, semantic analysis …
All of these should be processed in
Milliseconds
New Indexing Techniques
Traditional: webpages, files ?
Hash table, B-tree, Inverted Index …
New: subgraphs, trees, paths ?
What’s more
Graph is Frequently Changing…
New Computing Models
Single Machine vs Multiple Machines
Internal Algorithms vs External Algorithms
Single Core vs Multiple Cores
Outline
 Background
 Challenges and Opportunities
 Our Work: Graph Semantics
 Our Work: Graph Mining
 Our Work: Query Processing
 Our Work: Indexing
 Our Work: Computing Models
 Graph Processing System Design
 Future Developments
Structural Keyword Search
Jim, data mining
Jim
data mining
data mining
Jim
Jim, data mining
data mining
Jim
data mining
Jim
Traditional: Content Keyword Search
New: Structural Keyword Search
Our Work:
• ICDE’07: Finding Top-K Min-Cost Connected Trees in Databases
• SIGMOD’09: Keyword Search in Databases: The Power of RDBMS
• Morgan & Claypool 2009 (Book): Keyword Search in Databases
• VLDBJ’11: Scalable Keyword Search on Large Data Streams
• ICDE’11 & TKDE’12: Computing Structural Statistics by Keywords in
Databases
Graph Matching
MatchGraph 1 Graph 2
2
41
7
53
6
2
41
7
1
53
6
Graph PatternMatch
Our Work:
• EDBT’12: Finding Top-K Similar Graphs in Graph Databases
• CIKM’11 & VLDBJ’13: High Efficiency and Quality: Large Graphs Matching
• VLDB’14: Leveraging Graph Dimensions in Online Graph Search
Community Detection
?
What is a community in a graph?
A cohesive subgraph?
A dense subgraph?
Everyone is highly connected to others?
Everyone is with small distance with others?
An Example: k-core
1-core
2-core
3-core
Community Detection
Graph 3-core
4-clique 3-edge-cc 4-truss
? Other Semantics?
Our Work:
• SIGMOD’13: Efficiently Computing k-Edge Connected Components via Graph
Decomposition
• SIGMOD’14: Querying k-truss Community in Large and Dynamic Graphs
• VLDB’15: Influential Community Search in Large Networks
• KDD’15: Locally Densest Subgraph Discovery
Influential Community (VLDB’15)
Which are the
most influential
research groups?
A Collaboration Network
Locally Densest Subgraph (KDD’15)
Which are the most
representative dense
subgraphs?
Outline
 Background
 Challenges and Opportunities
 Our Work: Graph Semantics
 Our Work: Graph Mining
 Our Work: Query Processing
 Our Work: Indexing
 Our Work: Computing Models
 Graph Processing System Design
 Future Developments
Graph Classification
+ -+
+
+
-
--
Graph Database
…
Frequent Subgraphs
…
Optimal Subgraphs Classifier
1
2
3
4
1
2
3
+ -+
+
+
-
--
Graph Database
…
Optimal Subgraphs Classifier
+ -+
+
+
-
--
Graph Database
…
Optimal Subgraphs Classifier
1 2 3
Traditional: 3 Phases
Our work (CIKM’12): 2 Phases
Our work (PR’15): 1 Phase
Direct Selection
Direct Selection
Our Work:
• CIKM’12: Graph Classification: A Diversified Discriminative Feature Selection
Approach
• ICDE’13: Graph Stream Classification using Labeled and Unlabeled Graphs
• IJCAI’13: Graph Classification with Imbalanced Class Distributions and Noise
• TKDE’14: Bag Constrained Structure Pattern Mining for Multi-Graph Classification
• SDM’14: Multi-Graph Learning with Positive and Unlabeled Bags
• ICDM’14: Multi-Graph-View Learning for Graph Classification
• IJCAI’15: Multi-Graph-View Learning for Complicated Object Classification
• TKDE’15: CogBoost: Boosting for Fast Cost-sensitive Graph Classification
• PR’15: Finding the Best not the Most: Regularized Loss Minimization Subgraph
Selection for Graph Classification
Outline
 Background
 Challenges and Opportunities
 Our Work: Graph Semantics
 Our Work: Graph Mining
 Our Work: Query Processing
 Our Work: Indexing
 Our Work: Computing Models
 Graph Processing System Design
 Future Developments
Polynomial Delay
Enumeration Problems in Graph?
• Structural keyword search
• Community detection
• Graph pattern matching
• Similar graph search
Polynomial Time w.r.t. Input?
Output can be exponential
Impossible!
So…
Polynomial Total: Polynomial to Input+Output
Possible, but…
Polynomial Delay
time… … …
Many
answers!
Can’t you
be faster?
time
How about
this?
Polynomial Total
Polynomial Delay
New Solution
Polynomial Delay: Delay Time
Polynomial to Input
Total time is still large, but…
Our Work:
• ICDE’09: Querying Communities in Relational Databases
• Algorithmica’13: Fast Maximal Cliques Enumeration in Sparse Graphs
• EDBT’15: Efficiently Computing Top-K Shortest Path Join
• VLDB’15: Optimal Enumeration - Efficient Top-k Tree Matching
Diversified Graph Search
Enumeration Problems in Graph
• Structural keyword search
• Community detection
• Graph pattern matching
• Frequent graph pattern mining
• …
Top-6 Answers
Top-6 Diversified
Answers
Top-K Densest Communities?Consider Diversity?
Graph
Our Work:
• VLDB’12: Diversifying Top-K Results
• CIKM’12: Graph Classification: A Diversified Discriminative Feature Selection
Approach
• VLDB’13 & VLDBJ’15: Top-K Structural Diversity Search in Large Networks
• ICDE’15: Diversified Top-K Clique Search
Diversified Top-K Cliques (ICDE’15)
A
B
E
J
G H
K
I
F
C
D
Maximum CliqueTop-2 Maximum Cliques
Too much
overlap!
Diversified Top-2 Maximum Cliques
Cover All
Nodes!
Problem Statement:
Compute k Cliques to Cover Maximum
Number of Nodes
Outline
 Background
 Challenges and Opportunities
 Our Work: Graph Semantics
 Our Work: Graph Mining
 Our Work: Query Processing
 Our Work: Indexing
 Our Work: Computing Models
 Graph Processing System Design
 Future Developments
Dijkstra’s Algorithm?
Shortest Path Computation
A* Algorithm?
Traverse the whole graph in worst case
Precompute all-pair shortest paths?
Impractical!
Our approach (VLDBJ’12):
Compute a subset of pairs
VLDBJ’12
Our Work:
• VLDBJ’12: The Exact Distance to Destination in Undirected World
• VLDB’13: Top-K Nearest Keyword Search on Large Graphs
• VLDBJ’13: Computing Weight Constraint Reachability in Large Networks
• SIGMOD’15: Index-based Optimal Algorithms for Computing Steiner
Components with Maximum Connectivity
Outline
 Background
 Challenges and Opportunities
 Our Work: Graph Semantics
 Our Work: Graph Mining
 Our Work: Query Processing
 Our Work: Indexing
 Our Work: Computing Models
 Graph Processing System Design
 Future Developments
Our Focus
I/O Efficient Computation
Control
Data-
path
Secondary
Storage
(Disk)
Processor
Registers
Main
Memory
(DRAM)
Second
Level
Cache
(SRAM)On-Chip
Cache
1 ns 10 msSpeed: 5 ns 100 ns
100B TBSize: KB GB
Tertiary
Storage
(Tape)
10 sec
PB
10 ns
MB
Graph Problems
Main Memory vs Disk
Sequential I/O vs Random I/O
External vs Semi-external
Partition based vs Nested loop based
Our Work:
• EDBT’12: I/O Cost Minimization: Reachability Queries Processing over Massive
Graphs
• SIGMOD’13 & VLDBJ’14: I/O Efficient: Computing SCCs in Massive Graphs
• ICDE’14: Contract & Expand: I/O Efficient SCCs Computing
• SIGMOD’15: Divide and Conquer - I/O Efficient Depth-First Search
Parallel Computation
Memory
Core Core
L1 L1
L2
Switch
Core Core
L1 L1
L2
Switch
CPU
Disk
Memory
CPU
Disk
Memory
CPU
Disk
Memory
Network
• Computation Sensitive
Multicore
• Shared Memory
• Separated L1 Cache
• Reduce Cache Miss
• Data Sensitive
Distributed Computing
• Shared Nothing
• Separated CPU, memory, Disk
• Reduce Communication
• Divide Tasks • Divide Data
Multicore Distributed Computing
MapReduce, BSP…
Comparison…
Our Work:
• VLDB’10: Ten Thousand SQLs: Parallel Keyword Queries Computing
• SIGMOD’14: Scalable Big Graph Processing in MapReduce
• VLDB’15: Scalable Subgraph Enumeration in MapReduce
Outline
 Background
 Challenges and Opportunities
 Our Work: Graph Semantics
 Our Work: Graph Mining
 Our Work: Query Processing
 Our Work: Indexing
 Our Work: Computing Models
 Graph Processing System Design
 Future Developments
Graph Processing System Design
Objective 1:
Extracting Primitive Operators from DB and DM
Challenge: Completeness & Minimality
Objective 2:
Scalable Processing Techniques
Challenge: Guarantee of “Optimality”
Objective 3:
Characterizing Real-time Tractability
Challenge: Hard & Risky
Graph System Structure
Data Environments
Static, Streaming, Dynamic Graph, Probabilistic, Spatial, Evolving Graph, Random Graph
Computing Models
Main-memory, Distributed/Cloud/MapReduce/BSP/Spark/Pregel,
SSD, Parallel/Multi-core, External/Semi-External
Advanced Applications
Social Network (Twitter, Facebook), Geo Social (Checkin), Chemical, Biological,
Web Graph (Wiki), Collaboration (DBLP), Public Opinion Mining
Query Primitives
• Given a Graph Pattern:
Similarity, Pattern, Sub/Super Graph
• Given a Set of Nodes:
Topology: SimRank, Connectivity, Path
K-hop, Flow, Community, Reachability
• Given a Set of Keywords:
Knowledge Graph, Attributed Graph,
RDF
Mining Primitives
• Subgraph Based:
Cohesive Subgraph Mining
Community Detection
Graph Clustering, Partition
Frequent Subgraph Mining
• Aggregate Based:
PageRank, Outlier, Anonymity
Influence Maximization
Primitive Computing Paradigms
Joins, BFS, DFS, Topological Sort, Spanning Tree, Diameter
Our Current Development
Computing Models
SIGMOD’15b, VLDB’15a, VLDBJ’14, SIGMOD’14a, SIGMOD’13a,
EDBT’12b, VLDB’10
Advanced Applications
VLDB’15c, VLDBJ’13b, VLDB’13a, TKDE’12, ICDE’11, CIKM’11b
Query Primitives
VLDBJ’15, SIGMOD’15a,
VLDB’15b, KDD’15, ICDE’15b,
VLDB’13b, VLDBJ’12,
EDBT’12a, ICDE’09b, ICDE’07
Mining Primitives
Algorithmica’13, CIKM’12,
CIKM’11a, IJCAI’15,
TKDE’14, SDM’14, ICDE’13a,
TKDE’15, ICDE’13b, ICDM’13
IJCAI’13, ICDM’14, PR’15
Primitive Computing Paradigms
ICDE’15a, EDBT’15, ICDE’14, VLDB’14, SIGMOD’13b, VLDBJ’13a, VLDB’12,
Data Environments
SIGMOD’14b, VLDBJ’11, SIGMOD’09, ICDE’09a, EDBT’08, SSDBM’08
Outline
 Background
 Challenges and Opportunities
 Our Work: Graph Semantics
 Our Work: Graph Mining
 Our Work: Query Processing
 Our Work: Indexing
 Our Work: Computing Models
 Graph Processing System Design
 Future Developments
Future Developments
Social Network Recommendation
Location Based Social Network
Big Graph Processing in Cloud
Massive Graph Matching
Graph Summary
Graph Stream
Personalized Community
SearchHigh Influence Community
SearchGraph Clustering in Cloud
Massive Uncertain Graph
Conclusion
Mining and Query
Processing
The Era of Big Data
Indexing
Semantics
Computing Model
Big Graph: Larger, More Complex
More Challenges!
More Opportunities to Explore the
Unknown World!
Aknowledgements
1. Dr Lu Qin
2. Prof. Xingquan Zhu
3. Mr Jia Wu
4. Mr Shirui Pan
References
1. Jeffrey Xu Yu, Lu Qin, and Lijun Chang: Keyword Search in Databases, published by
Morgan & Claypool, 2009.
2. Xin Huang, Hong Cheng, Rong-Hua Li, Lu Qin, and Jeffrey Xu Yu: Top-K Structural
Diversity Search in Large Networks, in the International Journal on Very Large Data Bases
(VLDBJ), Vol. 24, No. 3, Pages 319-343, 2015.
3. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and Xuemin Lin: I/O Efficient:
Computing SCCs in Massive Graphs, in the International Journal on Very Large Data Bases
(VLDBJ), Vol. 24, No. 2, Pages 245-270, 2014.
4. Yuanyuan Zhu, Lu Qin, Jeffrey Xu Yu, Yiping Ke, and Xuemin Lin: High Efficiency and
Quality: Large Graphs Matching, in the International Journal on Very Large Data Bases
(VLDBJ), Vol. 22, No. 3, Pages 345-368, 2013.
5. Miao Qiao, Hong Cheng, Lu Qin, Jeffrey Xu Yu, Philip S. Yu, and Lijun Chang: Computing
Weight Constraint Reachability in Large Networks, in the International Journal on Very
Large Data Bases (VLDBJ), Vol. 22, No. 3, Pages 275-294, 2013.
6. Lijun Chang, Jeffrey Xu Yu, and Lu Qin: Fast Maximal Cliques Enumeration in Sparse
Graphs, in Algorithmica, Vol. 66, No. 1, Pages 173-186, 2013.
7. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Computing Structural Statistics by Keywords in
Databases. Invited paper by IEEE Transactions on Knowledge and Data Engineering
(TKDE), Vol. 24, No. 10, Pages 1731-1746, 2012.
8. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Hong Cheng, and Miao Qiao: The Exact Distance to
Destination in Undirected World, in the International Journal on Very Large Data Bases
(VLDBJ), Vol. 21, No. 6, Pages 869-888, 2012.
9. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Scalable Keyword Search on Large Data Streams,
in the International Journal on Very Large Data Bases (VLDBJ), Vol. 20, No. 1, Pages 35-
57, 2011.
10. Lu Qin, Rong-Hua Li, Lijun Chang, and Chengqi Zhang: Locally Densest Subgraph Discovery,
to appear in Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and
Data Mining (KDD'15), 2015.
References
11. Longbin Lai, Lu Qin, Xuemin Lin, and Lijun Chang: Scalable Subgraph Enumeration in
MapReduce, to appear in Proceedings of the Very Large Database Endowment (VLDB), 2015.
12. Lijun Chang, Xuemin Lin, Lu Qin, Jeffrey Xu Yu, Wenjie Zhang: Index-based Optimal
Algorithms for Computing Steiner Components with Maximum Connectivity, to appear in
Proceedings of ACM Conference on Management of Data (SIGMOD'15), 2015.
13. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, and Zechao Shang: Divide & Conquer: I/O Efficient
Depth First Search, to appear in Proceedings of ACM Conference on Management of Data
(SIGMOD'15), 2015.
14. Lijun Chang, Xuemin Lin, Lu Qin, Jeffrey Xu Yu, and Jian Pei: Efficiently Computing Top-K
Shortest Path Join, in Proceedings of the 18th International Conference on Extending
Database Technology (EDBT'15), 2015.
15. Rong-Hua Li, Jeffrey Xu Yu, Lu Qin, Rui Mao, and Tan Jin: On Random Walk Based Graph
Sampling, in the 31st IEEE International Conference on Data Engineering (ICDE'15), 2015.
16. Long Yuan, Lu Qin, Xuemin Lin, Lijun Chang, and Wenjia Zhang: Diversified Top-K Clique
Search, in the 31st IEEE International Conference on Data Engineering (ICDE'15), 2015.
17. Lijun Chang, Xuemin Lin, Wenjie Zhang, Jeffrey Xu Yu, Ying Zhang, and Lu Qin: Optimal
Enumeration: Efficient Top-k Tree Matching, in Proceedings of the Very Large Database
Endowment (VLDB), Vol. 8, No. 5, Pages 533-544, 2015.
18. Rong-Hua Li, Lu Qin, Jeffrey Xu Yu, and Rui Mao: Influential Community Search in Large
Networks, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 8, No. 5,
Pages 509-520, 2015.
19. Yuanyuan Zhu, Jeffrey Xu Yu, and Lu Qin: Leveraging Graph Dimensions in Online Graph
Search, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 8, No. 1, Pages
85-96, 2015.
20. Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu: Querying K-Truss
Community in Large and Dynamic Graphs, in Proceedings of ACM Conference on Management
of Data (SIGMOD'14), Pages 1311-1322, 2014.
References
21. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, and Xuemin Lin: Scalable
Big Graph Processing in MapReduce, in Proceedings of ACM Conference on Management of
Data (SIGMOD'14), Pages 827-838, 2014.
22. Zhiwei Zhang, Lu Qin, and Jeffrey Xu Yu: Contract & Expand: I/O Efficient SCCs
Computing, in the 30th IEEE International Conference on Data Engineering (ICDE'14),
Pages 208-219, 2014.
23. Xin Huang, Hong Cheng, Rong-Hua Li, Lu Qin, and Jeffrey Xu Yu: Top-K Structural Diversity
Search in Large Networks, in Proceedings of the Very Large Database Endowment (VLDB),
Vol. 6, No. 13, Pages 1618-1629, 2013.
24. Miao Qiao, Lu Qin, Hong Cheng, Jeffrey Xu Yu, and Wentao Tian: Top-K Nearest Keyword
Search on Large Graphs, in Proceedings of the Very Large Database Endowment (VLDB),
Vol. 6, No. 10, Pages 901-912, 2013.
25. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Xuemin Lin, Chengfei Liu, and Weifa Liang: Efficiently
Computing k-Edge Connected Components via Graph Decomposition, in Proceedings of ACM
Conference on Management of Data (SIGMOD'13), Pages 205-216, 2013.
26. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and Xuemin Lin: I/O Efficient:
Computing SCCs in Massive Graphs, in Proceedings of ACM Conference on Management of
Data (SIGMOD'13), Pages 181-192, 2013.
27. Yuanyuan Zhu, Jeffrey Xu Yu, Hong Cheng, and Lu Qin: Graph Classification: A Diversified
Discriminative Feature Selection Approach, in Proceedings of 2012 ACM International
Conference on Information and Knowledge Management (CIKM'12), Pages 205-214, 2012.
28. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Diversifying Top-K Results, in Proceedings of the
Very Large Database Endowment (VLDB), Vol. 5, No. 11, Pages 1124-1135, 2012.
29. Yuanyuan Zhu, Lu Qin, and Jeffrey Xu Yu: Finding Top-K Similar Graphs in Graph
Databases, in Proceedings of the 15th International Conference on Extending Database
Technology (EDBT'12), Pages 456-467, 2012.
30. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Qing Zhu, and Xiaofang Zhou: I/O Cost Minimization:
Reachability Queries Processing over Massive Graphs, in Proceedings of the 15th
International Conference on Extending Database Technology (EDBT'12), Pages 468-479,
2012.
References
31. Yuanyuan Zhu, Lu Qin, Jeffrey Xu Yu, Yiping Ke, and Xuemin Lin: High Efficiency and Quality:
Large Graphs Matching, in Proceedings of 2011 ACM International Conference on Information and
Knowledge Management (CIKM'11), Pages 1755-1764, 2011.
32. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Yuanyuan Zhu, and Haixun Wang: Finding Information Nebula
over Large Networks, in Proceedings of 2011 ACM International Conference on Information and
Knowledge Management (CIKM'11), Pages 1465-1474, 2011.
33. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Computing Structural Statistics by Keywords in
Databases, in Proceedings of the 27th IEEE International Conference on Data Engineering
(ICDE'11), Pages 363-374, 2011.
34. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Ten Thousand SQLs: Parallel Keyword Queries Computing,
in Proceedings of the Very Large Database Endowment (VLDB), Vol. 3, No. 1, Pages 58-69, 2010.
35. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Keyword Search in Databases: The Power of RDBMS, in
Proceedings of ACM Conference on Management of Data (SIGMOD'09), Pages 681-694, 2009.
36. Lu Qin, Jeffrey Xu Yu, Lijun Chang, and Yufei Tao: Querying Communities in Relational Databases,
in Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE'09), Pages
724-735, 2009.
37. Lu Qin, Jeffrey Xu Yu, Lijun Chang, and Yufei Tao: Scalable Keyword Search on Large Data
Streams, in Proceedings of the 25th IEEE International Conference on Data Engineering
(ICDE'09), Short Paper, Pages 1199-1202, 2009.
38. Lu Qin, Jeffrey Xu Yu, Bolin Ding, and Yoshiharu Ishikawa: Monitoring Aggregate k-NN Objects in
Road Networks, in Proceedings of the 20th International Conference on Scientific and Statistical
Database Management (SSDBM’08), Pages 168-186, 2008.
39. Bolin Ding, Jeffrey Xu Yu, and Lu Qin: Finding Time-Dependent Shortest Paths over Large Graphs,
in Proceedings of the 11th International Conference on Extending Database Technology (EDBT'08),
Pages 205-216, 2008.
40. Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin, Xiao Zhang, and Xuemin Lin: Finding Top-k Min-Cost
Connected Trees in Databases, in Proceedings of the 23rd IEEE International Conference on Data
Engineering (ICDE'07), Pages 836-845, 2007. (Best Student Paper)
References
41. Jia Wu, Xingquan Zhu, Chengqi Zhang, Philip S. Yu. Bag Constrained Structure Pattern
Mining for Multi-Graph Classification. IEEE Transactions on Knowledge and Data
Engineering (TKDE), Vol 26, No 10, pp.2382-2396, 2014.
42. Jia Wu, Zhibin Hong, Shirui Pan, Xingquan Zhu, Chengqi Zhang, Zhihua Cai. Multi-Graph
Learning with Positive and Unlabeled Bags. SDM 2014: 217-225.
43. Jia Wu, Xingquan Zhu, Chengqi Zhang, Zhihua Cai: Multi-instance Multi-graph Dual
Embedding Learning. ICDM’13, 2013: 827-836.
44. Jia Wu, Shirui Pan, Xingquan Zhu, Chengqi Zhang. Multi-Graph-View Learning for
Complicated Object Classification. International Joint Conference on Artificial Intelligence
(IJCAI’15), 2015
45. Shirui Pan, Jia Wu, and Xingquan Zhu, "CogBoost: Boosting for Fast Cost-sensitive Graph
Classification", IEEE Transactions on Knowledge and Data Engineering (TKDE), Accepted,
2015.
46. Shirui Pan, Xingquan Zhu, Chengqi Zhang, and Philip S. Yu. "Graph Stream Classification
using Labeled and Unlabeled Graphs", International Conference on Data Engineering
(ICDE’13), 2013
47. Shirui Pan and Xingquan Zhu. "CGStream: Continuous Correlated Graph Query for Data
Streams". 21st ACM International Conference on Information and Knowledge Management
(CIKM), 2012.
48. Shirui Pan and Xingquan Zhu. "Graph Classification with Imbalanced Class Distributions and
Noise", 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013
49. Jia Wu, Zhibin Hong, Shirui Pan, Xingquan Zhu, Chengqi Zhang, Zhihua Cai. "Multi-graph-
view Learning for Graph Classification", Proceedings of the 2014 IEEE International
Conference on Data Mining (ICDM), 2014
50. Shirui Pan, Jia Wu, Xingquan Zhu, Guodong Long, Chentqi Zhang, “Finding the Best not the
Most: Regularized Loss Minimization Subgraph Selection for Graph Classification”, to
appear in Pattern Recognition (PR), 2015
Thank you!
Questions?

More Related Content

What's hot

Intro to Graphs and Neo4j
Intro to Graphs and Neo4jIntro to Graphs and Neo4j
Intro to Graphs and Neo4jjexp
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBMohamed Taher Alrefaie
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
Map reduce programming model to solve graph problems
Map reduce programming model to solve graph problemsMap reduce programming model to solve graph problems
Map reduce programming model to solve graph problemsNishant Gandhi
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
 
OrientDB - the 2nd generation of (Multi-Model) NoSQL
OrientDB - the 2nd generation  of  (Multi-Model) NoSQLOrientDB - the 2nd generation  of  (Multi-Model) NoSQL
OrientDB - the 2nd generation of (Multi-Model) NoSQLLuigi Dell'Aquila
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
How Graph Databases started the Multi Model revolution
How Graph Databases started the Multi Model revolutionHow Graph Databases started the Multi Model revolution
How Graph Databases started the Multi Model revolutionLuca Garulli
 
Interpreting Relational Schema to Graphs
Interpreting Relational Schema to GraphsInterpreting Relational Schema to Graphs
Interpreting Relational Schema to GraphsNeo4j
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...Dippy Aggarwal
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Building a Graph-based Analytics Platform
Building a Graph-based Analytics PlatformBuilding a Graph-based Analytics Platform
Building a Graph-based Analytics PlatformKenny Bastani
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesKonstantinos Xirogiannopoulos
 

What's hot (20)

Intro to Graphs and Neo4j
Intro to Graphs and Neo4jIntro to Graphs and Neo4j
Intro to Graphs and Neo4j
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Map reduce programming model to solve graph problems
Map reduce programming model to solve graph problemsMap reduce programming model to solve graph problems
Map reduce programming model to solve graph problems
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
Graph based data models
Graph based data modelsGraph based data models
Graph based data models
 
OrientDB - the 2nd generation of (Multi-Model) NoSQL
OrientDB - the 2nd generation  of  (Multi-Model) NoSQLOrientDB - the 2nd generation  of  (Multi-Model) NoSQL
OrientDB - the 2nd generation of (Multi-Model) NoSQL
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
How Graph Databases started the Multi Model revolution
How Graph Databases started the Multi Model revolutionHow Graph Databases started the Multi Model revolution
How Graph Databases started the Multi Model revolution
 
Interpreting Relational Schema to Graphs
Interpreting Relational Schema to GraphsInterpreting Relational Schema to Graphs
Interpreting Relational Schema to Graphs
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Building a Graph-based Analytics Platform
Building a Graph-based Analytics PlatformBuilding a Graph-based Analytics Platform
Building a Graph-based Analytics Platform
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 

Viewers also liked

Balls and-bins model app
Balls and-bins model appBalls and-bins model app
Balls and-bins model appdeawoo Kim
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesData-Centric_Alliance
 
Clique Relaxation Models in Networks: Theory, Algorithms, and Applications
Clique Relaxation Models in Networks: Theory, Algorithms, and ApplicationsClique Relaxation Models in Networks: Theory, Algorithms, and Applications
Clique Relaxation Models in Networks: Theory, Algorithms, and ApplicationsSSA KPI
 

Viewers also liked (6)

Balls and-bins model app
Balls and-bins model appBalls and-bins model app
Balls and-bins model app
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Np cooks theorem
Np cooks theoremNp cooks theorem
Np cooks theorem
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
 
Np complete
Np completeNp complete
Np complete
 
Clique Relaxation Models in Networks: Theory, Algorithms, and Applications
Clique Relaxation Models in Networks: Theory, Algorithms, and ApplicationsClique Relaxation Models in Networks: Theory, Algorithms, and Applications
Clique Relaxation Models in Networks: Theory, Algorithms, and Applications
 

Similar to Chengqi zhang graph processing and mining in the era of big data

L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneMongoDB
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczIoan Toma
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, whenEugenio Minardi
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph GeneratorLDBC council
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhentranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhenDavid Peyruc
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRgo-pivotal
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...MongoDB
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2Neo4j
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataChristos Hadjinikolis
 

Similar to Chengqi zhang graph processing and mining in the era of big data (20)

L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhentranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2
 
Big data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big DataBig data week 2018 - Graph Analytics on Big Data
Big data week 2018 - Graph Analytics on Big Data
 

More from jins0618

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environmentjins0618
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networksjins0618
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environmentsjins0618
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践jins0618
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究jins0618
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutlinejins0618
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big datajins0618
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processingjins0618
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...jins0618
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationjins0618
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdjins0618
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus typejins0618
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase miningjins0618
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hinjins0618
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutlinejins0618
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisjins0618
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big datajins0618
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台jins0618
 
Movies&demographics
Movies&demographicsMovies&demographics
Movies&demographicsjins0618
 

More from jins0618 (20)

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processing
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台
 
Movies&demographics
Movies&demographicsMovies&demographics
Movies&demographics
 

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

Chengqi zhang graph processing and mining in the era of big data

  • 1. Graph Processing and Mining in the Era of Big Data Chengqi Zhang Centre for Quantum Computation & Intelligent Systems (QCIS) University of Technology, Sydney (UTS)
  • 2. Outline  Background  Challenges and Opportunities  Our Work: Graph Semantics  Our Work: Graph Mining  Our Work: Query Processing  Our Work: Indexing  Our Work: Computing Models  Graph Processing System Design  Future Developments
  • 4. Big Data Characteristics Big Data Volume • Petabytes • Records • Transactions Velocity • Batch • Real time • Streaming Variety • Structured • Unstructured • Semi- structured
  • 5. Graph in Big Data: Volume • 1.23 billon active users in 2013 • 190 friends/user on average • 500 TB data/day in 2012 • 2.1 billion webpages in 2000 • 15 billion edges in 2000 • 20 PB data/day in 2008 • 180-200 PB data in 2011 • 6.5 PB data + 50 TB/day in 2009
  • 6. Graph in Big Data: Velocity • Fast flowing data • Evolving data structures and relationships
  • 7. Graph in Big Data: Variety • Directed vs Undirected • Labeled vs Unlabeled • Weighted vs Unweighted • Heterogeneous vs homogeneous
  • 8. Outline  Background  Challenges and Opportunities  Our Work: Graph Semantics  Our Work: Graph Mining  Our Work: Query Processing  Our Work: Indexing  Our Work: Computing Models  Graph Processing System Design  Future Developments
  • 9. Challenges and Opportunities New Graph Semantics (Variety) New Query Processing Algorithms (Volume & Velocity) New Indexing Techniques (Volume & Velocity) New Computing Models (Volume) New Graph Mining Tasks (Variety)
  • 10. New Graph Semantics Traditional (Google) • Input: keywords • Output: webpages containing keywords • Ranked by PageRank New (Google) • Input: keywords • Output: knowledge graph/subgraph • Ranking should consider both structural and content information
  • 11. New Graph Mining Tasks Chemical Compound Database Chemical Features Team of Experts Several Years Graph Mining Several Hours
  • 12. New Query Processing Algorithms Location Relationship Text Spatial query processing, nearest neighbor search … Link analysis, shortest path search, community detection … Text processing, string matching, semantic analysis … All of these should be processed in Milliseconds
  • 13. New Indexing Techniques Traditional: webpages, files ? Hash table, B-tree, Inverted Index … New: subgraphs, trees, paths ? What’s more Graph is Frequently Changing…
  • 14. New Computing Models Single Machine vs Multiple Machines Internal Algorithms vs External Algorithms Single Core vs Multiple Cores
  • 15. Outline  Background  Challenges and Opportunities  Our Work: Graph Semantics  Our Work: Graph Mining  Our Work: Query Processing  Our Work: Indexing  Our Work: Computing Models  Graph Processing System Design  Future Developments
  • 16. Structural Keyword Search Jim, data mining Jim data mining data mining Jim Jim, data mining data mining Jim data mining Jim Traditional: Content Keyword Search New: Structural Keyword Search Our Work: • ICDE’07: Finding Top-K Min-Cost Connected Trees in Databases • SIGMOD’09: Keyword Search in Databases: The Power of RDBMS • Morgan & Claypool 2009 (Book): Keyword Search in Databases • VLDBJ’11: Scalable Keyword Search on Large Data Streams • ICDE’11 & TKDE’12: Computing Structural Statistics by Keywords in Databases
  • 17. Graph Matching MatchGraph 1 Graph 2 2 41 7 53 6 2 41 7 1 53 6 Graph PatternMatch Our Work: • EDBT’12: Finding Top-K Similar Graphs in Graph Databases • CIKM’11 & VLDBJ’13: High Efficiency and Quality: Large Graphs Matching • VLDB’14: Leveraging Graph Dimensions in Online Graph Search
  • 18. Community Detection ? What is a community in a graph? A cohesive subgraph? A dense subgraph? Everyone is highly connected to others? Everyone is with small distance with others? An Example: k-core 1-core 2-core 3-core
  • 19. Community Detection Graph 3-core 4-clique 3-edge-cc 4-truss ? Other Semantics? Our Work: • SIGMOD’13: Efficiently Computing k-Edge Connected Components via Graph Decomposition • SIGMOD’14: Querying k-truss Community in Large and Dynamic Graphs • VLDB’15: Influential Community Search in Large Networks • KDD’15: Locally Densest Subgraph Discovery
  • 20. Influential Community (VLDB’15) Which are the most influential research groups? A Collaboration Network
  • 21. Locally Densest Subgraph (KDD’15) Which are the most representative dense subgraphs?
  • 22. Outline  Background  Challenges and Opportunities  Our Work: Graph Semantics  Our Work: Graph Mining  Our Work: Query Processing  Our Work: Indexing  Our Work: Computing Models  Graph Processing System Design  Future Developments
  • 23. Graph Classification + -+ + + - -- Graph Database … Frequent Subgraphs … Optimal Subgraphs Classifier 1 2 3 4 1 2 3 + -+ + + - -- Graph Database … Optimal Subgraphs Classifier + -+ + + - -- Graph Database … Optimal Subgraphs Classifier 1 2 3 Traditional: 3 Phases Our work (CIKM’12): 2 Phases Our work (PR’15): 1 Phase Direct Selection Direct Selection Our Work: • CIKM’12: Graph Classification: A Diversified Discriminative Feature Selection Approach • ICDE’13: Graph Stream Classification using Labeled and Unlabeled Graphs • IJCAI’13: Graph Classification with Imbalanced Class Distributions and Noise • TKDE’14: Bag Constrained Structure Pattern Mining for Multi-Graph Classification • SDM’14: Multi-Graph Learning with Positive and Unlabeled Bags • ICDM’14: Multi-Graph-View Learning for Graph Classification • IJCAI’15: Multi-Graph-View Learning for Complicated Object Classification • TKDE’15: CogBoost: Boosting for Fast Cost-sensitive Graph Classification • PR’15: Finding the Best not the Most: Regularized Loss Minimization Subgraph Selection for Graph Classification
  • 24. Outline  Background  Challenges and Opportunities  Our Work: Graph Semantics  Our Work: Graph Mining  Our Work: Query Processing  Our Work: Indexing  Our Work: Computing Models  Graph Processing System Design  Future Developments
  • 25. Polynomial Delay Enumeration Problems in Graph? • Structural keyword search • Community detection • Graph pattern matching • Similar graph search Polynomial Time w.r.t. Input? Output can be exponential Impossible! So… Polynomial Total: Polynomial to Input+Output Possible, but…
  • 26. Polynomial Delay time… … … Many answers! Can’t you be faster? time How about this? Polynomial Total Polynomial Delay New Solution Polynomial Delay: Delay Time Polynomial to Input Total time is still large, but… Our Work: • ICDE’09: Querying Communities in Relational Databases • Algorithmica’13: Fast Maximal Cliques Enumeration in Sparse Graphs • EDBT’15: Efficiently Computing Top-K Shortest Path Join • VLDB’15: Optimal Enumeration - Efficient Top-k Tree Matching
  • 27. Diversified Graph Search Enumeration Problems in Graph • Structural keyword search • Community detection • Graph pattern matching • Frequent graph pattern mining • … Top-6 Answers Top-6 Diversified Answers Top-K Densest Communities?Consider Diversity? Graph Our Work: • VLDB’12: Diversifying Top-K Results • CIKM’12: Graph Classification: A Diversified Discriminative Feature Selection Approach • VLDB’13 & VLDBJ’15: Top-K Structural Diversity Search in Large Networks • ICDE’15: Diversified Top-K Clique Search
  • 28. Diversified Top-K Cliques (ICDE’15) A B E J G H K I F C D Maximum CliqueTop-2 Maximum Cliques Too much overlap! Diversified Top-2 Maximum Cliques Cover All Nodes! Problem Statement: Compute k Cliques to Cover Maximum Number of Nodes
  • 29. Outline  Background  Challenges and Opportunities  Our Work: Graph Semantics  Our Work: Graph Mining  Our Work: Query Processing  Our Work: Indexing  Our Work: Computing Models  Graph Processing System Design  Future Developments
  • 30. Dijkstra’s Algorithm? Shortest Path Computation A* Algorithm? Traverse the whole graph in worst case Precompute all-pair shortest paths? Impractical! Our approach (VLDBJ’12): Compute a subset of pairs VLDBJ’12 Our Work: • VLDBJ’12: The Exact Distance to Destination in Undirected World • VLDB’13: Top-K Nearest Keyword Search on Large Graphs • VLDBJ’13: Computing Weight Constraint Reachability in Large Networks • SIGMOD’15: Index-based Optimal Algorithms for Computing Steiner Components with Maximum Connectivity
  • 31. Outline  Background  Challenges and Opportunities  Our Work: Graph Semantics  Our Work: Graph Mining  Our Work: Query Processing  Our Work: Indexing  Our Work: Computing Models  Graph Processing System Design  Future Developments
  • 32. Our Focus I/O Efficient Computation Control Data- path Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Second Level Cache (SRAM)On-Chip Cache 1 ns 10 msSpeed: 5 ns 100 ns 100B TBSize: KB GB Tertiary Storage (Tape) 10 sec PB 10 ns MB Graph Problems Main Memory vs Disk Sequential I/O vs Random I/O External vs Semi-external Partition based vs Nested loop based Our Work: • EDBT’12: I/O Cost Minimization: Reachability Queries Processing over Massive Graphs • SIGMOD’13 & VLDBJ’14: I/O Efficient: Computing SCCs in Massive Graphs • ICDE’14: Contract & Expand: I/O Efficient SCCs Computing • SIGMOD’15: Divide and Conquer - I/O Efficient Depth-First Search
  • 33. Parallel Computation Memory Core Core L1 L1 L2 Switch Core Core L1 L1 L2 Switch CPU Disk Memory CPU Disk Memory CPU Disk Memory Network • Computation Sensitive Multicore • Shared Memory • Separated L1 Cache • Reduce Cache Miss • Data Sensitive Distributed Computing • Shared Nothing • Separated CPU, memory, Disk • Reduce Communication • Divide Tasks • Divide Data Multicore Distributed Computing MapReduce, BSP… Comparison… Our Work: • VLDB’10: Ten Thousand SQLs: Parallel Keyword Queries Computing • SIGMOD’14: Scalable Big Graph Processing in MapReduce • VLDB’15: Scalable Subgraph Enumeration in MapReduce
  • 34. Outline  Background  Challenges and Opportunities  Our Work: Graph Semantics  Our Work: Graph Mining  Our Work: Query Processing  Our Work: Indexing  Our Work: Computing Models  Graph Processing System Design  Future Developments
  • 35. Graph Processing System Design Objective 1: Extracting Primitive Operators from DB and DM Challenge: Completeness & Minimality Objective 2: Scalable Processing Techniques Challenge: Guarantee of “Optimality” Objective 3: Characterizing Real-time Tractability Challenge: Hard & Risky
  • 36. Graph System Structure Data Environments Static, Streaming, Dynamic Graph, Probabilistic, Spatial, Evolving Graph, Random Graph Computing Models Main-memory, Distributed/Cloud/MapReduce/BSP/Spark/Pregel, SSD, Parallel/Multi-core, External/Semi-External Advanced Applications Social Network (Twitter, Facebook), Geo Social (Checkin), Chemical, Biological, Web Graph (Wiki), Collaboration (DBLP), Public Opinion Mining Query Primitives • Given a Graph Pattern: Similarity, Pattern, Sub/Super Graph • Given a Set of Nodes: Topology: SimRank, Connectivity, Path K-hop, Flow, Community, Reachability • Given a Set of Keywords: Knowledge Graph, Attributed Graph, RDF Mining Primitives • Subgraph Based: Cohesive Subgraph Mining Community Detection Graph Clustering, Partition Frequent Subgraph Mining • Aggregate Based: PageRank, Outlier, Anonymity Influence Maximization Primitive Computing Paradigms Joins, BFS, DFS, Topological Sort, Spanning Tree, Diameter
  • 37. Our Current Development Computing Models SIGMOD’15b, VLDB’15a, VLDBJ’14, SIGMOD’14a, SIGMOD’13a, EDBT’12b, VLDB’10 Advanced Applications VLDB’15c, VLDBJ’13b, VLDB’13a, TKDE’12, ICDE’11, CIKM’11b Query Primitives VLDBJ’15, SIGMOD’15a, VLDB’15b, KDD’15, ICDE’15b, VLDB’13b, VLDBJ’12, EDBT’12a, ICDE’09b, ICDE’07 Mining Primitives Algorithmica’13, CIKM’12, CIKM’11a, IJCAI’15, TKDE’14, SDM’14, ICDE’13a, TKDE’15, ICDE’13b, ICDM’13 IJCAI’13, ICDM’14, PR’15 Primitive Computing Paradigms ICDE’15a, EDBT’15, ICDE’14, VLDB’14, SIGMOD’13b, VLDBJ’13a, VLDB’12, Data Environments SIGMOD’14b, VLDBJ’11, SIGMOD’09, ICDE’09a, EDBT’08, SSDBM’08
  • 38. Outline  Background  Challenges and Opportunities  Our Work: Graph Semantics  Our Work: Graph Mining  Our Work: Query Processing  Our Work: Indexing  Our Work: Computing Models  Graph Processing System Design  Future Developments
  • 39. Future Developments Social Network Recommendation Location Based Social Network Big Graph Processing in Cloud Massive Graph Matching Graph Summary Graph Stream Personalized Community SearchHigh Influence Community SearchGraph Clustering in Cloud Massive Uncertain Graph
  • 40. Conclusion Mining and Query Processing The Era of Big Data Indexing Semantics Computing Model Big Graph: Larger, More Complex More Challenges! More Opportunities to Explore the Unknown World!
  • 41. Aknowledgements 1. Dr Lu Qin 2. Prof. Xingquan Zhu 3. Mr Jia Wu 4. Mr Shirui Pan
  • 42. References 1. Jeffrey Xu Yu, Lu Qin, and Lijun Chang: Keyword Search in Databases, published by Morgan & Claypool, 2009. 2. Xin Huang, Hong Cheng, Rong-Hua Li, Lu Qin, and Jeffrey Xu Yu: Top-K Structural Diversity Search in Large Networks, in the International Journal on Very Large Data Bases (VLDBJ), Vol. 24, No. 3, Pages 319-343, 2015. 3. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and Xuemin Lin: I/O Efficient: Computing SCCs in Massive Graphs, in the International Journal on Very Large Data Bases (VLDBJ), Vol. 24, No. 2, Pages 245-270, 2014. 4. Yuanyuan Zhu, Lu Qin, Jeffrey Xu Yu, Yiping Ke, and Xuemin Lin: High Efficiency and Quality: Large Graphs Matching, in the International Journal on Very Large Data Bases (VLDBJ), Vol. 22, No. 3, Pages 345-368, 2013. 5. Miao Qiao, Hong Cheng, Lu Qin, Jeffrey Xu Yu, Philip S. Yu, and Lijun Chang: Computing Weight Constraint Reachability in Large Networks, in the International Journal on Very Large Data Bases (VLDBJ), Vol. 22, No. 3, Pages 275-294, 2013. 6. Lijun Chang, Jeffrey Xu Yu, and Lu Qin: Fast Maximal Cliques Enumeration in Sparse Graphs, in Algorithmica, Vol. 66, No. 1, Pages 173-186, 2013. 7. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Computing Structural Statistics by Keywords in Databases. Invited paper by IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 24, No. 10, Pages 1731-1746, 2012. 8. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Hong Cheng, and Miao Qiao: The Exact Distance to Destination in Undirected World, in the International Journal on Very Large Data Bases (VLDBJ), Vol. 21, No. 6, Pages 869-888, 2012. 9. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Scalable Keyword Search on Large Data Streams, in the International Journal on Very Large Data Bases (VLDBJ), Vol. 20, No. 1, Pages 35- 57, 2011. 10. Lu Qin, Rong-Hua Li, Lijun Chang, and Chengqi Zhang: Locally Densest Subgraph Discovery, to appear in Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'15), 2015.
  • 43. References 11. Longbin Lai, Lu Qin, Xuemin Lin, and Lijun Chang: Scalable Subgraph Enumeration in MapReduce, to appear in Proceedings of the Very Large Database Endowment (VLDB), 2015. 12. Lijun Chang, Xuemin Lin, Lu Qin, Jeffrey Xu Yu, Wenjie Zhang: Index-based Optimal Algorithms for Computing Steiner Components with Maximum Connectivity, to appear in Proceedings of ACM Conference on Management of Data (SIGMOD'15), 2015. 13. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, and Zechao Shang: Divide & Conquer: I/O Efficient Depth First Search, to appear in Proceedings of ACM Conference on Management of Data (SIGMOD'15), 2015. 14. Lijun Chang, Xuemin Lin, Lu Qin, Jeffrey Xu Yu, and Jian Pei: Efficiently Computing Top-K Shortest Path Join, in Proceedings of the 18th International Conference on Extending Database Technology (EDBT'15), 2015. 15. Rong-Hua Li, Jeffrey Xu Yu, Lu Qin, Rui Mao, and Tan Jin: On Random Walk Based Graph Sampling, in the 31st IEEE International Conference on Data Engineering (ICDE'15), 2015. 16. Long Yuan, Lu Qin, Xuemin Lin, Lijun Chang, and Wenjia Zhang: Diversified Top-K Clique Search, in the 31st IEEE International Conference on Data Engineering (ICDE'15), 2015. 17. Lijun Chang, Xuemin Lin, Wenjie Zhang, Jeffrey Xu Yu, Ying Zhang, and Lu Qin: Optimal Enumeration: Efficient Top-k Tree Matching, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 8, No. 5, Pages 533-544, 2015. 18. Rong-Hua Li, Lu Qin, Jeffrey Xu Yu, and Rui Mao: Influential Community Search in Large Networks, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 8, No. 5, Pages 509-520, 2015. 19. Yuanyuan Zhu, Jeffrey Xu Yu, and Lu Qin: Leveraging Graph Dimensions in Online Graph Search, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 8, No. 1, Pages 85-96, 2015. 20. Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu: Querying K-Truss Community in Large and Dynamic Graphs, in Proceedings of ACM Conference on Management of Data (SIGMOD'14), Pages 1311-1322, 2014.
  • 44. References 21. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, and Xuemin Lin: Scalable Big Graph Processing in MapReduce, in Proceedings of ACM Conference on Management of Data (SIGMOD'14), Pages 827-838, 2014. 22. Zhiwei Zhang, Lu Qin, and Jeffrey Xu Yu: Contract & Expand: I/O Efficient SCCs Computing, in the 30th IEEE International Conference on Data Engineering (ICDE'14), Pages 208-219, 2014. 23. Xin Huang, Hong Cheng, Rong-Hua Li, Lu Qin, and Jeffrey Xu Yu: Top-K Structural Diversity Search in Large Networks, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 6, No. 13, Pages 1618-1629, 2013. 24. Miao Qiao, Lu Qin, Hong Cheng, Jeffrey Xu Yu, and Wentao Tian: Top-K Nearest Keyword Search on Large Graphs, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 6, No. 10, Pages 901-912, 2013. 25. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Xuemin Lin, Chengfei Liu, and Weifa Liang: Efficiently Computing k-Edge Connected Components via Graph Decomposition, in Proceedings of ACM Conference on Management of Data (SIGMOD'13), Pages 205-216, 2013. 26. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and Xuemin Lin: I/O Efficient: Computing SCCs in Massive Graphs, in Proceedings of ACM Conference on Management of Data (SIGMOD'13), Pages 181-192, 2013. 27. Yuanyuan Zhu, Jeffrey Xu Yu, Hong Cheng, and Lu Qin: Graph Classification: A Diversified Discriminative Feature Selection Approach, in Proceedings of 2012 ACM International Conference on Information and Knowledge Management (CIKM'12), Pages 205-214, 2012. 28. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Diversifying Top-K Results, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 5, No. 11, Pages 1124-1135, 2012. 29. Yuanyuan Zhu, Lu Qin, and Jeffrey Xu Yu: Finding Top-K Similar Graphs in Graph Databases, in Proceedings of the 15th International Conference on Extending Database Technology (EDBT'12), Pages 456-467, 2012. 30. Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Qing Zhu, and Xiaofang Zhou: I/O Cost Minimization: Reachability Queries Processing over Massive Graphs, in Proceedings of the 15th International Conference on Extending Database Technology (EDBT'12), Pages 468-479, 2012.
  • 45. References 31. Yuanyuan Zhu, Lu Qin, Jeffrey Xu Yu, Yiping Ke, and Xuemin Lin: High Efficiency and Quality: Large Graphs Matching, in Proceedings of 2011 ACM International Conference on Information and Knowledge Management (CIKM'11), Pages 1755-1764, 2011. 32. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Yuanyuan Zhu, and Haixun Wang: Finding Information Nebula over Large Networks, in Proceedings of 2011 ACM International Conference on Information and Knowledge Management (CIKM'11), Pages 1465-1474, 2011. 33. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Computing Structural Statistics by Keywords in Databases, in Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE'11), Pages 363-374, 2011. 34. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Ten Thousand SQLs: Parallel Keyword Queries Computing, in Proceedings of the Very Large Database Endowment (VLDB), Vol. 3, No. 1, Pages 58-69, 2010. 35. Lu Qin, Jeffrey Xu Yu, and Lijun Chang: Keyword Search in Databases: The Power of RDBMS, in Proceedings of ACM Conference on Management of Data (SIGMOD'09), Pages 681-694, 2009. 36. Lu Qin, Jeffrey Xu Yu, Lijun Chang, and Yufei Tao: Querying Communities in Relational Databases, in Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE'09), Pages 724-735, 2009. 37. Lu Qin, Jeffrey Xu Yu, Lijun Chang, and Yufei Tao: Scalable Keyword Search on Large Data Streams, in Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE'09), Short Paper, Pages 1199-1202, 2009. 38. Lu Qin, Jeffrey Xu Yu, Bolin Ding, and Yoshiharu Ishikawa: Monitoring Aggregate k-NN Objects in Road Networks, in Proceedings of the 20th International Conference on Scientific and Statistical Database Management (SSDBM’08), Pages 168-186, 2008. 39. Bolin Ding, Jeffrey Xu Yu, and Lu Qin: Finding Time-Dependent Shortest Paths over Large Graphs, in Proceedings of the 11th International Conference on Extending Database Technology (EDBT'08), Pages 205-216, 2008. 40. Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin, Xiao Zhang, and Xuemin Lin: Finding Top-k Min-Cost Connected Trees in Databases, in Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE'07), Pages 836-845, 2007. (Best Student Paper)
  • 46. References 41. Jia Wu, Xingquan Zhu, Chengqi Zhang, Philip S. Yu. Bag Constrained Structure Pattern Mining for Multi-Graph Classification. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol 26, No 10, pp.2382-2396, 2014. 42. Jia Wu, Zhibin Hong, Shirui Pan, Xingquan Zhu, Chengqi Zhang, Zhihua Cai. Multi-Graph Learning with Positive and Unlabeled Bags. SDM 2014: 217-225. 43. Jia Wu, Xingquan Zhu, Chengqi Zhang, Zhihua Cai: Multi-instance Multi-graph Dual Embedding Learning. ICDM’13, 2013: 827-836. 44. Jia Wu, Shirui Pan, Xingquan Zhu, Chengqi Zhang. Multi-Graph-View Learning for Complicated Object Classification. International Joint Conference on Artificial Intelligence (IJCAI’15), 2015 45. Shirui Pan, Jia Wu, and Xingquan Zhu, "CogBoost: Boosting for Fast Cost-sensitive Graph Classification", IEEE Transactions on Knowledge and Data Engineering (TKDE), Accepted, 2015. 46. Shirui Pan, Xingquan Zhu, Chengqi Zhang, and Philip S. Yu. "Graph Stream Classification using Labeled and Unlabeled Graphs", International Conference on Data Engineering (ICDE’13), 2013 47. Shirui Pan and Xingquan Zhu. "CGStream: Continuous Correlated Graph Query for Data Streams". 21st ACM International Conference on Information and Knowledge Management (CIKM), 2012. 48. Shirui Pan and Xingquan Zhu. "Graph Classification with Imbalanced Class Distributions and Noise", 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013 49. Jia Wu, Zhibin Hong, Shirui Pan, Xingquan Zhu, Chengqi Zhang, Zhihua Cai. "Multi-graph- view Learning for Graph Classification", Proceedings of the 2014 IEEE International Conference on Data Mining (ICDM), 2014 50. Shirui Pan, Jia Wu, Xingquan Zhu, Guodong Long, Chentqi Zhang, “Finding the Best not the Most: Regularized Loss Minimization Subgraph Selection for Graph Classification”, to appear in Pattern Recognition (PR), 2015

Editor's Notes

  1. Graph is a powerful data structure to model the relationship among entities in the real world Web Graph: nodes are webpages, edges are hyperlinks Road Network: nodes are road intersections, edges are road segments Social Network: nodes are users, edges are friendships The Internet of Things: nodes are objects, edges are the relationships among objects
  2. Our research mainly focus on the Three Vs of Big Data
  3. The statistics of some real graph datasets
  4. What happens in one minute? Fast flowing data: new data are streaming in rapidly Evolving data structures and relationships: New relationships among people/entities are established/destroyed in every second
  5. A large variety of graph data Directed: Twitter; Undirected: Facebook Labeled: Chemical Compound; Unlabeled: Web Graph Weighted: Social Network; Unweighted: Computer Network Heterogeneous: The Internet of Things; Homogeneous: Paper Reference Network
  6. Different challenges tickle different big data Vs
  7. Unlike traditional Google search that answer a user query using a single webpage, Google Knowledge Graph Search aims to answer a user question using a collection of correlated webpages (modeled as subgraphs).
  8. Identify discriminative chemical features (modeled as subgraphs) in a chemical compound database (modeled as a database of graphs) is a critical task in Bioinformatics and Chemistry. Traditionally, this task usually relies on the experiences of domain experts, and the period is usually very long. However, with the help of graph mining, we can largely reduce the search space by providing a list of most promising features. This can largely shorten the period of identifying useful chemical features, and thus reduce the cost.
  9. Facebook introduced Graph Search in 2013. In order to support graph search efficiently, we need to consider techniques to handle various types of information to be combined in graph algorithms. For example: When combined with location information, we may need techniques such as spatial query processing, nearest neighbor search, etc. An Example: search all male users in age 20-30, that is within 100m of my current location. 2. When handing relationships, we may need techniques such as link analysis, shortest path search, community detection, etc. An Example: search all potential friends who share at least three common friends with me. 3. When combined with text information, we may need techniques such as text processing, string matching, semantic analysis, etc. An Example: search all my friends who like “hiking” and “swimming”
  10. To index a set of documents/wegpages, we can use the traditional Hash Table, B-Tree, Inverted index easily using linear time and space. However, in graphs, the answers are usually subgraphs, trees, and paths, the size of which can be exponentially large to the size of the graph. Therefore, traditional indexing structures cannot be directly used. In addition, when the graph changes, we should be able to maintain the index structure incrementally without re-computing from scratch.
  11. Traditionally, we use a single machine to store the graph. Now when the size of the graph is large, we may need to use multiple machines and derive distributed algorithms to process the graph. Traditionally, we keep the whole graph in the main memory of the machine. Now we need to consider external algorithms since the graph may be too large to fit in the main memory of a machine. Traditionally, we use a single core to process a graph. Now we need to consider multi-core programming to improve the efficiency of query processing. A number of graph processing systems have been established. Such as Hadoop@Apache, Pegasus@CMU, SNAP@Stanford, GraphLab@CMU, Hama@Apache, and Giraph@Apache.
  12. The traditional keyword search semantics in relational databases is content based search. Given a list of keywords, the answers are individual tuples that contain all/part of the keywords in the query. Now, by modeling the relational database as a graph, we proposed the structural keyword search. Given a list of keywords, the answers are a set of subgraphs. Each subgraph contains the tuples that contain the keywords as well as the relationships among the tuples. Our work focus on how to define a proper result semantics and how to use efficient algorithms to answer the query under each semantics.
  13. Problem 1: Given two large graphs, how to find the most common part of the two graphs? This is the Maximum Common Subgraph (MCS) problem, which is computational intractable. Problem 2: Given a large data graph and a small pattern graph, find all the subgraphs of the data graph that are isomorphic to the pattern graph. This is the subgraph isomorphism problem which is NP-hard. Our work mainly focus on how to find approximate solutions for graph matching, and how to increase the quality and efficiency of the matching.
  14. How to define a community in a graph is an open problem. However, there are some common senses. Generally, a community should be (1) a cohesive subgraph, (2) a subgraph with high density, (3) a subgraph with low diameter, and (4) a subgraph with high connectivity. Here is an example of k-core. A k-core is a subgraph such that every node has at least k neighbors in the subgraph. When k is small, the k-core is large, but sparse. When k becomes large, the k-core becomes small, but dense.
  15. Sometimes, k-core may result in undesired subgraphs. For this example, the second result is undesirable because it is a loose concentration of two dense subgraphs. Therefore, a lot of other semantics are proposed. For example: k-clique: a k-clique is a subgraph such that every two node in the subgraph are connected by an edge. K-edge connected component (k-edge-cc): a k-edge cc is a subgraph such that after removing any k-1 edges, it is still connected. K-truss: a k-truss is a subgraph such that every edge is contained in at least k-2 triangles. Our work mainly focus on defining an appropriate community semantic to hand a specific real-world application. We also focus on the efficiency and dynamic updating issues.
  16. A new researcher in the database area may want to find the most influential research groups in the database collaboration network, and follow their publications. This is the focus of our VLDB 2015 paper: how to find the most influential communities in a large network?
  17. In community detection, if we simply focus on the density of the returned subgraphs, all the subgraphs returned may come from the most densest region of the graph and other regions are omitted. In this work, we focus on finding dense subgraphs by considering the density of its local region. In this way, not only the globally large dense regions can be identified, dense subgraphs in other regions can also be found. This can help us to find some emerging but not necessarily large communities in the graph.
  18. Given a graph database with a set of graphs, graph classification aims to train a classifier to distinguish different features (subgraphs) in the graph database. A traditional method needs three phases: (1) we first compute the frequent subgraphs using the frequent subgraph mining techniques, and (2) compute the optimal subgraphs (features) from the set of frequent subgraphs, and (3) then we can train the classifier based on the optimal subgraphs. In our work of CIKM 2012, we propose to combine phase (1) and phase (2) into one phase. In this way, more structural information can be involved when selecting the set of optimal subgraphs (features). In our work of PR 2015, we use only one phase to compute the classifier. We integrate the process of classifier training into the process of optimal subgraph selection, and we allow iterative refinement to further optimize the algorithm. Our recent work along this direction mainly focus on different learning models for graph classification.
  19. Let’s fist consider some efficiency issues in graphs. A large number of graph problems are enumeration problems (e.g., to enumerate a list of subgraphs that satisfy a certain property). In algorithms, we say an algorithm is efficient if the algorithm can terminate in polynomial time w.r.t. the size of the input (e.g., the size of graph and query). However, for enumeration problem, usually the number of answers can be exponential to the size of the graph. Therefore, we need new terminologies to measure the efficiency of an enumeration problem. The first attempt is that: instead of requiring the time complexity of an algorithm to be polynomial to the size of input, we can make the complexity of the algorithm to be polynomial to the size of the input and output. This is called polynomial total. Polynomial total is possible for enumeration problems, however, it can still cause some problems (see next slide).
  20. Suppose for a certain enumeration problem, there are 3600 answers. By polynomial total, it is possible that the user waits for an hour until all answers are output at once. Obviously, such scenario is not desirable by the user. Therefore, in our new solution, instead of considering the total time, we require the delay time between consecutive answers to polynomial to the input only. This is called polynomial delay. For the above example, using polynomial delay, the user can see a new result in every second, and the user can also decide whether to see the next result when getting a certain number of answers. Comparing to polynomial total, although the total time may not decrease in polynomial delay, the user experience are obviously much better. Our work in this direction focus on deriving a polynomial delay algorithm for different graph semantics. E.g, multi-center community (ICDE’09), clique (Algorithmica’13), top-k shortest paths (edbt’15), top-k tree matching (vldb’15).
  21. We now consider the effectiveness issues in graphs. We still consider the enumeration problems in graphs. One of the common properties for most enumeration problems in graphs is that: the answers are subgraphs that can overlap with each other. As a result, some results are very similar with each other. Consider the top-k densest subgraph enumeration problem. In the shown example, if we want to find the top-6 answers, it is possible that all subgraphs are derived from the most densest regions in the graph. Obviously, the top-6 answers are not desirable because they are too similar and contains little information as a whole. This motivates us to consider the diversity when answering graph problems. The definition of diversity varies in different graph semantics, but the intuition is to enlarge the information contained in the returned answers. For the above example, we can see that if we considering the diversity in the top-6 answers, although the result may not be as large as the original top-6 answers, the new subgraphs cover most part of the graph and thus is more desirable. Along this direction, our work mainly focus on defining the diversity for different graph problems and derive efficient solutions to compute the diversified answers. An example is shown in the next slide.
  22. A clique is a subgraph in which every two nodes are connected by an edge. Identifying large cliques in a graph is a useful graph operation and wide applied in a lot of applications. However, if we simply compute the top-k cliques with largest size, the result can largely overlap with each other. Therefore, we consider diversity: instead of considering to maximize the size of each individual clique, we aim to compute k cliques that can together cover the maximum number of nodes in the graph. More technique details on how to efficiently solve the problem are in the paper.
  23. To illustrate the indexing techniques, let us consider a fundamental graph problem: compute the shortest path between two nodes in a graph. Given a source node (red node) and a target node (blue node), a straightforward solution is to use the classic Dijkstra’s Algorithm to compute the shortest path between them in an online manner. We can also use the A* algorithm to make it more efficient. However, both algorithms may traverse the whole graph in the worst cases. When the graph is large, the online computation algorithms are slow, because no index is used. Another solution is to precompute the shortest paths for every pair of nodes as an index. In this way, given the query, the answer can be obtained directly from the precomputed answers. Obviously, when the graph is large, the precomputation cost is too high and impractical. In our VLDBJ work, we propose a new algorithm. Our basic idea is as follows: for each node, instead of computing its shortest paths to all other nodes in the graph, we only precompute a small portion of them. In query processing, given a source node and a target node, we can join the precomputed shortest paths of the source and target nodes and if success, we concatenate the two shortest paths into one path and we can guarantee that the new path is the shortest path from the source node to the target node. Our other work in this category focus on deriving indexing techniques for various graph query semantics.
  24. The figure shows a typical memory hierarchy of a compute, which is introduced in every textbook of operation systems. The devices at the left part have high processing speed but low storage size, whereas those at the right part have low processing speed but high storage size. Here, we mainly focus on the storage of the main memory (DRAM) and secondary storage (disk). When a graph is large, it is usually hard to fit in the main memory of a machine. However, the disk is usually large enough to hold the graph. Therefore, the aim is to derive an I/O efficient algorithm to a graph problem. There are four issues: We need to decide which part of the graph should be loaded into the main memory, and which part are put on disk When we access data on disk, we need to maximize sequential I/Os and minimize random I/Os, because random accesses on disk is much slower than sequential accesses. For dense graphs such as social networks, we can usually guarantee that all nodes can be kept in the main memory and edges have to be stored on disk. This is called a semi-external algorithm. In the semi-external situation, some graph problems may be solved efficiently. Two most popular approaches for I/O efficient graph computation can be considered: (1) Partition based: we partition the graph into several parts each of which can hold into the main memory. We can use a divide and conquer method to compute the result. (2) We keep only the partial answers in memory and scan the graph on disk iteratively until the result converges to the final result. In this direction, we have derived I/O efficient algorithms for a number of fundamental graph problems. For example, reachability queries (EDBT’12), Strongly Connected Components (SIGMOD’13, VLDBJ’14, and ICDE’14), and Depth-First Search (SIGMOD’15).
  25. We consider two types of parallel computation models. The first is the multicore programming, and the second is the distributed computing such as MapReduce and BSP. We provide a simple comparison: Multicore programming is usually computation sensitive. Given a problem, we usually focus on how to divide the computation into different cores in a balanced manner. Distributed computing is usually data sensitive. Given a problem, we usually focus on how to divide the data to be stored in different computers. Multicore programming is based on a shared memory paradigm. Every core can access any part of the main memory. Each core have a separated L1 cache, in which computation can be totally parallelized. Distributed computing is based on a shared nothing paradigm. Every computer can only access its own CPU, memory, and disk. The computation in different computers can be totally parallelized. Computers can only exchange data using the network. Multicore programming mainly focus on reducing the cache miss to maximize the parallelism. Distributed computing mainly focus on reducing the communication cost to maximize the parallelism. In this direction, we have derived an efficient multicore algorithm to answer keyword queries (VLDB’10) and process several fundamental graph tasks using MapReduce (SIGMOD’14 and VLDB’15)
  26. Our final aim is to build a graph processing system with the following three objectives: First, we want to extract the primitive operators from graph processing and mining. However, we need to consider both the completeness of the operators as well as to guarantee the minimality, which is challenging. Second, we want to achieve high scalability in graph processing. However, it is not easy to guarantee the optimality of the algorithm. Third, we aim to make the query processing tasks real-time tractable. However, since a large number of graph algorithms are NP-hard, it is not easy to guarantee the real-time tractability.
  27. Our system structure consists of five layers: In bottom data environment layer, we aim to handle different data environments, e.g., streaming, static, Probabilistic, etc. In the computing model layer, we aim to support different computing models, e.g., in-memory, distributed, multi-core, external, etc. In the computing paradigms layer, we target on implementing the most primitive operators used in graphs, such as joins, breath-first-search, depth-first-search, topological sort, spanning tree computation, etc. In the query/mining primitives layer, we focus on designing some primitive operators for query processing and graph mining based on different semantics. For example, in query processing, we can design primitive operators depending on whether the query is a pattern, a set of nodes, or a set of keywords. In graph mining, we can design primitive operators depending on whether the result are subgraphs or some aggregated information. In the topmost application layer, we aim to combine different primitive operators to design algorithms that can be used in various application scenarios, e.g., social network, chemical, web search, etc.
  28. For each layer we have published a large number of research papers in top database/data mining conference/journals.
  29. Following the same framework, our future developments aim to enrich each component of our system from the data layer to the application layer, so that we can finally deliver a general-purpose graph processing system that integrate all graph semantics, data environments, and computing models for various applications.
  30. As a conclusion, the emerging of the era of big data brings a number of new challenges for the traditional graph processing techniques including new graph semantics, new mining tasks, new query processing algorithms, new indexing techniques, and new computing models. However, graphs are still becoming larger, and more complex. Big graph processing is still on its early stage since many challenges are still unsolved. There are more opportunities for us to explore the unknown world!