In today’s interconnected real world, social and informational entities are interconnected, forming gigantic, interconnected, integrated social and information networks. By structuring these data objects into multiple types, such networks become semi-structured heterogeneous social and information networks. Most real world applications that handle big data, including interconnected social media and social networks, medical information systems, online e-commerce systems, or database systems, can be structured into typed, heterogeneous social and information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective analysis of large-scale heterogeneous social and information networks poses an interesting but critical challenge.
In this talk, we present a set of data mining scenarios in heterogeneous social and information networks and show that mining typed, heterogeneous networks is a new and promising research frontier in data mining research. However, such mining may raise some serious challenging problems on scalability computation. We identify a set of problems on scalable computation and calls for serious studies on such problems. This includes how to efficiently computation for (1) meta path-based similarity search, (2) rank-based clustering, (3) rank-based classification, (4) meta path-based link/relationship prediction, and (5) topical hierarchies from heterogeneous information networks. We introduce some recent efforts, discuss the trade-offs between query-independent pre-computation vs. query-dependent online computation, and point out some promising research directions.
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han
1. Challenging Problems forChallenging Problems for
Scalable Mining ofScalable Mining of
Heterogeneous Social andHeterogeneous Social and
Information NetworksInformation Networks
Jiawei Han
Computer Science , University of Illinois at Urbana-Champaign
Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo
Zhao
Acknowledgements: ARL, NSF, AFOSR (MURI), NASA, Microsoft, IBM, Yahoo!, Boeing
August 12, 2013
1
2. 2
OutlineOutline
Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks
Promising?Promising?
Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks
On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and
Info. NetworksInfo. Networks
Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive
Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks
PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search
PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
ConclusionsConclusions
3. Where There Is Information,Where There Is Information,
There Are Networks!There Are Networks!
Social Networking WebsitesSocial Networking Websites Biological Network: Protein InteractionBiological Network: Protein Interaction
Research Collaboration NetworkResearch Collaboration Network Product Recommendation Network via EmailsProduct Recommendation Network via Emails
4. The Real World: Heterogeneous NetworksThe Real World: Heterogeneous Networks
Multiple object types and/or multiple link types
VenueVenue PaperPaper AuthorAuthor
DBLP Bibliographic NetworkDBLP Bibliographic Network The IMDB Movie NetworkThe IMDB Movie Network
ActorActor
MovieMovie
DirectorDirector
MovieMovie
StudioStudio
Homogeneous networks are information lossinformation loss projection of
heterogeneous networks!
The Facebook NetworkThe Facebook Network
Directly mining information-richer heterogeneous networksDirectly mining information-richer heterogeneous networks
5. Structured Heterogeneous Network ModelingStructured Heterogeneous Network Modeling
Leads to the New Power of Data Mining!Leads to the New Power of Data Mining!
DBLP: A Computer Science bibliographic database
A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …
5
Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!
6. 6
OutlineOutline
Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks
Promising?Promising?
Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks
On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and
Info. NetworksInfo. Networks
Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive
Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks
PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search
PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
ConclusionsConclusions
7. 7
On the Power of Mining Structured,On the Power of Mining Structured,
Heterogeneous NetworksHeterogeneous Networks
Links carry a lot of hidden information in structured,Links carry a lot of hidden information in structured,
heterogeneous social and information networksheterogeneous social and information networks
Effectiveness of miningEffectiveness of mining
Clustering in heterogeneous networks: Rank-basedClustering in heterogeneous networks: Rank-based
clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) andclustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and
user-guided, meta-path-based clustering [KDD’12]user-guided, meta-path-based clustering [KDD’12]
Knowledge propgation through heterogeneous linksKnowledge propgation through heterogeneous links
(GNetMine [ECMLPKDD’10]) and Rank-based classification(GNetMine [ECMLPKDD’10]) and Rank-based classification
(RankClass [KDD’11])(RankClass [KDD’11])
Meta-path-based similarity search (PathSim [VLDB’11])Meta-path-based similarity search (PathSim [VLDB’11])
Meta-path-based prediction in heterogeneous networksMeta-path-based prediction in heterogeneous networks
(PathPredict [ASONAM’11])(PathPredict [ASONAM’11])
8. RankClus:RankClus: Integrated Clustering and RankingIntegrated Clustering and Ranking
Highly ranked objects are
more important (i.e.,
more weighted) in a
cluster than weakly
ranked ones
Ranking will make more
sense within one cluster
than in multiple clusters
Ranking, as the feature of
the cluster, is conditional
to a specific cluster
Sub-Network
Ranking
Clustering
8
Clustering and ranking mutually enhance each other at each iteration
RankClus [EDBT’09]: An efficient, EM-like algorithm
9. 9
with Star Network Schemawith Star Network Schema
[KDD’09][KDD’09]
Beyond bi-typed information network: A Star Network Schema
Split a network into different layers, each representing by a net-
cluster
10. 10
NetClus: Database System ClusterNetClus: Database System Cluster
database 0.0995511
databases 0.0708818
system 0.0678563
data 0.0214893
query 0.0133316
systems 0.0110413
queries 0.0090603
management 0.00850744
object 0.00837766
relational 0.0081175
processing 0.00745875
based 0.00736599
distributed 0.0068367
xml 0.00664958
oriented 0.00589557
design 0.00527672
web 0.00509167
information 0.0050518
model 0.00499396
efficient 0.00465707
Surajit Chaudhuri 0.00678065
Michael Stonebraker 0.00616469
Michael J. Carey 0.00545769
C. Mohan 0.00528346
David J. DeWitt 0.00491615
Hector Garcia-Molina 0.00453497
H. V. Jagadish 0.00434289
David B. Lomet 0.00397865
Raghu Ramakrishnan 0.0039278
Philip A. Bernstein 0.00376314
Joseph M. Hellerstein 0.00372064
Jeffrey F. Naughton 0.00363698
Yannis E. Ioannidis 0.00359853
Jennifer Widom 0.00351929
Per-Ake Larson 0.00334911
Rakesh Agrawal 0.00328274
Dan Suciu 0.00309047
Michael J. Franklin 0.00304099
Umeshwar Dayal 0.00290143
Abraham Silberschatz 0.00278185
VLDB 0.318495
SIGMOD Conf. 0.313903
ICDE 0.188746
PODS 0.107943
EDBT 0.0436849
Go one-level deeper:
Authors in XML, Xquery cluster
Term Venue Author
11. Rank-Based Clustering for OthersRank-Based Clustering for Others
11
RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically!
Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE
12. 12
Classification in Heterogeneous NetworksClassification in Heterogeneous Networks
GNetMine
[ECMLPKDD'10]:
Knowledge propagation
across heterogeneous links
RankClass [KDD’11]:
Integration of ranking and
classification in
heterogeneous network
analysis
Highly ranked objects play
more role in classification An object can only be ranked high in some focused classes
Class membership and ranking are stat. distributions
Let ranking and classification mutually enhance each other!
Output: Classification results + ranking list of objects within each class
13. Experiments with Very Small Training SetExperiments with Very Small Training Set
DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network
Rank objects within each class (with extremely limited label information)
Obtain High classification accuracy and excellent rankings within each class
Database Data Mining AI IR
Top-5 ranked
conferences
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
ICDE ICDM ICML CIKM
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
Top-5 ranked
terms
data mining learning retrieval
database data knowledge information
query clustering reasoning web
system classification logic search
xml frequent cognition text
13
14. Similarity Search: Find Similar Objects in NetworksSimilarity Search: Find Similar Objects in Networks
Who are most similar to Christos Faloutsos?
Meta-Path: Meta-level description of a path between
two objects
Christos’s students or close collaborators Similar reputation at similar venues
Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
14
Schema of the
DBLP Network
Different meta-paths lead
to very different results!
Different meta-paths carry
rather different semantics
15. Which Similarity Measure Is Better?Which Similarity Measure Is Better?
Anhai Doan
CS, Wisconsin
Database area
PhD: 2002
Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
• Jignesh Patel
• CS, Wisconsin
• Database area
• PhD: 1998
• Amol Deshpande
• CS, Maryland
• Database area
• PhD: 2004
• Jun Yang
• CS, Duke
• Database area
• PhD: 2001
15
PathSim [VLDB’11]
16. PathPredict:PathPredict: Meta-Path Based New Co-authorMeta-Path Based New Co-author
Relationship Prediction in DBLP [ASONAM’11]Relationship Prediction in DBLP [ASONAM’11]
Co-authorship prediction: Whether two authors are going to
collaborate for the first time
Co-authorship encoded in meta-path
Author-Paper-Author (A-P-A)
Topological features encoded in meta-paths as below:
Meta-paths between authors under length 4Meta-paths between authors under length 4
Meta-Path Semantic Meaning
16
17. The Success of PathPredict: Exploring Meta-PathsThe Success of PathPredict: Exploring Meta-Paths
Explain the prediction
power of each meta-path
Wald Test for logistic
regression
Higher prediction accuracy
than using projected
homogeneous network
11% higher in
prediction accuracy
Citation prediction
The selected meta-
paths could be rather
different
17
Co-author predictionCo-author prediction for Jian Peifor Jian Pei: Only 42 among 4809: Only 42 among 4809
candidates are true first-time co-authors!candidates are true first-time co-authors!
(Feature collected in [1996, 2002]; Test period in
[2003,2009])
18. 18
OutlineOutline
Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks
Promising?Promising?
Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks
On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and
Info. NetworksInfo. Networks
Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive
Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks
PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search
PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
ConclusionsConclusions
19. 19
Challenges on BigMineChallenges on BigMine
Scalable mining of massive information networks: Necessity
Many such networks are gigantic: News, PubMed, …
DBLP is a small one: 2M papers and 0.8M authors, …
Meta-path: Potentially long chains of matrix multiplication of
such networks
APVPA: AP X PV X VP X PA
Comparative analysis of multi-meta-paths is costly
Scalable mining of massive information networks: Possibility
Many functions do not need to compute eigen values
Top-k computation may save computation cost substantially
Precomputation may save online computation substantially
Clustering-based precomputation:
20. 20
Computing Eigen Values: When Need It?Computing Eigen Values: When Need It?
Computations needed
Clustering (RankClus), classification (RankClass), similarity
search (PathSim), prediction (PathPredict)
A small # of interactive processing (e.g., EM-styled)
Meta-path-based prediction : Selection from a set of “parallel”
meta-paths
21. Long Meta-Path May Not Carry the Right SemanticsLong Meta-Path May Not Carry the Right Semantics
Repeat the meta-path 2, 4, and infinite times for conference
similarity query
21
22. 22
Top-K Computation Is What We NeedTop-K Computation Is What We Need
Similarity search: “Who are similar to Christos?”
There is no need/interest to calculate and rank the remaining
0.8M authors
Only top-k (e.g., top-100) authors are needed in practice
Lots of optimizations can be explored for top-k computation
Precomputation vs. online computation
Precomputation of long meta-paths will save online, costly
multi-matrix multiplication
Clustering-based precomputation
Example: top-k similarity authors
Precomputation by clustering: only computing rather
similar author groups
23. Co-Clustering-Based Pruning AlgorithmCo-Clustering-Based Pruning Algorithm
General idea:
Store commuting matrices for short path schemas and
compute top-k queries on line
Framework
Generate co-clusters for materialized commuting matrices, for
feature objects and target objects
Derive upper bound for similarity between object and target
cluster, and between object and object
Safely pruning target clusters and objects if the upper
bound similarity is lower than current threshold
Dynamically update top-k threshold
24. Similarity Search: Experiments on EfficiencySimilarity Search: Experiments on Efficiency
Searching for top-20 objects vs.
1001th-1020th objects: PathSim-
pruning is more efficient than
PathSim-baseline
The denser the corresponding
commuting matrix, the more
PathSim-pruning can improve
The more neighbors of a query, the
more PathSim-pruning can improve
Then compare the efficiency under
different top-k’s (k = 5, 10, 20) for
PathSim-pruning using query set 1
A smaller top-k has stronger
pruning power, and thus needs less
execution time
24
25. PathPredict: Exploring Big Data SpacePathPredict: Exploring Big Data Space
Scalable computation in
really huge heterogeneous
networks?
Sampling may lead to
similar judgment on
importance of meta-path
Query-dependent
prediction can be
“selective” and thus may
not need that much
resources
Precomputation and
clustering may further
enhance its efficiency
25
26. 26
Mining Query-Relevant “Hidden” NetworksMining Query-Relevant “Hidden” Networks
Query-relevant hidden networks
What is the hidden network closely relevant to “SVM”?
The network should contains weighted network consisting of
papers, terms, authors and venues
Is “kernel machine” closely relevant to “SVM”? How could we
know it?
It takes substantial computation to derive such a
“weighted/ranked” hidden heterogeneous network
Due to the diversity of queries (e.g., SVM + Cloud + SIGMOD), it is
impossible to precompute every possible combinations
How can we compute such hidden network efficiently on the fly?
An interesting open problem
27. 27
ConclusionsConclusions
Heterogeneous social & information networks are ubiquitous
Most datasets can be “organized” or “transformed” into
“structured” multi-typed heterogeneous info. networks
Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …
Surprisingly rich knowledge can be mined from structured
heterogeneous info. networks
Clustering, ranking, classification, path prediction, ……
Knowledge is power, but knowledge is hidden in massive, but
“relatively structured” nodes and links!
Challenge to BigMine: How to mining massive, heterogeneous
information networks efficiently
Some progress/tricks on scalability and efficiency
Many open problems and much more to be explored!
28. From Data Mining to Mining Info. NetworksFrom Data Mining to Mining Info. Networks
28
Han, Kamber and Pei,
Data Mining, 3rd
ed. 2011
Yu, Han and Faloutsos (eds.),
Link Mining, 2010
Sun and Han, Mining Heterogeneous
Information Networks, 2012
29. ReferencesReferences
M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information
Networks", KDD'11.
Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies,
Morgan & Claypool Publishers, 2012
Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis", EDBT’09
Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with
Star Network Schema", KDD’09
Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in
Heterogeneous Information Networks”, VLDB'11
Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in
Heterogeneous Bibliographic Networks", ASONAM'11
Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in
Heterogeneous Information Networks”, WSDM'12
F. Tao, et al., “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”,
(system demo) KDD’13
C. Wang, J. Han, et al., “Mining Advisor-Advisee Relationships from Research Publication
Networks", KDD'10
C. Wang, M. Danilevsky, et al., “A Phrase Mining Framework for Recursive Construction of a
Topical Hierarchy”, KDD’13
29