Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han

Challenging Problems forChallenging Problems for
Scalable Mining ofScalable Mining of
Heterogeneous Social andHeterogeneous Social and
Information NetworksInformation Networks
Jiawei Han
Computer Science , University of Illinois at Urbana-Champaign
Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo
Zhao
Acknowledgements: ARL, NSF, AFOSR (MURI), NASA, Microsoft, IBM, Yahoo!, Boeing
August 12, 2013
1

2
OutlineOutline
 Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks
Promising?Promising?
 Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks
 On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and
Info. NetworksInfo. Networks
 Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive
Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks
 PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search
 PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
 Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
 ConclusionsConclusions

Where There Is Information,Where There Is Information,
There Are Networks!There Are Networks!
Social Networking WebsitesSocial Networking Websites Biological Network: Protein InteractionBiological Network: Protein Interaction
Research Collaboration NetworkResearch Collaboration Network Product Recommendation Network via EmailsProduct Recommendation Network via Emails

The Real World: Heterogeneous NetworksThe Real World: Heterogeneous Networks
 Multiple object types and/or multiple link types
VenueVenue PaperPaper AuthorAuthor
DBLP Bibliographic NetworkDBLP Bibliographic Network The IMDB Movie NetworkThe IMDB Movie Network
ActorActor
MovieMovie
DirectorDirector
MovieMovie
StudioStudio
Homogeneous networks are information lossinformation loss projection of
heterogeneous networks!
The Facebook NetworkThe Facebook Network
Directly mining information-richer heterogeneous networksDirectly mining information-richer heterogeneous networks

Structured Heterogeneous Network ModelingStructured Heterogeneous Network Modeling
Leads to the New Power of Data Mining!Leads to the New Power of Data Mining!
 DBLP: A Computer Science bibliographic database
A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …
5
Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!

6
OutlineOutline

7
On the Power of Mining Structured,On the Power of Mining Structured,
Heterogeneous NetworksHeterogeneous Networks
 Links carry a lot of hidden information in structured,Links carry a lot of hidden information in structured,
heterogeneous social and information networksheterogeneous social and information networks
 Effectiveness of miningEffectiveness of mining
 Clustering in heterogeneous networks: Rank-basedClustering in heterogeneous networks: Rank-based
clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) andclustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and
user-guided, meta-path-based clustering [KDD’12]user-guided, meta-path-based clustering [KDD’12]
 Knowledge propgation through heterogeneous linksKnowledge propgation through heterogeneous links
(GNetMine [ECMLPKDD’10]) and Rank-based classification(GNetMine [ECMLPKDD’10]) and Rank-based classification
(RankClass [KDD’11])(RankClass [KDD’11])
 Meta-path-based similarity search (PathSim [VLDB’11])Meta-path-based similarity search (PathSim [VLDB’11])
 Meta-path-based prediction in heterogeneous networksMeta-path-based prediction in heterogeneous networks
(PathPredict [ASONAM’11])(PathPredict [ASONAM’11])

RankClus:RankClus: Integrated Clustering and RankingIntegrated Clustering and Ranking
 Highly ranked objects are
more important (i.e.,
more weighted) in a
cluster than weakly
ranked ones
 Ranking will make more
sense within one cluster
than in multiple clusters
 Ranking, as the feature of
the cluster, is conditional
to a specific cluster
Sub-Network
Ranking
Clustering
8
 Clustering and ranking mutually enhance each other at each iteration
 RankClus [EDBT’09]: An efficient, EM-like algorithm

9
with Star Network Schemawith Star Network Schema
[KDD’09][KDD’09]
 Beyond bi-typed information network: A Star Network Schema
 Split a network into different layers, each representing by a net-
cluster

10
NetClus: Database System ClusterNetClus: Database System Cluster
database 0.0995511
databases 0.0708818
system 0.0678563
data 0.0214893
query 0.0133316
systems 0.0110413
queries 0.0090603
management 0.00850744
object 0.00837766
relational 0.0081175
processing 0.00745875
based 0.00736599
distributed 0.0068367
xml 0.00664958
oriented 0.00589557
design 0.00527672
web 0.00509167
information 0.0050518
model 0.00499396
efficient 0.00465707
Surajit Chaudhuri 0.00678065
Michael Stonebraker 0.00616469
Michael J. Carey 0.00545769
C. Mohan 0.00528346
David J. DeWitt 0.00491615
Hector Garcia-Molina 0.00453497
H. V. Jagadish 0.00434289
David B. Lomet 0.00397865
Raghu Ramakrishnan 0.0039278
Philip A. Bernstein 0.00376314
Joseph M. Hellerstein 0.00372064
Jeffrey F. Naughton 0.00363698
Yannis E. Ioannidis 0.00359853
Jennifer Widom 0.00351929
Per-Ake Larson 0.00334911
Rakesh Agrawal 0.00328274
Dan Suciu 0.00309047
Michael J. Franklin 0.00304099
Umeshwar Dayal 0.00290143
Abraham Silberschatz 0.00278185
VLDB 0.318495
SIGMOD Conf. 0.313903
ICDE 0.188746
PODS 0.107943
EDBT 0.0436849
Go one-level deeper:
Authors in XML, Xquery cluster
Term Venue Author

Rank-Based Clustering for OthersRank-Based Clustering for Others
11
RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically!
Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE

12
Classification in Heterogeneous NetworksClassification in Heterogeneous Networks
 GNetMine
[ECMLPKDD'10]:
Knowledge propagation
across heterogeneous links
 RankClass [KDD’11]:
Integration of ranking and
classification in
heterogeneous network
analysis
 Highly ranked objects play
more role in classification An object can only be ranked high in some focused classes
 Class membership and ranking are stat. distributions
 Let ranking and classification mutually enhance each other!
 Output: Classification results + ranking list of objects within each class

Experiments with Very Small Training SetExperiments with Very Small Training Set
 DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network
 Rank objects within each class (with extremely limited label information)
 Obtain High classification accuracy and excellent rankings within each class
Database Data Mining AI IR
Top-5 ranked
conferences
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
ICDE ICDM ICML CIKM
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
Top-5 ranked
terms
data mining learning retrieval
database data knowledge information
query clustering reasoning web
system classification logic search
xml frequent cognition text
13

Similarity Search: Find Similar Objects in NetworksSimilarity Search: Find Similar Objects in Networks
 Who are most similar to Christos Faloutsos?
 Meta-Path: Meta-level description of a path between
two objects
Christos’s students or close collaborators Similar reputation at similar venues
Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
14
Schema of the
DBLP Network
Different meta-paths lead
to very different results!
 Different meta-paths carry
rather different semantics

Which Similarity Measure Is Better?Which Similarity Measure Is Better?
 Anhai Doan
 CS, Wisconsin
 Database area
 PhD: 2002
Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
• Jignesh Patel
• CS, Wisconsin
• Database area
• PhD: 1998
• Amol Deshpande
• CS, Maryland
• Database area
• PhD: 2004
• Jun Yang
• CS, Duke
• Database area
• PhD: 2001
15
PathSim [VLDB’11]

PathPredict:PathPredict: Meta-Path Based New Co-authorMeta-Path Based New Co-author
Relationship Prediction in DBLP [ASONAM’11]Relationship Prediction in DBLP [ASONAM’11]
 Co-authorship prediction: Whether two authors are going to
collaborate for the first time
 Co-authorship encoded in meta-path
 Author-Paper-Author (A-P-A)
 Topological features encoded in meta-paths as below:
Meta-paths between authors under length 4Meta-paths between authors under length 4
Meta-Path Semantic Meaning
16

The Success of PathPredict: Exploring Meta-PathsThe Success of PathPredict: Exploring Meta-Paths
 Explain the prediction
power of each meta-path
 Wald Test for logistic
regression
 Higher prediction accuracy
than using projected
homogeneous network
 11% higher in
prediction accuracy
 Citation prediction
 The selected meta-
paths could be rather
different
17
Co-author predictionCo-author prediction for Jian Peifor Jian Pei: Only 42 among 4809: Only 42 among 4809
candidates are true first-time co-authors!candidates are true first-time co-authors!
(Feature collected in [1996, 2002]; Test period in
[2003,2009])

18
OutlineOutline

19
Challenges on BigMineChallenges on BigMine
 Scalable mining of massive information networks: Necessity
 Many such networks are gigantic: News, PubMed, …
 DBLP is a small one: 2M papers and 0.8M authors, …
 Meta-path: Potentially long chains of matrix multiplication of
such networks
 APVPA: AP X PV X VP X PA
 Comparative analysis of multi-meta-paths is costly
 Scalable mining of massive information networks: Possibility
 Many functions do not need to compute eigen values
 Top-k computation may save computation cost substantially
 Precomputation may save online computation substantially
 Clustering-based precomputation:

20
Computing Eigen Values: When Need It?Computing Eigen Values: When Need It?
 Computations needed
 Clustering (RankClus), classification (RankClass), similarity
search (PathSim), prediction (PathPredict)
 A small # of interactive processing (e.g., EM-styled)
 Meta-path-based prediction : Selection from a set of “parallel”
meta-paths

Long Meta-Path May Not Carry the Right SemanticsLong Meta-Path May Not Carry the Right Semantics
 Repeat the meta-path 2, 4, and infinite times for conference
similarity query
21

22
Top-K Computation Is What We NeedTop-K Computation Is What We Need
 Similarity search: “Who are similar to Christos?”
 There is no need/interest to calculate and rank the remaining
0.8M authors
 Only top-k (e.g., top-100) authors are needed in practice
 Lots of optimizations can be explored for top-k computation
 Precomputation vs. online computation
 Precomputation of long meta-paths will save online, costly
multi-matrix multiplication
 Clustering-based precomputation
 Example: top-k similarity authors
 Precomputation by clustering: only computing rather
similar author groups

Co-Clustering-Based Pruning AlgorithmCo-Clustering-Based Pruning Algorithm
 General idea:
 Store commuting matrices for short path schemas and
compute top-k queries on line
 Framework
 Generate co-clusters for materialized commuting matrices, for
feature objects and target objects
 Derive upper bound for similarity between object and target
cluster, and between object and object
 Safely pruning target clusters and objects if the upper
bound similarity is lower than current threshold
 Dynamically update top-k threshold

Similarity Search: Experiments on EfficiencySimilarity Search: Experiments on Efficiency
 Searching for top-20 objects vs.
1001th-1020th objects: PathSim-
pruning is more efficient than
PathSim-baseline
 The denser the corresponding
commuting matrix, the more
PathSim-pruning can improve
 The more neighbors of a query, the
more PathSim-pruning can improve
 Then compare the efficiency under
different top-k’s (k = 5, 10, 20) for
PathSim-pruning using query set 1
 A smaller top-k has stronger
pruning power, and thus needs less
execution time
24

PathPredict: Exploring Big Data SpacePathPredict: Exploring Big Data Space
 Scalable computation in
really huge heterogeneous
networks?
 Sampling may lead to
similar judgment on
importance of meta-path
 Query-dependent
prediction can be
“selective” and thus may
not need that much
resources
 Precomputation and
clustering may further
enhance its efficiency
25

26
Mining Query-Relevant “Hidden” NetworksMining Query-Relevant “Hidden” Networks
 Query-relevant hidden networks
 What is the hidden network closely relevant to “SVM”?
 The network should contains weighted network consisting of
papers, terms, authors and venues
 Is “kernel machine” closely relevant to “SVM”? How could we
know it?
 It takes substantial computation to derive such a
“weighted/ranked” hidden heterogeneous network
 Due to the diversity of queries (e.g., SVM + Cloud + SIGMOD), it is
impossible to precompute every possible combinations
 How can we compute such hidden network efficiently on the fly?
 An interesting open problem

27
ConclusionsConclusions
 Heterogeneous social & information networks are ubiquitous
 Most datasets can be “organized” or “transformed” into
“structured” multi-typed heterogeneous info. networks
 Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …
 Surprisingly rich knowledge can be mined from structured
heterogeneous info. networks
 Clustering, ranking, classification, path prediction, ……
 Knowledge is power, but knowledge is hidden in massive, but
“relatively structured” nodes and links!
 Challenge to BigMine: How to mining massive, heterogeneous
information networks efficiently
 Some progress/tricks on scalability and efficiency
 Many open problems and much more to be explored!

From Data Mining to Mining Info. NetworksFrom Data Mining to Mining Info. Networks
28
Han, Kamber and Pei,
Data Mining, 3rd
ed. 2011
Yu, Han and Faloutsos (eds.),
Link Mining, 2010
Sun and Han, Mining Heterogeneous
Information Networks, 2012

ReferencesReferences
 M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information
Networks", KDD'11.
 Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies,
Morgan & Claypool Publishers, 2012
 Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis", EDBT’09
 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with
Star Network Schema", KDD’09
 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in
Heterogeneous Information Networks”, VLDB'11
 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in
Heterogeneous Bibliographic Networks", ASONAM'11
 Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in
Heterogeneous Information Networks”, WSDM'12
 F. Tao, et al., “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”,
(system demo) KDD’13
 C. Wang, J. Han, et al., “Mining Advisor-Advisee Relationships from Research Publication
Networks", KDD'10
 C. Wang, M. Danilevsky, et al., “A Phrase Mining Framework for Recursive Construction of a
Topical Hierarchy”, KDD’13
29

Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han

Similar to Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han (20)

Recently uploaded

Recently uploaded (20)

Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han

Editor's Notes