Strategies for Landing an Oracle DBA Job as a Fresher
Pharos Social Map Based Recommendation For Content Centric Social Websites
1. IBM China Research Laboratory
Social Map Based Recommendation for
Content-Centric Social Websites
IBM Research - China
Presenter: Shiwan Zhao (zhaosw@cn.ibm.com)
Pharos Team:
赵石顽 袁泉 张夏天 郑文涛
Advisor: Michelle Zhou, Rongyao Fu, Changyan Chi 1
2. IBM China Research Laboratory
About me
1993~1998
– B.S. Computer Science, Tsinghua University
1998~2000
– M.S. Computer Science, Tsinghua University
2000~now
– IBM Research - China
2007~now
– Focus on recommendation technologies
2
3. IBM China Research Laboratory
Agenda
Part 1:
– Problem & challenges
– Pharos solution overview
– Demo
Part 2:
– Some technology details
3
4. IBM China Research Laboratory
Problem
Content-centric social websites (e.g., forums,
wikis, and blogs) have flourished with the
exponential growth of user-generated information
– Overwhelming amount
– Evolving over time
– Not well organized
It is hard for users, especially new users, to grasp
what’s out there and then find out interested
information
4
5. Example China Research Laboratory
IBM
A Blog website contains huge amount of dynamically evolving content (blog
entries), while not providing effective navigation approaches
– Search
• Be useful when users have well-defined goals
– Recent entries
– Top entries by
• most comments
• most ratings
• most visits
– Featured blog entries
– Tag cloud
– …
Like looking for needles in a haystack, without guidance, novice users can
NOT find anything interesting, then leaves BlogCentral quickly (low
stickiness), and won’t come back again (low stickiness)
5
6. IBM China Research Laboratory
Existing solutions & challenges
Researchers have developed recommender
systems to solve this information overload
problem
– E.g. Blog/News/Webpage recommender
However, current recommenders must address
two challenges:
– difficult to make effective recommendations for new users
(the cold start problem) due to the lack of user
information
– difficult to explain recommendation rationales to end
users to make the recommendation more trustworthy
6
7. IBM China Research Laboratory
Pharos Solution
Dynamically create a social map helping users find out who's talking
about what in an online site.
Social map creation
– Modeling & summarizing
time-sensitive user
behaviors of content-centric
online sites as a set of
“latent communities”
Social map based
recommendations
– Provide social landmarks
for new users to jump start
– Provide personalized social
map for experienced users
to effectively navigate the
community
7
8. IBM China Research Laboratory
Demo screenshot
John
Steve
Michael
Alice
Tom
8
9. IBM China Research Laboratory
Agenda
Part 1:
– Problem & challenges
– Pharos solution overview
– Demo
Part 2:
– Some technology details
9
10. IBM China Research Laboratory
Pharos Overview
* Multi-faceted recommendation
Triggers Visual Recommendation
Explanations Info item (page, fragment)
Explicit
People (reference to Bluepages, URL)
Implicit Recommendation
Algorithms Community (latent, dynamic
community)
Social Map
.. . . .. .
. . .......
.. Time-sensitive social map as
. ... ..... . .
. .... recommendation context
target user
Time
Content Modeling
Content Modeling
Behavior Mining
Behavior Mining
User behavior on content
10
11. IBM China Research Laboratory
Pharos Technical Focus
Visual Recommendation
Explanations
3. Community
Recommendation
summary Algorithms
2. Community/item/
people ranking
Social Map
.. . . .. .
. . .......
..
. ... ..... . .
. ....
target user
1. Latent community
Time
extraction
Content Modeling
Content Modeling
Behavior Mining
Behavior Mining
User behavior on content
11
12. IBM China Research Laboratory
Latent community extraction
Three approaches
– Directly model user-content relationships by using co-
clustering methods
– Group people firstly, then find associated content
– Group content firstly, then find associated people
12
13. IBM China Research Laboratory
Approach 1: time-elastic co-clustering
How long of the time window size we should use
to mining the communities?
How long is right?
. . ...
.
... ... .. ..... ... . .. . . .. ..... ....
.. . . .
........ .......
.. .. ... .... ..
. .
... ........... . .... .. . .. .. . ..
.. . . . .... . ....... .. ...... .......
.. . ..
. . . ... . . . . . . .
. . . . . .. .. . ... .
Time
Time-Elastic ad hoc April 2009
community detection
Community Map
GraphScope: Parameter-free Mining of Large Time-evolving
Graphs, Jimeng Sun, et al. KDD’07
13
14. IBM China Research Laboratory
Input Data – Graph Stream
User actions as a stream
... .............. ...... .. . ..... . .. ........ .......
. . . . . .. . ....
... . . .................. . . . . .
. . .. . .
... .
. .. .. .. .
Time
Split click stream into many small time atom frame
... . . .. .................. . . . . . . . .... ....
. . .... .. . . . . ..
. . . . . .. . . . .. .
Time
A frame click stream data can
be presented by a user-item
matrix (Graph).
– In the matrix, 1 means one
interaction between user
and item.
14
15. IBM China Research Laboratory
Approach
Two Step
– Co-clustering graphs
– Decide whether a new come graph should be merged with
current segment or start a new segment
Based on the MDL (Minimum Description Length) of
graphs
– MDL is the limit of graphs can be compressed
– Decide merging or splitting a segment
• If compress graphs together can save more encoding cost
than compress them respectively, we merge the new graphs
with current segment.
• Otherwise, we start a new segment by the new Graph
15
16. IBM China Research Laboratory
Pros and cons
Pros
– Clustering users and items on the same time
– Parameter free
• Don’t need to assign cluster numbers
– Automatically decide the size of time window
Cons
– Fixed Graph Size
• Any graphs must have the same size (rows and columns)
• Can’t handle new users and items
– Can’t handle large scale graphs
– Can’t guarantee the optimal result
– Result on very sparse graph is not very good
• Communities don’t make sense.
• Our data is extremely sparse (< 0.1%)
16
17. IBM China Research Laboratory
Approach 2: evolutionary spectral clustering for user
clustering
Discover communities within a time window
– Get high quality clustering in each time window
Model community evolution for a sequence of time windows
– Make the evolution between time windows smooth
Community Map
.. . .. .. ... ..
..
.. ..
... .. .. .. .. . ... ..
.. .. . ..
Time
Jan 2009 Feb 2009 Mar 2009 Apr 2009
In BlogCentral Domain
17
18. IBM China Research Laboratory
Evolutionary framework
Basic Idea
– Cost Function: Cost = α*CS +β*CT
• Snapshot cost (CS), measures the snapshot quality of the current
clustering result with respect to the current data features,
• Temporal cost (CT), measures the temporal smoothness in terms of the
goodness-of-fit of the current clustering result with respect to either
historic data features or historic clustering results
Two Evolutionary framework
– PCQ for preserving cluster quality, the current partition is applied to
historic data and the resulting cluster quality determines the temporal
cost.
– PCM for preserving cluster membership, the current partition is directly
compared with the historic partition and the resulting difference
determines the temporal cost.
– PCQ is our currently implemented framework
Evolutionary Spectral Clustering by Incorporating Temporal
Smoothness, Yun Chi, et al. KDD’07 18
19. IBM China Research Laboratory
Approach 3: LDA for content clustering
Latent Dirichlet Allocation (LDA), a probabilistic latent
semantic model for topic analysis
⎛ N ⎞ k
p (w α , β ) = ∫ p (θ α )⎜ ∏∑ p ( z n θ ) p ( wn z n , β ) ⎟d θ
⎜ n =1 z ⎟
⎝ n ⎠
[Blei et al. 03]
LDA is a generative probabilistic model of a corpus. The basic
idea is that the documents are represented as random mixtures
over latent topics, where a topic is characterized by a
distribution over words.
19
21. IBM China Research Laboratory
Latent community extraction - comparison
Co-clustering
– Not work well for extremely sparse data (<0.1%)
Spectral clustering for user
– Most behaviors are from anonymous user, difficult to
distinguish users
– Topics are not concentrated for each community
* LDA for content clustering
– Users are more likely to be interested in content
21
22. IBM China Research Laboratory
Pharos Technical Focus
Visual Recommendation
Explanations
3. Community
Recommendation
summary Algorithms
2. Item/people
ranking
Social Map
.. . . .. .
. . .......
..
. ... ..... . .
. ....
target user
1. Latent community
Time
extraction
Content Modeling
Content Modeling
Behavior Mining
Behavior Mining
User behavior on content
22
23. IBM China Research Laboratory
Item/People Ranking
PR( p j )
PR( pi ) = (1 − d )cvi + d ∑
Authority-based ranking by
context-sensitive PageRank,
considering p j ∈M ( pi ) L( p j )
– Time factor
Context vector (e.g., item attributes)
– Context information, e.g., item
attributes, report chain of people
People Blog entries
Influential people:
Active author with A 1
high quality entries Influential entry:
written by influential
authors, high visited /
B 2 commented
Authority from author to entry
Authority from entry to author
C 3
Authority from commenter/rater to entry
Authority from visitor to entry
D 4
23
24. IBM China Research Laboratory
Pharos Technical Focus
Visual Recommendation
Explanations
3. Community
Recommendation
summary Algorithms
2. Item/people
ranking
Social Map
.. . . .. .
. . .......
..
. ... ..... . .
. ....
target user
1. Latent community
Time
extraction
Content Modeling
Content Modeling
Behavior Mining
Behavior Mining
User behavior on content
24
25. IBM China Research Laboratory
Community Summary & visualization
Community representative keywords extraction
– Modified TF/IDF
– Content topic modeling by LDA (Latent Dirichlet Allocation)
Visualization
– A bubble chart layout (used by ManyEyes2) to pack top-N
communities tightly on the social map
• bubble’s size is determined by community’s ‘hotness’
– Inside each community, Wordle3 layout used to pack labels
tightly
25
26. IBM China Research Laboratory
Summary
Model, detect, and use a social map that summarizes user behavior of
online sites to make accurate and trustworthy recommendations
Increase recommendation accuracy
– Helps “cold start” problem by providing new users with “social landmarks” of
a social site to jump start their engagement
– Provides users with overall social awareness to compensate for
recommendation inaccuracy
Enhance recommendation trustworthiness
– Explain recommendation results in the context of a social map
Interactive recommendation
– User can navigation through the social map to find what they need
26