Social Relation Based Scalable Semantic Search Refinement
1. Social Relation Based Scalable
Semantic Search Refinement
Yi Zeng1, Xu Ren1, Yulin Qin1,2, Ning Zhong1,3,
Zhisheng Huang4, Yan Wang1
1. International WIC Institute, Beijing University of Technology, China
2. Carnegie Mellon University, USA
3. Maebashi Institute of Technology, Japan
4. Vrije University Amsterdam, the Netherlands
1
2. Motivation
• Vague/Incomplete queries over large scale semantic data
(How to get more refined queries to reduce the size of the result set?).
• Large scale semantic data vs most relevant data for a specific user
Diversity for different users in the
context of large scale semantic data
User interests Network of friends,
collaborators, etc.
Interests based Search refinement
search refinement through social relationship
Group interests based search refinement
2
3. Social Relations and Social Networks
• Most of the social networks follow the power law distribution.
• Using the FOAF vocabularies, the DBLP coauthor network is created.
Fig. 1: Coauthor number distribution in Fig. 2: log-log diagram of Figure 1.
the SwetoDBLP dataset.
• Approximate power law distribution not many authors who have a lot of
coauthors, and most of the authors are with very few coauthors.
• Considering the scalability issue, when the number of authors expand
rapidly, it will not hard to rebuild the coauthor network since most of the
authors will just have a few links.
3
4. Search Refinement through
Social Relationship
Table 1: A partial result of the expert finding search task Domain experts
“Artificial Intelligence authors”(User name: John McCarthy). dataset
Satisfied Authors without Satisfied Authors with
social relation refinement social relation refinement User URIs
Carl Kesselman (312) Hans W. Guesgen (117) *
Thomas S. Huang (271) Virginia Dignum (69) * Coauthor Network
dataset
Edward A. Fox (269) John McCarthy (65) *
Lei Wang (250) Aaron Sloman (36) * Bridging two separate
John Mylopoulos (245) Carl Kesselman (312) datasets together and help to
Ewa Deelman (237) Thomas S. Huang (271) refine the expert finding task.
... ...
In an enterprise setting, if the found experts have some previous
relationship with the employer, the cooperation may be smoother.
4
6. Obtaining the Retained Interests
• Are retained interests appeared more frequently than others?
(Frequency) Total Interest : TI (i ) = ∑n m(i, j )
j =1
• Except for frequency, what else is important to correctly obtain retained
interests?
Forgetting mechanism in cognitive memory retention
(exponential function model, power function model) [Anderson, Schooler 1991].
(Frequency and Recency) Memory Retention:
P = Ae−bT ; P = AT −b
Pictures from: [Schooler 1993] Schooler, L. J. & Anderson, J. R.: Recency and Context: An
Environmental Analysis of Memory. In Proceedings of the Fifteenth Annual Conference of the
Cognitive Science Society, pp. 889-894, 1993.
6
7. Obtaining the Retained Interests
• (Frequency and Recency) Exponential Model for Interest Retention :
EIR(i ) = ∑ j =1 m(i, j ) × Ae
n − bTi , j
• (Frequency and Recency) Power Model for Interest Retention :
PIR(i ) = ∑ j =1 m(i, j ) × ATi , j − b
n
[Zeng 2009a] Cognitive Memory Retention Based Starting Point for Query Extension and
Granular Selection, Yi Zeng, Haiyan Zhou, Ning Zhong, Yulin Qin, Shengfu Lu, Yiyu Yao, Yang
Gao. In: Cognitive Memory Component (v1), LarKC deliverable 2-3-1, Coordinated by Jose
Quesada and Yi Zeng, March 30, 2009.
[Zeng 2009b] Yi Zeng, Yiyu Yao, Ning Zhong. DBLP-SSE: A DBLP Search Support Engine, In:
Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE
Computer Society, Milan, Italy, September 15-18, 2009.
[Maanen 2009] Leendert van Maanen, Julian N. Marewski.: Recommender Systems for
Literature Selection: A Competition between Decision Making and Memory Models, CogSci 2009,
July 31-August 1, 2009.
7
8. Obtaining the Retained Interests
• To some extend, current
interests are relevant to
interest retention.
Using the power law
model, under A=0.855,
and b=1.295, we
selected all the authors
whose publication
numbers are above 100,
Figure 7a: A comparative study of Figure 7b: Difference on and we predict their top 9
total research interests from 1990 to the contribution values interests from 2000 to
2008 and retained interests in 2009 from papers published in 2007 using interest
(based on both the power law and different years
exponential law models) retention (1226 persons).
49.54% of this samples
can predict 3 out of 9
interests.
• We analyzed research Interest retention for all
the 615,124 computer scientists based on the
SwetoDBLP dataset. We released the “computer
scientists’ research interest RDF dataset :
http://www.iwici.org/dblp-sse
Figure 7c: A Comparison of Total Interests and Interest Retentions http://wiki.larkc.eu/csri-rdf
of the author “Ricardo A. Baeza-Yates”. (Nov, 2009 from DBLP) 8
9. Network
Link
Retained Interests Search Search
PageRank
in a Social Environment
Information Retrieval Web
Web Carlos Castillo
Group Retained Interests : Query Content
Spam
• Diversity Challenge Ricardo A. Baeza-Yates
• Consistency Engine Mining Analysis Analysis Detection
Group Retained Interest : Top 9 Retained Top 9 Group Retained
Interests Interests
⎧1 (i ∈ RItop 9 )
⎪ p
Web 7.81 Search 35
E (i, p ) = ⎨ , Search 5.59 Retrieval 30
⎪0 (i ∉ RI p )
top 9
⎩ Retrieval 3.19 Web 28
GIR (i ) = ∑ p =1 E (i, p ),
n
Information 2.27 Information 26
Query 2.14 System 19
For most prolific authors in DBLP Engine 2.10 Query 18
(publication number >50):
Minining 1.26 Analysis 14
5161 persons
Challenge … Text …
On average, 52.55% of an
individual’s retained interests are Analysis … Model …
consistent with his/her group Top 9 interests retention of a user and his group
interests retention. (Ricardo A. Baeza-Yates,
retained interests. based on May 2008 version of SwetoDBLP). 9
10. Search Refinement by Interests
from Different Perspectives
• Vague/incomplete queries may produce too many results that the
users have to wade through.
• Research interests may be very related with search tasks.
• Research interests can be evaluated from various perspectives.
(1) Total Interests;
(2) Retained Interests;
(3) Co-author Group retained interests;
10
11. Refinement with Retained interests,
group retained interests
8 requests to DBLP authors
were sent out.
7 replied.
Participants 7 DBLP authors:
• Preference order 100% :
List 2, List 3 List 1
• Preference order 100% :
List 2 ≈ List 3
• Preference order 83.3% :
List 2 > List 3 List 1
• Preference order 16.7% :
List 3 > List 2 List 1
11
13. Semantic Similarity
---- Obtaining More Accurate Interest Descriptions and
Observations of Interest Dynamics
Network
Link
Search
Search
PageRank search retrieval 0.645
search query 0.552
Information Retrieval Web
Web Carlos Castillo search pagerank 0.813
Query Content
Spam retrieval query 0.467
Challenge Ricardo A. Baeza-Yates retrieval pagerank 0.293
Analysis Analysis Detection
Engine Mining
query pagerank 0.098
Figure 14. Consistent interests without consideration of semantic
similarity. logic reasoning 0.667
Network
Link
logic inference 0.606
Search
Search PageRank reasoning inference 0.909
Retrieval ontology OWL 0.805
Information Web
Web Carlos Castillo Table . Some examples on
Query Content semantic similarities based on
Spam
Challenge Normalized Google Distance.
Ricardo A. Baeza-Yates
Analysis Analysis Detection
Engine Mining
Figure 15. Consistent interests with consideration of semantic
similarity. 13