Authors: Vacharasintopchai, Thiti and Nguyen-Huu, Phong
Issue Date: 11-Dec-2009
Type: Article
Series/Report no.: Proc. 2nd International Conference on Robotics, Informatics, and Intelligent Technology (RIIT2009);
Abstract: Social networks offer incredible opportunities for users to create contents and share their experiences. The number of users joining these social networks has been rising dramatically. However, in a social network several users may share the same name. This causes name ambiguity in which search engine returns homogeneous search results for each queried name. To solve this problem we propose an approach to improve search results for finding friends within a large social network by using friendships among users as our backbone feature. Our approach finds most ranked seeds by using PageRank algorithm before computing approximate shortest path in a directed graph. We also retrieve real data from the social network Twitter to verify our approach. Results show that our approach outperforms the SeedBase approach which selects seeds randomly with large margin.
URI: http://dspace.siu.ac.th/handle/1532/718
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
An Improved People-Search Technique for Directed Social Network Graphs
1. ...
The r International Conference on Robotics, Informatics, and IJttelligent Technology (RIm(}()9)
December 1 r-
J.f', 2009 at Bangkok. Thailand
AN IMPROVED PEOPLE-SEARCH TECHNIQUE
FOR DIRECTED SOCIAL NETWORK GRAPHS
Thiti Vacharasintopcha4 Nguyen Huu Phong
School of Technology, Shinawatra University
Pathumthani~ Thailand 12160 Email: thitiv@Siu.ac.th.phongnh174@yahoo.com.vn
ABSTRACf they have already got offline'relationships to reconnect with
them [4, 5]. Users can find their friends by providing their
Social networks. offer incredible opporttmities fur users names addition to other infonnation. However, search
to ~ contents and $bare tbeirexperiences. The nwnber results from large popular sites return long irrelevant user
of users joining these social networks has been rising lists than one can imagine,
drama1ically. However. in a social network several users In this paper, we propose an approach to improve
may share the same name. This "CaUSeS name ambiguity in searching for friends in social network, Our approach
which search engine returns homogeneous search results for employs the PageRon/c algorithm to find seeds in order to
each queried name. To solve this problem we propose an compute approximate shortest paths within a social network.
approach to improve search results fur finding friends within We use the friendship among friends as the backbone.
a large social network by using friendships among users as feature. We also conduct experiments to verify the
our backbone feature. Our approach finds illOSt ranked seeds effectiveness of the proposed approach.
by using PageRank algorithm before l.-omputing The rest of 1his paper win be organized as follows:
approximate shortest path in a directed graph. We also First, we investigate previous developments of people-
retrieved real data from the social network. Twitter to verifY search teclmiques in Section 2. Then we present our
our approach. The sesults show that our approach approach to search for friends in a social network in
ootper.fonns the SeedBase approach which selects seeds Section3_ Later, results are presented in Section 4. Finally,
nmdomly by a large margin. conclusions are discussed in Section 5.
Index Terms-- Search algorithm, social network analysis, 2. RELATED WORK
authority analysis, shortest-path algorithm, graph algorithm
The top web search providers such as Google and
Ymoo offer standard search services where users can search
1. INTRODUCTION
by keying keywords. This may be lyrics of a song, a movie
trailer, a show time of a fashion, the title of a textbook. or a
Recently, social networlG have gained the explosive
name of a friend. In traditional method. search engine5
growth of popularity on the Web and the number of people
match the provided keyword to contents in their database
joining these networks is increasing significantly. These
and return to users with a list of homogeneous search results.
social networks assist users to create a network of mends
In 1he web search and infonnation retrieval area, the
and help in maintaining relationships among long distance
accuracy of search results provided by search engines is
friends, finding friends and sharing infurmation among
evaluated by a method called ranking algorithm. PageRank
networks. Moreover, in the very near future, the social
is one of the well known algorithms in this area [6, 7]. This
network site will play an important source of knowledge and
algoritlun takes the numbet of forward links and the number
information II}.
of back links to a web page as important factors to rank: each
Popular social networks on the Web include MySpace
web page IS]. By this way, the search engine retmns users
and Twitter. These sites assist users to create_and customize
with a list of ordered and ranked web pages for each
their personal information, btogs, multimedia, groups and
particular keyword. To get better search result, a user can
other features. MySpace began in July 2003 and was the
provide further infurmation about the searched keyword. For
largest social netwOl:k in the world in November 2006 with
instance, in case of finding a friend, users could provide the
more than 130 million users [2]. Twitter, a microblogging
high school name where they studied together and the school
had increased the number of users significantly since it
year to the input SO that search engines can be able to fitter
began in October, 2006 [3].
out irrelevant data and return more desired results.
Researches demonstrate that the majority of user's
Furthemlore, .search results can be improved by using
activity on the social network is to search for friends who
61
2. The r International Conference on Robotics, biformatics. and Intelligent Technology (RIIn()()9)
December 11!l1 -l.f', 2009 at Bangkok, Thailand
implicit users' information such as social annotation f7J and (BFS) from a finder to aU users [5]. The number of
relationship queries [9, 10]. calculation is limited by stopping BFS after a desired bop.
To improve web search results, autlrors [7J discuss two The latter algoritlnn uses seeds and computes approximate
algorithms., namely SociaiSimRonk and SocialPageRonk. shortest distances from these seeds to all users [5]. The
The former is based on an observation tbat when users research demonstrates that the seed-based ranking algoritlun
browse and annotate a web page, this can be a sood ootpetforms other algorithms in tenn of performance ~
indicator of the web page content [7}. The latter is bas¢d on precision [5].
another observation that the number of users who annotate
on a web page can demonstrate the quality of the web page
[7]. That r~ shows that both types can improve web
search significantly [7]. In this section, we popose an approach to select better
Another research observestbat the top ranked web page seeds than the approach that is discussed in Section 2, in
pairs could contain relationships between the two entities, wbiclI seeds are selected randomly before computing
and that relationship can be used to improve the web search approximate ~ paths. In our approach, all vertices are
[10]. ranked at first by using the PageRank algorithm. Then these
In social network context, search engines could apply vertices are sorted in reverse order and top ranked vertices
the same patterns as mthe web search 8lU for the particular are selected as seeds. After that. these seeds are used to
purpose of people search as mentioned in [5]. However, the compute shortest paths ftom. them to all vertices.
use of this approach in searching could meet the same
problem as in the web search, which is returning the same 3.1 Seed Dista_ces
search result for every ~ ~ provides the same keyword
[5,9]. According to the autOOrs [11. 12}, for a given graph
In general, a social network can be represented as a structure with the number of vertices is n and the number of
graph in which vertices representing users and edges edges is m. a query for the distance between any pair of
representing their relationships. The simplest form of vertices takes smaller amount of time and space than
relationship is the friendship where a user is a friend of computing all pair shortest paths when these distances are
another. In a social network, when a user searches for a pn>COtllputed.
people name they would likely recognize people who have a The authors [5] applied the concept above by selecting
closer relationship with them, in other worrls, the :friend a a small fiactioo of vertices randomly. These seeds
person is looking for is more likely a person who bad the (landmarks) are used as navigational beacons in their
~ path" to them in their relationship graphs. Figure 1 friendsbip graph. Then shortest paths ftorn these seeds to all
shows an example of searching fur a people name in the vertices are computed.. Later. the shortest path between any
social network. The user named 'Ilman searches fur his pair of vertices can be queried.
friend whose name is Huyen. Two results are returned in Figure 2 shows an example of the seed distance
which the first person is in the distance of I and the second approach fur the convenience of demonstration.. Suppose
person is in the distance of 2. 'The correct .result should be that 'We need to ~ the shortest path between Vertex 1
the first person since she is closer to. Thuan than the second and Vertex 7. We also suppose that Vertex 5 is chosen. as a
person. seed. We first find the shortest path between Vertex 5 and
Vertex 1. this shortest path is DSl =1. In the same manner,
the shortest path between Vertex 5 and Vertex 7 is DS7 = 1.
Finally, the shortest path between Vertex 1 and Vertex 7 is
the swn of the above two shortest paths which
is £),.7 = DSI + DS7 = 2. In this case, the shortest path is
Phuong correct. However, in the other case such as when we need to
compute the shortest path between Vertex 1 and Vertex 2,
FIgUre 1: Searching for People Name in II Social the seed is Vertex 5. Then by using this approach, the
Network shortest path between Vertex 1 and Vertex 2 is
In this case, authors in [5] use the approximate shortest
DI2 = Ds, + D52 = 3 which is incorrect.
path in a friendship graph as a factor in their ranking Table I shows the pre"Plocessing result from seed
algorithms. These algorithms include on-the-fiy ranking and vector 1. From this table, we can find seed distances from
seed-based ranking. The funner algorithm computes any pairs of vertices by computing the smn of their distanceS
distances at scoring time by running Breath First Search totbe seed.
62
,i
i
d
3. The r Inte.rnational Cotiference on Robotics, Informatics, and Int.eJJigent Technology (RImOO9)
December 1 r t
- 1.f", 2009 at Bangkok, Thailand
number of Friends that a Tweeter follows is defined
asC(T). PageRank of a Tweeter PR(T) is computed as
follows:
To compute the PageRank of all Tweeters, we first set
PageRank of 1hem to be ones. Then we iterate over' all of
.( tweeters and compute their PageRank by using Equation 1.
Figure 3 shows an example directed graph in Twitter. In
this graph. the user named Thuan has five followers. Each of
Figure 2: Example Graph of Vertices ami Edges
them also has some other followers. From Equation 1,
Thuan is the highest rank since he is followed by many
In a social network with 100 million users., we would
important followers. ~ next ranked is Huyen since she has
.need to compute up to 1016 times to know distances from all more number of followers than Hanh and Thanh. The
vertices. By using small fraction of seeds., the runtime remaining followers are ranked equally.
required and space ~ be reduced significantty.
Table 1: Example ofPre-processing Seed Distance
The approach in [5] selects the seed randomly which
may cause the lower accuracy than if better seeds are
chosen. Therefore, we propose an approach to select the
most important seeds instead of choosing seed randomly.
Since our social netWork forms a directed graph which has a
wmmon feature as web links in PageRank algorithm. we Figure 3: Example Directed Graph in Twitter
decide to use PageRank algorithm in selecting seeds in our
approach. 3.3 Vect:ors Distances
3.2 pageRank vectors distances of seeds consist of distances from
fractions of vertices (seeds) to all vertices. First, these seeds
PageRank algorithm is used to rank web pages based on are ranked by using the PageRank algorithm as descnDed in
fue number offorward links and back links to a web page [8, Section 3.2. Then these seeds are sorted in reverse order so
13, 14, 15, 16J. The intuition of this algorithm is that when that the highest ranked ones are arranged at first. Next, a
users link from a web page to other web pages. this could fraction of seeds are selected from the top ones. Later, seed
indicate endorsement of the web page content [13, 14}. We distances from these selected seeds to all vertices are
observed that in our social network Twitter, one tweeter may computed. The exact shortest path between two given
follow other tweeters (friends) and may be followed by vertices is computed using classical Dijkstra's algorithm as
others (followers). Therefore., applying the PageRank described in [17, 18] instead of using BFS and Map-reduce
algorithm could help to find better seeds. - computation as presented in [5]. Though there are several
We use the pageRank algorithm in our approach to rank algorithms to perform faster rumring time such as
each tweeter in our social network in which friends are as implementing Dijkstra's algorithm with Fibonacci heap [11,
furward links and followers are as back links. According to 19J, this goes beyond OUT scope. Finally, the approximate
[8, 15}, the PageRank algorithm can be stated as: Given a shortest path between two given vertices is the smallest sum
graph of Twitter, the number of Followers (F) that fullows a of shortest paths from these vertices to selected seeds.
Tweeter (T) is denoted n. A parameter d is the damping In Figure 2, suppose that we choose Vertex 1 and
factor which ranges between 0 and 1, and is set at 0.85. The Vertex 5 as seeds and we need to find approximate shortest
63
4. ~.
The r International Coriference on Robotics, Informatics, and Intelligent Technology (RIm0Q9)
December 1 r - 14"', 2009 at Bangkok, Thailand
path between two vertices Vertex 5 and Vertex 7. 1be seed 4. EXPERIMENTS AND RESULTS
distances from the Vertex 1 to all vertices
are Dl =[0" I, I, I, 1, 1,2]. Also, the distances from In this section. we present our results and discussion of
the two methods: SeedBase and PageRank.
Vertex5 to all vertices is D'J =[1,2,2,2,0,2, 1]. The In Experiment 1, two sub experiments were conducted
approximate shortest paths between the two vertices using using two different datasets. The maximum size of each
the two seed distances are 3 and 1, respectively. The dataset was set to 125. The first dataset contained 87
approximate shortest path as described above is the smallest vertices and 103 edges. The second data set contained 107
distance 1. vertices and 120.edges. The number ofVt:rtices was less than
the maximum number 125 since some tweeters had protectal
3.4 ExperimeDtal Setup data which arc tmIy accessible by those in their friend lists..
The numbers of seeds'was varied from 1 to 10 with "1"
FtrSt., we selected a pair of vertices randomly since we incremen1s. The mean accuracy of the SeedBase and the
did not have access to data logs from the data resource PageRank were compared to the on-fue..fly ranking which
(Section 3.5) for name queries. Wt; tht:n computed the yields l000"{' accurncy. The outcomes are presented in
approximate shortest path between them using our approach. Table 2 and Table 3, respectively. These data were also
This result was eompared to me oo-d.le-fly tanking as plotted in Figure 4 for the convenience of comparison.
described in [51 since it yields lOOO/e accuracy_
In order to know the perfurmance of our approach, we Table 2: Atturacies of SeedBase and PageRank from
implemented the seed-base ranking algorithm (SeedBase) as Experiment I Dataset I
described in {5]. For comparative purposes, we modified
this algorithm by replacing its combination of BPS and ::::seed Seed Base (°'0) PageRank (%)
Map-reduce with Dijkstra's algorithm. 1 14 46
We l"3.O each experiment I (} times to compute accuracies 2 14 78
and running times. We perfunned experiments from a virtual 3 21 74
machine with 1.8 GHz processor, 512 MB RAM. 4 20 83
5 27 85
3.5 Data CoUection
6 38 80
We evaluated the accutacy of our approach with real 7 35 81
data from the social network Twitter. These data were ..
8 37
.. _., 95
" '''-
gathered by using the snowbaII method described by [2J. 9 47 9S
The algorithm was executed through several steps: selecting 10 34 99
tweeters as initial seeds, running a BFS to all of their friends
until it reaches to a desired hop. Table 3; A~.rnde5 qfSee<Wase aad PageRank fro..-
First, some tweeters were selected as initial seeds. E~t I Dataset 2
These tweeters were retrieved from Twitter public timeline
in which a list of 20 tweeters was generated .randomly. As a #seed SeedBase (° 0) PageRank (°0)
result, these tweeters may not be connected to each others. I 4 41
Since the focus of this research is to examine the 2 41
15
relationship among eacll user, we decided to pick up only
3 21 52
one tweeter in the public timeline per time.
4 17 81
Second, the number of frknds of each chosen tweeter
varies.. Twitter limits the number of tweeters that one can 5 26 7&
follow up to 2000. Even though this limitation can be lifted 6 30 83
by increasing the amount of number who funo~ it can be 7 27 79
observed that these tweeters are very likely to be a 8 43 86
representative of an organization rather than an individual. 9 39 88
For this reason, these tweeters are not retrieved. 10 41 87
Finally, the number of hops can also be chosen in
variety, depending on the rnaximmn desired size of the With both datasets, the accuracies of both PageRank
commnnity. For example. if each tweeter bas 10 friends and
and SeedBase went higher as the number of seeds increased.
the number of hops is equal to 5, the maximum size of this However, the PageRmtk outperfoimed SeedBase by a large
5
community is 1x 10 • margin. Even with given only one seed,
64
i
, I
I i
! ~
5. The r International Conference on Robotics, Informatics, and Intelligent Technology riImOO9)
December lI st - l,f', 2009 at Bangkok, Thailand
In these experiments, the accuracies of both PageRank
and SeedBase were also higher as the numbers of seed
increased. PageRank's accuracies were between 18% and
30% when -fue·first seed was given, whereas the accuracy of
eo ~ed.Base was lower than 100/0. PageRank's accuracy
mcreased constantly at first, then grew rapidly after seven or
I
__,___ ~~~~==-~ __:_ ~":~R~;k2
I I . I 1
thirteen seeds and then kept increasing slowly until it
reached up to above 90'% when the number of seeds was
I ! I '.~. I I
- -
1 seedBase1~~
-----/"4-- _
-----1----
I
~-_
between nineteen and twenty five. SeedBase's accuracy also
I I , f - ":" , increased constantly but reached up to only about 66% and
--~-~-----~--
32%.
20 In summary,with all datasets the Page Rank
/' outperfurmed the SeedBase significantly. In addition, it can
~I
~~--~2-----74----~6~--~8~--~'O~
be seen that trends of accuracies of both PageRank and
SeedBase were increas¢ as the numbers of seed went
higher.
Figvre 4: Aceunlcies of SeedBase and PageRank from
Our Twitter network fonns a directed graph where the
directions from one tweeter to others are ordered. As a
E~rinrentl~tlandDa~t2
result. a tweeter bas higher rank than others when many high
ranked tweeters follow. Our results are also in agreement
PDgeRank's accuracy was between 40% and 50"10 whereas with the results from [20] where the centrality method is
1he accuracy of SeedBase was below 20"/0. PageRank's used fur choosing seeds (landmarks) in undirected graphs,
accuracy rate increased slowly at first; then grew rapidly
where vertices at the central of graph with many shortest
after two or four seeds. Then, the performance kept
edges going through are important.
increasing slowly and reached above 90% accuracy when
The PageRank method takes longer runtime than the
1he number of seeds was between eight and ten. SeedBase's SeedBase. The reason is that, from the PageRonk
. accuracy increased constantly as the number of seeds was
Equation 1, each vertex may be traversed several times to
increased but reached up to only 40% which is much lower
rank all vertices befure picking up seeds so that in worse
than that from PageRank.
In Experiment 2, two different datasets were also used. case the runtime is o(n2), whereas, in the SeedBase, the
However. the maximum size of each dataset was increased
to 1000. The first dataset contained 181 vertices and 230
nmtime is constant 0(1) since it is spent only to pick up
edges. The second dataset contained 482 vertices and S50 seeds. However, in our social network Twitter, the number
edges. The number of seeds was varied from 1 to 25 with of friends that a tweeter follows is many times smaller than
"1" increments. The mean accuracy of the SeedBase and the the number of all tweeters so that our approach is reasonable
PageRank were compared to the on-the-fly ranking.. The and effective.
results are plotted in Figure 5.
5. CONCLUSIONS
In this research. the approximate shortest pa1h between
tweeters in Twitter is used as our backbone factor in ranking
eo search results. We have applied our strategy by using the
PageRank. algorithm to select most important tweeters
before computing approximate shortest path among them..
In terms of accuracy, the results show that our strategy
outperforms the SeedBase method in (5}, which selects seeds
nmdomly, by large margin. The resuhs are also showed that
the high accuracy can be achieved with small fraction of
seeds. Our approach uses small amount of seed (about 2"10-
5%) but yields very high accuracy. Applying this approach
in social networks will make the search result for finding
5 10 15 20 25
"""'be< of Seeds friends more ef'fuctive. Future work includes reducing the
preprocessing time by speeding up the ranking seeds
process. The implemented source codes in PHP
Figure 5: Accuracies of SeedBase and PageRank from
programming language are made available.
Experiment 2 Dataset 1 and Dataset 2
65
6. '"
The :zM International Conference on Robotics, Informatics, and intelligent Technology (R.IIT2009)
st
December lI - 14''', 2009 at Bangkok, Thailand
6. REFERENCES [IOJ G. Luo, C. Tang and Y. Tian, "Answering relationship
queries on the web", Proceedings of the 16th intemativnqj
[IJ A. Mislove, M. Marcon, K. P. Gurrnnadi, P. DruscheI, Coriference on World Wide Web, WWW '07, pp. 561-570
and B. Bhattachrujee, "'Measurement and analysis of online ~7. '
social networks", Proceedings qfthe 7th ACM SIGCOMM
Coriference on irrternet Measurement, IMe '07, pp. 29.-42, [tl] M. Thorup, and U. Zwick, "Approximate distance
2007. or1lCles", Journal ofACM52, 1, pp. 1-24,2005.
(2) Y- Aim, S. Han, H. Kwak, S. Moon .and H. Jeong, [12J S. Baswana and S. Sen, "Approximate distance oracles
"Anaiysis oftopoIQgica1 characteristics of huge online social for unweigbted graphs in expected O(n2) time", ACM
networking services", Proceedings of the 16th international Transactions on Algorithms 2,4, pp. 557-577, 2006.
Conference on World Wide Web, WWW '07, pp. 835-844,
2007. [13] L. Ding, T. Finin, A. Joshi, R. Pan, R. S. Cost, Y. Peng,
P. Reddivari, V. Doshi and J. Sachs, "Swoogle: a search and
[3] A. Java, X. Song, T. Finin, and B. Tseng, "Why we metadata engine for the semantic web", Proceedings of the
twitter: understanding microblogging usage and Thirteenth ACM international Conference on information
communities", Proceedings of the 9th WebKDD and 1st and Knowledge Management, CIKM '04, pp. 652-659,
SNA·KDD 2007 Workshop on Web Mining and Social 2004.
Network Analysis, WebKDDISNA-KDD '07, pp. 56-65,
2007. [14] M. Richardson, A. Prakash and R Brill. "Beyond
PageRank: .machine learning for static ranking",
[4] A. N. Joinson, "Looking at, looking up or keeping up Proceedings of the 15th international Co.,yerence on World
with people?: motives and use of facebook". Proceeding qf Wide Web, WWW '06, pp. 707-715, 2006.
the Twenty-Sirth Annual SIGCHI Cortference on Human
Factors in Computing Systems, CHI '08, pp.. 1027-1036, {I5J S. Brin and L Page, "The anatomy of a large-scale
2008. hypertextual Web search engine", Comput. Netw. ISDN Syst.
30,1-7, pp. 107-117, 1998.
[5} M. V. Vieira, B. M. Fonseca, R. Damazio, P. B.
Golgher, D. d. Reis and B. Ribeiro-Neto, "Efficient search [16] Y. Zhang, L Zhang, Y. Zhang, XLi, "XRank:
ranking in social networks"', Proceedings of the Sixteenth Learning More from Web User Behaviors.,n Computer and
ACM Cmrference on Conference on ir(ormation and Information TecJnwlogy, Intemational Coriference on, pp.
Knawledge Management, CIKM '07, pp. 563-572, 2007. 36, Sixth IEEE International Conference on Computer and
Infurmation Technology (Crro6), 2006.
[6JE. Amitay, D. Carmel, N. Har'EL S. Ofek-Koifinan, A
Soffer, S. Yogev and N. Golbandi, "Social search and [17] T. G.. Micl!ael and T. Roberto, "Data Structure and
discovery using a unified approach", Proceedings if the Algorithms in Java", John Wiley & Son, Inc., ISBN-
20th ACM Conference on Hypertext and Hypermedia, HT {)471644528, 2004.
'09, pp. 199-208,2009.
[I8} E. Dijkstra, "'A note on two problems in connexion with
[7] S. Bao, G. Xue, X. Wu, Y. YU, B. Fei. and Z. Suo graphs", Numerische Mathematik:, 1: pp. 269-271, 1959.
''Optimizing web search using social annotatioos",
Proceedings of the I 6th internatiunal Conference on World [19J M. Holzer, F. Schulz, D. Wagner and T. Willha1m,
Wide Web, WWW'07, pp. 501-510, 2007. "Combining speed-up techniques fur shortest-path
computations". A CM Journal on Experimentant
[8] L. Page, S. Brin, R. Motwani and T. Winograd, "The Algorithmics 10,2.5,2005.
pagerank citation ranking: Bringing order to the web"
Teclmical report, Stanford Digital Library Technologi~ [20] P. Michalis, B. Francesco, C. Carlos and G. Aristides,
Project, 1998. "Fast Shortest Path Distance Estimation in Large Networks",
to be appeared in Proceedings of the Eighteenth ACM
[9] D. V. Kalashnikov, R. Nuray-Turan and S. Mebrotra, Conference on Conference on information and Knowledge
"Towards breaking the quality curse.: a web-querying Management, CIKM '09, 2009.
approach to web people search", Proceedings of the 31st
Annual international ACM SIGIR Conference on Research
and Development in iriformation Retrieval, SIGIR '08, pp.
27-34, 2008.
66