Axa Assurance Maroc - Insurer Innovation Award 2024
Scalable and Parallelizable Processing of Influence Maximization for Large-Scale Social Networks
1. Scalable and Parallelizable Processing
of Influence Maximization
for Large-Scale Social Networks
for Large-Scale Social Networks
Apr 9, 2013
Jinha Kim, Seung-Keol Kim, Hwanjo Yu
Pohang University of Science and Technology (POSTECH)
25. 29
Micro Level (CWW 10)
• Cannot count influence propagation routes
between two nodes
26. 30
Evaluating (S)σ
• Monte-Carlo Simulation (KKT 03)
• Simultaneous simulation (CWY 09)
• Breaking down a graph into communities
(WCS 10)
• Shortest path between two nodes (KS 06)
• Local arborescence based on the most
probable path (CWW 10)
28. 33
Intuition
• How about extremely localizing influence??
• Influence path between two nodes as
influence evaluation unit !!
• Considering all path is not tractable
(#P-hard)
• Only considering meaningful influence
paths
34. 41
Parallel evaluation
• To approximate ({v}),σ
Pv V→ is required
• For v≠u, Pv V→ and Pu V→ do
not have common paths
• Independent evaluation
of ({v}) is guaranttedσ
vv11vv11
uu1111
uu1111
uu1n1n
uu1n1n
......
vv22vv22
uu2121
uu2121
uu2n2n
uu2n2n
......
36. 43
• ({v}) ≠ (S {v}) - (S)σ σ ∪ σ
• influence blocking!!!!
• v blocks a path from u S∈
• We should detect blocked(invalid) paths
Approximating (S {v}) - (S)σ ∪ σ
is not trivial
is not trivial
uuuu vvvv
uuuu vvvv
before
after
40. 47
Approximating
(S {v}) - (S)σ ∪ σ
(S {v}) - (S)σ ∪ σ
Marginal infl. of a node v
infl. of v to itself
Infl. of seeds S to a node v
Only consider valid paths
52. 63
• KKT 03 : Kempe, D., Kleinberg, J., andTardos, E. Maximizing
the spread of influence through a social network.
(KDD ’03)
• SC 06 : Kimura, M., and Saito, K.Tractable models for
information diffusion in social networks.
(PKDD ’06)
• LKG 07 : Leskovec, J., Krause,A., Guestrin, C., Faloutsos, C.,
VanBriesen, J., and Glance, N. Cost-effective outbreak
detection in networks.
(KDD ’07)
• CWY 09 : Chen,W.,Wang,Y., andYang, S. Efficient influence
maximization in social networks.
(KDD ’09)
53. 64
• CWW 10 : Chen,W.,Wang, C., and Wang,Y. Scalable influence
maximization for prevalent viral marketing in large-scale social
networks.
(KDD ’10)
• WCS 10 : Wang,Y., Cong, G., Song, G., and Xie, K. Community-based
greedy algorithm for mining top- k influential nodes in mobile social
networks.
(KDD ’10)
• JSC 11 : Jiang, Q., Song, G., and Cong, G., Simulated Annealing Based
Influence Maximization in Social Networks.
(AAAI ’11)
• LYK 12 : Lee,W., Kim, J., andYu, H., CT-IC: Continuously activated
and Time-restricted Independent Cascade Model forViral Marketing
(ICDM ’12)
Notas do Editor
Hello, my name is Jinha Kim and let me present our research topic named scalable and pralleliazable processing of Influence Maximization for Large-Scale Social Networks. This is a joint work with Seung-Keol Kim and my advisor hwanjo Yu.
The goal of this presentation is to devise a method that efficiently evaluates influence which is the most time-consuming part of the influence maximization problem.
This diagram outlines this talk. First, how viral marketing exploits the social network is shown briefly. Then, to find the most effective users in viral marketing, the influence maximization problem is formulated. To concretize the problem, how social networks are abstracted as graphs and how influence is propagated throughout graphs are described briefly. Finally, how the influence maximization problem is solved using our method is described in detail.
Let me show how viral marketing works in social networks.
In social networks, a user’s opinion is spread throughout the network. For example, a twitter user writes an impressive posting and his or her followers may re-tweet it as a sign of agreement. Then, the followers of followers may re-tweet it again and this kind of chain reaction affects the whole network. This is called the ‘word of mouth’ effect.
To exploit the word-of-mouth effect, marketers persuade some influential users and hope that the positive opinion of them is inflated into the social network. This is how the viral marketing works in social networks. Therefore, to be a successful marketing, finding the top most influential people is the most crucial task.
Then, an important question arises. How can we find such users in an algorithmic way!
The question is formulated as the influence maximization problem.
First, the influence should quantified. When a user subset is given as S, a function of sigma S returns the expected number of users influenced by S. This is the quantified influence in networks.
Then, the influence maximization problem is formulated as a combinatorial optimization expression.
To define sigma S concretely, a graph and a influence diffusion process should be modeled.
A social network can be abstracted as a weighted directed graph.
For example, assuming that in facebook, a user ‘v’ likes a posting of his or her friend ‘u’. In the corresponding graph, user ‘u’ and ‘v’ become nodes and their friendship relation becomes an edge and how much user ‘v’ likes his or her friends ‘u’’s posting becomes the weight of the edge.
And diffusion model should also be defined.
Given a graph, the quantified influence sigma of S depends on how influence is propagated. In this research, our method is based on the independent cascade model which is simple but well-established. In the next few slides, I will explain how the independent cascade model works in an inductive way.
At time zero, several seed nodes are activated by the marketers. and All the other nodes are inactive.
At the time i plus one, as shown in the figure, active nodes which are activated at time i have one chance to activate its inactive out-neighbors. on the contrary active nodes which are activated before i do not have such chance.
The influence propagation continues until no nodes are activated Assuming that no nodes are activated at time j consequently at time j plus one, no inactive nodes have chance to be activated
After defining the graph and the diffusion model, the influence maximization problem can be solved.
Before describing our method, let me show two challenges that the influence maximization processing confronts.
At the macro level,
The optimization expression itself is NP-hard. Intuitively, finding the optimal solution requires finding the best from all possible combinations. The expression is reducible to set-covering problem and proven to be NP-hard.
To detour the NP-hard challenge, the greedy algorithm is proposed in the seminal paper of the influence maximization problem. The greedy algorithm repeatedly chooses the node which gives the most marginal influence increase from the current seed set. In the greedy algorithm, influence of each node and the marginal influence increase are two major evaluation components. However, evaluating the exact influence is also hard.
We call it the micro level challenge of the influence maximization.
The influence evaluation itself is included in the #P-hard problem class, which says we cannot count the number of all possible solutions of a given problem. In the influence evaluation perspective, it is related to the fact that we cannot count all influence propagation paths even between two nodes in a polynomial time.
To overcome the micro level challenge, several methods are proposed. In the seminal paper of the influence maximization, influence is evaluated using Monte-Carlo simulation. in which actual diffusion process is repeated over ten thousand times and the average activated nodes are determined to be the influence. However, the Monte-Carlo simulation takes too much time. To boost the evaluation time, local structure such as shortest path between two nodes or local arborescence structures are used.
Along with these methods, we propose a more efficient influence approximation heuristic, IPA.
To evaluate influence efficiently, existing methods confines the influence diffusion locally. Our intuition is how about localizing influence extremely. That leads to set all meaningful paths between two nodes as influence evaluation unit. The word ‘meaningful’ in this context is formally defined in the next slide.
When an influence path is a sequence of nodes, the influence propagation probability ipp(.) is defined as the product of the sequence of edge weights. We only consider influence paths whose ipp is no less than the pre-defined threshold theta. For example, assuming that all edge weights are 0.1 and the threshold is 0.001, we only consider influence paths of length up-to three, but paths longer than three will be ignored.
With the definition of the meaningful influence paths, let us see how meaningful influence paths are collected and organized to evaluate the single node influence. Suppose that a graph is given as the left figure. IPA first traverses the graph from each node in a breadth first way. The right figure is the result of the traverse from node a. The traversal stops when a cycle is detected or ipp() becomes less than the threshold
After the traversal, IPA extracts the influence paths from the tree. Influence paths are all the paths from the root to each non-root nodes in the traversal tree. From the traversal tree in the left figure, ten paths that start from node ‘a’ are extracted. We call such path set as P sub a to V.
For each node, the graph traversal is conducted and all influence paths are collected. The paths are grouped by their starting nodes.
Now, IPA can approximate the influence of a single node. hat symbol is used to indicate that it is an approximation. The influence of single node ‘v’ is the sum of one which is the influence of itself and the sum of influence from v to the nodes of ‘v’’s reaching area O sub v. The reaching area of ‘a’ in the example is b,c,d and e. The influence between two nodes are defined as the complement of the probability that no paths between them do not influence the sink node.
The parallel evaluation of single node influence is simple. To approximate the influence of a node v, P sub v to node set V is required. For two different node u and v, P sub v to capital V and P sub u to capital V are required but do not have common paths. Therefore, parallel evaluation of single node influence is possible.
Up-to now, IPA evaluated the single node influence. To evaluate the marginal influence increase, IPA re-organizes the paths. In the single node influence evaluation, paths are grouped by their starting nodes. Now, Paths are re-grouped by their ending nodes. By re-organizing, IPA can efficiently evaluate the marginal influence increase in parallel
Now, we reach the marginal influence increase evaluation phase. It is complicated because the marginal increase is not equal to the mere difference of the influence before and after adding new seed candidate. For example, before adding v as a seed, a path of u,v and the remaining is valid. However, after adding, such path becomes meaningless because activation trial of u to v is impossible in the independent cascade model. We call this influence blocking and should detect such invalid paths
For the current seed set S, among the paths that start from seed nodes ///----------------------of P sub capital S to capital V, all paths that have v as their element are invalidated. For the new seed node v, among the paths that start from v ///------------------------of P sub v to capical V all paths that have any current seed nodes are invalidated. In sum, a valid path contains only one seed node as its starting node.
Let us see how invalid paths are detected. Suppose that the current seed set consists of only ‘a’ and ‘d’ is added as a new seed candidate. The left figure shows that before adding d, paths from a to e are only valid. After adding ‘d’ into seed set, five paths become candidate paths.
However, adding new seed makes some paths invalid. For example, d blocks the influence of a in the path (a,d,e) and a blocks the influence of d in the path (d,a,c,e). In the end, only three paths are valid and used to evaluate the marginal influence increase.
Using the valid paths, marginal influence increase is evaluated. The marginal influence increase is the sum of one which is the influence of a new seed v and the sum of the marginal influence increase from seed nodes to the v’s reaching area. The marginal influence increase of the seed nodes to a node u a member of v’s reaching area is the complement of the probability that no valid paths from seed nodes do not influence u. Similarly to the single node influence evaluation, green box is also parallelizable. This is all about how IPA evaluates influence.
Now, let me show the empirical evaluation result of our method
Five publicly available real datasets are used. The node size ranges from 75 thousand to 5mil and the edge size ranges from 500 thousand to 70m.
Along with IPA, four other influence evaluation methods are used. Monte-Carlo is the Monte-Carlo simulation method which is used in the seminal paper of the problem. the number of repetition is 20,000. PMIA is the state of the art influence evaluation method which exploits the local arborescence structure. SD is an influence evaluation method that only counts on the graph structure but not influence diffusion model Random is random. All five influence evaluation methods are plugged into the greedy algorithm.
First, we should find the threshold in each dataset for IPA and PMIA. As shown in the figure, although processing time and influence are both desirable features, they have trade-off relation. Thus, we find the elbow point in which neither feature sacrifices the other.
This plot shows the log-scaled processing time of the five methods. Greedy is slow. In patent and livejournal, it couldn’t finish until one hundred thousand seconds elapsed. The single discount and random is trivially fast because they do not consider the influence diffusion but the influence of their solution is not good. IPA shows an order of magnitude shorter processing time than PMIA which is the state of the art. PMIA did not finish in livejournal due to the memory problem.
Along with the processing time, we also evaluate how fast the next seed node is pop out after the first seed node is found. As shown in the plots, IPA
These plots show the influence of the solutions of five methods in five datasets. In influence of the seed node, greedy is trivially the best because it repeats the influence diffusion simulation until stable influence is acquired. In Epinion, both IPA and PMIA shows influence close to that of greedy. Single discount and Random show low influence. In Stanford, IPA only loses 8% of influence compared to greedy, but PMIA loses over 20%.
In DBLP, IPA shows slightly lower influence than PMIA, but the difference is not much compared to Stanford dataset
In patent, IPA shows more influence as the number of seed nodes increases. In LiveJournal, only IPA produces meaningful influence.
Finally, we report the parallelization effect. The parallelization effect is measured by the speed-up which is a fraction of the processing time of single threaded IPA over that of multi threaded IPA. As shown in the figure, IPA parallelizes more when the dataset size is bigger.
That’s it. This is the end of this talk. Any questions??