Jaideep

Social Analytics
Mining Behaviors of a Connected
World
PAKDD School
April 11, 2013
Sydney, Ausytralia

Jaideep Srivastava
University of Minnesota
srivasta@cs.umn.edu

4/9/2013 University of Minnesota 1

Course Outline
• Module 1
• Introduction to Social Analytics – applying data mining to social computing
systems; examples of a number of social computing systems, e.g.
FaceBook, MMO games, etc.
• Module 2
• Computational online trust
• Identifying key influencers
• Information flow in networks
• Module 3
• Analysis of clandestine networks
• Katana - game analytics engine


Social Network Analysis
O a iz tio a
rg n a n l Sc l
o ia
A th p lo y
n ro o g T e ry
ho P y h lo y
sco g

C g itiv
on e
P rc p n S c -C g itiv
e e tio o io o n e K o le g
nw de
N tw rk
e o s N tw rk
e o s

Ra
e lity Sc l
o ia K o le g
nw de
N tw rk
e o s N tw rk
e o s

A q a ta c
c u in n e K o le g
nw de
E id m lo y
p e io g (lin s
k) (c n n
o te t)

S c lo y
o io g

 Social science networks have widespread application in various
fields
 Most of the analyses techniques have come from Sociology,
Statistics and Mathematics
 See (Wasserman and Faust, 1994) for a comprehensive introduction
to social network analysis
12/02/06 IEEE ICDM 2006 4

What have been it‟s key scientific successes?
 In classical social sciences numerous results
 „Six degree of separation‟ [Milgram]
 Popularized by the „Kevin Bacon game‟

 „The strength of weak ties‟ [Granovetter]
 „Online networks as social networks‟ [Wellman, Krackhardt]
 „Dunbar Number‟
 Various types of centrality measures
 Etc.
 In the Web era
 „The Bow-Tie model of the Web‟ [Raghavan]
 „Preferential attachment model‟ [Barabasi] (Yes and No)
 „Powerlaw of degree distribution‟ [Lots of people] (NO!)
 Etc.

Application successes
 Numerous in social sciences
 Google – PageRank
 LinkedIn – expanding your Cognitive Social Network
 making you aware that „you‟re more connected and closer than you
think you are‟
 Expertise discovery in organizations
 Knowledge experts, „authorities‟
 Well-connected individuals, „hubs‟
 Rapid-response teams in emergency management
 Information flow in organizations
 Twitter – real time information dissemination
 Etc.

Online (Multiplayer) Games

High

How
social

Low
Low High
How enagaging

Player Behavior & Revenue Model
 Blizzard (subscription)  Zynga (free2play)
 World of Warcraft  Farmville, Fishville, Mafia
 12 million subscribers Wars, etc.
 Revenue model  180 million players
 $15/month  Revenue model
 Approx $3billion annual  Virtual goods
revenue  $700 million in 2010
 4 hours a day, 7 days a  0.5 hrs a day, 7 days a week
week!

Hard core gamers Everyone

Less socially acceptable More socially acceptable

Like Cocaine Like Caffeine

Implications of this „addiction‟
 3 billion hours a week are being spent playing online
games
 Jane McGonigal in “Reality is Broken”
 Labor economics
 What is the impact of so much labor being removed from the pool
[Castranova]
 Entertainment economics
 If MMO players can get 100 hrs/month of entertainment by spending
$25 or so, what will happen to other entertainment industries?
 Psychological/Sociological
 Is it an addiction – the prevailing view (Chinese government‟s „detox
centers‟ for kids)
 Are they fulfilling a deeper need that real world is not (McGonigal)
 Societal
 A trend far too important to not be taken seriously!

Levis‟ – Example of Social Retail

 Levis‟ leverages its brand to ensure customers provide their social
network
 Levis‟ can leverage predictive social analytics technology to understand
the value of the customer‟s social network

11

Opportunity, Innovation, Impact
 Companies do not understand the social graph of their customers
 It‟s not just about how they relate to their customers, but also about
how customers relate to each other

vs.

 Understanding these relationships unlocks immense value
 Innovation: Understanding the social network of customers
 Key influencers, relationship strength, …
 Impact: Deriving actionable insights from this understanding
 Customer acquisition, retention, customer care, …
 Social recommendation, influence-based marketing, identifying trend-setters, …

Ninja Metrics confidential information. Copyright 2012 12

Unlocking true value by product, category, or store

0021

-$128.61

-$293.79

-$79.63

13

True Value of each customer
 True value = individual value + social value
 Who really matters, and to what degree
 Some empirical facts
 31% activity due to socialization
 23% more individual + 8% more social activity

The individual‟s their social and their true
lifetime value influence total

14

Impact of New Instrumentation on Science
 1950s
 Invention of the electron microscope fundamentally changed
chemistry from „playing with colored liquids in a lab‟ to „truly
understanding what‟s going on‟
 1970s
 Invention of gene sequencing fundamentally changed biology from
a qualitative field to a quantitative field
 1980s
 Deployment of the Hubble (and other) Space telescopes has had
fundamental impact on astronomy and astrophysics
 2000s
 Massive adoption is fundamentally changing social science
research
 Massively Multiplayer Online Games (MMOGs) and Virtual Worlds
(VWs) are acting as „macroscopes of human behavior‟

The Virtual World
Observatory (VWO) Project
• Four PIs, 30+ Post-docs, PhD and MS students, UGs, high-schoolers
• Noshir Contractor, Northwestern: Networks
• M. Scott Poole, Illinois Urbana-Champaign/NCSA: Groups
• Jaideep Srivastava, Minnesota: Computer Science
• Dmitri Williams, USC: Social Psychology
• Collaborators
• Castronova (Sociology, Indiana), Yee (Xerox PARC), Consalvo, Caplan
(Economics, Delaware), Burt (Sociology, U of Chicago), Adamic (Info Sci,
Michigan), …
• Data and technology partners
• Sony (EverQuest 2), Linden Labs (2nd Life), Bungie (Halo3), Kingsoft
(Chevalier‟s Romance), others …
• Cloudera Systems (Hadoop), Microsoft (SQL Server), Weka, …

Overall Goals of the VWO Project

Basic Science
• behavior, socialization, …
• novel, scalable algorithms
• NSF
Sticky Social
Media tons of Analytics Government applications
data • Social science
• Games • team dynamics (Army)
models • skills acq, leadership (Army)
• Virtual Worlds • New algorithms • social influence & adolescent
• Other social • Ultra-scalability health (CDC)
• etc.
apps
Business applications
• customer churn
• game design
• anti-social behavior
• etc.

Collaborators, Sponsors, Partners
 Team

 Financial Sponsors

 Data Partners

 Technology Partners

Who is playing?

 It is not just a
bunch of kids
 Average age is
31.16 (US
population median
is 35)
 More players in
their 30s than in
their 20s.

How much do they play?

 Mean is 25.86
hours/week
 Compares to
US mean of
31.5 for TV
(Hu et al,
2001)

• From prior experimental work, MMO play eats into entertainment TV
and going out, not news
• So much for kids being the ones with the free time.

Gender Differences
 More men players (78/22%)
 Men played to compete , and women played to socialize
 Men play more other games, but it was the women who were more
satisfied EQ2 players
 Women: 29.32 hours/week

 Men: 25.03 hours/week

 Likelihood of quitting: “no plans to quit”:
women 48.66%, men 35.08%
 Self reported play times
 Women: 26.03 (3 hours less than actual)
 Men: 24.10 (1 hour less than actual)
 Boys and girls are socialized early on, and thus have clear role
expectations for their behaviors and identities (Gender Role Theory
in action!!)

Inferring RW gender from VW data
Goal
What virtual world behaviors and characteristics predict
real world gender?
Data:
 Survey Data n=7119
 Survey Character Store
 EQ2 Character Store
Variables
 Avatar Characteristics:
 Gender, Race, Class, Experience, Guild Rank, Alignment,
Archetypes
 Game play Behaviors:
 Total Deaths & Quests, PvP Kills & Deaths, Achievement Points,
Number of Characters, Time played, Communication patterns

25

Gender prediction results
 Close to 95% prediction accuracy
 Decision trees work rather well
 Character Gender, Race and Class are significant
predictors to real life gender.
 Gender swapping is rarer, but systematically different by
real gender
 Players tend to choose:
 character gender based on their real life gender
 character races that are gendered: women play
elves/men play barbarians
 classes that are gendered: women play priests/men
play fighters

26

Gender swapping behavior
Game Character Real Gender
Gender
Male Female Total

Male 4065 82.6% 98 8.2% 4163 68.0%

Female 855 17.4% 1104 91.8% 1959 32.0%

Total 4920 100.0% 1202 100.0% 6122 100.0%

Observation
• Far more males gender swap than females
• Why?
• Men are more creative?
• Women have less identity confusion?
• Women get their „fill of gender swapping in real life‟ 
27

Economics: A test of RW  VW
mapping

 Do players behave in virtual worlds as we expect
them to in the actual world?
 Economics is an obvious dimension to test
 In the real world, perfect aggregate data are hard to
get

GDP and Price Level
 GDP and price levels are robust but comparatively unstable

GDP and Prices on Antonia Bayle

5,000,000 160
4,500,000 140
4,000,000
120

Prices (January = 100)
3,500,000
100
GDP (Gold)

3,000,000
2,500,000 80
2,000,000 60
1,500,000
40
1,000,000
20
500,000
0 0
January February March April May

Nominal GDP Price Level (January = 100)

Money Supply and Price
Change in Money Supply and Population on Antonia Bayle
 The instability is 2500 4000

explicable through 3000

Change in Money (000 Gold)
2000
the Quantity Theory 2000

Change in Accounts
1000

of Money 1500
0

-1000

 a rapid influx of 1000
-2000

money . . . 500 -3000

-4000

0 -5000
February March April May

Change in Money Supply (000 Gold) Change in Active Accounts

 . . . dramatically Price Level on Antonia Bayle

boosted prices 50
40
Percent Change in Price Level

30
 More evidence that 20
this behaves like a 10

real economy 0
-10 February March April May June

-20
Price Level

Networks in Virtual Worlds

SONIC

Advancing the Science of
Networks in Communities

Why do we create and sustain
networks?
 Theories of self-interest  Theories of contagion
 Theories of social and  Theories of balance
resource exchange  Theories of homophily
 Theories of mutual interest  Theories of proximity
and collective action  Theories of co-evolution

Sources:
Contractor, N. S., Wasserman, S. & Faust, K. (2006). Testing multi-theoretical multilevel hypotheses
about organizational networks: An analytic framework and empirical example. Academy of
Management Review.
Monge, P. R. & Contractor, N. S. (2003). Theories of Communication Networks. New York: Oxford
University Press.

SONIC


“Structural signatures” of Social Theories
A A

B
B F

+ F

+ -
C
- E
C
E

D
D

Self interest Exchange Balance

A
A

B F
B F
- +
+ C E

C D
E

Novice
D Expert
Collective Action Homophily Contagion

SONIC


Black: male
Red: female

Partnership Instant messaging

SONIC
Trade Mail

4 5 2
0 0 0
(1)
0 0 0
(2)
0 1 0
1 3 0
1 2 0
0 1 0
(n)
0 0 0

Social Networks (1) (2) (n)
as network structure frequency Cluster Structure
vectors in a bag-of-words model Vectors using
Text clustering
methods
Social theory + IR Cluster means provide Attribute values for
based network modes of network each cluster can be
analysis structure configurations used to discover
Making up all the trends between network
social networks structures and attributes

Results – normalized network structure
vector means for all clusters

Results
 Clusters 1 and 4 are similar
 Groups kill fewer monsters
 Group members in cluster 4 do not communicate much
 Group members in cluster 1 generally limit their communication to just
one other person in the group
 Most people belong to these two clusters
 Consistent with previous research - users in virtual environments are
less likely to interact with strangers
[N. Ducheneaut, N. Yee, E. Nickell and R. Moore, “Alone Together?” Exploring
the social dynamics of massively multiplayer online games, Proceedings
CHI06, ACM Press, New York, 407-416.]
 Cluster 5 groups have many 1-edge and 2-out stars
 Most of the communication is one way possibly indicating presence of
central people
 Maximum number of monsters killed out of all clusters
 Performance of the groups is very good
 Minimal communication
 It is possible that cluster 5 consists of groups more focused on playing
and performing well in the game and less on socializing

Some Open Questions
 How similar/dissimilar are online social networks from real-
world social networks?
 Is online socialization
 Only a sustainability activity of real-world networks?
 Causing new social networks to be formed?
 Is fundamentally a different type of networking activity?
 A fundamental tenet of socialization has been
“geography/proximity drives socialization”
 How is this being impacted by online socialization?

TeamSkill: Modeling Team
Chemistry in Online Multi-
Player Games

Description and motivation
 The goal of this work is to improve skill assessment approaches
used in multiplayer games, especially team-based games
 Xbox Live, PSN, Steam, and Battle.net have between 149-167
million total users, collectively (and that number is growing)
 Many of the most popular games are team-based: Halo, Call of Duty,
TF2, CS, etc
 Why?
 Better skill assessment = fairer games = less player attrition
 More accurate rankings of players/teams
 Most previous work has focused on individuals, not teams
 It‟s a hard problem
 Online setting: Updates to a player‟s skill distribution must be done
after each game is played
 Applicability: Any generalized assessment technique cannot include
game-specific data
 Our datasets are from the games of professional Halo 3 players

40

What is Halo?

 Halo is one of the most well-known “first-person shooter” video
game series in the world
 Online/LAN-based multiplayer is its most significant component
 650,000 to 850,000 unique users per day at its peak
41

Major League Gaming
 Our work focuses on
professional Halo 3 players
 Online scrimmage data, as
well as complete tournament
data, is readily available from
bungie.net and mlgpro.com
 Players are highly-skilled
individually – allows us to
better focus on group-level
What is Major League Gaming (MLG)? performance characteristics
• A professional league for competitive  Players change teams
gaming, including Halo 3 from ‟08-‟10
• Best players in the world compete in this regularly, helps isolate
league impact of particular players
• Last tournament watched by over on overall team performance
1,000,000 people online
 Unexplored boundary case
• Our datasets are comprised of games
played by these players for skill assessment

42

Skill assessment
 An old problem (with different applications in different
contexts)
 Paired comparison estimation
 Foundational work by Thurstone (1927) and Bradley-Terry (1952)
 Elo (1959), popularized in chess ranking (FIDE, USCF)
 Glicko (1993) – player-level ratings volatility incorporated (σ2),
addition of rating periods
 TrueSkill (2006) - factor-graph based approach used in
Microsoft‟s Xbox Live gaming service
 Used to match players/teams up with each other online
 Our work focuses on Elo, Glicko, and TrueSkill

43

Elo (1959)
 Arpad Elo, Dept of Physics, Marquette University
 Was a master chess player in USCF
 Proposed the Elo Rating System
 Replaced earlier systems of competitive rewards (i.e.,
tournaments) in USCF/FIDE
 Simple to implement: assumes player skill is normally-
distributed with constant variance β2
 Still widely-used today

44

Glicko (1993)
 Mark E. Glickman, Department of Health Policy
and Management , Boston University
 Proposed the Glicko Rating System
 Addresses rating reliability issue through the use of
rating periods and player-specific variances
 Elo is a special case of Glicko
 Iterative approach for approximating the
marginal posterior distribution of a player‟s skill
conditional on other players‟ priors
 More computationally tractable for large datasets

45

TrueSkill (2006)
 Ralf Herbrich and Thore
Graepel at Microsoft
Research, Cambridge, UK
 Used for automated
ranking/matchmaking on Xbox
Live
 Uses factor graphs to model
multi-player, multi-team
environments
 Converges quickly (~5 games
for a player)
 Large-scale deployment
 30 million Xbox Live members
 150+ games use TrueSkill for
ranking & matchmaking

46

Issues with current approaches
1 2 3 4
+
1234

 Basic idea
 Given: skill ratings of each team member, i.e., si ~ N(μi, σi2)
 Sum across all team members
 Not intuitive: Team chemistry is a well-known concept in team-based
competition [Martens 1987, Yukelson 1997] and it is not captured in
any of these models
 Can think of it as the overall dynamics of a team resulting from
leadership, confidence, relationships, and mutual trust
 Independence assumption not realistic in teams, especially at
high levels of play

47

More information is available, however…
1 2 3 4

12 13 14 23 24 34

123 124 134 234

1234

 Observation: We have more information than just the history of
individual players – we also know the histories of groups of players
 Player 1‟s history ∩ player 2‟s history  history of {1, 2}
 Existing approaches only make use of top row
 Idea: Estimate the skills of subgroups of players on a team, combine
in some way, and use to produce better estimate of a team‟s skill

48

Reframing the assessment
problem
k=1 1 2 3 4

k=2 12 13 14 23 24 34

k=3 123 124 134 234

k=4 1234

 Alterations to existing approaches necessary (i.e., Elo/Glicko/TrueSkill)
 Player-level skill representation  generalized subgroup skill
representation
 For each game and for each k <= size of the team (K), treat each
group as you would an individual player and update skill accordingly
 Hashing of skill variable matrix according to unordered subgroup
membership
 Treat Elo/Glicko/TrueSkill as „base learners‟, a la boosting
 Each rating says something about the skill of that particular subgroup
 …but how do we aggregate these ratings to estimate the skill of a
team?
49

TeamSkill
 Four different aggregation approaches:
 TeamSkill-K
 TeamSkill-AllK
 TeamSkill-AllK-EV
 TeamSkill-AllK-LS

50

Aggregation issues

black = history available red = no history available

time: t=1 t = 100 t = 200
After 1 game New player – 5 - who‟s New player – 6 – who‟s
never played with 1, 2, or 3 played with 3 or 5 (not both)
 Real world case: assume that players can leave/join the 4-player
team and look at the timeline
 The question at each point in time, t: how best to combine the
available group ratings to produce a team rating?
 The problem: the feature space is expanding and contracting over
time.
51

Data set overview
 Collected over the course of
2009
 7,590 games (2,076 from
tournaments and 5,514 from
Xbox Live scrimmages)
 448 players on 140 different
professional and semi-
professional teams
 Games took place during
January 2008 through January
2010
 Websites pertaining to this data
 http://stats.halofit.org - Player/team
statistics
 http://halofit.org – Datasets and related
information
“Friend or foe” social network for all tournaments in 2008 and
2009

52

Evaluation
 The TeamSkill approaches were evaluated by predicting the outcomes
of games occurring prior to 10 MLG tournaments and comparing their
accuracy to unaltered versions (k = 1) of their base learner rating
systems - Elo, Glicko, and TrueSkill (for TeamSkill-K, all possible
choices of k for teams of 4, 1 ≤ k ≤ 4, were used)
 For each tournament, we evaluated each rating approach using:
 3 types of training data sets - games consisting only of previous tournament data,
games from online scrimmages only, and games of both types.
 3 periods of game history - all data except for the data between the test tournament
and the one preceding it (“long”), all data between the test tournament and the one
preceding it (“recent”), and all data before the test tournament (“complete”).
 We will only show “complete” as results for “long” and “recent” mirrored those for “complete”
 2 types of games - the full dataset and those games considered “close” (i.e., prior
probability of one team winning close to 50%).

53

Results – all games

Prediction accuracy for both tournament and scrimmage/custom games using complete history

Prediction accuracy for tournament games using complete history

Prediction accuracy for scrimmage/custom games using complete history
54

Results – “close” games

Prediction accuracy for both tournament and scrimmage/custom games using complete history

Prediction accuracy for tournament games using complete history

Prediction accuracy for scrimmage/custom games using complete history
55

Conclusions
 Modified several existing assessment approaches to
operate on generalized player subgroup entities (instead of
just individual players)
 Introduced four aggregation methods for combining player
subgroup information to produce final forecast
 Shown evidence that close games are decided on the
basis of team chemistry, consistent with sports psychology
research

56

Time Equals Stickiness

6.00

5.00
6
Ratio of Quitters to Stayers

4.00

3.00

2.00

1.00

0.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69
Character Level

 There are less quitters as the levels go up, and focus
should be on the first 20 levels.

Solo vs. Social Players

6.00

5.00
Ratio of Quitters to Stayers

4.00

3.00 Solo
Social

2.00

1.00

0.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69
Character Level

 Isolated players are 3.5x more likely to quit (B = 1.26, p<.001).
Focus design on facilitating social interaction.

Problem Statement & Approach
 Objective: At time t, for player p, given a window w, compute
the probability, P(p,t,w), of p churning within the interval
(t,t+w), and the confidence in this probability.
 Approach: Estimate P(p,t,w) and its confidence using
 Data about p‟s activity and socialization behavior
 Statistics and machine learning (linear, non-linear, ensemble, etc.)
 Use of socio-psychological theories of player motivation
 Novel synthesis of data driven (DD) and theory driven (TD) approaches

61

Player Motivation Theories

Richard Bartle, 1996

Nick Yee, 2005

62

Mapping MMO Behaviors to Theory

 Example MMORPG
 Persistent virtual world
 Players have avatars
 Quest-driven
 Participation in groups,
guilds
 Dataset (game logs)
 Time period: FEB-JUN, 2006
 No. of accounts: ~16000
 Churner definition: 2 months of inactivity

63

Synthesis of TD and DD approach

64

Model Evaluation (Confusion Matrix)

 Model performance for different features (10
fold cross-validation)
Features Tree size Precision (%) Recall (%) F-measure
(nodes) (%)
Pure DD 485 69.3 84 76
TD: 59 67.1 76.2 71.3
Achievement+Social
TD: Achievement 37 67.2 73.2 70.1
TD: Socialization 7 64.3 55.4 59.5

 Observations
 Decision tree learning works quite well
 DD model very accurate (76%); difficult to interpret (485
nodes)
 TD model interpretable (7 – 59 nodes), but less accurate
 Achievement-orientation more important than socialization
65

Model Evaluation (Lift Chart)

 Observations
 TD model does better in
predicting top quintile of
churners
 DD model performs better in the
40%-70% range

66

Ensemble Approach – Model
Evaluation
 Ensemble approach
 A „committee of classifiers‟ votes on the result
 Heterogeneity helps
No. of Precision Recall F-measure
clusters
5 66.23 91.94 76.99
10 66.49 91.08 76.87
15 67.58 89.80 77.12
20 67.6 90.13 77.25

 Improvements over single model
 Best recall value shows 7.94% improvement , i.e. model
can identify larger proportion of potential churners)
 Best F-measure shows 1.25% improvement

67

Conclusions

 Comparison of theory-driven and data-driven model in
terms of prediction accuracy and model interpretability
 Achievement-orientation is more important than
socialization-orientation in identifying potential churners
 Ensemble model can identify larger proportion of
potential churners as compared to single global model

68

Course Outline

• Module 1
• Introduction to Social Analytics – applying data mining to social computing
systems; examples of a number of social computing systems, e.g.
FaceBook, MMO games, etc.
• Module 2
• Computational online trust
• Identifying key influencers
• Module 3
• Analysis of clandestine networks
• Katana - game analytics engine


Computational Trust in
Multiplayer Online
Games

“If you want to go fast walk alone, if you
want to go far then walk with a group.”
- Proverb from Ghana

Virtual Worlds & Massive Online
Games

 Massively Multiplayer Online Role
Playing Games
(MMORPGs/MMOs)
 Simulated Environments like
SecondLife
 Millions of people can interact with
one another is shared virtual
environment
 People can engage in a large number
of activities with one another and with
the environment
 Many of the observed behaviors have
offline analogs
72
72

Big Picture Questions
 How is trust expressed differently in different social
contexts?
 Cooperative (PvE), Adversarial (PvP), …
 How is trust expressed in different types of social
networks?
 Housing, Mentoring, Trade, Group, …
 What are the characteristics of trust and related networks
in MMOs?
 Similarities and differences with social networks in other domains
e.g., citation networks, co-authorship networks
 What role can features derived from the trust network play
in prediction tasks e.g., link prediction (formation,
breakage, change), trust propensity, success prediction

73

Computational Trust in Multiplayer Online Games
Department of Computer Science, University of Minnesota
Muhammad Aurangzeb Ahmad
Trust as a Multi-Level Network Phenomenon

Trust as a multi-modal multi-level Network formulations of Traditional Incorporating social science theories Generative Network Models for Trust based
network phenomenon Trust related concepts in trust related prediction task social interactions



Social Characteristics of Trust

Effect of Social Environments on Trust Trust and Homophily Trust and Clandestine Behavior Trust and Mentoring Trust and Trade

Most types of homophily do not
Different Social environments result in carry over to the MMO domain No Honor Amongst Thieves
differences in network signatures

Trust based and other social networks in MMOs
exhibit anomalous network characteristics







Trust and Prediction
Trust Prediction Family of Problems Item Recommendation Success Prediction (Social Capital)

Predict Formation, Change, Breakage of Trust with in the same
network and across social networks. Predict trust propensity.
Effects of social environments on trust Effect of network structure on trust







Trust and Prediction
Trust Prediction Family of Problems Item Recommendation Success Prediction (Social Capital)

network and across Social networks. Predict trust propensity.
Effects of social environments on trust Effect of network structure on trust


Computational Social Science
Semantics Structure

Trust and Homophily Characteristics of Trust Networks
Most types of homophily do not Trust based and other social networks in MMOs
carry over to the MMO domain exhibit anomalous network characteristics

Algorithmic
Trust Prediction Family of Problems


Applications
Detection of Clandestine Actors

No Honor Amongst Thieves
Gold Farmer Detection

Housing-Trust in EQ2

• Access permissions to in-game house
as trust relationships
• None: Cannot enter house.
• Visitor: Can enter the house and can
interact with objects in the house.
• Friend: Visitor + move items
• Trustee: Friend + remove items

• Houses can contain also items which
allow sales to other characters without
exchanging on the market

80

Homophily and Trust
 Homophily: Birds of a feather flock together
 There is no one form of homophily and homophily in
general is described in multiple ways: Status vs. Value
Homophily
 Each of these homophiles are in turn defined in multiple
ways themselves
 Previous literature instantiates homophily in MMOs in
terms of player characteristics and behavior in the game

81

Network Models, Homophily and
Trust Networks in MMOs

 RQ1: Does homophily in MMOs operate in ways similar to
homophily in the offline world?
 RQ2: How do we map characteristics that define
homophily in the offline world to online settings?

82

Mapping Homophily in MMOs

 In general, studies of homophily in MMOs assume only one type of homophily and
generalize based on that type
 Even in the offline world homophily is of different types
 Hence the necessity of Mapping Homophily which we address here
 Mapping and Proteus Effect
83

Trust and Homophily in MMOs

Homophily Type Hypothesis Observation
H Gender Homophily Players trust other players who ?
1 are of the same gender
H Age Homophily Players trust other players who ?
2 are of the same age cohorts
H Class Homophily Players trust other players who ?
3 are of the same class
H Race Homophily Players trust other players who ?
4 are of the same race
H Guild Homophily Players trust other players who ?
5 belong to the same guild
H Level Homophily Players trust other players who ?
6 are at a similar level
H Challenge Players trust other players who ?
7 Homophily like similar types of challenges

84

Key Observations

H1: Players trust other players who are of the same gender?

In general players trust other players who are of
the same gender

H2: Players trust other players who are of similar age?

The stronger the type of trust the lesser is the age
difference between the people specifying trust

H3: Players trust other players who are of the same class?

Class does not seem to effect the choice of trusting
others

85

Key Observations

H4: Players trust other players who are of the same race?

Race does not seem to effect the choice of trusting
others

H5: Players trust other players who are of the same guild?

In general, the stronger the type of trust, the greater
is the percentage of the people who trust people in
their own guilds

H6: Players trust other players more who are level at a similar rate?

Leveling at the same rate does not seem to greatly
effect trust amongst players

86

Key Observations

H7: Players trust other players who are of the same level?

• Level difference seems to have some effect on trust
• For Trustee (strongest) and the Visitor (weakest) form of trust, the lower level
players are more likely to trust players who are at a higher level

Summary:
• Homophily is observed for a subset of types in MMOs as compared to what it is
observed for in the offline world
• The types of homophily which are not observed in MMOs are the ones which are
greatly effected by game mechanics

87

Social Networks: General
Observations
 There is an extensive literature on characteristics of social
networks (Leskovec PAKDD 2005, Leskovec ICDM 2005,
McGlohon ICDM 2008, McGlohon KDD 2008)
 The network exhibits monotonically
shrinking diameter over time (Leskovec PAKDD 2005,
Leskovec ICDM 2005, McGlohon ICDM 2008)
 At a certain point in time called the Gelling Point many
smaller connected connect together and become part of
the largest connected component (Leskovec PAKDD 2005,
Leskovec ICDM 2005, McGlohon ICDM 2008)
 The largest connected component (LCC) comprises of
the majority of the nodes in the network (>= 80%)
(McGlohon ICDM 2008, McGlohon KDD 2008)

89

Social Networks: General
Observations
 The size of the second and the third largest connected
components remain constant (more or less) even though
the identity of these components change over time
(Leskovec PAKDD 2005, Leskovec ICDM 2005, McGlohon
ICDM 2008, McGlohon KDD 2008)
 Network isolates are few in number (<5%)
(Leskovec PAKDD 2005, Leskovec ICDM 2005)
 The number of connected components decreases over
time
(Leskovec PAKDD 2005, Leskovec ICDM 2005, McGlohon
ICDM 2008)
 Relatively fast growth of LCC close to the gelling point
(Leskovec PAKDD 2005, Leskovec ICDM 2005)

90

Trust Networks in MMOs

 Data from 4 servers is available. Results from one server
(Player vs. Environment, „guk‟) are shown
 The network consists of 15,237 nodes, 30,686 edges
and 1,476 connected components
 Dataset spans from January 2006 to August 2006
 Average node degree of 4.03. The size of the three
largest connected components are as follows: 9039, 51
and 49. The largest connected component accounts for
59% of all the nodes in the network

The Trust Network on „guk‟ on August 31, 2006

91

Key Observations

 Observation 1: Preferential Attachment: The rich get
richer but not too rich
Explanation: Social bandwidth is limited, Dunbar
Number
 Observation 2: The growth of the LCC is retarded after
the gelling point
Explanation: The trust network has a relatively low
growth rate as compared to the other networks
 Observation 3: Non-monotonic change in the diameter
of the largest connected component
Explanation: Players have different levels of activity at
various points in time and can also “drop out” of the
network if they churn from the game

92

Key Observations
 Observation 4: A large number of isolate components are
observed (> 1000)
Explanation: People join in groups and spend all the time
playing with one another instead of interacting with people
from the outside
 Observation 5: The number of isolate components
increases monotonically over time
Explanation: (Same as observation 4)
 Observation 6: Nodes in the non-LCC constitute a
significant portion of the network (41%, 8 months after
gelling point)
Explanation: (Same as observation 4)

93

Generative Models of Trust Networks
 Time bound Preferential Attachment: The rich get rich
but not so much after a certain point in time. Edge
formation is bound by time
 Presence of Auxiliary Components: Isolate components
are added to the network at an almost constant rate over
time

94

Generative Models of Trust Networks
(ii)

 Non-Monotonic Decrease in the Diameter:
Nodes become inert after a certain point in time. Sample
the lifetime of nodes from a normal distribution
 Homophily in Edge formation: Probability of edge
formation dependent upon node degree as well as
agreement (similarity) in node characteristics

95

Results

Diameter: Non-Monotonically Changing Diameter

% LCC as being relatively small

96

Results

Number of Connected Components

Network growth and the gelling point

97

Conclusion: Trust Networks
 Trust Networks in MMOs exhibit many properties which are
not exhibited by other social networks in most other
domains
 Proposed a model based on observations and domain
knowledge
 Models of social networks should incorporate the
peculiarities which are observed in MMOs in general
 Generalization? Similar observations have been made for
mentoring networks but not for PvP, Trade and Chat
Networks

98

Trust Prediction Family of
Problems
Trust Prediction: Given a trust network G predict which
nodes are going to trust one another in the future

Prediction Across Networks: Given a set of actors who
participate in multiple types of interactions using features from
one network predict the existence of links in the other network

100

Trust Prediction Family of
Problems
 Machine Learning/Classification approach to the problem
of trust prediction
 Introduced a set of new problems (inter-network link
prediction, trust propensity prediction)
 Proposed an algorithm for link prediction which used
domain knowledge from social science theories
 New Contributions:
 Does the social context (adversarial vs. cooperative) effect
prediction results?
 If so then what is the implication for generalization?

101

Prediction Task

 Trust Prediction as a classification problem
 60,000 examples for each prediction task
 10 Fold Cross-validation
 Data from Guk (Cooperative) and Nagafen (Adversarial)
Servers
 Six Standard Classifiers for Comparison: J48, JRip,
AdaBoost, Bayes Network, Naive Bayes and k-nearest
neighbor
Positive Example: Negative Example:

Training Period Test Period Training Period Test Period

102

Prediction: Group Network

 In general good prediction results are obtained for the
combat network (in either direction), except one case
 The grouping network is a union of all grouping
instances and thus it is extremely dense (1,796,438
edges, 31,900 nodes)
103

Prediction: Group Network

 Grouping is not a good predictor of mentoring but the vice
versa is correct
 Mentoring is (more often than not is accompanied by
grouping) but grouping can be in a variety of contexts e.g.,
raids, quests, dungeon instances, …
104

Prediction: Combat Network

 In general good prediction results are obtained for the combat
network (in either direction)
 In general if players have played against another person then
they have friends in other networks who have done the same
or something similar
105

Comparison across Social
Environments

 T: Trust; M: Mentoring; B: Trade
 In general the results are similar for both the
environments with some exceptions
 Mentoring-Mentoring: In the adversarial environment a
more cliqueish behavior is observed i.e., friends of
friends are likely to mentor one another in the future
which is not the case for the cooperative environment

106

Comparison across Social
Environments

 Mentoring-Related Tasks: In general mentoring is
much more prevalent in the adversarial environment as
compared to the cooperative environment (~3 times) and
is much more intense. Overlap between mentoring and
trade is thus more likely

107

Trust Amongst Clandestine Actors

“A plague upon‟t when thieves
cannot be true one to another!”
– Sir Falstaff, Henry IV, Part 1, II.ii

Do gold farmers
trust each other?
109

Hypergraphs to Represent Tripartite Graphs

• Accounts can have several characters
• Houses can be accessed by several characters
• Projecting to one- or two-model data obscures crucial
information about embededdness and paths
• Figure 2a: Can ca31 access the same house as ca11?
• Figure 2b: Are characters all owned by same account?
110

Hypergraphs: Key Concepts

• Hyperedge: An edge between three or
more nodes in a graph. We use three
types of nodes: Character, account and
house
• Node Degree: The number of
hyperedges which are connected to a
node
NDh1 = 3
• Edge Degree: The number of
hyperedges that an edge participates in
EDa1-h1 = 2

111

Network Characteristics

• Long tail distributions are observed for the various degree distributions
• The mapping from character-house to an account is always unique
• Players who are connected to a large number of houses are highly active
players otherwise as well
112

Characteristics of Hypergraph Projection Networks
• Account Projection: Majority of the gold farmer nodes are isolates
(79%). Affiliates well-connected (8.89) vs. non-affiliates (3.47)
• Character Projection: Majority of the gold farmer nodes are isolates
(84%). Affiliates well-connected (10.42) vs. non-affiliates (3.23)
• House Projection: 521 gold farmer houses. Most are isolates (not
shown) but others are part of complex structures. Densely connected
network with gold farmers (7.56) and affiliates (84.02)

The Housing Projection Network
113

Key Observations

• Gold farmers grant trust ties less frequently than either
affiliates or general players
• Gold farmers grant and receive fewer housing permissions
(1.82) than their affiliates (4.03) or general player population
(2.73)

Total degree In degree Out degree

<n> < nGF > < nAff > <n> < nGF > < nAff > <n> < nGF > < nAff >

Farmers 1.82 0.29 1.82 0.89 0.29 0.89 1.07 0.29 1.07

Affiliates 4.03 1.28 0.70 1.55 0.75 0.70 2.88 0.63 0.70

Non-Affiliates 2.73 - 7.77 1.57 - 5.98 1.56 - 2.34

114

Key Observations

• No honor among thieves
• Gold farmers also have very low tendency to grant other
gold farmers permission (0.29)
• Affiliates also unlikely to trust other affiliates (0.70)



Farmers 1.82 0.29 1.82 0.89 0.29 0.89 1.07 0.29 1.07

Affiliates 4.03 1.28 0.70 1.55 0.75 0.70 2.88 0.63 0.70

Non-Affiliates 2.73 - 7.77 1.57 - 5.98 1.56 - 2.34

115

Key Observations

• Affiliates are brokers:
• Farmers trust affiliates more (1.82) than other farmers (0.29)
• Affiliates trust farmers more (1.28) than other affiliates (0.70)
• Non-affiliates have a greater tendency to grant permissions to
affiliates (7.77) than in general (2.73)



Farmers 1.82 0.29 1.82 0.89 0.29 0.89 1.07 0.29 1.07

Affiliates 4.03 1.28 0.70 1.55 0.75 0.70 2.88 0.63 0.70

Non-Affiliates 2.73 - 7.77 1.57 - 5.98 1.56 - 2.34

116

Frequent Itemset Mining for Frequent Hyper-subgraphs

Support of a Hyper-subgraph: Given a sub-hypergraph of size k,
subP is the pattern of interest containing the label P, shP is a pattern
of the same size as subP and contains the label P, the support is
defined as follows:

Support of pattern also containing a gold farmer (red) = 5/8

117

Frequent Itemset Mining for Frequent Hyper-subgraphs

Confidence of a Hyper-Subgraph: Given a sub-hypergraph of
size k, subP is the pattern of interest containing the label P, subG is
a pattern which is structurally equivalent but which does not contain
the label P, the confidence is defined as follows:

Confidence of pattern and containing a gold farmer = 5/7

118

Frequent Patterns of GFs
• Very low (s ≤ 0.1) support and confidence for almost all
(except 8) frequent patterns with gold farmers
• Remaining 8 patterns can be used for discrimination
between gold farmers and non-gold farmers in a subset of
the instances
• Gold farmers & affiliates are more connected: A 3rd of more
complex patterns are associated with affiliates (15/44)

119

Application: Gold Farmer Detection (Behavioral)
Classification using in-game behavioral and demographic features
Each model corresponds to a grouping of different features

120

Application: Gold Farmer Detection (Label
Propagation)
• Problem: Not all people who are labeled as normal players are such. Some
of them are gold farmers but have not been identified as such
• Analogue: If someone is socializing almost exclusively with criminals then
he may be a criminal
Classifier Metric Initial Dataset Label Prop Change In Performance
Bayes Net Precision 0.17 0.189 0.019
Recall 0.834 0.819 -0.015
F-Score 0.282 0.307 0.025
J48 Precision 0.494 0.62 0.126
Recall 0.189 0.337 0.148
F-Score 0.273 0.437 0.164
J Rip Precision 0.495 0.537 0.042
Recall 0.462 0.462 0
F-Score 0.478 0.497 0.019
KNN Precision 0.436 0.46 0.024
Recall 0.396 0.428 0.032
F-Score 0.415 0.443 0.028
Logistic Regression Precision 0.455 0.534 0.079
Recall 0.189 0.271 0.082
F-Score 0.267 0.36 0.093
Naïve Bayes Precision 0.146 0.142 -0.004
Recall 0.538 0.502 -0.036
F-Score 0.23 0.221 -0.009
Adaboost w/ DT Precision 0.405 0.471 0.066
Recall 0.105 0.08 -0.025
F-Score 0.167 0.137 -0.03

121

 Model 1 (Player Attribute Based Features): These features are
based on the attributes of the player‟s character in the game e.g.,
character race, character gender, distribution of gaming activities etc.
 Model 2 (Item Based Features): These are the features which are
derived from items bought and sold from the consignment network.
These features are based on the frequency of the frequent items sold
or bought by gold farmers.
 Model 3 (Player Attribute & Item Based Features): All the attributes
from the previous two models.
 Model 4 (Item Network Based Features): Features which are derived
from the item network in a manner analogous to Model 2.
 Model 5 (Player Attribute & Item-Network Based Features): A
combination of features from Model 1 and Model 4.
 Model 6 (Item Network & Item-Network Based Features): A
combination of features from Model 2 and Model 4.
 Model 7 (Player Attribute, Item & Item-Network Based Features):
Union of all the features described above.

122

Application: Structural Signatures Approach Applied to Trade Networks

Not sufficient data is present to use the structural signature approach to catch
gold farmers
Alternative: Application of the same approach in other networks: Trade Network

A set of standard machine learning models are used: Naive Bayes, Bayes Net,
Logistic Regression, KNN, J48, JRip, AdaBoost and SMO

123

Conclusion: Gold Farmer Behavior Analysis and Prediction

Representation Related Issues
addressed with respect to trust in
MMOs
Application of frequent pattern mining to
discover distinct trust patterns associated
with gold farmers

No honor between thieves:
Gold farmers tend not to trust other
gold farmers

124


Computational Social Science
Semantics Structure

Trust and Homophily Characteristics of Trust Networks
Most types of homophily do not Trust based and other social networks in MMOs
carry over to the MMO domain exhibit anomalous network characteristics

Predictive Analysis
Trust Prediction Family of Problems


Applications
Detection of Clandestine Actors

No Honor Amongst Thieves

Conclusion
 Availability of data allows one to analyze phenomenon
where it was not possible to do so in the past (Trust in
MMOs in the current case)
 Explored a series of big-picture questions pertaining to
trust in MMOs
 Similar affordances and contexts (with respect to the offline
world) lead to similar outcomes in the online world
 Explored and expanded the scope of trust related
prediction tasks
 Analysis is required on more datasets for generalizability

127

Acknowledgement
 DMR Lab
 Professor Jaideep Srivastava and all DMR lab members especially
Zoheb Borbora, Amogh Mahapatra, Young Ae Kim, Nishith Pathak,
Kyong Jin Shim and Nisheeth Srivastava
 Virtual Worlds Observatory
 Professor Noshir Contractor (Northwestern), Professor Scott Poole
(UIUC), Professor Dmitri Williams (USC)
 External Collaborators at Northwestern, UIUC, USC, U Toronto
especially Brian Keegan
 Funding Agencies:
 NSF, AFRL, NSCTA, IARPA

128

Publication Summary

Publication Type Total Thesis Related Comment
Book 1 1 Book on Clandestine
Behaviors and Networks
(Springer 2012)

Conference 14 5
Papers
Book Chapters 1 -
Journal Papers 2 -
Workshop Papers 10 3 2 Best Paper Awards
Short/Poster 8 1
Papers
Technical Reports 4 1
Tutorials 4 1
27 Co-authors
Patents 2 1
129

A Computational Model
for
Social Influence

Objective: Model social
influence in a dynamic network
setting as a spread of cascades
to identify key influential nodes

131

Outline

 Team Overview
 Optimal Target Selection
 Current approaches
 Proposed Approach
 Results
 Update summary
 Q&A

132

Motivation

Original
Retweet

Global retweets of Tweets coming from Japan for one
hour after the earthquake
133
Courtesy:http://blog.twitter.com/2011/06/global-pulse.html

Optimal Target People Selection
 Goal: Find the optimal targets to maximize influence
spread in network
 Why is it important?
 Public Polls: Sentiment spread
 Sales and Marketing: Word-of-mouth spread
 Public Health: Disease spread
 Influence: A force that attempts to change the opinion or
behavior of the individual
 Influence is causal
 My friend buys a product -> I buy a product

134

Current Approaches (1/2)

 Assume a certain model for propagation of influence [1]
 Independent Cascade Model
 Each node is independently influences the neighbor
using a biased coin toss
 Linear Threshold Model
 Each node picks a random threshold
 Each node infects the neighbor by a specified
amount

All of them need to know the
propagation probability 135

Current Approaches (1/2)

 Use a two step approach
 Step 1: Choose influence Model

 Step 2: Find the subset of top-k nodes that maximizes
the influence
 Bad News: Step 2 is NP-Hard [2]
 Popular method for step 2: Greedy Heuristics
 Greedy heuristics gives (1-1/e) approximation

 Address scalability issues in the second step [3][4][5][6]
 CELF [3]: Influence maximization function follows law
of diminishing returns (sub-modular)

136

Limitations of current approaches

 Estimation/Learning of propagation
probabilities
 Data-driven and model free approach
 Propagation models are not content
unaware
 Content-aware influencer mining
 No causality effect in influence model
 Add causal effect in influence
propagation
137

Proposed Approach

Communication
Logs

Frequent Network
Sequence Discovery of
Mining Influencers

Ordered list of
influencers

Network
Structure
138

Sequence Cascade Mining

Example

Temporal order
Append neighbors from network i=2

Check downward closure and
support count
Peter, Charles, Kim, Vicky, Nancy
Network structure

Let Support = 2
139


Example


Check downward closure and
support count
Network structure

Let Support = 2
140


Example


Break loop; No more work to do…

Network structure
Return Frequent Sequences

Let Support = 2
141


L1 Frequent Sequence Mining

Candidate Generation for k+1
By appending neighbors

Downward closure pruning

Support Count and reorganize

142

Influence Function
 Node pattern Influence Set (Q(i,p))
=
 Influence Set of a Node

 Influence Set of a Node Set

It can be shown that influence function I(V,s) is
monotone and sub-modular
143

Network Discovery of Influencers

Greedy heuristic with (1-1/e) approximation
guarantee 144


Frequent Sequences

Example
i=1
Choose the node with max out-degree
Find top-3 (Randomly break tie between P
influencers and C)
P C chosen
Current Set of Influenced People
C N
C K P

K V
145


Frequent Sequences

Example
i=2

P P chosen

C N
C K P V N

K V
146


Frequent Sequences

Example
i=3
(Randomly break tie between K, V,
N)
P K chosen

C N
C K P V N

K V
147

Context-Aware Influencers

Context Specific
influencers

Bag words Frequent Network
representing Sequence Discovery of
the context Mining Influencers

148

Initial Results (1/2)

DBLP Dataset USPTO Dataset

Significant performance gain over established
baseline PrefixSpan 149

Initial Results (2/2)

DBLP
Dataset

USPTO
Dataset

150

Jaideep

Recommended

Recommended

More Related Content

Similar to Jaideep

Similar to Jaideep (20)

Recently uploaded

Recently uploaded (20)

Jaideep