Thesis defense presentation - SDSU Computational Sciences, 2013. Use a number of network analysis tools, including community detection, pagerank, eigenvector centrality, etc., to determine key metrics of graphs determined by key word searches. These custom graphs and the associated metrics are then presented in interactive graphics and tables.
2. INTRODUCTION
• Deliverable of this project:
– Provide a means to identify who to turn to for
more information on a topic, or several topics
– Provide better insight into project organization
• Use Network Analysis tools
– Augment, not replace, existing enterprise search
tools
– Social Network Analysis
• Community detection algorithm
• PageRank, …
3. WHAT WAS DONE
• Used several datasets as input:
– Archived “public” Qualcomm mail-lists:
• 136 mailboxes, 10.0 GB
• 20393 vertices, 940403 edges
– Enron email:
• 158 mailboxes, 1.3 GB
• 90026 vertices, 3715056 edges
– Test datasets:
• Karate Club: 34 vertices, 78 edges
• Les Misérables: 77 vertices, 820 edges
• Created an interactive web client:
– User search term input
– Interactive graphics and tables of metrics
– Perl CGI, R, C++, Javascript, D3 (Data Driven Documents)
4. MAILBOX INPUT
Enron dataset: user based
• Wide range of topics per inbox
• All emails in each mailbox are
all “from:”, “to:”, or “cc:” the
same person
• Examples:
• Jeff Skilling
• Kenneth Lay
Qualcomm dataset: topic based
• All emails have, for the most part, a
common theme
• Emails in each mailbox are from multiple
senders
• All emails include the mail-list as a
recipient
• Emails may include other
recipients, including other mail-lists
• Examples:
• Photography
• Android
• Hiking
7. User Interface – search
Enron:
Search term parsing:
user input: ‘San Diego Power’
regex: ‘(san|diego|power)’
All Fields: searches all dataset columns
Topic: search only the “subject” column
People: only search “to:” and “from:” columns
Maillist: matches only on the “mail-list” column*
16. PREPROCESSING
Edge weight
Two methods compared:
“bytes sent” and “message count”.
Bytes sent is the sum of the
number of characters in each
matching message, minus any
included emails in each message.
Message count is the total
number of matching messages.
Analysis and intuition says either
method will provide similar
results.
Message count chosen because it
is simpler and faster.
Enron dataset
19k vertices, 3.5M vertices
17. COMMUNITY DETECTION
• Communities are clusters of vertices that have
more interconnections within the cluster than
outside of the cluster
• Any non-trivial social network will have
communities
• The metric associated with communities is
“modularity”, which ranges from -1 to 1, and is
defined as:
18. COMMUNITY DETECTION
• Ai,j is the edge weight between vertices i and j
• ki is the sum of the weights attached to vertex i
• m is the ½ the sum of all the weights in the graph
(for compatibility with earlier definitions of modularity)
• δ(ci,cj) is equal to 1 if vertices i and j are in the same
community, and 0 if not.
• A completely random network has modularity ~ 0
19. COMMUNITY DETECTION
• Modularity detection algorithms seek to maximize Q
• Exact solutions are computational expensive,
particularly for large networks
• Various network detection algorithms exist
• For this project the “Louvain” method was used
• Authors:
• Vincent Blondel • Jean-Loup Guillaume
• Renaud Lambiotte • Etienne Lefebvre
21. COMMUNITY DETECTIONZachary − Karate Club
Step 2
1
2
3
4
56
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Modularity: -0.04980276Assign each vertex to a unique community
22. COMMUNITY DETECTIONZachary − Karate Club
Step 2
1
2
3
4
56
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Modularity: -0.04980276
Pick a vertex and place in the
community of each of its neighbors.
Measure change in modularity at
each step.
Place the vertex in the community
with greatest positive change.
23. COMMUNITY DETECTION
Modularity: -0.04980276
Continue to pick vertices at random
and swapping until a minimum
increase is found after a complete
cycle.
Zachary − Karate Club
Step 3 through N
swap communities
1
2
3
4
56
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
24. COMMUNITY DETECTION
Modularity: 0.2483563
Continue to pick vertices at random
and swapping until a minimum
increase is found after a complete
cycle.
Zachary − Karate Club
Step 4 through N
1
2
3
4
56
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
25. COMMUNITY DETECTION
Modularity: 0.2483563
Continue to pick vertices at random
and swapping until a minimum
increase is found after a complete
cycle.
Zachary − Karate Club
Step 4 through N
1
2
3
4
56
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
27. COMMUNITY DETECTION
Collapse all nodes and edges in each
new community into “super-
communities”. Repeat previous steps.
1
2
3
4
5
6
7
28. COMMUNITY DETECTION
Repeat previous steps until no further
increase in modularity can be gained, or
an upper limit on iterations is reached
1
2
3
4
32. METRICS
• Eigenvector Centrality
– Favors nodes that are highly connected to other
highly connected nodes
• PageRank
– Favors nodes that are connected from other highly
connected nodes
– Strongly biased towards mail-lists
35. METRICS
• Degree
– the number of other vertices connected to a
vertex
• Strength (in, out, total)
– sum of the weights of inbound, outbound, edges
• Betweenness Centrality:
– The ratio of the shortest paths traversing through
a vertex divided by the total number of shortest
paths in the network
37. SOFTWARE DESIGN
• Testing done by using known datasets and
comparing values to other published values
• First step of the CGI is to run PageRank and
keep only the top 750 nodes
• Most searches likely only want the top few
ranked vertices
• Keep processing on local machine manageable
• Prevents “hairballs”
38. SOFTWARE DESIGN
• Perl CGI running under either Apache or
python CGIHTTPServer
• R does all the heavy lifting for the analysis
• Force-Directed Graph from D3, a javascript
library, is used for interactive graphics
• DataTables creates interactive html tables for
sorting and filtering
• The size of the vertices is an average of the
PageRank and Eigenvector values
• Color is assigned by community
42. SOFTWARE USAGE
• Results from a chip design project
– Dark blue: configuration management
– Light blue: hardware design
– Dark green: senior leadership
– Light green: test and design for test
– Salmon: not exactly sure
• One of the senior leads recognized this as a
good visualization of the
organization of the team
and said this would be of
value to Qualcomm
43. FUTURE STEPS
• More datasources
– ClearCase, email, communities, Perforce, HR
databases
• Better search
• Deeper search
• Make Gephi more scriptable
• Commercial products