A Study of Internet RFC Authors using NetDraw and yEd

© 2009 – Olivier MJ Crépin-Leblond. Full copyright notice on Page 1 1

A Study of Internet RFC Authors using
NetDraw and yEd.
Olivier M. J. Crépin-Leblond, PhD.

Abstract— The Internet is a very important yet extremely sophisticated aspect of modern life.
There has often been discussion in online forums about its origins. In particular, the community
feels that it is time to say “thank you” to those people who contributed to its design and evolution.
Some of the main contributors are already well known and recognized. This essay shows how to
use Social Network Analysis to identify the other significant contributors to this adventure. The
analysis rests on the main assumption that the Internet Engineering Task Force’s (IETF) 5000+
“Request For Comments” (RFCs) constitute the engineering basics for the Internet. Here, we use
novel methods to extract data from the RFCs using readily available software, and use a suite of
free downloadable software to draw several social maps of the RFC authors’ space. Our results
highlight recent techniques for social mappings & data analysis in complex interaction
environments such as large organizations and emerging bottom-up process governance circles
such as those considered for governing the Internet.

Index Terms—NetDraw, Mage, yEd, RFC, Father, Internet, Social, Networking, IETF.

I. INTRODUCTION
N APRIL 1969, Dr. Steve Crocker, then at UCLA, published the first Request for Comments, RFC 1 [1]
I entitled Host software. The RFC repository consisting of more than 5000 entries, remains one of the
“technical pillars” of the network of networks called the Internet. Once published, an RFC cannot be
modified. Many RFCs are therefore superseded (or made obsolete) as new ones replace them, but each
publication contributes to the overall Internet edifice. As mentioned on the RFC Editor Web page, “The
RFC (Request for Comments) series contains technical and organizational documents about the
Internet, including the technical specifications and policy documents produced by the Internet
Engineering Task Force (IETF).”[2].
So who is the “Father of the Internet”? There is no single answer to this frequently posed question. Dr.
Leonard Kleinrock is credited with packet switching theory [3]. Dr. Joseph Licklider, with the concept
that computers could all be connected together into a giant network to talk to each other [4]. What about
Dr. Douglas Englebart [5] inventor of the computer mouse? One of the most important advances in the
Internet’s development was the TCP communications protocol, developed in 1974 by Dr. Vinton Cerf
and Dr. Robert Kahn [6]. However, circa 1977 the “IP” in TCP/IP was split off from TCP circa at the
urging of Dr. Danny Cohen, Dr. David Reed and Jon Postel, to support real-time, unsequenced packet
streams. Furthermore, Dr. Robert Metcalfe is credited with co-inventing Ethernet [7], which today is the
basic physical communication standard in most wired networked computers. How do all these people

Draft manuscript completed December 5, 2008. Revised April 2009. Working Title: “Will the Real Father(s) please stand up?” This work was supported
in part by Global Information Highway Limited. The Author is with Global Information Highway Ltd, 7 Kensington Church Court, London, W8 4SP, UK.
(e-mail: ocl@gih.com)
© 2008/2009 Olivier MJ Crépin-Leblond. All Rights Reserved. The Copyright for this paper rests with the Author but permission to freely distribute the
information contained within this publication is granted provided the source of the article is credited. Parts of this document may be reproduced in a
commercial publication ONLY if prior permission has been granted by the copyright holder.


relate to each other?
However, the Internet is not solely TCP/IP and Ethernet. A great number of services and other
protocols at each layer of the Internet model make this network of networks, what it is today. It is
therefore likely that each protocol and component of today's Internet has several “fathers” (and
“mothers”). In fact, there are several thousands such contributors, both inside and outside the realm of
RFC space. Nevertheless, because their proposals are contained in the many RFCs, we decided to look
specifically at the Internet standards, RFCs and their authors, possibly the largest “family” of Internet
pioneers and contributors available.

This essay serves to determine the most prolific authors/contributors to the RFC database and to
extract a social network of RFC authors in order to better understand their working relationships and
spheres of influence. It uses modern social network engineering tools to make the vast amounts of data
available to us today more easily understandable. It will also serve to highlight the shortcomings of such
a method, mainly caused by its restricted input data set consisting solely of the RFC database.

Why this research?
By undertaking this research, we show the use of social networking topology modeling to elucidate
the workings of bottom-up processes promoted to construct at-large governance. We define a
methodology for such study and look forward to such an analysis being used in future organizational
processes involving large groups of participants. Finally we explore avenues to more fully comprehend
the change in social paradigm that Internet brings to the traditional governing processes used in non-
Internet regimes.


II. DATA COLLECTION METHOD

A. Collecting Data
The source data of the study was loaded from the RFC Editor FTP site as the RFC Bibliographic
Listing (Created 09/08/2008)[8]. This has the advantage of respecting a set format which can be more
easily machine-readable than other RFC documents. This resulted in 5340 RFCs indexed.

B. Refining/Formatting Data
Using data mining techniques to extract the names of authors and their interpersonal relationships
from the list of RFC authors forms a crucial part of the work.
No purpose-built software was used for data mining: the data set was filtered in several stages using
text processing tricks usable by anyone with an ability to master them in standard Microsoft software.

This consisted of importing the list of RFC authors as a text file into MS Word and reformatting the
text with even delimiters using the “replace” functions inherent in that software. The resulting file was
imported into an MS Excel table with each line corresponding to one RFC entry matching names of
authors, one name per column – a formatted table of authors working together. The most time-
consuming process was to crosscheck accuracy and synchronicity of data manually due to errors
generated by erroneous formatting of the original file. For example, missing punctuation delimiters in
the MS word file triggered mismatching of names in columns. Intermediate stages included tables 52
columns across & 5 170 rows in height. This table was transformed (using cut/paste) into a linear
numbered referential X-Y listing of authors containing 10 735 entries.
The file was imported into an MS Access Database. Two cross-linking rules were set-up. The first one
served to add-up the total number of publications per author. The second one was used to add-up the
number of publications of each pair of authors.
The input included a table of 10 735 entries. The outputs consisted of a table of 3 480 entries for the
authors listing and a table of 17 266 pairing links. This constituted our network of authors.

Cutting and pasting into a text file and adding the correct formatting code resulted in a file satisfying
the input “.vna” format required for the NetDraw Software. The format is human-readable and therefore
easy to generate manually or automatically, without being a proprietary binary file format. It is shown
next.


*Node data
ID, publi
Postel_J. 205
McCloghrie_K. 92
Rose_M. 75
Rekhter_Y. 69
Reynolds_J. 64
Schulzrinne_H. 62
McKenzie_A. 60
Braden_R. 51
Crocker_D. 51
[…]
*Tie data
from to intensity
Postel_J. Reynolds_J. 37
Reynolds_J. Postel_J. 37
McCloghrie_K. Rose_M. 26
Rose_M. McCloghrie_K. 26
[…]

ID is the author’s name; publi is a variable denoting the number of publications; intensity is the
number of publications for the author pair. Obviously, this collaboration is reciprocal so it is
automatically shown going both ways. “[…]” denotes all further entries.

The data mining mechanism defines the data which is made available for the NetDraw software to
analyze and plot. Different data sets can be designed for different purposes and the stage of information
collection and data mining is therefore crucial in relation to the targeted end results.


III. GLOSSARY OF TERMS
In order to analyze a social network, we start by looking at each individual.
In the field of bottom-up analysis, all networks are composed of groups (or sub-graphs). When two
participants have a tie, they form a "group". One approach to thinking about the group structure of a
network begins with this most basic group, and seeks to see how far this kind of close relationship can
be extended. This is a useful method, because sometimes more complex social structures evolve, or
emerge, from very simple ones, and this is the type of hidden information which we are hoping to detect
when analyzing the network.
Social networking analysis relies on graph theory, a discipline which has been traditionally
mathematical in nature. Because each discipline speaks a particular language, it is important to define a
restricted number of terms which will be used at length in this essay.

For the sake of easy referral, those terms are presented here, taking into account the context of our
analysis. In general, different terms sometimes have the same meaning depending on the context
(bibliographical, scientific, geographic; mathematical, etc.). Their equivalency is shown here.

A “node”, also referred to mathematically as a “vertex” (plural: “vertices”), is a point representing a
single RFC author. In NetDraw, this is also called a “symbol”. In the paragraph above, we referred to a
node as it a “participant” or an “individual”. In order to reduce confusion, we use only “node” and
“author”.
When two or more nodes (RFC authors) work together on an RFC, they are linked by a “line”. A line
therefore ends at nodes. In NetDraw, this is also called a “link”. A mathematical designation of a link is
an “edge”. All three terms will be used in this essay.
A “graph” is the set of nodes and set of lines between pairs of nodes, as visualized on a 2 or 3
dimensional plane.
A “network” consists of a graph and additional information on the nodes or the lines of the graph.
This is effectively what we are building with NetDraw.
A “cluster” is a group of 2 or more nodes connected together.
A “clique” is a maximal complete sub-network containing 3 nodes or more. It is a specific form of
cluster. In graph theory, this sub-set of a network contains nodes which are more closely and intensely
tied to one another than they are to other members of the network. Strictly speaking a group is identified
as a clique when every node is directly linked to every other node in the group.
A “dyad” is the smallest grouping of nodes, that is, two nodes linked together.
“Betweenness” is defined as the degree a node lies between other nodes in the network. In effect, it is
an intermediary, also known as a bridge or a liaison. Therefore, it is the number of other nodes it links
directly or indirectly together through its own links. The degree of betweenness is important in a social
network because it defines the nodes connecting sometimes vastly different groups together.
“Closeness” is defined as the degree a node is near all other nodes in a network (directly or
indirectly). Thus, closeness is the inverse of the sum of the shortest distances between each node and
every other node in the network. A node with a high degree of closeness is more “central” to the
network than one with lower closeness.
“Pendants” are nodes connected to the rest of the network through a single link.
“Isolates” are nodes which are not connected to any other node in the network. In our case, this is an
author having published all of his or her RFCs solo.


IV. PLOT AND ANALYSIS
NetDraw [9] is a network visualization software that can be downloaded from the Internet for free. Its
license agreement allows it to be freely copied. A set of analytical protocols is available to extract
meaning from the data. The algorithms included in the software are used in social network analysis,
micro-molecular analysis, physics as used in astronomy, and other disciplines. In this section, results
will be presented for several types of analysis.

A. Circle Layout
1) Method/Theory
The Circle Layout uses a simple algorithm to plot nodes in a geographic circle. In NetDraw, it is
possible to define the order of the nodes around the circle to be alphabetic or depending on the number
of RFCs published.
The best connected nodes are found by simply looking at the concentration of links and their
thickness. User intervention is however required to detect pendants since these are also plotted within
the circle and are not immediately discernable.
2) Results
A graph of the resulting plot is shown in Figure 1.

Figure 1: Network nodes plot using the Circle Layout (Authors having published 20+ RFCs)

The parameters used for the plot were as follows:
• Data subset: author has published more than 20 RFCs (64 authors satisfied this rule)
• Node size according to number of publications


• Link thickness between two nodes according to number of publications the authors have written
together

There is a concentration of links around Jon Postel. This is to be expected since, as RFC editor for
many years, his contribution to the RFC process vastly exceeded any other author’s contribution. Other
link concentrations can also be easily discerned, directly related to the closeness of each author.
A dyad, a few isolates, as well as several pendants are visible. The order around the circle is set
automatically by the program using a parameter which is user-chosen, in this case alphabetically.
Another straight forward parameter which could be chosen for this function is the number of links to
other nodes. Nonetheless, neither parameter avoids the pitfalls that the software falls into and which
requires a human eye to reorganize:
• The dyad, pendants & isolates had to be extracted manually from the circle’s layout;
• Nodes are not arranged in an order which reduces link distance. For example, Malkin_G is
connected to Reynolds_J and Baker_F but is geographically located on the other side of the
circle, thus adding to a possibly false impression of extensive inter-connection between nodes.

3) Scaling up
Loosening the data subset constraints of 20 RFC publications per author brings more nodes in the
picture. The restricted data results may show no connection between the isolates and the main group –
this may only be so due to the constraints used. In fact, they may connect to the main group via other
authors who do not satisfy the sample’s constraints but have a high degree of betweenness.
Reducing the constraint by selecting authors of 10 RFCs or more (171 authors), reveals an increase in
mesh density within the network. This is shown in Figure 2.

Figure 2: Network nodes plot using the Circle Layout (Authors having published 10+ RFCs)


Removing the subset constraints altogether shows the overall graph shape of the network, including
all 3 480 authors, as shown in Figure 3. It is clear that RFC authors are well connected together and that
the RFC process provides a real sense of community.

Figure 3: Network nodes plot using the Circle Layout (all RFC authors)

4) Conclusions
The Circle layout utilizes a simple algorithm to display nodes in a geographic circle. Its advantages
are reduced computing processing power requirements and a display giving the eye a clear sense of
cross-group connectivity. Its weaknesses are individual anomalies such as the ill-placing of isolates,
dyads and wrong placement of nodes which are connected to a reduced number of other nodes. The
algorithm does not take into account the geographical positioning of nodes according to their links to
other nodes.
Both weaknesses can be corrected by human intervention. As a result, the algorithm is very useful for
displaying social interaction between the authors of RFCs and detecting some of the synergies that
originated in building the RFC standards.


B. Multi-Dimensional Scaling (MDS) Analysis
1) Method/Theory
Multi-Dimensional Scaling (MDS) Analysis [10] comprises a set of statistical techniques used together
to visualize data in an N-dimensional plane. The MDS Algorithm looks at similarities within the data
and assigns a location to each node of the input network. This algorithm is particularly suited for 3D
visualization.
MDS is not so much an exact procedure as rather a way to "rearrange" nodes in an efficient manner,
so as to arrive at a configuration that best approximates the link structure.

2) Results & tri-dimensional MAGE Plot
• Link thickness between two authors according to number of publications written together

MDS analysis yields poor results when plotted in 2 dimensions because the nodes overlap each other,
thus making the graph illegible. However, it is possible to view a 3D graph by exporting of the graph
data (in Kinetic Image .kin format) to a separate (free) 3-dimensional rendering program named
MAGE[11].
MAGE is used for all sorts of 3-dimensional rendering such as molecular chemistry and physics,
biology, mathematical analysis and even archeological modeling. NetDraw can export data to a Kinetic
Image format, which makes it suitable for displaying the network in 3D, as seen in Figure 4.

Figure 4 : Mage visualisation of authors of 20+ RFCs

The overall structure consists of a main cluster of nodes and several isolates. Pendants are also clearly
discernable. Nodes at the center of the cluster can be clearly seen as being more connected.
An important feature of MAGE is the ability to rotate the structure taking any node as an axis.
Zooming in/out is also possible. Rotation is a particularly important cognitive process for the brain to


understand 3D structures, although we are only using a subset of the features of MAGE.

The zoom feature is illustrated in Figure 5 which shows a clique within the network structure. This
shows a working group of authors who wrote several RFCs together. Whilst it does not mean that all
authors were present in each RFC, it shows extended collaboration between the authors represented by
the nodes.

Figure 5: Zooming in on a cluster within a Mage Plot of Multi-dimensional Analysis

Once zoomed-in, rotating the structure around the central cluster’s node is also possible and yields good
results.

3) Scaling up
Multi-dimensional system analysis is a processor and memory-intensive method since its results are
best represented in 3 dimensions. A test run was undertaken by selecting authors having published at
least 10 RFCs. This brought the number of authors up to 171 authors. MDS cluster analysis, although
demanding much processing power, gave poor results, even when plotted using MAGE. The cause was
traced to tight clustering of the nodes, thus requiring parameters in MAGE to be tweaked to omit
displaying the nodes. This resulted in a diagram showing the links only – a wire frame of the whole
structure which required maximum zooming in to be displayed. The resulting view was very unclear.

Scaling up the MDS analysis with an input data set from the initial 64 nodes to hundreds or even
thousands of nodes, requires more computing power and several Gigabytes of memory. Insufficient
memory triggers buffer overflows which crash the software. Future versions of the software might avoid
this condition although increasing node count increases complexity exponentially.


4) Conclusions
Multi-dimensional system analysis is useful in displaying the network in three dimensions. The
NetDraw feature to transfer the results to MAGE (through a .kin export file) is very useful to plot the
network in true 3D, including changing the position of lights as well as visualizing the network from
any angle and traveling virtually through it. Node clusters can be visualized with ease. However, some
information is lost, for example thickness of link or size of node. It is hoped that future versions of
NetDraw and Mage will incorporate those features to make the visualization an enhanced experience.


C. Geodesic Distance through Spring Embedding
1) Method/Theory
The Spring Embedding method is based on the geometric theory of gravitation [12], although
constrained to the 2-dimensional plane, hence the crowded display. Each node is considered to be acting
on the other nodes through attraction and repulsion, and the links between the nodes are taken as springs
enabling the nodes to travel. This iterative method places nodes on the plane and eventually reaches a
stable state, provided enough iterations are calculated.
The “geodesic distance” is the shortest path between two nodes. If node x is connected to node y
which is connected to node z, the distance from node x to node y is the length of the geodesic distance
from x to y. The geodesic distance from node x to node z is the sum of the geodesic distances from x to
y and from y to z. In the context of social networking, this enables us to analyze the “networking extent”
of an individual based on his or her number of connections. In other words, how well are they connected
to the rest of the network? This is the concept of “centrality”, also referred to as “closeness” and
described earlier in the glossary.

The constraints of the layout criteria, whilst introducing some error margin, included “node repulsion”
and “equal edge lengths”.
Node repulsion introduces a minimum distance between nodes displayed on the graph and is required
to avoid a clustering of the nodes to the extent that the overall diagram would be unreadable.
Equal edge lengths is self-explanatory and serves to constrain the length of the links between the
nodes in order to provide some space within the graph. It does not mean that all links will have the same
lengths: the program will just try to make the lengths as similar as possible. Both constraints were used
specifically to improve the readability of the graph.

The analytical process being an iterative process, every instance of the analysis does not yield a
geometrically exact reproducible layout, although the produced layouts are a very similar in shape and
geometric positioning. The structures and clusters are the same.
This type of plot is readily available in the NetDraw software.

2) Results
• layout using Spring embedding iterative simulation
• number of iterations:100 – 1 Billion

Since this is an iterative analysis, increasing the number of iterations should improve on the
“accuracy” of the results. In fact, repeating the analysis from 100 iterations in regular steps to
1,000,000,000 iterations showed no significant difference to the layout. The resulting graph is shown in
Figure 6.

With each node representing an individual researcher, individuals which are located more at the center
of the diagram act as bridges between various groups of researchers. It is also possible to easily see
clusters of nodes which are well interlinked together. Cliques are clearly visible, and clusters including
thicker link width indicate more extensive collaboration between a number of authors.


Attempting to export the .kin data to a MAGE 3D plot yielded results which did not appear as
conclusive as the MDS analysis due to the high cluster concentration of nodes – the export process
shortened the link length to such an extent that the viewing of the cluster was affected.

Figure 6: Graph of Spring Embedding using Geodesic Distances, Node Repulsion and Equal Edge Lengths
(64 authors having published 20+ RFCs). 100 Million iterations.


3) Scaling up
Scaling up and running the simulation under constraint that authors publish 10 or more RFCs, brings
the total number of nodes to 171. Sadly, the clutter caused by more nodes makes the resulting graph less
useful than for a more restricted input set. It is possible to discern the largest nodes, but smaller nodes
are seen with difficulty. (Figure 7)

Figure 7: Graph of Spring Embedding using Geodesic Distances, Node Repulsion and Equal Edge Lengths
(171 authors having published 10+ RFCs)

The new authors join the whole network with several pendants, very few isolates and only one dyad.
With up to all 3000+ authors, the network becomes difficult to interpret due to lack of space.

4) Conclusions
The advantage of Geodesic Distance analysis using node repulsion is that of providing results which
are easily displayed in two dimensions. Since the analysis is based on an iterative process, the
computing power required for such an analysis can be user-selected. Lower iteration values yield results
which are slightly more unstable in geometric placement of the nodes. Cliques, clusters, isolates and
other features of the network can be clearly identified and reliable conclusions can be derived about the
centrality of an individual thanks to his or her final geometric location within the resulting “network of
people”.


D. K-Core Analysis
1) Method/Theory
K-Core analysis is based on the clustering of groups of people who are closely connected together. It
is a way to study the nested structure of a modular organization. The K-Core of a network is the
maximal sub-network consisting of links with degree at least k. For example, the 1-core is simply the
original network; the 2-core is the network with all the pendants removed etc. Increasing k removes
links and nodes which are less closely connected to the network.

2) Results

Four distinct groups of people are established: three main groups, and one group of authors that
published solo.

It can be seen that clustering is caused by “similarity” data. As expected from the algorithm, the
defining factor for the clustering is the number of links originating at each node. This in itself is a
limitation. When performing K-Core analysis, the resulting groups show inconsistencies.

Pendants and nodes connected to 2 groups with a single link to those groups, or to 2 nodes in the same
group, are defined as a separate group. This, of course, is a correct representation of K-Cores, but of no
use for our purpose of organizing the groups visually.
Manual translation of these nodes into the correct groups was therefore required and the resulting
graph is shown in Figure 8.
Dyads do not fare well either in K-Cores since they are not connected to the main group. Pendants
also need to be translated since they are not seen by the software as having integrated well with any of
the cliques present, although in real life, a pendant would probably benefit well from the clique through
the node to which it was linked.

3) Scaling up

Scaling up, running the simulation under constraint that authors publish 10 or more RFCs, brings the
total number of nodes to 171. Whilst the overall graph including all Ks is too crowded, it is possible to
run a different type of K-Core analysis, by selecting only groups with specific value for K. This selects
the nodes having a specific closeness or better.


Figure 8: K-Core Analysis of 64 authors having published 20+ RFCs

Since the network is divided into six groups, (Group 1, Group 2, Group 3, Group 4, Pendants,
Isolates), the value of k can be selected to be any number from 1 to 6. k=0 selects the isolates. k=1, the
pendants, k=2, the nodes having 2 links etc.
Selecting nodes with k=6 and plotting them using Spring Embedding with Geodesic Distances, Node
Repulsion and Equal Edge Lengths, it is possible to display the most tightly connected nodes in the
network. These 15 nodes might not be the most central, but form the highest clique in the overall graph.
This is shown in Figure 9.
In another run, a value of k=5 was selected thus incorporating more nodes in the graph, as shown in
Figure 10. The network obtained is the core network upon which most other nodes will link to.
In the real world, and non technical language, the 64 authors shown this graph are the “pillars of the
community” in that they have published in excess of 10 RFCs and have also networked extensively with
their peers. Some authors might have published more RFCs than them, but their network might not have
been as wide-ranging.
4) Conclusions
Whilst K-Core analysis might appear to, on first use, not yield meaningful results, this is countered by
the usefulness in finding the nodes with the highest closeness within our target group. Performing K-
Core analysis and displaying the results by grouping according to the K-Core criteria, it is possible to
see how many of each type of node is present in the network. Displaying the results using Spring
Embedding GeoDesic Layout shows who are the most socially connected authors in our network. The
mixing and matching of parameters (constraints about number of RFCs published, k value, display and
grouping methods) can bring very interesting facts about the social network than first meet the eye.


Figure 9: K-Core Analysis K=6 of authors having published 10+ RFCs / Spring Embedding GeoDesic Layout

Figure 10: K-Core Analysis K=5 of authors having published 10+ RFCs / Spring Embedding GeoDesic Layout


E. Blocks & Cutpoints
In this analysis, the software checks for nodes that will specifically cut parts of the overall network off
if they were to be removed from the structure.

1) Method/Theory and results
The parameters used in our analysis were as follows:

Results using this method are not useful in our case: the subset of authors selected has worked
extensively together since it is really composed of the core of our network of 3 480 authors. As a result,
the overall network of nodes features enough redundancy for no single “point of failure” – ie.
“Cutpoint” – except for pendants. Since this can be established visually, there is no requirement to run
the analysis and plot results.
However, this type of analysis would be useful in more loosely-connected communities because it
tags the nodes which are essential in linking disparate clusters which would otherwise have been
unconnected.

2) Scaling up
As the constraints on the RFC authors are eased by allowing authors having published less than 20
RFCs in the network, it is possible to discover where the cutpoints are to these other authors. This could
determine which of the core authors bring connectivity between the core network of authors and the rest
of the RFC community. However, in the case of RFCs, the network is too closely connected to be
affected by cutpoints.

3) Conclusions
“Blocks and cutpoint” analysis is useful in examining loosely-connected networks.

This type of analysis yields ambiguous results when used on closely-connected networks such as the
network of RFC authors since the only critically connected components of the network are pendants,
and those are easily detected by eye.

It is worth noting that this type of analysis can be combined with any of the above analyses since the
tagging of blocks and cutpoints can be undertaken by changing node colors and shapes. Sometimes, a
new network layout can enhance readability whilst keeping block and cutpoint tagging active.


F. Factions
A “faction” is a group or clique within a larger organization, or the like. In graph theory, a "faction" is
a part of a graph in which the nodes are more tightly connected to one another than they are to members
of other “factions”.

The NetDraw program can iteratively determine the most appropriate division of the network using a
“factioning” algorithm. It is worth comparing this analysis with K-Core data which is based on similar
principles of local clustering or sub-structure.
1) Method/Theory
The algorithm is different from the K-Core algorithm in that NetDraw actually asks how many
factions should be created. The algorithm then forms the number of groups desired by seeking to
maximize connection within, and minimizing connection between the groups. Nodes are colored, and
the information about which nodes fall in which partitions (i.e. which cases are in which factions) is
saved to the node attributes database.
2) Results

In our example, expanding the K-Core analysis described earlier, it was assumed that we could
initially divide the network into 5 factions. This is shown in Figure 11. It is then possible to explore
further faction division by increasing the parameter for the number of clusters required. This yields
sometimes peculiar layouts, shown in Figure 12.

Figure 11: 5 factions of authors having published 20+ RFCs / Layout, node color & shape, according to factions


Figure 12: 10 factions of authors having published 20+ RFCs / Layout, node color & shape, according to analysis
when dividing into 5 factions. Note which factions have been divided – hence which are the weaker factions

There appears to be no single correct or incorrect “answer” using the faction algorithm. There is just a
measure of the faithfulness of a node to a cluster depending on its connection to one, two, or more
groups.

Figure 13: 5 factions of authors having published 20+ RFCs / Layout, node color & shape, according to analysis
when dividing into 10 factions. Note which nodes have been grouped.


It is therefore possible to determine which factions are more strongly linked together and which are
likely to break apart when circumstances change. It is also possible to see which new factions are likely
to be created.
The algorithm can be used in the other direction. For example, it is possible to start with a larger
number of factions, and reduce the number of factions, with groups merging together. It is interesting to
see how there is no homogeneous gathering of all nodes when the number of factions is reduced. An
example is shown in Figure 13.
Another oddity is the grouping of nodes which are not inter-connected to each other. Rather than the
algorithm grouping them due to their inter-connectivity, it groups them due to their not fitting in any
other faction.

The results can be displayed not only as a layout rendered by this algorithm, but also as another
layout, such as K-Cores, Circles, etc. This introduces interesting differences since some nodes which
might be part of one cluster during faction analysis, might be part of another group during K-Core
analysis grouped layout.

3) Scaling up
Scaling up, running the simulation under constraint that authors publish 10 or more RFCs, brings the
total number of authors to 171. Reading individual node labels is impossible at this density. However, it
is possible to remove labels and perform macro-analysis.
For example, it is possible to divide the network into 10 factions and assign a color and shape to each
faction, then group the factions together by reducing the network to 5 factions. Some factions do not
wholly group with a single other faction but sometimes distribute their nodes among the other factions
according to the affinity each node had with other nodes in other factions. This makes for interesting
analysis in real world social grouping, for example in electoral processes.


This is shown in Figure 14.

Figure 14: 10 factions above are grouped into 5 factions below. It is possible to see how some clusters splintered
among several factions. Network subset: authors having published 20+ RFCs

4) Conclusions
The “Factions” feature in NetDraw is useful to group authors into clusters and detect those authors
having an affinity to another group when dividing the network into a different number of factions. This
type of analysis is sometimes more conclusive when a network is more loosely interconnected than in
our example making use of a restricted number of authors which are very closely related to each other.
This analysis is also useful when grouping clusters according to affinity. Our example shows that the
grouping of clusters is not one that takes place wholly and evenly since some factions divide themselves
among the remaining clusters.
As with any social network analysis, care must be taken not to jump to conclusions from first
examination because oddities might appear in the clustering process. These are caused by lack of fit
within any other group, rather than a similarities or good connectivity within the group itself.


G. Girvan-Newman algorithm

The Girvan-Newman algorithm is one of the methods used to detect communities in complex systems.
In fact, the theory developed by Girvan and Newman [13] defined communities as not being quite the
same thing as clusters.

1) Method/Theory
A “community” is a cluster of nodes where the inter-relationship between nodes is high through a
high concentration of links. A clique would fit this description but a community is not restricted to a
clique. What defines the community from the cluster is that the links to nodes in other communities are
specifically less dense, whilst clusters do not take this into account.

Without going into details about this algorithm, its basic function is as follows:

1. Calculate the betweenness of all existing links in the network;
2. Remove links with the highest betweenness;
3. Recalculate betweenness of all links affected by the removal;
4. Repeat steps 2 and 3 until no links remain.

2) Results

The analyst can choose how many communities to create from the network. Running the analysis
produces node data which will be saved with the rest of the data related to each node. Displaying the
results is possible by selecting the “Group by attribute” layout.

It is therefore possible to reach a large number of results, depending on the number of communities
chosen. With other methods, the meaning of the resulting data is left to the analyst’s eye. Selecting too
few communities will cluster nodes which are too loosely connected together. Too many communities
will explode more tightly knit communities but show the cliques within the communities with greater
detail.

However, the Girvan Newman algorithm introduces the variable Modularity Q. The algorithm
calculates the Modularity of each type of grouping and Q is an indicator of the quality of clustering.

Choosing a calculation using from 2 to 15 clusters in our target network, the following results were
obtained:

Clusters 2 3 4 6 7 8 9 13 14 15
Q 0.013 0.294 0.460 0.500 0.493 0.487 0.482 0.463 0.458 0.442

Q is maximized when dividing the network into 6 clusters. This result therefore appears to be the most


befitting group structure in our network, and this is shown in Figure 15.

Figure 15: Girvan-Newman algorithm clustering for 6 clusters (Modularity Q=0.500)
For comparison reasons, it is then possible to mix several analyses on one diagram. For example, the
above diagram layout can be kept while node attributes are modified according to other parameters such
as K-Core analysis. Performing such a plot, it is possible to find the degree of connectivity of nodes
within each community. This is shown in Figure 16.

In the diagram, the nodes with highest K-Core value are shown as upward pointing red triangles, the
next as down pointing blue triangles, then yellow circles in square, etc. This provides information about
the key connecting nodes, intra & intercommunity-wise.

Figure 16: Girvan-Newman algorithm clustering for 6 clusters (Modularity Q=0.500) & K-Core Analysis


3) Scaling up
Growing the sample size by loosening the constraint to 10+ publications by author, it is possible to
analyze 171 authors.
Running the data through the Girvan Newman algorithm produced values for the Modularity factor Q
different from the smaller data set.

Clusters 7 8 9 10 11 12 13 14 15 16
Q 0.091 0.456 0.446 0.453 0.451 0.455 0.453 0.438 0.438 0.437

In this case, no type of clustering shows a dominant Q modularity. The network could be divided into 8
to 13 communities with similar Q modularity, thus demonstrating a very similar quality of clustering.

It is therefore apparent that the Girvan Newman algorithm does not scale well with our network since
the communities are to tightly knit together – a testimony to the “community feeling” of RFC authors.
4) Conclusions
When analyzing the core network of RFC authors, the Girvan-Newman algorithm produces graphical
results which give the impression of being similar to other methods. It is useful to find those clusters of
nodes with highest betweenness, even when zooming onto communities which might have initially
appeared to be tightly knit. The Modularity factor Q is calculated by the NetDraw software using the
algorithm, as a measure of the quality of clustering, and this allows us to find the most natural type of
clustering for the data.
This algorithm is consequently very efficient at detecting communities and the most likely grouping of
those communities, even when the initial data set is as restricted as the RFC authors list. It yields more
accurate results when used with smaller social networks.


H. Hiclus of Geo-distances
This stands for a method named High Clustering of Geodesic distances. This algorithm was
developed by Johnson [14] and is used by NetDraw to generate n-numbers of clustering possibilities were
n ranges from 2 clusters to the total number of nodes analyzed.
1) Method/Theory
The Hiclus of geodesic distance is a measure of cohesion in subgroups within the network calculated
by algorithms defined as follows:

With N nodes that need to be clustered and an N x N distance (or similarity matrix):

1. Assign each node to its own cluster, with its distances defined as the distances (similarity)
between the items they contain
2. Find the most similar pairs of clusters and merge them into a single cluster
3. Compute distances (similarities) between the new cluster and each of the old clusters
4. Repeat steps 2 and 3 until all nodes are clustered into a single cluster of size N.

The geodesic distance in this context makes the assumption that the graph is a three-dimensional
object and that the links between each node is the distance between them. For example, adjacent nodes
have a distance of one. From a node to another by stepping through a third node has distance of two, etc.
2) Results
A large set of results is calculated by the program and is saved as new attributes for each node. This
can therefore be plotted using “group by attribute”. The Hiclus of 5 clusters is shown in Figure 17.

Figure 17: Hiclus of geodesic distance selecting 5 clusters


Figure 18: Hiclus of geodesic distance selecting 7 clusters

The Hiclus of 7 clusters (Figure 18) appears more meaningful since enough groups are formed which
show real clustering. Increasing the number of groups (8, 9, 10, etc.) it is possible to see groups
splitting. The Hiclus of 15 clusters is shown in Figure 19.

Figure 19: Hiclus of geodesic distance selecting 15 clusters – groups are splitting up into individuals


Analysis of these graphs makes it possible to find out which nodes are more likely to break off from a
cluster, and in which order. Since NetDraw generates node data until Hiclus N, where N is the number
of nodes in the network, it is possible to gauge the order in which groups will split up.
For example, taking the graph of Figure 19 and redefining node colors and shapes according to the
Hiclus of Geodesic distance with 5 clusters, it is possible to see how the 5 original clusters split up into
15 clusters, some clusters being single pendants or dyads. This is shown in Figure 20.

Figure 20: Hiclus of geodesic distance selecting 15 clusters compared with node attributes for 5 clusters

3) Scaling up
Since the algorithm as implemented in NetDraw involves clustering from 2 to the total number of
nodes in the graph, this type of analysis does not scale well except if using powerful processing
resources. An analysis of 64 authors (initial subset) yielded the above results. Increasing the sample size
to 171 authors served only to crowd the graph to the point of making it less legible.
If all constraints are removed, more than 3 000 authors have to be processed and this has been found
to generate superfluous results. This type of analysis is therefore better used for smaller subsets of
nodes.
4) Conclusions
The Hiclus of Geodesic distance analysis yields results where a division of the graph is undertaken
from 2 to N clusters, where N is the total number of nodes in the graph. Successive plots, for example
Hiclus of 5 and Hiclus of 15, are possible, and if the node shape is defined according to its clustering in
Hiclus 5, it is possible to see the clustering in Hiclus 15 and the varied make-up of the resulting clusters.
Whilst pendants will be the first to break from a cluster, cliques are likely to be the last clusters to
divide themselves. It can be seen clearly by comparing the graph for Hiclus 5 and Hiclus 15. This is
essentially a very useful method to gauge the stability of a group of people.


I. Ego Networks
One of the assumptions in each of the analyses presented thus far assumes that all nodes are active
throughout the period of activity from which the data was mined. In the case of RFCs, this was
unfortunately sometimes not the case. For example, Jon Postel passed away in 1998 and this left a huge
gap in the RFC space, not only because of his hierarchic position in the social network but also because
he was such a pleasant and hardworking individual. This kind of influence could however not be
measured mathematically.
If one resorts to strictly looking at relationships as defined from data mining, a mathematical measure
of an individual’s influence in a network can be calculated in NetDraw. This theory is named by social
network researchers as “Ego Networks”.
1) Method/Theory
The Ego network of a node with geodesic distance 1 consists of all nodes immediately linked to that
node. When the geodesic distance is increased to 2, nodes connected to those nodes are included in the
graphic, and so forth for higher Geodesic distances.
NetDraw allows the user to select more than one node’s ego network to find out the geodesic relation
between them, depending on each individual ego network’s reach.
2) Results
• Complete Data set of 3 480 authors and 17 000+ links

In order to illustrate the concept of Ego networks, we have simulated the ego network of a well-
known RFC author, Dr. Vinton Cerf, as shown in Figure 21.

Figure 21: Ego Network of V. Cerf. (geodesic distance = 1)


Input file discrepancies (Metcalfe_R. and Metcalfe_B) are treated in Section V.A.1.a.

This diagram, just like every other result obtained using NetDraw, can be exported to MAGE and
rotated, zoomed-in and otherwise manipulated in 3D. An example screenshot is shown in Figure 22.

Figure 22: Ego Network of V. Cerf. (geodesic distance = 1) as seen in MAGE 3-D

Another use of the Ego network analysis as applied in NetDraw is the analysis of connection paths
between two nodes having a geodesic distance greater than 1.
It is possible to plot the Ego network for another author, for example Randy Bush, and relate it to Dr.
Cerf’s Ego network, whilst keeping a maximum geodesic distance of 1. This is shown in Figure 23,
overleaf.

NetDraw allows for any combination of geodesic distance & simultaneous node selection (or de-
selection, in order to note the “holes” in the network, and further analysis is possible on this sub-
network alone, through K-Core, Newton-Girvan, or indeed any other analysis as described above. This
makes for a very extensive combination of analysis and the possible generation of interesting social
patterns within the network.

3) Scaling up
In NetDraw, it is possible to select a geodesic distance of 2 (or more), in order to find out the nodes
connected to the nodes connected to Dr. Cerf’s node – the 2nd degree of separation. Since the RFC
community is well networked, the resultant graph is much more crowded, as seen in Figure 24. The
analysis is therefore limited by readability of the resulting graphs.


Figure 23: Ego Network of V. Cerf. (geodesic distance = 1) relating to R. Bush’s network

Figure 24: Extended Ego Network of V. Cerf. (geodesic distance = 2)


4) Conclusions
The Ego Network analysis is very useful in determining the structure of nodes directly linked to a
node, and in turn, the structure of nodes connected to those nodes. It is a useful tool to determine the
extent of a node’s social networking reach as well as the social structure between two or more nodes.
When used on a data set consisting of a group of people in an organization, it is therefore possible to
evaluate an individual’s social influence and immediate surround.


J. Geometrical Analysis using yEd
yEd [15] is a free Java-based graph editor which can be used to generate drawings and to apply
automatic layouts to the graphs comparable to those generated by NetDraw. The strength of yEd lies in
its ability to re-map complex graph structures into entirely new layouts which might bring more sense to
the input data and help detect hierarchies or pseudo-hierarchies within a social network.

Other layout algorithms make use of geometry to produce Orthogonal or Organic layouts, Tree and
Circular layouts including multi-radial and plain disc layout which can detail interconnected rings and
star topologies.

NetDraw was used earlier to provide a number of layouts, but yEd’s algorithms are more powerful in
re-routing edges (links) to provide a cleaner layout topology, especially when using edge routing, an
option which makes edges align with each other.
It is important to note that NetDraw and yEd have entirely different purposes. NetDraw is used to
analyze a network to detect clusters, ego networks etc. yEd is a graph editor used solely to display a
network in a variety of topologies. Indeed, most users utilize yEd solely to produce clearer graphs for
knowledge representation, software engineering, database schematics, process and workflow illustration
and family trees.

1) Method/Theory
yEd accepts several input file data formats including GraphML, YGF, GML (a popular text-based
format), TGF and XML formats. Unfortunately, none of these formats is compatible with any of the
formats in which NetDraw data can be exported.
The yEd graph therefore had to be built using the integrated graphic editor by a click, drag and drop
process to create nodes and link them together. The input data was manually read from the .vna file
generated by the NetDraw software when saving the NetDraw graph.
As a result, 64 nodes and several hundred links were created manually using point and click. Each
node was also labeled accordingly.

A choice was made to select rectangular boxes allowing for containing an author’s full name, but it is
also possible to modify node attributes to follow shapes, colors and sizes, whilst also modifying link
thickness, arrowheads, etc. In this respect, yEd has features similar to NetDraw. The only drawback is
that it is impossible to change node attributes automatically, although this can change under certain
conditions when performing specific types of layout analysis, according the special demands of the
resulting graphic.

Arrowheads denote a link’s direction. All arrowheads were removed in order to clear up the clutter
generated by so many nodes in such a small topological space. It is important to note that even with
arrowheads removed, links keep a direction. Whilst some layout algorithms do not make use of this
information, others, such as the hierarchical layout algorithm, establish layer order by using the direction
of the links. This might be confusing when no arrowheads are present and might lead to erroneous
results.
Link thickness was defined for each link, according to the number of RFCs written together by a pair
of authors. 5 levels of arrow thickness were chosen manually.


2) Results
A large number of permutation of layouts is possible using yEd. Each type of layout allows for several
parameters to be modified, sometimes producing vastly different results.

a) Circle Layout
It is possible to select from layouts which are appear similar to those obtained using NetDraw. One
such layout is the Circle layout where a plot similar to the one shown in Figure 1 can be created.
Nonetheless, yEd offers more layout options to plot the circle.

For example, the circle plot layout can be transformed into a disk, where some nodes appear in the
center of the circle, and others, namely cutpoints, appear outside the circle. Those cutpoints are defined
as the base for all pendants. Some manual housekeeping (shortening of some links, coloring of cutpoints
and pendants) results in the graph shown in Figure 25.

Figure 25: yEd plot of network (subset of authors having published 20+ RFCs) in disk layout.


b) Disk Layout with organic edges
Starting with the network shown in Figure 25, the links connecting the nodes can be altered into
organic links, whilst the layout of the nodes remains untouched. This algorithm routes the links so as to
ensure that they do not overlap nodes and keeps a specifiable minimal distance between them.
The algorithm is based on a force directed layout paradigm. Nodes act as repulsive forces on links in
order to guarantee a certain (user-defined) minimal distance between nodes and links. The links tend to
contract themselves. Using “simulated-annealing” leads to link layouts, which are calculated for each
link in turn. The resulting graph is visually attractive in that nodes are not overlapped by links, although
since some links overlap each other, it is sometimes difficult to follow their routing. The result is shown
in Figure 26.

Figure 26: yEd plot of network (subset of authors having published 20+ RFCs) in disk layout and organic links


c) Disk layout with orthogonal edges
Starting with the network shown in Figure 25, the links connecting the nodes can be altered into
orthogonal links, whilst the layout of the nodes remains untouched. This algorithm can route the links of
the network using only vertical and horizontal line segments, while keeping the positions of the nodes in
the network fixed. The routed links will usually not cross through any nodes and not overlap any other
links. The resulting network is shown in Figure 27.
yEd channel edges layout provides a similar routing topology for the links, with a few less significant
alterations.
It is interesting to note that yEd’s orthogonal edge router and orthogonal channel edge router
algorithms can be used on any type of network topology without displacing the initial node position. It is
therefore possible to “clean up” any type of network graph through a combination of node positioning
and link positioning.

Figure 27: yEd plot of network (subset of authors having published 20+ RFCs) in disk layout and orthogonal links


d) Organic Layout
Selection of the Organic Layout produces undirected graphs containing no overlap between nodes.
Processing the resulting graph through the edge router’s organic layout also makes sure that no overlap
occurs between links and nodes. The type of layout generated has essential similarities with the layout
obtained in NetDraw’s output of Spring Embedding using Geodesic Distances, Node Repulsion and
Equal Edge Lengths analysis. The organic layout box in yEd also allows for the defining of a preferred
link length. The resulting network can be seen in Figure 28. Whilst readability is improved over the

Figure 28: yEd plot of network (subset of authors having published 20+ RFCs) in organic layout and organic links

NetDraw output shown in Figure 6, clusters might be slightly less noticeable by eye because all nodes
are evenly spaced. yEd allows for the manual definition of clusters, whereas the cluster can be laid out
with a different algorithm. This is seen later.

e) Orthogonal Layout
This type of layout produces compact drawings with no node overlaps, few crossings, and few bends.
All links are routed in an orthogonal style: only vertical and horizontal line segments will be used. This
enhances readability of the resulting graph. As with every other type of layout in yEd, this option offers
a selection of preferences which radically modify the results, each with its own advantages.


One such option is the use of “Node Boxes”, where nodes are resized according to the number and
position of their neighbors to reduce the overall number of bends in the links. Readability of the graph is
improved and a by-product of this algorithm is that it tends to cluster more intensely tied nodes together.
An example of this stylish layout is shown in Figure 29. A tradeoff is that node size might be
misinterpreted as being linked to the importance of a node in the network whilst it is clearly not the
case.

Figure 29: yEd plot of network (subset of authors having published 20+ RFCs) using variable size Node Boxes and
orthogonal layout, with grid size 15.

Correlating results obtained using NetDraw with results obtained with yEd generates interesting
results. For instance, since the above diagram appears to show clear clustering of nodes which are
closely connected together, a clustering algorithm from NetDraw can be used and applied to the nodes
in Figure 29.
The results from the Girvan-Newman algorithm (Section G) generated a diagram shown in Figure 15,
with 6 clear clusters appearing to be the most optimal network clustering.
Applying these results to the nodes shown in Figure 29 and selecting the option of “face
maximization”, generates the graphic shown in Figure 30.


Figure 30: yEd plot of network (subset of authors having published 20+ RFCs) using variable size Node Boxes and
orthogonal layout, with grid size 15, cross-linked with Girvan-Newman Clustering algorithm data

The clusters are shown and appear to validate the data generated through the Girvan-Newman
algorithm. Further combinations of analysis and graphical display are possible, although not all
combinations bring further cognitive advantages to the analysis.
For example, other options using this layout also allow for mixing the orthogonal layout algorithm


with a tree sub-algorithm where larger sub-trees are processed using a special tree algorithm. Whilst
results using each of these algorithms generate good looking graphs, no significant further insight is
gained from our input RFC network than by other means described earlier.

f) Hierarchical Layout
An important option in yEd is plotting the graph using hierarchical layout. This includes a set of
algorithms which can be permutated to generate a vast array of hierarchical graphs.
Establishment of a hierarchy of nodes necessitates the use of link direction. However, the network of
RFC authors involves two-way collaboration between common authors of a RFC, with no explicit
hierarchy or precedence of one author over another. Drawing the graph with reciprocal arrows vitiates
the hierarchical layout analysis by still trying to establish a clear hierarchy and reciprocal arrows are not
shown as a single double-ended arrow, but rather a cyclic process. Removing arrowheads does not
remove link direction. Such layout is therefore flawed since the anticipated result, that which determines
the most significant nodes in the network (for example, nodes denoting highest RFC authorship or
highest betweenness), is not the result attained.
When the “top to bottom” option is selected, nodes are placed in hierarchically arranged layers and
this gives a false illusion of hierarchy when there actually is none. Nevertheless, the aesthetic outlook of
the resulting layout is helpful in providing a clear view of the network, especially when selecting
orthogonal edge routing. Whilst at first glance, clustering of cliques seems apparent, this is actually not
the case. Many long links exist, linking distant nodes. By selecting hierarchical optimal ranking, layer
assignment is done in such a way that the overall sum of the layer distances of all edges in the layout is
minimal. The resulting graph is shown in Figure 31.

Figure 31: Hierarchical Layout of network (subset of authors having published 20+ RFCs) with orthogonal links.
Options are hierarchical optimal ranking,


Surprising, visually creative but analytically ineffective results are obtained with the polyline option
which creates a vast number of bends and parallel links.
Various other options are available to concentrate or disperse, increase or decrease the number of
links, layout and direction of the network. Each permutation of options produces graphs which are
equally as visually attractive, but with no analytical value.
As a result, the hierarchical layout might not be most suited to display the results obtained in our
analysis – although the minimum bends in the links make the network very readable.

g) Using groups with a mix of layouts
Clusters found using NetDraw can be defined as a group in yEd, and a mix of layouts applied for the
groups themselves and the links between the groups. This gives rise to nested graphs. For example, it is
possible to define 6 node clusters from the results of the Girvan-Newman algorithm. Nodes in each
cluster can be grouped, and groups laid-out independently from the rest of the network. Many
combinations of types of layout can be used to reach various results.
Cross-group links could be routed orthogonally, organically, randomly, or could be removed
altogether. In this case, the resulting diagram of node organization within each group is shown in Figure
32.

Figure 32: Grouping of nodes (subset of authors having published 20+ RFCs) cross-linked with NetDraw’s Girvan-
Newman Clustering algorithm data and layout using disk algorithm, with removal of inter-group links


3) Scaling Up

The sub-networks plotted in Figures 25-32 reach a nodal upper limit for effective micro-analysis due
to the restricted space available on a single A4 page. Scaling the network size up by removing
constraints on the input data set is possible but is bound by two principal limits:
- all data from NetDraw need to be input manually either using point & click, or writing a
data file in text format; RFC authors are so closely linked together that this introduces an
exponential number of links as soon as further nodes are added;
- as more nodes are displayed on screen, clutter takes over. Node size might need to be
reduced, the diagram zoomed-out, and labels therefore rendered unreadable.

Micro-analysis transforms itself into macro-analysis. On a network as cross-linked as the RFC authors,
macro-analysis of data using yEd does not yield additional key results. However, networks containing
more defined sub-groups and less cross-group connectivity will likely yield satisfactory results with
macro-analysis.

4) Conclusions
yEd is a powerful piece of software which can be used to generate new network topologies in a social
network. Whilst many of its features are similar to the features provided by NetDraw, yEd is different in
that it is a graph editor, whilst NetDraw performs network analysis.

Its weaknesses:
- Most input file formats are binary files and do not interface with NetDraw output files. It
is therefore difficult to share data between the two types of software
- Except in specific cases, nodes do not support automatic attributes which could have
been generated by analysis – attributes need to be set-up manually

Its strengths:
- Nodes can be replaced by user-defined icons, thus giving rise to the possibility of very
impressive visual styles
- Edge routing generates particularly clean graphical results
- Hierarchical, organic and orthogonal layouts are not offered in NetDraw. The results
attained using yEd are therefore complementary to those reached using other software
- Tree, as well as star and spoke layouts generate very clear results which might pinpoint
more “hidden” information within a data set or social network
- Most components of the resulting graph can be labeled extensively
- Customizable workplace by docking sub-menus as desired
- Graphs can be nested: for example part of a graph can be displayed using one algorithm
and another part using another algorithm best fitting its needs

In this section, we have shown a few possible uses of yEd in the context of social network analysis.
Combined with NetDraw software, yEd provides a powerful free starter pack which can be used in the
world of Social Network analysis.


V. DISCUSSION

A. Limits
The analysis presented here is bounded by many limits. Many assumptions had to be made in order to
keep the analysis to a sustainable size. These assumptions are likely to introduce discrepancies in the
results. For the sake of awareness, the limits of the analysis are detailed in this section.
1) Input file discrepancies

a) Naming conventions for individual authors
As is sometimes customary in Anglo-Saxon countries, some names are not always transcribed in their
original form. “Richard” may be quoted as “Dick”; “Robert” as “Bob”; “Anne” as “Ann” etc.
Similarly, some foreign names may be spelled differently depending on the period. This is particularly
understandable when the RFC database is pure ASCII and many names use characters that go beyond
the scope of ASCII, for example replacing “ü” with “ue” or “u”.
Both name inconsistencies might introduce several apparently different instances for a given author.
This was not corrected in the several cases found in the database because we did not have the ability to
crosscheck if “John Doe” and “Jon Doe”, or “Bob Smith” and “Robert Smith” were the same individual.
Such manual crosschecking would be too time-consuming. The errors introduced in the results were
found to be small enough to ignore. The rationale behind this is that an author would use one type of
spelling to their name in most cases. Erroneous spelling would therefore be the exception rather than the
norm.
An example of such discrepancy can be seen in Figure 21 where both Metcalfe_B. (Bob) and
Metcalfe_R. (Robert) are shown.

b) Naming conventions for organizations
The naming integrity in the RFC database is imperfect. Whilst in some cases the full name of an
organization is given, there are also several equally frequent occurrences were the acronym of the same
organization is used.
For example: IAB vs. Internet Architecture Board vs. Internet Activities Board; IETF vs. Internet
Engineering Task Force; IANA vs. Internet Assigned Numbers Authority; ISOC vs. Internet Society;
IESG vs. Internet Engineering Steering Group etc.
We felt that these discrepancies were sparse enough not to cause major data corruption. Furthermore,
our study centered on individuals and not on organizations. The decision was therefore taken to ignore
those discrepancies, though it is worth noting that several RFCs have one of the above organizations as
their sole or joint author.

c) Reporting on work with third parties
Several RFCs report on work undertaken by or in collaboration with third parties who might not be
named in the RFC itself. Some of these RFCs have an author identifying himself or herself as the editor
of the document. It is unknown whether this editor will have also contributed to the work presented,
and the team which performed the work might be identified through acronyms described above, or are
simply unidentified. “Conversation with” a third-party also constitutes the subject of several early RFCs.
In all cases, the name recorded is that of the RFC author. Manually treating each case in turn is much
too time consuming, and whilst some individuals might have benefited from this reporting, we felt that
this inconsistency was marginal enough to be ignored.

d) RFC status
This is explained in RFC 2026. It is important to remember that not all RFCs are standards track


documents, and that not all standards track documents reach the level of Internet Standard.
RFCs therefore fall in different statuses: Internet Standards Track (Proposed Standard, Draft Standard,
Internet Standard); Non-Standard Track Maturity Levels (Experimental, Informational, Historic); Best
Current Practice (BCP) and Unknown.
Our study does not take note of the current status because it is assumed that at the time the RFC was
written, it was current. The "usefulness" and scope of an RFC’s importance cannot be scientifically
evaluated. All RFCs are therefore considered in this study on an equal footing. Again, this might
introduce inaccuracies in the results, although the RFC sample size is so large, these are statistically
minimal. Furthermore, it is worth noting that all RFCs are equal when it comes to networking between
authors, whether the RFC reaches standards track or not.

2) Use of Network Analysis Software

a) 2-D vs. 3-D
Mentioning 3-dimensional displaying of data always attracts much attention. Although NetDraw
displays networks in 2-dimensions, its interfacing to export data which can be readily used in MAGE
software displaying a network in 3D is a real asset. However, the utility of displaying the data in 3D is
dependent on the input data set. Some network topologies will not show well in 3D. The question of 2D
vs. 3D is one which can only be answered through trial and error.
Although an in depth discussion about 2D vs. 3D is outside the scope of this document, current
scientific knowledge points to 3-dimensional cognitive processes requiring more complex processing
for the human brain than 2-dimensional. Fixed 3-dimensional display of data adds complexity to the
brain’s visual recognition and might therefore be less useful than 2-dimensions except when presented
in an interactive way, such as the reader being able to rotate the 3-dimensional space about an axis. In
fixed displays, adding a third dimension to a planar representation might be counterproductive by
adding complexity.
Much data about the 2D vs. 3D cognitive model is available elsewhere on the Internet.

b) Large input data sets
It is said that a diagram speaks a thousand words. Large input data sets can indeed be analyzed using
the tools presented in this essay. However, space restrictions on an A4 page make it difficult to show the
results in a legible manner. It is therefore usually only possible to undertake macro-analysis on the
network (removing individual node labels, for example), and restrict micro-analysis to smaller data sets.
A mix of micro and macro analysis would be very useful in the future – somehow being able to zoom in
towards specific parts of the network and isolating them using point and click. For the time being, both
NetDraw and yEd have low usability factor in this scenario.
Large input data sets also require an increased amount of computing power and memory. Provided
adequate computing resources and memory are available, it would be possible to carry out a much more
targeted analysis.

c) Informal data vs. Formal data
The data source, namely IETF RFCs, constitutes only a subset of every development and collaborative
work ever undertaken to make the Internet what it is today. Whilst formally only a subset of authors are
included as the authors of an RFC, most RFCs are discussed extensively in working groups when at
Internet Draft stage, and informal communication provides much of the input towards the final RFC. In
choosing a defining link between authors as the only link between the authors and as the total data set
for our analysis, we are unable to incorporate the informal data generated in the discussions.
This introduces a limit which vitiates the hypothesis of this research to find the “Father of the
Internet”. Indeed, based on the research which therefore uses only a subset of the people involved in the


Internet’s development, it is clear that only a subset of contributors to the Internet’s development is
displayed in the graphs.
Informal data is nearly impossible to track. Increasing the reliability of the results would have to
involve a mining of every email ever sent in the realm of the IETF standards process and this is clearly
impossible. Working group mailing lists do have archives, but these are so dissimilar in style,
completeness and social interaction protocols that a significant dose of Artificial Intelligence would be
required to mine noteworthy data.

d) Restricted data sets
Even with the subset data source consisting solely of IETF RFCs, the data mining process used further
reduces the input data set since it is too basic to extract acknowledgments from the RFC’s text. The only
parameters currently mined are the name of authors for each RFC. Processing of this data gives rise to
the number of publications by author and an author’s links with other authors.
An important dimension missing from the data set is the concept of time. Some RFCs were written in
the 70s, some in the 80s, some in the 90s etc. Modifying the data mining process to incorporate dates
would enable NetDraw analysis by target date, which could then show social networks as they existed in
each period of time. Comparison of those networks might provide a good idea on the “nomadic”
behavior of some authors, a possible explanation for the differing faithfulness shown by some nodes
when dividing the network of authors into clusters, as seen in Section G, the Girvan-Newman algorithm.
The Internet has evolved, and so have the social networks of people building it.
Another dimension missing from the data set is the significance of an RFC, the current assumption
being that every RFC is as “important” as every other RFC – and this is clearly not the case. Perhaps a
concept of “RFC weight” could be developed to measure the impact of each RFC on the Internet’s
technical development.
The more restricted a data set, the more restricted the results.

3) Reliance on RFC database to find a father
This study makes exclusively use of the RFC database to examine its evolutionary process. Of course,
the basic assumption of “Anything not in a RFC does not exist” is as absurd as “Anything invented
before the first RFC does not exist”. A great many inventions, for example WYSIWYG and the Mouse,
the World Wide Web, search engines, peer-to-peer computing and other applications also make the
Internet what it is today. In fact, the biggest strength of today’s Internet is that you can throw any type of
traffic at it and it will carry it, since it is both physical, link and application layer independent.
None of the above applications was covered by a RFC. Does this mean, none of the inventors of those
applications have any kind of paternity claim over the Internet?
RFCs are not the Alpha and Omega of the Internet’s existence. For example, a large amount of early
work was published as Internet Experiment Notes (IENs), a set of more than 200 documents and reports
preceding the first Internet RFC (RFC675)[16]. Our analysis misses this data.
Perhaps a wider, more inclusive, cross-standard, cross activity, cross-invention and cross-layer search
and analysis would be required? Scio me nihil scire (I know myself to know nothing) [17].

B. Opportunities
Social Network Analysis opens a new door to further understanding groups of people. In the context
of RFC authors, any method circumventing the limits described above would increase the accuracy of
the analysis and therefore the accuracy of the results.

As we have seen in this essay, a major analytical shortfall of our research is that it does not take
chronological perspective and timelines in consideration. Any ongoing research process spanning


several discoveries introduces a hierarchical chronology of innovations and publications. For instance,
combining this study with an analysis making use of the RFC Citation index would yield further
information about the influence RFCs authors have had over theirs peers and over the development of
today’s internet. The Citation index would need to be data mined from the existing RFC Index [8].

The treatment of these results would ease the limit described in Section V.A.1.d above, as well as
allow for the sketching of NetDraw and yEd graphs, a non-exhaustive list being suggested as follows:
• RFC timelines
• A hierarchical tree of RFCs which might lead to a hierarchical tree of RFC authors
• The Ego network of a RFC, which might lead to an Ego network of authors with indirect
influence, rather than the current direct influence analysis shown in Section IV.I.
• The branching of RFCs which have been unsuccessful in generating traction – some might be
lost opportunities, some might be rising stars, some might be dormant, some might be
alternative processes and some might be dead ducks. The interest in this analysis is generated
from the question: did history make full use of knowledge available at the time?

Clearly, a cross-discipline fusion of analytical methods, using statistical techniques, Artificial
Intelligence, graph theory, chaos theory, fuzzy logic and social network analysis of a cross-layer input
set of data would enhance the accuracy of results – as would having access to vast manpower and
computing resources. The opportunity to derive data on this subject from the sources currently available
on the Internet is almost limitless, but the intent of the work presented in this essay is not to reach highly
conclusive and accurate results. Rather, it is to provide a somewhat rhetorical example of what could be
achieved with very limited computing resources (namely one laptop) and software freely available out
there on the Internet.

VI. FURTHER WORK
Taking into account the opportunities described above, the door is clearly open to many paths for
further work. For example:
• Use methods of social analysis on:
o ISOC bottom-up structure
o ICANN and its constituencies, starting perhaps with the At-Large structures
o W3C recommendations and its consensus-based standards tracks
o WSIS/UN IGF bottom-up processes and at large involvement at global level
o Elements of Internet Governance
• Use of methods of social analysis in a political party, to ensure a smooth information flow and
correct leadership process, including the processes leading to presidential races and elections
• Use of such methods in any organization whose decision structure is based on the concept of
bottom-up processes, whether by consensus or vote

VII. CONCLUSION
In this exercise, we have demonstrated the worth of Social Analysis and its usefulness in light of new
Knowledge Management practices. By combining some Competitive Intelligence Data Mining
techniques with Social Network Analysis, we have introduced new parameters which can be used to
verify and display the degree of satisfactory consensus building in an organization. This could include
the organization of working groups or combining of entities having a different social, contextual and


historical background. This could also include the sharing of data within the context of an
organization’s Knowledge Management.
The Internet and its governance is possibly one of the most complex societal systems ever to evolve.
Its complex mesh of working communities will require co-ordination in the future in order for
governance to be able to tackle future challenges successfully and to make sure that its decision process
is as inclusive as possible whilst being streamlined enough to actually reach decisions. Since the
Internet’s place in people’s lives is increasing year on year, novel scientific tools which could help in its
governance & development should be available for anyone to use.

This essay has provided an insight into what some of these future tools might look like and how
useful they could be.

As for the question, “who is the father of the Internet” – since we have proven that RFC authors have
a habit of working as a community, this would be impossible to determine without a DNA sample. Will
the real father(s) and mother(s) please stand up?

More specifically, we have shown that:

- There are many “fathers” of the Internet. They are all closely linked together into a
network of authors comprised of many cliques and clusters which appear to be as
interlinked as the interlinking of networks in the Internet’s network of networks
- Many RFC documents are written single-handedly by authors, although this constitutes
a minority in the community
- The most prolific authors tend to form clear clusters, inter-linked to other clusters by
key individuals
- Jon Postel having held the position of RFC Editor was one of those key people
- Joyce K. Reynolds is also a very prominent author, with many RFCs co-authored with
Jon Postel – in fact, she also acted as RFC editor and helped with IANA management
- Robert Braden is a key character in the RFC structure of authors, as shown by his high
centrality. Admittedly, he chaired the IRTF End-to-End Research Group which
developed many key RFC's, and served as the RFC co-editor for the IETF.

ACKNOWLEDGMENT
The author thanks E. Boutin (University of Toulon, France) [18] and Dr. Brian Dickens (National
Institute of Standards and Technology, Gaithersburg, MD, USA) for their valuable feedback and
corrections in the dissertation of this paper, V. Cerf [19] for his kind feedback about early Internet
research and R. Bush [20] for having allowed his name to be used in examples on Ego networks. The
author would also like to dedicate this essay to Tim Gartside (ISOC Sphere Labels project) [21] who
provided some of the initial inspiration for this research but who left us tragically before its conclusion.


REFERENCES

[1] S. Crocker, “Host Software”, RFC Repository, IETF Online Secretariat, Available:
http://www.ietf.org/rfc/rfc0001.txt?number=1
[2]
RFC Editor Web Page. Available: http://www.rfc-editor.org/
[3]
Kleinrock, L., “Communication Nets; Stochastic Message Flow and Delay”, McGraw-Hill Book Company, New York,
1964. (Out of Print) Reprinted by Dover Publications, 1972. (Published in Russian, 1971, Published in Japanese, 1975.)
[4]
Licklider, J. C. R., "Topics for Discussion at the Forthcoming Meeting, Memorandum For: Members and Affiliates of the
Intergalactic Computer Network". Washington, D.C.: Advanced Research Projects Agency, 23 April 1963.
[5]
Engelbart, D. C., et al., "SRI-ARC. A technical session presentation at the Fall Joint Computer Conference in San
Francisco, Dec. 9, 1968" (NLS demo ’68: The computer mouse debut), 11 film reels and 6 video tapes (100 min.), Engelbart
Collection, Stanford University Library, Menlo Park (CA) (some footage available on the Internet)
[6]
Cerf, V. and Kahn, R., “A Protocol for Packet Network Intercommunication”, IEEE Trans on Communications, Vol 22-5,
May 1974.
[7]
Metcalfe, R, et. Al., Xerox Corporation, “Multipoint data communication system with collision detection”, U.S. Patent
4,063,220, 31 March 1975.
[8]
RFC Index. Available: ftp://ftp.rfc-editor.org/in-notes/rfc-ref.txt
[9]
NetDraw Network Visualization. Available: http://www.analytictech.com/Netdraw/netdraw.htm
[10]
Torgerson, W. S., “Multidimensional scaling: I. Theory and method.” Psychometrika,
17:401-419.
[11]
3D Analysis: The Mage Page. Available: http://www.sbb.duke.edu/kinemage/magepage.php
[12]
Einstein, A., "Die Grundlage der allgemeinen Relativitätstheorie", Annalen der Physik 49, 1916.
[13]
Girvan, M. and Newman, M.E., “Community structure in social and biological networks.”, Proc. Natl. Acad. Sci. USA,
99, 7821-7826, 2002.
[14]
Johnson, S.C., "Hierarchical Clustering Schemes" Psychometrika, 2:241-254, 1967.
[15]
yEd Graph Editor. Available: http://www.yworks.com/en/products_yed_about.html
[16]
Internet Experiment Note (IEN) Available: http://www.postel.org/ien/txt/ien-index.txt
[17]
attributed to Socrates’s apology which Plato handed down
[18]
Boutin, Eric, Personal Web Page : http://i3m.univ-tln.fr/imprimer.php3?id_article=88
[19]
Cerf, Vinton, Web Page (no affiliation) : http://en.wikipedia.org/wiki/Vint_Cerf
[20]
Bush, Randy, Personal Web Page : https://archive.psg.com/
[21]
Gartside, Tim, Web Page : http://wiki.chapters.isoc.org/tiki-index.php?page=Tim+Gartside&bl=y

Olivier M.J. Crépin-Leblond has been an Internet user since 1988. He received a B.Eng. Honours degree in Computer Systems and
Electronics from King’s College, London, UK, in 1990, a Ph.D. in Digital Communications from Imperial College, London, UK, in 1997, and a
Specialized Masters Degree in Competitive Intelligence and Knowledge Management from CERAM Business School in Nice-Sophia Antipolis, France, in
2007.
Over the years, he has been involved in many Internet and telecom projects, has founded Global Information Highway Ltd in 1995 and is available as a
consultant in Telecom matters. Current interests range from IPv6 deployment, Network Neutrality, Internet Governance and Green Internet to all aspects of
Strategy, Intelligence and Knowledge Management in the 21st Century, especially for bottom-up consensus-based organisations.
He is a member of the IET and senior member of the IEEE, Board member of the English chapter of ISOC and of ICANN’s European At-Large
Organisation (EURALO). In 2010 he is also a Nominations Committee member for ICANN.
Full details available on: http://www.gih.com/ocl.html

A Study of Internet RFC Authors using NetDraw and yEd

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (17)

Destaque

Destaque (10)

Semelhante a A Study of Internet RFC Authors using NetDraw and yEd

Semelhante a A Study of Internet RFC Authors using NetDraw and yEd (20)

Mais de Olivier MJ Crépin-Leblond

Mais de Olivier MJ Crépin-Leblond (20)

Último

Último (20)

A Study of Internet RFC Authors using NetDraw and yEd