Crawling Big Data in a New Frontier for Socioeconomic Research: Testing with Social Tagging
1. Crawling Big Data in a New Frontier
for Socioeconomic Research:
Testing with Social Tagging
JUAN DIEGO BORRERO, jdiego@uhu.es
ESTRELLA GUALDA, estrella@uhu.es
University of Huelva
Seminários CIEO - Universidade do Algarve
Faro, 31 October, 2012 1
2. Table of Contents
• 1. Introduction • 3. Methodology
• 2. Theoretical perspective – 3.1. Data Collection
procedure
– Web 2.0 and Collaborative – 3.2. Analysis procedure.
tagging SNA
– Tagging and Folksonomy • 4. Results
– The collective knowledge – 4.1. Centralization:
inherent in social tags Authority
– Tagging and Social – 4.2. Node Tags: Users
networks producing Tags
– Social Web and its impact • 5. Discussion
on Information Retrieval – 5.1. Centrality and Power
(IR) and Recommender – 5.2. Central Tags: Users
Systems (RS) producing Tags
• 6. Conclusions and future
research
2
3. 1. Introduction
What puzzles?
1. The era of Big Data and Social Media has begun!
E.g., Twitter, Facebook, Tumbrl, Delicious, Youtube,
Flickr, Wikipedia…
2. Will it transform how we study human communication
and social relations?
3. Will it alter what ‘research’ means?
Some or all of the above?
3
4. 1. Introduction
What puzzles?
1. Big Data is notable not because of its size, but
because of its relationality to other data. Big Data is
fundamentally networked. Its value comes from the
patterns that can be derived by making connections
between pieces of data, about an individual, about
individuals in relation to others, about groups of
people, or simply about the structure of information
itself.
2. Big Data is important because it refers to an analytic
phenomenon playing out in academia.
3. Big data is important because of its popular
salience.
4
5. 1. Introduction
Tagging
• New technologies have made it possible for
a wide range of people to produce, share,
interact with, and organize data.
• People can classify the huge amount of
information at her/his disposal in the form of
tags.
5
6. 1. Introduction
Tagging in Delicious
Keywords
freely
chosen by
users
employed
to
annotate
various
types of
digital
content, or
suggested
by
Delicious
6
Source: www.delicious.com
7. 1. Introduction
Social Tagging Systems
Many users add metadata in
the form of tags
Source: http://bvdt.tuxic.nl/index.php/the-wisdom-of-
the-crowds-in-the-audiovisual-archive-domain/
Resulting collective tag
structure
Source: http://blog.hubspot.com/blog/tabid/6307/bid/7372/9-Reasons-Why-
Your-Social-Media-Strategy-Isn-t-Working.aspx/
7
Source: http://www.idonato.com/2009/05/27/fun-with-tag-clouds/
9. 1. Introduction
Our Assumption
• Big Data offers the humanistic disciplines a new
way to work in the quantitative side and it also offers
other kind of objective method for analysis.
• Although in reality, working with Big Data is still
subjective.
• Due to this, it is crucial to begin asking questions
about the analytic assumptions, methodological
frameworks, and underlying biases embedded in
the Big Data phenomenon.
9
10. 1. Introduction
Our Objectives
1. Proposing a methodology to use big data
from Web 2.0 in social research,
2. Applying it to extract automatically data from
Delicious social bookmarking website, and
3. To show the type of results that this kind of
analysis can offer to social scientists.
4. We focus our study in globalization
agriculture community, and pay special
attention to SNA
10
11. 2. Theoretical perspective
Web 2.0… and collaborative tagging
Web 2.0 is the business
revolution in the computer
industry caused by the move to
the Internet as platform, and an
attempt to understand the rules
for success on that new
platform (O’Reilly, 2007)
Collaborative – or social –
tagging is the activity in the
Web 2.0 of annotating digital
resources with keywords - tags
(Golder and Huberman, 2006;
Trant, 2009).
Source: http://www.laurenwood.org/anyway/2007/11/web-20-buzzwords/
11
12. 2. Theoretical perspective
… collaborative tagging
Collaborative – or social – tagging is the activity in the
Web 2.0 of annotating digital resources with keywords -
tags (Golder and Huberman, 2006; Trant, 2009).
Webpages,
photos,
videos…
A collaborative tagging system is mainly composed of three interconnected components
users, tags, and resources
(Smith, 2008)
12
13. 2. Theoretical perspective
… collaborative tagging and folksonomy
Social tagging
systems
aggregate the
tags of all
users and
describe the
resources in a
so-called
folksonomy
(Vander Wal,
2004)
problems Synonyms global warming = climate change
Terms variations globalization = globalisation
poor=poors
13
14. 2. Theoretical perspective
… folksonomy and collective knowledge
Bottom-up
process…
…the tags of many different users
are aggregated and the resulting
collective tag structure
– such as tag cloud – depicts the
collective knowledge of Web users
(Cress et al., 2012)
14
Source: http://blog.cimmyt.org/?p=6052
15. 2. Theoretical perspective
Tagging and social networks
The structure of Social tagging websites can be viewed as a
network of three different node types: the U users, the R
resources (web sites – URLs) and the T tags that the U users
deploy to tag the R web sites. Figure 1. A Bipartite Network made of three users U=(u,u’,u’’),
three tags T=(t,t’,t’’) and two kinds of links: between users RU
(straight lines), and between users and tags RT (dashed lines)
A particular class of networks
is the bipartite networks,
whose nodes are divided into
two sets –e.g. users and tags.
An opinion network (Maslov
and Zhang, 2001; Blattner et
al., 2007), is a network in
which users connect to the
objects that they gather. 15
Source: Authors
16. 2. Theoretical perspective
Social web and its impact on Information
Retrieval (IR) and Recommender Systems (RS)
1. From Social IR point of view -i.e. IR that uses
folksonomies- IT creates algorithms for folksonomies
in order to identify which information is relevant and to
identify communities to their need, this paper aims to
exhibit a methodology to retrieve big data from Web
2.0 environment.
2. We introduce social tagging as basis for
recommendations focused into a ternary relation
between users, resources, and tags, to discover latent
patterns links to the activity of collaborative tagging,
which could be basic in order to provide effective
recommendations to different actors.
16
17. 3. Methodology
• Data set from: Delicious – www.delicious.com
–.
• Delicious = social bookmarking system whose
– Content is created, annotated and viewed by its
users.
– Non-hierarchical classification system: users can tag
each of their bookmarks on the Delicious website,
and provides knowledge about the URL marked
– Collective nature:
• view bookmarks added or annotated by other users.
• organize existing tags into groups (tag bundles).
17
18. 3.1. Data Collection procedure
Collected annotations made in Social Bookmarking Services.
At least four parts:
• 1. Link to the resource (website…)
• 2. One or more tags
• 3. User who makes the annotation
• 4. Moment/ time when the annotation is made
• This article focus more on the co-occurrence of users, resources
and tags (user, resource, tag).
Dataset collected : U = {u1; u2; : : : ; uK}, R = {r1; r2; : : ; rM}, and T =
{t1; t2; : : ; tN}
18
19. 3.1. Process to retrieve the data
Figure 2. Data Collection Procedure
(A) Start point. Identify the search attributes.
Authoritative source as baseline to find keywords
connected to the idea of ‘globalization of agriculture’
– Wikipedia definition of “critics of globalization
(popular, high reputation)
– Other starts points (future)
– Selected (manually= researcher expertise) main
concepts from the website homepages, tag clouds or
topics.
– Identified the 5 seed keywords (globalization +
agriculture, food, organic, and GMO)
– Other concepts rejected
(B) With a Perl program web-crawling was made,
gathering the sample of users, URLs and tags
- For globalization+agriculture; globalization+food;
globalization+organic; globalization+GMO
- 22 April 2011 and 21 May 2011 (one completed
month) Source: Authors
- Results: 10,220 taggings that involved 851 users on
1,077 URLs and 1,720 tags.
(C) Program in Haskell to reduce the amount of data
by cutting the URLs and using key words, including the
identification of synonyms, the elimination of words with (D) Dataset for
capital letters and derivatives such as words in plural. analysis
19
21. Table 1. Keywords Used in the topic
“Globalization of agriculture”
Search attributes Number of More frequent Tags
used resulting tags /
(I+II) Main Tags
Globalization (I) + 1,116 Food (268), economics (176),
agriculture (II) environment (145), politics
(85), trade (81),
sustainability (70)
Globalization (I) + 1,682 Economy (180), economics
food (II) (171), environment (122),
sustainability (78), politics
(60)
Globalization (I) + 22 Business (3), fair-trade (3)
organic (II)
Globalization (I) + 54 Food (13), agriculture (12)
GMO (II)
21
Source: Authors
22. 3.2. Analysis procedure: SNA
Network analysis
• Node centrality: identification of the nodes that are more “central” than
others
Network level property = idea of the node’s social power based on how well it
“connects” to the network.
• Degree of a node = Number of direct connections individuals have with
others in the group
Highest degree = exerts influence (or authority).
In-degree = number of incoming ties that reflect the popularity of a website. As a
result, the prominent, well-connected members (those with a high degree of
centrality) are usually the opinion leaders.
Out-degree = number of outgoing ties which determine if a particular user is an
active or passive participant within the network.
Software Pajek (big series of data): Delicious bookmarking system’s user
is simply using Delicious, latent structures, power that emerges from
the network…
22
23. Figure 3. Hyperlink Network Energy Kamada-Kawai Map.
Bipartite Network userurl
Source: Authors by Pajek
23
24. Results 4.1. Centralization (Authority)
Centralization: userURL
URL’s Indegree: Sum of total inbound links
User’s Outdegree: Sum of the total outbound links
Network highly centralized within a few nodes:
Only 10 URLs from 526 (1.90%) account for 32.29% links to URLs.
10 URLs got 3,290 inbound links from a total of 10,219.
Only 10 users from 851 (1.17%) account for 14.05% links to URLs.
These 10 users produced 1,436 outbound links from a total of 10,219.
10 most centralized websites. Nine of them were media-based (online newpapers such as
The New York Times, BBC, The Guardian, Washington Post, Financial Times, Reason,
The Nation, Spiegel and The Economist) (Table 2)
Identification of Users with a greater degree of centrality.
Mritiunjoy user play a very important role in the network.
Mritiunjoy joined to Delicious on 12 march, 2007 and to the date he has 10,020 links and
is following 38 users.
Mritiunjoy Mohanty - is a professor at the Indian Institute of Management Calcutta, India
and his Research Interests are Political Economy of growth and development.
24
30. Figure 8. Hyperlink Network. 851 users arranged in rank order by
number of outbound links and 1,077 URLs arranged in rank order
by number of inbound links
Source: Authors
Why?/ How come that a few users and websites are better connected
than the majority?
30
31. Value of identified nodes (websites) due to:
• The links that they receive (its
instrumental nature)
• The profile of these organizations
(newspapers that channel big quantities of
resources – information) (quality of the
links) = central URLs with authority.
31
32. Results. 4.2. Node Tags: Users producing Tags
• Collective tag structure (excluded the key
search words, such as globalization, agriculture,
food and organic, and GMO) produced with
Wordle.
• Sizes of the terms in the tag clouds are
proportional to the weights - the top 25
highest weighted tags.
• Tag clouds: identifying the topical groupings in
a tag network
– Identification of topics around globalization of
agriculture
32
33. Figure 9. Tag Cloud for Agriculture Globalization
Network Identified on the delicious Data Set
Source: Authors by wordle
Resulting main key topics were economics and the environment
Main keywords used by users to describe or characterise in Delicious the topic
‘globalization of agriculture’.
33
34. 50 more frequent TAGS. Tags used more than 20 times
Economics 350 World 47 BBC 30
Environment 274 Global 46 Future 30
Sustainability 153 Capitalism 45 Geography 30
Politics 152 Green 43 Water 30
Economy 144 Research 42 Nutrition 29
Trade 131 Crisis 41 Government 27
Business 99 International 41 Wto 27
Poverty 97 Oil 38 Agribusiness 26
Culture 84 Prices 37 Ecology 25
Farming 84 Activism 35 Europe 25
Africa 83 News 35 Globalwarming 23
Health 78 Science 35 Reference 22
Development 76 Hunger 34 Technology 22
Energy 76 Usa 34 Biofuel 21
India 65 Inflation 32 Corporations 21
China 59 History 31 Farmers 21
Policy 55 Local 31 34
35. Discussion: 5.1. Centrality and Power
New York Times in this network of globalization of agriculture in Delicious
surpasses by far other URLs (with 1,203 inbound links, followed by BBC
website with 674 ones).
Most cited, recommended or considered websites with regards to a topic
occupy a central place and have an important role in the process of
dissemination of news, events, trending topics, ideology, culture and etcetera.
Identification of key collective actors (represented here through URLs) allows
a better comprehension of leadership, influence process, and power-
related structures.
For social practitioners, is a good way to identify key informants in a
community through whom disseminating useful and important information.
Very inequal distribution of power of the URLs cited by users in the topic
globalization of agriculture.
- Important accumulation of inlinks.
ADVANTAGES OF THIS TYPE OF KNOWLEDGE
FOR RESEARCHING AND INTERVENING
35
36. Discussion. 5.1. Centrality and Power
• FOCUS ON Users: identification of key actors that
disseminate and share URLs, as the previously cited
Mritiunjoy
– Determine from where key elements that structure the network
emerge.
• Why ‘that’ so important actor in the network of
globalization of agriculture?
– Key actors in this type of network could configure and
reconfigure the evolution of the network (TIME), and
structure and even manipulate the type of interchange of
resources in Delicious or in similar bookmarking sites.
• Is it by chance? Are most prominent actors in a type of
website like Delicious corresponding to a profile of very
active and participative people? Do they usually work
(or have as hobby) in this area and this is why
accumulate and tag so many URLs in Delicious?
– Further steps of the research.
36
37. 5.2. Central Tags: Users producing Tags
• Tags suggested by the website + Added new tags in a creative way
• ‘Tag cloud’: visual approach to the language used by users
• From a total of 1700 tags two words were the main ones.
• Each user could label a URL with an unlimited number of tags
(average 12 tags per user, max 433 and min 2).
• Most frequently tags used were the words: ‘economics’ (350 citations
out of 1700 tags -20.6%-) and ‘environment’ (273, 16%).
• Other very frequent tags were also sustainability (153), politics (152),
economy (144), trade (131), business (99), poverty (97), culture (84),
farming (84), africa (83), health (78), and development (76),
representing these 13 tags in relatives terms one out of four
labelled tags around the topic (25,9%).
Questions:
• Reasons of the prominence of the two first tags around the
globalization of agriculture.
• Are some of the 1700 found tags used in a interchangeable basis?
– Why sometimes the word economics is used sometimes, and why other
times is used economy?
– Are they used in the same way at classifying the URLs?
37
38. Conclusions: achieved goals
• Presenting this methodology to use big data from Web 2.0 in
socioeconomic research, and the illustration from a social
bookmarking site (Delicious) is:
• A first step towards the development of empirical techniques
capable of automatically differentiating groups of
individuals with common interests, and individuals who
occupy a more central position.
• First stone in the difficult process of understanding and
discovering patterns in the process that characterize users
tagging URLs for collaborative reasons.
• Utility: Discovering latent patterns = provide effective
recommendations to different actors.
• Understanding the community of more than a thousand links.
• Retrieval and analysis of information: complex but easy =
working in interdisciplary teams
38
39. Other topics for Researching: Future
• Improvements are necessary regarding in retrieval methods and the
implementation of Information Retrieval and Recommender Systems
techniques
• Influence of first tags on the following ones. Role of innovation and
creativity at tagging
• Evolution and usage of language around an issue along time.
• Ideological and terminological approaches in the national/ international
arena
• Use of some tags at classifying URLs and the distinction among users in
the way they use some words/tags
– Distinction between scientifics/ other professionals or users?
– Identify users with the same patterns at tagging, or URLs that were similarly
labelled: study structural equivalences
• Other possible studies based in retrieving the pages and making content
analysis
• Why some labels are present/ absent?
• Are there “traditions”/ “fashions” at tagging in the Web 2.0?
• Comparing results from Delicious and from other social bookmarking sites
• Go in-depth about users (if possible)
• And other explorations, other starting points, other bookmarking sites, other
indicators, complementary to those used in this illustration
39
40. Possible Applications
• Producing and manipulating public opinion (at recommending and
describing websites) and markets
– If we know the interests of users belonging to a network, we could also be
able to make recommendations
• Recommender Systems, changes into a ternary relation between
users, resources, and tags, more complex to manage.
• Important for researchers interested in formulating strategies for
intervention and mobilisation, but also practitioners, and
companies could make use of this.
• The discovering of the central elements in a network (users and
URLs), at the same time that the tags used by users could be key to
design future strategies for the dissemination of messages and to
achieve more success in the communications, making use of
important keywords, for instance, to atract more attention, etc.
• Implementation of Information Retrieval and Recommender Systems
techniques in social commerce and social media contexts.
• Applications in advertising, mobilising, etc.
• Security, Social Studies, Market studies, consumers
• Time: longitudinal analysis
• Etcétera
40