DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Improving Flickr discovery through Wikipedias
1. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Improving Flickr discovery through Wikipedias
Federico Gobbo
{federico.gobbo}@uninsubria.it
Universit` degli Studi dell’Insubria
a
Varese, Italy
(cc) Some rights reserved.
1/21
2. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Introduction
1
Why folksonomies are interesting
Folksonomies
2
Why folksonomies differ?
Linguistic issues
3
Augmented folksonomies through natural language
Introducing Flickrpedia
4
Multilingual diversity as the source of knowledge
Concluding Remarks
5
2/21
3. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Why folksonomies are interesting
A key question of information retrieval today
How to add meaningful metadata to web content, in order to
increase the utility of information by improve the precision of
information retrieval to search engines?
3/21
4. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Why folksonomies are interesting
Folksonomies, a tentative answer. What are they?
folksonomy = folks + taxonomy
A folksonomy is made by tags or labels, usually single-word
metadata attached to online items (documents, photos, videos,
etc.), in order to add contextual meaning to the items themselves.
Folksonomies are a tentative effort toward the goal of improving
the precision of information retrieval.
4/21
5. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Why folksonomies differ?
Folksonomies and traditional taxonomies
Unlike traditional taxonomies, there is no explicit hierarchy
between tags nor tags are exclusive. For example, the photo of a
cat may be tagged as ‘cat’ and ‘european’ and ‘animal’, but there
is nothing that say that all cats are animals: tags can be seen as
common facets of the item itself (Schmitz 2006). There is no
central authority, and this is the main reason why folksonomies are
becoming more and more popular among web resource users.
5/21
6. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Why folksonomies differ?
The two different scopes of folksonomies
Each tag has two different scopes at the same time:
personimy, the user’s defined one (Quintarelli 2005);
consensus, the social shared meaning.
Consensus is becoming more and more important, as the wide use
of tag suggestion interfaces in web applications suggests.
6/21
7. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Why folksonomies differ?
Folksonomies and the Long Tail (see the video!)
7/21
8. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Why folksonomies differ?
The key concept of serendipity
Consensus permits serendipity, i.e. users dig the web through tags
finding new, unexpected and useful content, not easily accessible
via traditional search engines.
Tags are used as filters, i.e. a query on more tags returns the items
tagged with any of the given tags – or with all tags, depending on
the application (Golder and Huberman 2006).
The purpose of this paper is to improve serendipity allowing people
to dig folksonomies regardless of the natural language(s) they
master.
8/21
9. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Augmented folksonomies through natural language
Tags as linguistic objects
Tags are words, i.e. alphabetical strings meaningful in some
natural language. There is no controlled language. In particular,
features unrecognized are:
synonymity (different word strings, analogue meaning);
homography (identical word string, totally different meaning);
different strategies in encoding are possibles (e.g.
‘28-03-2008’, ‘2008March3’, ‘3rd March 2008’);
misspellings are very frequent, so standard NLP techniques are
banned.
Guy and Tonkin (2006) even advocated tag literacy education.
9/21
10. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Augmented folksonomies through natural language
The linguistic divide in folksonomies
Multilingualism is an issue not fully explored yet in folksonomies.
In fact, tags are written in a human language and users are
inclined to write in the languages they are comfortable in.
It is certainly desiderable for a user not comfortable in English or
other big language (in terms of presence in the web) to search and
find tags using a search engine interface in his or her tongue, while
the engine searches the corresponding tags in English and in other
major human languages.
10/21
11. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Multilingual diversity as the source of knowledge
How to overcome the linguistic divide?
A proposal: through a special web application which extracts the
pairs language-tags in every available language before passing the
tags to the folksonomy search engine.
The claim is improvement in serendipity: when searching in 20
natural languages at the same time, some interesting data will be
found, undiscovered through a single language search.
11/21
12. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Multilingual diversity as the source of knowledge
Flickr and its API
Flickr is one of the most popular web applications for photos (+2
million photos are found if ‘flowers’ are searched, nowadays).
Photos are freely tagged by users, so it can be considered a
folksonomy.
Open source APIs in major programming languages are available
and people can make queries to the Flickr repository through an
authentication key given on request.
http://www.flickr.com/services/api
12/21
13. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Multilingual diversity as the source of knowledge
Flickrpedia = Flickr + Wikipedias
Flickrpedia is designed on an API in Ruby and over development
framework Ruby on Rails (Thomas 2005, Thomas and
Heinemeier-Hansson 2005). Users can make queries in Flickr
writing a tag specifying its natural language.
The system crawls the Wikipedia in the corresponding language
and look for an appropriate page. With the help of regular
expressions, Flickrpedia parses the web page and extracts the
existing language pairs of the same topic in other languages from
the appropriate web page box.
13/21
14. How Flickrpedia works
German user
enters the query in Flickrpedia
the system
Flugzeug
German crawls
parsing with the help of regular expressions
Airplane Avion Hegazkin
...
English French basque
the German user
obtains the desidered
photos from Flickr!
15. The web page box for “alternate languages” in Wikipedia
An example: the German word ‘Flugzeug’
16. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Multilingual diversity as the source of knowledge
The results of the German word ‘Flugzeug’
At 2007, April, 11, Flickr finds less than 10,000 photos while
Flickrpedia more than 20,000 for the same query, giving a lot of
unexpected and relevant photos. 16/21
17. Don’t trust me: try by yourself!
Word searched: ‘Flugzeug’, i.e. airplane in German
http://buffy.sciva.uninsubria.it/∼rl608838/search
18. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Flickrpedia until now
Flickrpedia should only store the wikipedias according to the
existing natural languages – actually, 85. Large and extemporaneus
shared information repositories, like Flickr, can be managed
through other semi-structured information repositories as the
wikipedias.
Flickrpedia, if refined out of its actual prototypical phase, may help
users with poor knowledge of major languages to retrieve
information only through their lesser-used languages.
18/21
19. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Further direction of Flickrpedia
Flickrpedia is far from perfect: homographies are still unmanaged,
even if wikipedias have disambiguating pages, and it is not clear
which wikipedias to choose in order to optimize serendipity.
By now the parsed wikipedias are the biggest ones in terms of wiki
pages, but this doesn’t give any guarantee of serendipity
augmentation.
Finally, the API given by Flickr is a severe limit: up to 20 tags can
be inserted in a single query request, and up to 60 thumbnails may
be given.
19/21
20. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Beyond Flickrpedia
This approach isn’t limited to Flickr as the underlying folksonomy.
Our research direction is towards generalization, i.e. users can
choose the appropriate folksonomy performing multilingual queries.
It is still to demonstrate how to apply this approach to
folksonomies where the semantic references are different from
photos, i.e. an airplane or a flower is still so in almost every human
language, more or less.
The real underlying problem is how to measure serendipity, i.e.
specific and precise metrics for serendipity are needed.
20/21
21. Index Introduction Folksonomies Linguistic issues Introducing Flickrpedia Concluding Remarks
Thank you. Any questions?
Download these slides at the following permalink:
http://purl.org/net/fgobbo
(cc) F. Gobbo 2007. Published in Italy.
Attribuzione – Non commerciale – Condividi allo stesso modo 2.5
21/21