Laura Hollink, Peter Mika and Roi Blanco. Web Usage Mining with Semantic Analysis. In proceedings of the International World Wide Web Conference, Rio de Janeiro, Brazil, May 2013.
A Beginners Guide to Building a RAG App Using Open Source Milvus
WWW2013: Web Usage Mining with Semantic Analysis
1. Web Usage Mining with Semantic Analysis
Laura Hollink, VU University Amsterdam
Peter Mika, Yahoo! Labs Barcelona
Roi Blanco, Yahoo! Labs Barcelona
2. Analysis of web user behavior
What are typical use cases? Are these carried out in a particular order?
Which use cases are not satisfied? And to which other sites do users
go?
3. Analysis of web user behavior
What are typical use cases? Are these carried out in a particular order?
Which use cases are not satisfied? And to which other sites do users
go?
oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!
captain'america'''movies.yahoo.com moneyball'trailer'''movies.yahoo.com'
money'''moneyball'movies.yahoo.com'
moneyball'''movies.yahoo.com''movies.yahoo.com en.wikipedia.org'''movies.yahoo.com''peter'brand'''peter
nymag.com'''moneyball'the'movie'''www.imdb.com'
moneyball'trailer'movies.yahoo.com''moneyball'trailer''
brad'pi-''brad'pi-'moneyball''brad'pi-'moneyball'movie'brad'pi-'moneyball''brad'pi-'moneyball'oscar'''www.imdb.co
relay'for'life'calvert'ocunty www.relayforlife.org'trailer'for'moneyball'''movies.yahoo.com 'moneyball.movie
moneyball'en.wikipedia.org 'movies.yahoo.com map'of'africa''www.africaguide.com'
money'ball'movie'''www.imdb.com money'ball'movie'trailer''moneyball.movie-trailer.com''
brad'pi-'new''www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com'brad'pi-'news'
news.search.yahoo.com moneyball'trailer''moneyball'trailer'www.imdb.com''www.imdb.com!
Transaction logs: sessions of queries and clicks
4. Analysis of web user behavior
oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!
captain'america'''movies.yahoo.com moneyball'trailer'''movies.yahoo.com'
money'''moneyball'movies.yahoo.com'
moneyball'''movies.yahoo.com''movies.yahoo.com en.wikipedia.org'''movies.yahoo.com''peter'brand'''peter
nymag.com'''moneyball'the'movie'''www.imdb.com'
moneyball'trailer'movies.yahoo.com''moneyball'trailer''
brad'pi-''brad'pi-'moneyball''brad'pi-'moneyball'movie'brad'pi-'moneyball''brad'pi-'moneyball'oscar'''www.imdb.co
relay'for'life'calvert'ocunty www.relayforlife.org'trailer'for'moneyball'''movies.yahoo.com 'moneyball.movie
moneyball'en.wikipedia.org 'movies.yahoo.com map'of'africa''www.africaguide.com'
money'ball'movie'''www.imdb.com money'ball'movie'trailer''moneyball.movie-trailer.com''
brad'pi-'new''www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com'brad'pi-'news'
news.search.yahoo.com moneyball'trailer''moneyball'trailer'www.imdb.com''www.imdb.com!
Transaction logs: sessions of queries and clicks
Are these use cases typical for all movies? Recent movies? Only for
Moneyball?
5. Why are these questions difficult to answer?
Sparsity of the event space
‣ 64% percent of queries are unique within a year
‣ even the most frequent patterns have extremely low support
To illustrate: top 12 most frequent sessions observed in our data:
6. Tasks
Question 1: what are typical use cases?
‣Task 1: find sequences of events in the data that are more
frequent (have a higher support) than a threshold.
Question 2: what use cases are not satisfied?
‣Task 2: learn to predict website abandonment from
queries and clicks.
8. Data processing and linking steps
1.link queries to entities
2.select types of entities (classes)
3.detect modifier words (download, trailer, cast, date, etc.)
4.identify navigational queries
5.identify ‘loosing’ queries.
'oakland'as'bradd'pi-'movie'''moneyball'''movies.yahoo.com oakland'as'''wikipedia.org!
9. 1. Linking queries to entities in the LOD cloud
• We link one entity to each query.
• The intent of about 40% of unique Web queries is to find a particular entity
[Pound, WWW2008].
• We link to Freebase (has a lot of movie related info) and DBpedia (Wikipedia is
widely used)
10. 2. Select one type per entity
• We use the Freebase API to get the semantic “types” of
each query URI
• Freebase ‘Notable types API’ is not official and not
documented.
• For repeatability and transparency, we have created our
own heuristics to select one type for each entity:
1. no internal or administrative types,
2.prefer established domains (‘Commons’) over user defined schemas
(’Bases’)
3.aggregate specific types into more general types
a)subtypes of location -> location
b)subtypes of award winners and nominees -> award_winner_nonimee
c)prefer movie related types over other types: film, actor,
artist, tv_program, tv_actor and location (order of decreasing
preference).
entity
TypeType
Type Type
Type
Type
11. 3. Detect modifier words in queries
Top 100 most frequent words that appear in the query log before or after
entity names [Mika ISWC2009, Pantel WWW2012].
movie, movies, theater, cast, quotes, free, theaters, watch, 2011, new, tv,
show, dvd, online, sex, video, cinema, trailer, list, theatre . . .
12. 4. Identifying navigational queries
• A navigational query is a query entered with the intention of navigating to a
particular website.
• A common heuristic is to consider navigational queries where the query
matches the domain name of a clicked result.
• “official homepage” is value of dbpedia:homepage, dbpedia:url, and
foaf:homepage.
netflix login www.netflix.com
banana www.bananas.org
European Parliament europarl.europa.eu
13. 5 Identify ‘loosing’ queries
• A ‘loosing’ query is the query that leads a user to abandon a service in favor
of another service.
• Common definition: A user repeats the same query and clicks on another
result in the list.
• Our broader, semantic definition:
14. Evaluation
1.Linking to entities and types
2.Detection of frequent usage patterns
3.Prediction of website abandonment
Applied to the movie domain
• sample of server logs of Yahoo! Search in the US
from June, 2011, split into sessions.
• Only sessions that contain at least one visit to any
of 16 popular movie sites4.
• 1.7 million sessions, containing over 5.8 million
queries and over 6.8 million clicks.
15. Evaluation of links to entities and types
• Compare manually created <query, entity> and <entity, type> pairs to
automatically created links.
• 2 samples: the 50 most frequent queries and 50 random queries.
Examples:
• Ambiguous query: “Green Lantern” - the movie or the fictional character?
• Wrong type: Oil peak is a serious game subject?
16. Evaluation of links to entities and types
Queries Entities Types
Frequencyofoccurrence
Frequencyofoccurrence
Frequencyofoccurrence
17. Frequent usage patterns I
• Freebase:release_date property of entities.
Recent movies Older movies
19. Frequent usage patterns III
• A comparison of
websites.
• most frequent query
types that lead to a click
on a website.
/film
/film/actor
/tv_program
/people/person
/book/book
ional_universe/fictional_character
/music/artist
/tv/tv_actor
/location
/film/film_series
Website 1
proportionofqueriesthatleadtoaclickonthewebsite
0.0
0.1
0.2
0.3
0.4
0.5
0.6
/film
/location
/book/book
/film/actor
/business/employer
/fictional_universe/work_of_fiction
ional_universe/fictional_character
/tv_program
/architecture/building_function
/film/film_series
Website 2
proportionofqueriesthatleadtoaclickonthewebsite
0.0
0.1
0.2
0.3
0.4
0.5
0.6
/location
/business/employer
/film
/film/actor
/organization/organization
/architecture/building_function
/people/person
/tv_program
/tv/tv_network
/internet/website_category
Website 3
proportionofqueriesthatleadtoaclickonthewebsite
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Proportionofqueries
Proportionofqueries
Website BWebsite A
20. Predicting website abandonment
• 3 Classification Tasks:
Given a (part of a) session in which a user is lost/gained, predict...
1...whether a user will be gained for a given website.
2...given that the session includes a given website, whether this website is in
the loosing or gaining position.
3...given that the session includes two given websites, which one is in the
gaining position.
•Gradient Boosted Decision Trees.
21. Discussion and future work
• Mining patterns of entire queries gives problems with sparsity of data
• We interpret the structure and semantics of the queries, using openly
available, up-to-date information on the Web.
• give a “semantic” definition of navigational and ‘loosing’ queries
• find patterns of user behavior
• predict website abandonment
• This is the beginning:
• Use more properties of entities, more features.
• Detect more complex patterns.
• Explore other linked open datasets.