1. KDD for Personalization
PKDD 2001 Tutorial
September 6, 2001
Bamshad Mobasher - DePaul University, Chicago
Bettina Berendt - Humboldt University Berlin
Myra Spiliopoulou - Leipzig Graduate School of Management
Web Personalization
• The Problem
– dynamically serve customized content (pages, products,
recommendations, etc.) to users based on their profiles,
preferences, or expected interests
• Personalization v. Customization
– In customization, user controls and customizes the site
or the product based on his/her preferences
– usually manual, but sometimes semi-automatic based on
a given user profile
– Personalization is done automatically based on the
user’s actions, the user’s profile, and (possibly) the
profiles of others with “similar” profiles
PKDD 2001 Tutorial: “KDD for Personalization” [I-2]
[2]
2. Customization Example
my.yahoo.com
my.yahoo.com
PKDD 2001 Tutorial: “KDD for Personalization” [I-3]
[3]
Personalization Example
amazon.com
amazon.com
PKDD 2001 Tutorial: “KDD for Personalization” [I-4]
[4]
3. A simplified scheme for personalization
what kind?
selects - document etc.
- query
user how? information object(s)
- request, specification
- rating related to
why?
- similarity (syntactic/semantic)
- co-occurrence in other users´
navigation histories
- co-occurrence in user´s other
navigation histories
system
recommends other information object(s)
PKDD 2001 Tutorial: "KDD for Personalization" [I-5]
ÃÒÓÛ Ì Ý Ù×ØÓÑ Ö ÃÒÓÛÐ × ÈÓÛ Ö
Ê Ð Ø ÓÒ× Ô× × ÓÒ Ù×ØÓÑ Ö Ò× Ø ÔÖÓÔ Ð Ò ÓÖ Ò Þ Ø ÓÒ ÖÓÑ
× ÑÔÐÝ ØÖ ØÒ Ù×ØÓÑ Ö× ÒØÐÝ ØÓ ØÖ ØÒ Ø Ñ Ö Ð ØÚ ØÓ Ø Ö
Ò ×¸ ÔÖ Ö Ò ×¸ Ò Ú ÐÙ ÔÓØ ÒØ Ðº º º º
ÃÒÓÛ Ò Ø Ù×ØÓÑ Ö × Ô Ö ÑÓÙÒØ Ò ØÓ Ý³× Ñ Ö ØÔÐ Û Ö Ø
Ù×ØÓÑ Ö × ÑÓÖ ÓÔØ ÓÒ׸ Ö Ø Ö Ü Ð ØÝ Ò Ö ÜÔ Ø Ø ÓÒ׺
ººº
ÂÓ Ò º Æ × ´ ÒØÙÖ µ Ò
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [I-6]
4. Ù×ØÓÑ Ö ÒÓÛÐ ÑÔÐ ×
½ºµ ÕÙ × Ø ÓÒ Ó Ù×ØÓÑ Ö Ø
¾ºµ Ò ÐÝ× × Ó Ù×ØÓÑ Ö Ø
¿ºµ Ø ÓÒ Ò ÓÖ Ò ÛØ Ø Ò Ò× Ø×
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [I-7]
ÕÙ × Ø ÓÒ Ó Ù×ØÓÑ Ö Ø
Ù×ØÓÑ Ö Ø Ö Ö ÓÖ Ò × Ó
¯ ÔÖ Ö Ò ×
¯ ØÖ Ò× Ø ÓÒ×
¯ ÔÖ ¹× Ð × ÓÒØ Ø×
¯ Ø Ö¹× Ð × ×ÙÔÔÓÖØ
¯ ÑÓ Ö Ô Ò ÓÖÑ Ø ÓÒ
ËÓÑ Ó Ø × Ø
¬ ÑÝ ÔÙÖ × ÖÓÑ Ø Ö Ô ÖØ ×
¬ ÑÝ Ð Ò ÑÙÐØ ÔÐ ×Ô Ö Ø Ø × × Ø Ø × ÖÚ ÓÑÔÐ Ø ÐÝ
Ö ÒØ ÔÙÖÔÓ× ×
¬ Ö Ó Ú ÖÝ Ò ÕÙ Ð ØÝ
Û Ø Ö ×Ô Ø ØÓ ÖÖÓÖ Ö Ø ×¸ Ö Ð Ð Øݸ ÓÚ Ö ¸ Ö ÔÖ × ÒØ Ø Ú Ò ××
Ø ÈÖ Ô Ö Ø ÓÒ
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [I-8]
5. Ò ÐÝ× × Ó Ù×ØÓÑ Ö Ø
Ø Ò ÐÝ× × × ÓÙÐ ÔÖÓÚ ÓÒ ÕÙ ×Ø ÓÒ× Ð
¯ Ï Ù× Ö× Û ÐÐ ÓÑ Ù×ØÓÑ Ö×
¯ Ï Ù×ØÓÑ Ö× Û ÐÐ Ö ØÙÖÒ Ò
¯ Ï Ó × ÑÓÖ Ð ÐÝ ØÓ Ö ×ÔÓÒ ØÓ ÔÖÓÑÓØ ÓÒ Ø ÓÒ
¯ Ï Ó ÛÓÙÐ ÒØ Ö ×Ø Ò ÖÓ××¹× Ð »ÙÔ¹× Ð ×Ù ×Ø ÓÒ×
ÐÓ× ÐÝ Ö Ð Ø ØÓ ÕÙ ×Ø ÓÒ× Ð
¯ Á× Ø Ï ¹× Ø ÔÔÖÓÔÖ Ø ÐÝ × Ò ØÓ × ÖÚ Ø ÓÖ Ò × Ø ÓÒ³×
Ó Ð×
¯ Ö Ø Ù×ØÓÑ Ö× × Ø ×
¯ Ö Ø Ù×ØÓÑ Ö× × Ø × ÒÓÙ ØÓ ÓÑ Ò
¯ Ö Ø Ù×ØÓÑ Ö× × Ø × ÒÓÙ ØÓ ÓÑ ÔÖÓÑÓØ Ö× Ó Ø ×Ø
Ø ÅÒÒ
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [I-9]
Ø ÓÒ Ò ÓÖ Ò Û Ø Ø Ò Ò× Ø×
¯ Ð ÒÑ ÒØ Ó Ø Ñ Ö Ø Ò ÔÓÐ Ý
¯ Ð ÒÑ ÒØ Ó Ø ×ÙÔÔÐÝ Ò¸ Ò ÐÙ Ò Ø Ö × Ð × ×ÙÔÔÓÖØ
¯ Ù×ØÑ ÒØ Ó Ø Û × Ø
¡ ×Ø Ø × Ø Ö ¹ × Ò
¡ ÖÓÛ× Ò »Æ Ú Ø ÓÒ ×Ù ×Ø ÓÒ×
¡ Ê ÓÑÑ Ò Ø ÓÒ× ÓÒ Ø Ô
¡ ÁÒØ ÐÐ ÒØ ×× ×Ø Ò
¡ È Ö×ÓÒ Ð Þ Ð ÝÓÙØ Ò ÓÒØ ÒØ
Ø Ì Ø Ñ Ð ØÛ Ò Ò× Ø Ò Ø ÓÒ × ÓÙÐ Ñ Ò Ñ Þ º
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [I-10]
6. Ì Ø ÓÒ × ÓÙÐ Ö Ø Ú ÐÙ
¯ ÓÖ Ø Ù×ØÓÑ Ö
¯ ÓÖ Ø ÓÖ Ò × Ø ÓÒ
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [I-11]
× ÓÖØ Ü ÙÖ× ÓÒ ÓÒ Ú ÐÙ Ö Ø ÓÒ
ÁÒ ¾ ¹ ÓÑÑ Ö ¸ × ÒÓØ ×Ù ÒØ ØÓ
¯ Ó Ö Ò Ü ×Ø Ò ÔÖÓ Ù Ø Ø ÖÓÙ Ø ÁÒØ ÖÒ Ø
¯ Ø Þ Ô ÖØ» ÐÐ Ó Ø ÑÖ Ò ÞÒ Ò
¯ ÒØÖÓ Ù Ö ÐÐ ÒØ Ò Û ÔÖÓ Ù Ø Ò Ø ÑÖ Ø
Ì ÔÖÓ Ù Ø ÑÙ×Ø Ö Ò Ú ÐÙ ØÓ
¯ ÛÒ Ø Ù×ØÓÑ Ö Ù×ØÓÑ Ö ÓÒÚ Ö× ÓÒ
¯ Ö Ø ÒØ Ù×ØÓÑ Ö Ù×ØÓÑ Ö Ê Ø ÒØ ÓÒ
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [1-12]
7. Ì ÑÓ Ð Ó ÃÙ Ð Ò ÓÒ× Ö× Ø ÓÐÐÓÛ Ò ØÝÔ × Ó Ú ÐÙ ¿¾
´½µ ÓÑÔ Ö Ø Ú
´¾µ ÑÔÖÓÚ Ò ÒÝ
´¿µ ÑÔÖÓÚ Ò Ø Ú ØÝ
´ µ ÒØ Ö Ø Ú
´ µ ÓÖ Ò × Ø ÓÒ Ð
´ µ ×ØÖ Ø
´ µ ÒÒÓÚ Ø Ú
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [1-13]
ÖÓÑ ÕÙ × Ø ÓÒ ØÓ Ø ÓÒ
¯ Ì Ö × ÒÓ Ð Ó Ø º
¡ Ð ×ØÖ Ñ Ø ÙÑÙÐ Ø Ò ØÖ Ñ Ò ÓÙ× Ô º
¡ ÑÓ Ö Ô Ø Ò ÕÙ Ö º
¡ Ù×ØÓÑ Ö ÔÖÓ Ð × Ö Ú Ð Ð ÓÖ Ò ÕÙ Ö º
¯ Ì Ö × ÒÓ Ð Ó Ñ Ø Ó ÓÐÓ × ÓÖ Ø Ò ÐÝ× ×º
¯ Ì Ð ØÝ ØÓ ÜÔÐÓ Ø Ø Ø Ò Ö × × Ø ÑÙ ×ÐÓÛ Ö Ô
Ò Ø ÒÙÑ Ö Ó Ô Ö×ÓÒ Ð Þ Ï × Ø × × ÒÓØ Ö ÐÐÝ Ð Ö º
¯ Ì ØÓÐ Ö Ð Ð Ô× ØÑ ØÛ Ò ÕÙ × Ø ÓÒ Ò Ø ÓÒ × ÐÓÛ
½ º
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [I-14]
8. Personalization: An HCI perspective
= does personalization increase usability?
A Web site’s usability is high if users
- achieve their goals / perform their tasks in little time,
- do so with a low error rate,
- experience high subjective satisfaction.
Usability testing:
- qualitative and quantitative methods
- experts and "normal" users
- questionnaires and experiments
Usability is a special concern on the Web because
unlike with other products / software, "users experience
usability first and pay later". (Nielsen [49]
[B12])
PKDD 2001 Tutorial: "KDD for Personalization" [I-15]
Data Preparation for Personalization
PKDD 2001 Tutorial: “KDD for Personalization” [DP-1]
9. Web Usage Mining
• Discovery of meaningful patterns from data
generated by client-server transactions on one or
more Web servers
• Typical Sources of Data
– automatically generated data stored in server access
logs, referrer logs, agent logs, and client-side cookies
– e-commerce and product-oriented user events (e.g.,
shopping cart changes, ad or product click-throughs,
etc.)
– user profiles and/or user ratings
– meta-data, page attributes, page content, site structure
PKDD 2001 Tutorial: “KDD for Personalization” [DP-2]
What’s in a Typical Server Log?
<ip_addr><base_url> -- <date><method><file><protocol><code><bytes><referrer><user_agent>
<ip_addr><base_url> <date><method><file><protocol><code><bytes><referrer><user_agent>
203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html
HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-
bin/pursuit?query=advertising+psychology&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98;
I)"
203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:23 -0600] "GET
/Calls/Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/Calls/OWOM.html"
"Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:24 -0600] "GET /Calls/Images/line.gif
HTTP/1.0" 200 190 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:25 -0600] "GET /Calls/Images/red.gif
HTTP/1.0" 200 104 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:31 -0600] "GET / HTTP/1.0" 200 4980
"" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:35 -0600] "GET /Images/line.gif
HTTP/1.0" 200 190 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:35 -0600] "GET /Images/red.gif
HTTP/1.0" 200 104 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:35 -0600] "GET /Images/earthani.gif
HTTP/1.0" 200 10689 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:33:11 -0600] "GET /CP.html HTTP/1.0"
200 3218 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
10. The Web Usage Mining Process
C ontent and
S tructure D ata
P re processing P attern D iscove ry P attern A n alysis
R aw U sage P reprocessed "Interesting"
R ules, P atterns,
D ata C lickstream R ules, P atterns,
and S tatistics
D ata and S tatistics
PKDD 2001 Tutorial: “KDD for Personalization” [DP-4]
Usage Data Preprocessing
Raw Usage
Data
Data User/Session Page View Path
Cleaning Identification Identification Completion Server Session File
Episode
Identification
Usage Statistics
Site Structure
and Content
Episode File
PKDD 2001 Tutorial: “KDD for Personalization” [DP-5]
11. Data Preprocessing for Web Usage Mining
• Data cleaning
– remove irrelevant references and fields in server logs
– remove references due to spider navigation
– remove erroneous references
– add missing references due to caching (done after
sessionization)
• Data integration
– synchronize data from multiple server logs
– integrate e-commerce and application server data
– integrate meta-data (e.g., content labels)
– integrate demographic / registration data
PKDD 2001 Tutorial: “KDD for Personalization” [DP-6]
Data Preparation for Web Usage Mining
(Cooley, Mobasher, Srivastava, 1999 [15])
• Data Transformation
– user identification
– sessionization / episode identification
– pageview identification
• a pageview is a set of page files and associated objects
that contribute to a single display in a Web Browser
• Data Reduction
– sampling and dimensionality reduction (ignoring
certain pageviews / items)
• Identifying User Transactions (i.e., sets or sequences
of pageviews possibly with associated weights)
PKDD 2001 Tutorial: “KDD for Personalization” [DP-7]
12. User and Session Identification: Need for
Reliable Usage Data
• Validity of results in Web usage mining is affected by
the ability to:
– distinguish among different users to a site
– reconstruct the activities of the users within the site
• Difficult to obtaining reliable usage data
– proxy servers and anonymizers
– rotating IP addresses connections through ISPs
– missing references due to caching
– inability of servers to distinguish among different visits
PKDD 2001 Tutorial: “KDD for Personalization” [DP-8]
Identifying Users and Sessions
• Server log L is a list of log entries each containing
timestamp, host identifier, URL request (including
URL stem and query), referrer, agent, cookie, etc.
• User identification and sessionization
– user activity log is a sequence of log entries in L
belonging to the same user
– user identification is the process of partitioning L into
a set of user activity logs
– the goal of sessionization is to further partition each
user activity log into sequences of entries
corresponding to each user visit
PKDD 2001 Tutorial: “KDD for Personalization” [DP-9]
13. Sessionization Heuristics
• Real v. Constructed Sessions
– Conceptually, the log L is partitioned into an ordered
collection of “real” sessions R
– Each heuristic h partitions L into an ordered collection
of “constructed sessions” Ch
– The ideal heuristic h*: Ch* = R
• Two Basic Types of Sessionization Heuristics
– Time-oriented heuristics
– Navigation-oriented heuristics
PKDD 2001 Tutorial: “KDD for Personalization” [DP-10]
Time-Oriented Heuristics
• Consider boundaries on time spent on individual
pages or in the entire a site during a single visit
– Boundaries can be based on a maximum session
length or maximum time allowable for each pageview
– Additional granularity can be obtained by treating
different boundaries on different (types of) pageviews
h1: Given t0, and a threshold θ, the timestamp for first
request in a constructed session S, the request with
timestamp t is assigned to S, iff t - t0 ≤ θ.
h2: Given t1, and a threshold δ, the timestamp for a
request in constructed session S, the next request
with timestamp t2 is assigned to S, iff t2 - t1 ≤ δ.
PKDD 2001 Tutorial: “KDD for Personalization” [DP-11]
14. Navigation-Oriented Heuristics
• Take the linkage between pages into account
– “linkage” can be based on site topology (e.g., split a
session at a request that could not have been reached
from previous requests in the session)
– or can be usage-based (using referrers in log entries)
• usually more restrictive than topology-based heuristics
and more difficult to implement in frame-based sites
href: Given two consecutive requests p and q, with p
belonging to constructed session S. Then q is assigned
to S, if the referrer for q was previously invoked in S, or if
the referrer for q is “undefined” and tq - tp ≤ ∆ (time delay
∆ is to allow for proper loading of frameset pages).
PKDD 2001 Tutorial: “KDD for Personalization” [DP-12]
Measures for Sessionization Accuracy
(Berendt, Mobasher, Spiliopoulou, 2001 [7])
• A heuristic h maps entries in the log L into
elements of constructed sessions, such that:
– (a) each entry in L is mapped to exactly one element
of a constructed session
– (b) the mapping is order-preserving
• Measures quantify the successful mappings of real
sessions to constructed sessions
– a measure M evaluates a heuristic h based on the
differences between Ch and R
– each measure assigns to h a value M(h) ∈ [0,1] so
that M(h*) = 1
PKDD 2001 Tutorial: “KDD for Personalization” [DP-13]
15. Measures for Sessionization Accuracy
• Categorical and Gradual Measures
– categorical measures: based on the number of real
sessions that are reconstructed by the heuristics
– gradual measures: based on the degree to which the
real sessions are reconstructed by the heuristics
PKDD 2001 Tutorial: “KDD for Personalization” [DP-14]
Categorical Measures
• Based on the notion of “complete reconstruction”
– a real session is completely reconstructed if all its
elements are contained in the same constructed
session
– the measure Mcr(h) is the ratio of the number of
completely reconstructed real sessions in Ch to the
total number of real sessions |R|
PKDD 2001 Tutorial: “KDD for Personalization” [DP-15]
16. Categorical Measures
• Derived categorical measures:
– Mcrs considers only completely reconstructed real
sessions whose first element is also the first element of
a constructed session
– Mcre considers only completely reconstructed real
sessions whose last element is also the last element of
a constructed session
– Mcrse considers only completely reconstructed real
sessions with correct starts and ends
• in absence of overlapping real sessions for individual
users, this gives the number of constructed sessions
that are identical to corresponding real sessions
PKDD 2001 Tutorial: “KDD for Personalization” [DP-16]
Gradual Measures
• Allow for measuring partial overlaps between real
and constructed sessions
– degree of overlap between real sessions r and
constructed session c, dego(r,c), is the number of
elements they have in common divided by total
number of elements in r.
– degree of overlap for a real session r is the maximum
dego(r,c) over all constructed sessions c.
– the measure Mo(h) is the average degree of overlap
over all real sessions
– if a real session is completely reconstructed, its
overlap degree is 1
PKDD 2001 Tutorial: “KDD for Personalization” [DP-17]
17. Gradual Measures
• To take the size of constructed session into account,
we define the degree of similarity
– degs(r,c) = | r ∩ c | / | r ∪ c |
– Ms(h) is is the average degree of similarityt over all real
sessions
– if a real session is completely reconstructed, its
similarity degree is 1
PKDD 2001 Tutorial: “KDD for Personalization” [DP-18]
Which Measures?
• The choice of the measures depends on the goals of
usage analysis, for example:
– “complete reconstruction” may be appropriate for
clustering and association-based analyses (it correctly
shows set of pages accessed together)
• it also preserves sequential order of accesses, so it can
be used for the analysis of users’ navigational behavior
– Mcrs: useful for analyzing access to entry points
– Mcre: useful for analyzing access to exit points
– overlap-based measures can be useful for comparing
overall effectiveness of sessionization heuristics in
grouping pages or objects
PKDD 2001 Tutorial: “KDD for Personalization” [DP-19]
18. Which Sessionization Heuristics?
• The choice of sessionization heuristic depends on
the characteristics of the data
– if individual users visit the site in short but temporally
dense sessions, h2 may perform better than h1
– in cases when timestamps are not reliable (e.g., using
integrated data across many log files), href may be a
better choice for sessionization
– referrer-based heuristics tend to perform worse in
highly dynamic, frame-based sites
PKDD 2001 Tutorial: “KDD for Personalization” [DP-20]
Comparison of Sessionization
Heuristics
h1-30 h2-10 h-ref
•• cookies used to identify
cookies used to identify
unique users
unique users
1.00
•• server generated session
server generated session
0.95
variable used to identify
variable used to identify
0.90 “real” sessions
“real” sessions
0.85 •• site was frame-based and
site was frame-based and
0.80 highly dynamic
highly dynamic
0.75 •• thresholds of 30 and 10
thresholds of 30 and 10
0.70
minutes were used for h1
minutes were used for h1
and h2, respectively
and h2, respectively
0.65
•• href performed poorly, due
href performed poorly, due
0.60
to propagated errors in
to propagated errors in
0.55
misclassified frameset
misclassified frameset
0.50 references
references
M_o
M_crse
M_cr
M_crs
M_cre
M_s
•• 30% of users had multiple
30% of users had multiple
IP addresses (coming from
IP addresses (coming from
behind proxy servers)
behind proxy servers)
PKDD 2001 Tutorial: “KDD for Personalization” [DP-21]
19. Mechanisms for User Identification
Method Description Priv acy Adv antages Disadv antages
Concerns
IP A ddre s s + A s s um e e a c h unique Lo w A lw a ys a va ila ble . N o N o t g ua ra nte e d to be
A g e nt IP a ddre s s /A g e nt a dditio na l unique . D e fe a te d by
pa ir is a unique us e r te c hno lo g y re quire d. ro ta ting IP s .
E m be dde d U s e dyna m ic a lly Lo w to A lw a ys a va ila ble . C a nno t c a pture
S e s s io n Ids g e ne ra te d pa g e s to m e dium Inde pe nde nt o f IP re pe a t vis ito rs .
a s s o c ia te ID w ith a ddre s s e s . A dditio na l o ve rhe a d
e ve ry hype rlink fo r dyna m ic pa g e s .
R e g is tra tio n U s e r e xplic itly lo g s M e dium C a n tra c k M a ny us e rs w o n't
in to the s ite . individua ls no t jus t re g is te r. N o t
bro w s e rs a va ila ble be fo re
re g is tra tio n.
C o o k ie S a ve ID o n the c lie nt M e dium to C a n tra c k re pe a t C a n be turne d o ff by
m a c hine . hig h vis its fro m s a m e us e rs .
bro w s e r.
S o ftw a re P ro g ra m lo a de d into H ig h A c c ura te us a g e da ta Lik e ly to be re je c te d
A g e nts bro w s e r a nd s e nds fo r a s ing le s ite . by us e rs .
ba c k us a g e da ta .
PKDD 2001 Tutorial: “KDD for Personalization” [DP-22]
Impact of User Identification Heuristics
These experiments show the impact of using IP+Agent heuristic for user
These experiments show the impact of using IP+Agent heuristic for user
identification on sessionization heuristics (as compared to cookies)
identification on sessionization heuristics (as compared to cookies)
h1-30-real h1-30-ipa h -ref-real h -ref-ipa
1.00 1.00
0.90 0.90
0.80 0.80
0.70 0.70
0.60 0.60
0.50 0.50
0.40 0.40
0.30 0.30
_s
_o
r
e
rs
re
_s
r
e
_o
rs
re
_c
_c
rs
rs
_c
_c
_c
_c
M
M
M
M
_c
M
_c
M
M
M
M
M
M
M
PKDD 2001 Tutorial: “KDD for Personalization” [DP-23]
20. Inferring User Transactions from Sessions
• Observation: reference lengths
follow an exponential
distribution
• Page types correlate with Histogram of
reference lengths page reference
lengths (secs)
• Page types: navigational,
content, or hybrid
• Can automatically classify
pages as navigational or content
using statistical modeling
• A transaction can be defined as
an intra-session path ending in a
content page, or as a set of navigational content
content pages in a session pages pages
PKDD 2001 Tutorial: “KDD for Personalization” [DP-24]
Path Completion
• Refers to the problem of inferring missing user
references due to caching.
• Effective path completion requires extensive
knowledge of the link structure within the site
• Referrer information in server logs can also be used
in disambiguating the inferred paths.
• Problem gets much more complicated in frame-
based sites.
PKDD 2001 Tutorial: “KDD for Personalization” [DP-25]
21. Path Completion - An Example
A User’s navigation path:
A => B => D => E
=> D => B => C
URL Referrer
B C A --
B A
D B
E D
D E F C B
• There may be multiple candidates for completing the path.
For example consider the two paths : E => D => B => C and
E => D => B => A => C.
• In this case, the referrer field allows us to partially
disambiguate. But, what about: E => D => B => A => B => C?
• One heuristic: always take the path that requires the fewest
PKDD 2001 Tutorial: “KDD for Personalization” [DP-26]
Integrating E-Commerce Events
• Either product oriented or visit oriented
• Not necessarily a one-to-one correspondence with
user actions
• Used to track and analyze conversion of browsers to
buyers
• Major difficulty for E-commerce events is defining
and implementing the events for a site
– however, in contrast to clickstream data, getting
reliable preprocessed data is not a problem
• Another major challenge is the successful
integration with clickstream data
PKDD 2001 Tutorial: “KDD for Personalization” [DP-27]
22. Product-Oriented Events
• Product View
– Occurs every time a product is displayed on a
pageview
– Typical Types: Image, Link, Text
• Product Click-through
– Occurs every time a user “clicks” on a product to get
more information
• Category click-through
• Product detail or extra detail (e.g. large image) click-
through
• Advertisement click-through
PKDD 2001 Tutorial: “KDD for Personalization” [DP-28]
Product-Oriented Events
• Shopping Cart Changes
– Shopping Cart Add or Remove
– Shopping Cart Change - quantity or other feature (e.g.
size) is changed
• Product Buy or Bid
– Separate buy event occurs for each product in the
shopping cart
– Auction sites can track bid events in addition to the
product purchases
PKDD 2001 Tutorial: “KDD for Personalization” [DP-29]
23. Content and Structure Preprocessing
• Processing content and structure of the site are
often essential for successful usage analysis
• Two primary tasks:
– determine what constitutes a unique page file (i.e.,
pageview)
– represent content and structure of the pages in a
quantifiable form
PKDD 2001 Tutorial: “KDD for Personalization” [DP-30]
Content and Structure Preprocessing
• Basic elements in content and structure processing
– creation of a site map
• captures linkage and frame structure of the site
• also needs to identify script templates for dynamically
generated pages
– extracting important content elements in pages
• meta-information, keywords, internal and external links,
etc.
– identifying and classifying pages based on their
content and structural characteristics
PKDD 2001 Tutorial: “KDD for Personalization” [DP-31]
24. Quantifying Content and Structure
• Static Pages
– All of information is contained within the HTML files for
a site
– Each file can be parsed to get a list of links, frames,
images, and text
– Files can be obtained through the file system, or HTTP
requests from an automated agent (site spider)
PKDD 2001 Tutorial: “KDD for Personalization” [DP-32]
Quantifying Content and Structure
• Dynamic Pages
– Pages do not exist until they are created due to a
specific request
– Relevant information can come from a variety of
sources: Templates, databases, scripts, HTML, etc.
– Three methods of obtaining content and structure
information:
• Series of HTTP requests from a site mapping tool
• Compile information from internal sources
• Content server tools
PKDD 2001 Tutorial: “KDD for Personalization” [DP-33]
25. Integrating content and structure I
Domain knowledge: content
- purpose: group pages by their content
- method: analyze text, meta-tags, and/or URL (query string)
- grouping by classification or clustering
Concept hierarchies
Entertainment
Performing Music ... Example of a
Arts content-based
Artists Genres New Releases ... concept hierarchy
Blues Jazz New Age ...
PKDD 2001 Tutorial: "KDD for Personalization" [DP-34]
Integrating content and structure II
Content profiles from feature clusters
1, vector space model: each unique word in corpus = one dimension,
each page(view) is a vector with a non-zero weight for each word
in that page(view), zero weight for other words
2. feature - pageview matrix (note: "feature" = word,
"pageview" because of frames)
music jazz artist ...
pv1 1.00 0.80 0.05
pv2 1.00 0.00 0.70
...
3. features as weighted vectors of pageviews
jazz = [ <pv1,0.80>, <pv2,0.00>, ... ]
4. group features -> feature clusters -> content profiles
PKDD 2001 Tutorial: "KDD for Personalization" [DP-35]
26. Integrating content and structure III
Structure
- purpose: group pages by their hyperlink structure
- ex. page types in Pirolli et al. [54] and Cooley et al. [B20]:
[B24] [15]:
head, navigation, content, look-up, personal
- ex. path distance to a reference page
A.html B.html C.html
dA = 1 dA = 2
- structure as weighted vector of page(view)s
S = [ <A.html,0>, <B.html,1>, <C.html,0>, ... ](only B content page)
S = [ <A.html,0>, <B.html,1>, <C.html,3>, ... ] (path distances)
- grouping by classification or clustering
PKDD 2001 Tutorial: "KDD for Personalization" [DP-36]
Relating content and structure to mined usage I :
Content/structure mining as pre-/post-processing steps
Ex. online catalog search (Berendt & Spiliopoulou [B18, B17]):
[8, 6]):
1. service-based concept hierarchy: which query options?
Info on schools
indiv. school list of schools ...
1 parameter 2 par.s 3 parameters
Location Name ... Location+Name ... ...
PKDD 2001 Tutorial: "KDD for Personalization" [DP-37]
27. Relating content and structure to mined usage I
2. discovering and comparing navigation patterns in classified pages
part of a resulting WUM navigation pattern:
PKDD 2001 Tutorial: "KDD for Personalization" [DP-38]
Relating content and structure to mined usage I
Ex. WebSIFT Information Filter (from Cooley [14]):
[B19]):
Mined knowledge domain know- interesting belief example
ledge source
general site structure The head page is not the most
usage statistics common entry point
general site content A page designed to provide
usage statistics content is being used as a
navigation page
frequent itemsets site structure A set of pages is frequently
accessed together, but not
usage clusters site content directly linked
A usage cluster contains
=> discover patterns at different pages from multiple content
levels of abstraction, discover categories
deviations from intended usage
PKDD 2001 Tutorial: "KDD for Personalization" [DP-39]
28. Relating content and structure to mined usage II :
Usage, content, and structure mining as 3 ways
of deriving a common kind of representation
Mobasher, Dai, Luo, Sun, & Zhu [44]
[B22]
- a vector of tuples <pageview,weight>:
usage: sessions / visits, or parts of them (past + current)
content: features
structure: pages and their characteristics
- unordered or ordered collections
=> identify clusters that are similar,
where similarity is by usage, content, or structure
PKDD 2001 Tutorial: "KDD for Personalization" [DP-40]
È ØØ ÖÒ × ÓÚ ÖÝ ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ ÅÝÖ ËÔ Ð ÓÔÓÙÐÓÙ ÀÀÄ º[PD-1]
ºº
29. Ï ÒØ Ý Ø ÓÐÐÓÛ Ò ×Ô Ø× Ó Ø Ô Ö×ÓÒ Ð Þ Ø ÓÒ × ÖÚ ×¸ Û Ò
ÒÚ × ×Ø Ö ×ÙÐØ Ó Ô ØØ ÖÒ × ÓÚ ÖÝ
Î × Ð ØÝ Ë ÖÚ Ð Ñ ÒØ
¯ Ô Ö×ÓÒ Ð Ö ÓÑÑ Ò Ø ÓÒ ¯ ´Ð Ò ØÓµ Ô
¯ × Ð ÒØ ÝÒ Ñ Ù×ØÑ ÒØ ¯ ÔÔÐ Ø ÓÒ Ó Ø
¯ ×Ø Ø Ô »× Ø Ù×ØÑ ÒØ
Å Ø Ò × ÓÒ ÕÙ × Ø ÓÒ Ø ÐÐ Ø ÓÒ
¯ Ù× Ö ÔÖÓ Ð × ¯ ÐÐ ×Ø Ô× ÓҹРÒ
¯ Ù× Ö Ö Ø Ò × ¯ Ó ¹Ð Ò Ô ØØ ÖÒ × ÓÚ ÖÝ
¯ Ù× Ö Ú ÓÙÖ ² ÓÒ¹Ð Ò Ñ Ø Ò
¯ ÓÒØ ÒØ Ó Ó Ø×
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ ÅÝÖ ËÔ Ð ÓÔÓÙÐÓÙ ÀÀÄ º[PD-2]
ºº
È ØØ ÖÒ × ÓÚ ÖÝ ÔØ Ú Û × Ø ×
Ì ÔÔÖÓ Ó È Ö ÓÛ ØÞ ² ØÞ ÓÒ ¾¸ ¿
Ì ÁÒ Ü Ò Ö ÓÒ× ×Ø× Ó Ø Ö Ô × ×
½º ÄÓ ÔÖÓ ×× Ò ×Ø Ð × Ñ ÒØ Ó × ×× ÓÒ× × × Ø× Ó Ô Ö ÕÙ ×Ø×
¾º ÐÙ×Ø Ö Ñ Ò Ò ÖÓÙÔ Ò Ó Ó¹Ó ÙÖ Ò ÒÓÒ¹Ð Ò Ô × ÛØ ÐÔ
Ó Ø ×Ø Ö Ô
¿º ÓÒ ÔØÙ Ð ÐÙ×Ø Ö Ò
¡ Ì Ö ÔÖ × ÒØ Ø Ú ÓÒ ÔØ Ó ÐÙ×Ø Ö × ÒØ º
¡ ÐÙ×Ø Ö Ñ Ñ Ö× ÒÓØ Ö Ò ØÓ Ø × ÓÒ ÔØ Ö Ö ÑÓÚ ÖÓÑ
Ø ÐÙ×Ø Öº
¡ È × Ö Ò ØÓ Ø × ÓÒ ÔØ Ò ÒÓØ ÔÔ Ö Ò Ò Ø ÐÙ×Ø Ö
Ö ØØ ØÓ Ø ÐÙ×Ø Öº
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-3]
30. ÓÖ ÐÙ×Ø Ö¸ Ø ÁÒ Ü Ò Ö ÔÖ × ÒØ× ØÓ Ø Ï × ÒÖ
¯ Ò Ò ÜÔ Û Ø Ð Ò × ØÓ ÐÐ Ô ×Ó ÐÙ×Ø Ö
Ì Ï × ÒÖ ×
¬ Û Ø ÖØ Ò ÛÔ × ÓÙÐ Ò ×Ø Ð ×
¬ Û Ø Ø× Ð Ð × ÓÙÐ
¬ Û Ö Ø × ÓÙÐ ÐÓ Ø Ò Ø × Ø
ÓÖ Ò ØÓ ÓÙÖ Ø ÓÖ Þ Ø ÓÒ
Î × Ð ØÝ Ë ÖÚ Ð Ñ ÒØ Ô ÓÒØ Ò Ò
ËØ Ø Ô »× Ø Ù×ØÑ ÒØ × Ò Ð ÔÔÐ Ø ÓÒ Ó Ø
Å Ø Ò × ÓÒ Ç ¹Ð Ò Ô ØØ ÖÒ × ÓÚ ÖÝ
Ù× Ö Ú ÓÙÖ Ò Ô ÓÒØ ÒØ
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-4]
È ØØ ÖÒ × ÓÚ ÖÝ ÓÖ Ê ÓÑÑ Ò Ø ÓÒ×
Ì ÓÐÐ ÓÖ Ø Ú ÐØ Ö Ò ÔÔÖÓ
Å Ò Ì Ó Ø× ×Ù ×Ø ØÓ Ù× Ö Ö Ø Ó× ÔÖ ÖÖ Ý Ù× Ö×
× Ñ Ð Ö ØÓ Öº
½º Ì Ù× Ö³× ØÖ Ò× Ø ÓÒ × Ñ Ø Ò×Ø ÐÓ ØÖ Ò× Ø ÓÒ׺
¾º Ì Ñ Ø × Ö Ö Ò º
¿º Ì ×Ø ´× Ø Ó µ Ñ Ø ´ ×µ Ö × Ð Ø º
º Ì Ó Ø× Ø Ø Û Ö × ÓÛÒ Ò Ø × Ð Ø ØÖ Ò× Ø ÓÒ× Ö
ÖÒ Ü ÐÙ Ò Ó Ø× ÐÖ Ý × Òº
º Ì Ó Ø× Û Ø Ø ÖÑÓ×Ø Ö Ò Ö × ÓÛÒ ØÓ Ø Ù× Öº
ÐÐ ×Ø Ô× ÓҹРÒ
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-5]
31. È ØØ ÖÒ × ÓÚ ÖÝ ÓÖ Ê ÓÑÑ Ò Ø ÓÒ×
Ì Ø Å Ò Ò ÔÔÖÓ
Å Ò Í× Ö × Ñ Ð Ö ØÝ Ò Ò Ò Ø ÖÑ× Ó Ú ÓÙÖ¸
ÒØ Ö ×Ø׸ ÔÖ Ö Ò × Ø Ø Ø Ò ÑÓ ÐÐ Ó ¹Ð Ò
½º È ØØ ÖÒ × ÓÚ ÖÝ ÓÚ Ö Ø ÐÓ Ø
¾º Ì ÓÒØ ÒØ× Ó Ø Ù× Ö³× ØÖ Ò× Ø ÓÒ Ö Ñ Ø Ò×Ø
Ø × ÓÚ Ö Ô ØØ ÖÒ׺
¿º Ì Ñ Ø × Ö Ö Ò º
º Ì Ó Ø× ××Ó Ø Û Ø Ø ×Ø Ñ Ø × Ö Ö Ò
Ü ÐÙ Ò Ó Ø× ÐÖ Ý × Òº
º Ì Ó Ø× Û Ø Ø ÖÑÓ×Ø Ö Ò Ö × ÓÛÒ ØÓ Ø Ù× Öº
×Ó Ø Ø µ Ì ÚÓÐÙÑ ØÒÓÙ× ÐÓ × Ô Ö ÓÖÑ Ö ÓÒÐÝ ÔÖÓ Ö Ú×× ÔÓØعÐÖÒ׺º
µ ÇÒ¹Ð Ò Ñ Ò
Ø
Ò×Ø
Ò
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-6]
È ØØ ÖÒ × ÓÚ ÖÝ Ê ÓÑÑ Ò Ø ÓÒ× ÓÒ ÓÖÖ Ð Ø Ø Ñ×
Ì ÔÔÖÓ Ó ÎÙ Ø Ò Ç Ö ÓÚ ¼
Ì Ö ÓÑÑ Ò Ø ÓÒ ÔÖÓ Ð Ñ × Ò ×
Ú Ò Ø Ö ØÒ × Ó Ø Ø Ú Ù× Ö ÓÒ × Ø Ó Ø Ñ׸ Û Û ÐÐ
Ö Ö Ø Ò × ÓÒ Ø Ö Ñ Ò Ò Ø Ñ×
Ì Ö ØÒ × Ó Ò Ø Ñ Ò ÔÖ Ø ÖÓÑ Ø Ö ØÒ ×
Å Ò
ÓÒ ÓÖÖ Ð Ø Ø Ñ׺
Î × Ð ØÝ Ë ÖÚ Ð Ñ ÒØ ÔÔÐ Ø ÓÒ Ó Ø
È Ö×ÓÒ Ð Ö ÓÑÑ Ò Ø ÓÒ
Å Ø Ò × ÓÒ Ê Ø¹ Ç ¹Ð Ò × ÓÚ ÖÝ Ó ÔÖ ØÓÖ× ÓÖ Ø
Ò × Ó ÓÖÖ Ð Ø Ø Ñ× ÑÔ Ø Ó Ø Ñ ÓÖÖ Ð Ø ÓÒ ÓÒ Ö Ø Ò ×
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-7]
32. Å Ø Ó ÓÐÓ Ý
¯ Ì Ö ØÒ Ó Ø Ñ Ú Ò ÒÓØ Ö Ø Ñ × ÔÔÖÓÜ Ñ Ø Ù× Ò
Ð Ò Ö ÙÒ Ø ÓÒ ´Ò Ñ ÜÔ Öصº
¯ Ì ÚÖ ÓÖÖ Ð Ø ÓÒ ÑÓÒ Ô Ö× Ó Ø Ñ× × ÔÔÖÓÜ Ñ Ø Ù× Ò
Ö Ò ÓÑ × ÑÔÐ Ò ÓÚ Ö Ø Ù× Ö Ö Ø Ò ×º
¯ Û Ø Ò × Ñ × ÔÖÓÔÓ× ØÓ Ð ÛØ Ø Ø Ø Ø Ù× Ö× Û Ø
× Ñ Ð Ö ÔÖ Ö Ò × Ñ Ý ÔÖÓÚ Ö ÒØ Ö Ø Ò × ÓÖ Ø × Ñ × Ø Ó
Ø Ñ׺
ÁÒ Ø × × Ñ
¬ Ì Ð Ò Ö ÜÔ ÖØ× ÓÖ ÐÐ Ô Ö× Ó Ø Ñ× Ò ÓÑÔÙØ Ó ¹Ð Ò º
¬ Ì Ö Ø Ò × ÓÖ Ò Ø Ú Ù× Ö Ö ÔÖ Ø ÖÓÑ Ø × Ø Ó Ô Ö× Ó
Ø Ñ× Ö Ø Ö Ø Ò Ø × Ø Ó Ù× Ö Ö Ø Ò ×º
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-8]
È ØØ ÖÒ × ÓÚ ÖÝ Ê Ô Ø¹ ÙÝ Ò Ø ÓÖÝ ÓÖ Ô Ö×ÓÒ Ð Þ Ø ÓÒ
Ì ÔÔÖÓ Ó Ý Ö¹Ë ÙÐÞ Ø Ð ¾
Å Ò
µ Ê ÓÑÑ Ò Ø ÓÒ× Ö × ÓÒ ÓÖÖ Ð Ø ÔÖÓ Ù Ø׺
µ ÓÖÖ Ð Ø ÓÒ× Ò ÒØ ÛØ Ö Ò Ö ³× Ö Ô Ø¹ ÙÝ Ò Ø ÓÖݸ
µ Ø Ö Ù×Ø Ò Ø ØÓ Ø Ô ÖØ ÙÐ Ö Ø × Ó ÒÓÒÝÑÓÙ× Ù× Ö × ×× ÓÒ׺
ÓÖ Ò ØÓ ÓÙÖ Ø ÓÖ Þ Ø ÓÒ
Î × Ð ØÝ Ê ÓÑÑ Ò Ø ÓÒ Ó Ò¹ Ë ÖÚ Ð Ñ ÒØ ÔÔÐ Ø ÓÒ
ÓÖÑ Ø ÓÒ ÔÖÓ Ù Ø× Ó Ø ÓÖ ÍÊÄ
Å Ø Ò × ÓÒ Ù× Ö ÔÖ Ö¹ Ç ¹Ð Ò × ÓÚ ÖÝ Ó ÓÖÖ Ð Ø
Ò × ÓÖ ÔÔÐ Ø ÓÒ Ó Ø× ÔÔÐ Ø ÓÒ Ó Ø×
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-9]
33. Ö Ò Ö ³× Ö Ô Ø¹ ÙÝ Ò Ø ÓÖÝ
¡ ÔÖ Ø× ÙÝ Ö Ú ÓÙÖ ÖÓÑ ´ µ Ô Ò ØÖ Ø ÓÒ Ò ´ µ Ú Ö
ÔÙÖ × Ö ÕÙ Ò Ý Ó Ò Ø Ñ
¡ Ý ÔÖÓÚ Ò Ö Ö Ò ÑÓ Ð Ø Ø Ö Ø Ö Þ × Ö Ô Ø
Ó¹Ó ÙÖ Ò ÔÙÖ × × Ó Ø Ñ× × Ö Ò ÓÑ ÓÖ ÒÓØ Ö Ò ÓÑ
Û Ö
Ô Ò ØÖ Ø ÓÒ Ö Ö× ØÓ Ø ÔÖ Ö Ò Ó Ù×ØÓÑ Ö ÓÖ Ö Ò
Ú Ö ÔÙÖ × Ö ÕÙ Ò Ý Ö Ö× ØÓ Ö Ô Ø ÔÙÖ × × Ó Ø
Ø Ñ¸ ÒÓÖ Ò Ö Ø Ö ×Ø × Ó Ø Ø Ñ¸ ÑÓÙÒØ Ó Ø Ø Ñ Ò
× Þ Ó Ø ÔÙÖ × × Û ÓÐ º
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-10]
××ÙÑÔØ ÓÒ× Ó ¾
¬ Ì ÔÖÓ Ð ØÝ Ó Ö Ó¹Ó ÙÖ Ò × Ó ØÛÓ ÔÖÓ Ù Ø× Ò ×Ù × ÕÙ ÒØ
ÔÙÖ × × ÓÐÐÓÛ× ÐÓ ÖØ Ñ × Ö × ×ØÖ ÙØ ÓÒº
¬ ËÙ × ÕÙ ÒØ ÔÙÖ × × Ó Ø × Ñ Ù×ØÓÑ Ö´×µ Ò Ó × ÖÚ ×
ÕÙ Ú Ð ÒØ ØÓ × Ø Ó ÔÙÖ × × ×× ÓÒ× ÙÖ Ò Ø ÐÓ Ô ÖÓ º
Å Ø Ó ÓÐÓ Ý
¯ ÓÑÔÙØ Ø ÓÒ Ó Ø Ö ÕÙ Ò Ý ×ØÖ ÙØ ÓÒ× Ó ÐÐ Ó¹Ó ÙÖ Ò × Ó
ÔÖÓ Ù Ø Ô Ö׸ ÓÙÒØ Ò ÓÒ Ó¹Ó ÙÖ Ò Ô Ö × ×× ÓÒ ÓÒÐÝ
¯ Ð Ñ Ò Ø ÓÒ Ó ×ØÖ ÙØ ÓÒ× Û Ø ×Ñ ÐÐ ÒÙÑ Ö Ó Ó × ÖÚ Ø ÓÒ×
¯ Ð Ñ Ò Ø ÓÒ Ó Ø Ô Ö ÒØ Ð Ó Ø Ö Ô Ø¹ ÙÝ Ô Ö×
¯ ÓÑÔÙØ Ø ÓÒ Ó Ø Ó¹Ó ÙÖ Ò ÔÖ ØÓÖ ÓÖ Ô Ö
×Ó Ø Ø ÓÙØÐ Ö× ÓÖ ÔÖ ØÓÖ Ò Ó × ÖÚ × ÓÖÖ Ð Ø Ø Ñ׺
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-11]
34. Pattern Discovery Association mining for personalization
Basic Idea: match left-hand side of rules with the active user
session and recommend items in the rule’s consequent
Essential to store patterns in efficient data structures
• the search of all rules in real-time is computationally
ineffective
Ordering of accessed pages is not taken into account
Good recommendation accuracy, but the main problem is
“coverage”
• high support thresholds lead to low coverage and may
eliminate important, but infrequent items from consideration
• low support thresholds result in very large model sizes and
computationally expensive pattern discovery phase
PKDD 2001 Tutorial: “KDD for Personalization” [PD-12]
[1]
Association Mining - Basic Concepts
We start with a set I of items and a set D of transactions.
A transaction T is a set of items (a subset of I):
I = { i1 , i 2 ,..., i m } T ⊆ I
An Association Rule is an implication on itemsets X and Y,
denoted by X ==> Y, where
X ⊆ I , Y ⊆ I , X ∩Y =∅
The rule meets a minimum confidence of c, meaning that
c% of transactions in D which contain X also contain Y. In
addition for each itemset a minimum support of s must be
satisfied:
s ≤ X ∪Y / D c ≤ X ∪Y / X
PKDD 2001 Tutorial: “KDD for Personalization” [PD-13]
[2]
35. È ØØ ÖÒ × ÓÚ ÖÝ ××Ó Ø » ××Ó Ø Ø Ñ× Ò Ù× Ö×
Ì ÔÔÖÓ Ó Ä Ò¸ ÐÚ Ö Þ ² ÊÙ Þ ¿
Å Ò
µ Í× Ö× Ö ××Ó Ø ØÓ ÓØ Ö Ò Ø ÖÑ× Ó ÓÛ Ø Ý Ö Ø Ø Ñ׺
µ ÁØ Ñ× Ö ××Ó Ø ØÓ ÓØ Ö Û Ø Ö ×Ô Ø ØÓ Ù× Ö ÔÖ Ö Ò ×º
××Ó Ø ÓÒ× ÑÓÒ Ø Ñ× Ò ÓÙÒ Ó ¹Ð Ò º
××Ó Ø ÓÒ× ØÓ Ø Ø Ú Ù× Ö Ò ÓÙÒ ÓÒ¹Ð Ò º
ÓÖ Ò ØÓ ÓÙÖ Ø ÓÖ Þ Ø ÓÒ
Î × Ð ØÝ Ë ÖÚ Ð Ñ ÒØ ÔÔÐ Ø ÓÒ Ó Ø
È Ö×ÓÒ Ð Ö ÓÑÑ Ò Ø ÓÒ
Å Ø Ò × ÓÒ ××Ó Ø ÓÒ× ÇÒ¹Ð Ò × ÓÚ ÖÝ Ó ××Ó º
ÑÓÒ Ø Ñ× Ò ÑÓÒ Ù× Ö× ÖÙÐ × Û Ø Ú Ò ÊÀË
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-14]
Å Ø Ó ÓÐÓ Ý
¯ Ê ÓÑÑ Ò Ø ÓÒ× Ö ×Ù Ø ØÓ Ñ Ò ÑÙÑ ÓÒ Ò Ò Ñ Ò ÑÙÑ
ÒÙÑ Ö Ó ÖÙÐ × ÓÒ×ØÖ ÒØ׺
¯ Ì Ñ Ò Ö × ÓÚ Ö× ××Ó Ø ÓÒ ÖÙÐ × Ø Ö Ø Ú Ðݸ ÙÒØ Ð Ø ×Ö
ÒÙÑ Ö Ó ÖÙÐ × × ÜØÖ Ø º
Ì ×ÙÔÔÓÖØ ÙØÓ × Ù×Ø Ò Ø Ö Ø ÓÒº
¯ ÊÙÐ × ÓÒ ÖÒ ÓØ Ø Ñ× Ò Ù× Ö×
Í× Ö½ Ð Æ Í× Ö¾ ×Ð µ Ì Ö ØÍ× Ö Ð
ÁØ Ñ½ Ð Æ ÁØ Ñ¾ Ð µ Ì Ö ØÁØ Ñ Ð
¯ Ò Ø Ø Ñ× Ö ÓÑÔÙØ ÖÓÑ ××Ó Ø ÓÒ× ÒÚÓÐÚ Ò Ù× Ö×
× Ñ Ð Ö ØÓ Ø Ø Ú Ù× Öº ÓҹРÒ
¯ Ë ÓÖ × Ó Ø Ñ× Ö ÓÑÔÙØ ÖÓÑ ××Ó Ø ÓÒ× Ö Ø Ò Ù× Ö
ÔÖ Ö Ò ×º Ó ¹Ð Ò
¯ Ì Ò Ø Ø Ñ× Û Ø ×Ø × ÓÖ × Ö ×Ù ×Ø ØÓ Ø ØÚ
Ù× Öº ÓҹРÒ
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-15]
36. Pattern Discovery Association mining for personalization
The approach of Mobasher, et al, 2001 [45]
Main Idea: avoid offline generation of all association rules;
generate recommendations directly from itemsets
• discovered frequent itemsets of are stored into an “itemset
graph” (an extension of lexicographic tree structure of
Agrawal, et al 1999 [2])
• recommendation generation can be done in constant time
by doing a directed search to a limited depth
According to our categorization
Visibility: Personal recommenda- Service element: pageview
tions or silent dynamic adjustment
Matching based on: user behaviour
PKDD 2001 Tutorial: “KDD for Personalization” [PD-16]
[3]
Methodology:
• Construct Frequent Itemset Graph
– each node at depth d in the graph corresponds to an
itemset
– I, of size d and is linked to itemsets of size d+1 that
contain I at level d+1
– the single root node at level 0 corresponds to the empty
itemset
• frequent itemsets are matched against a user's active
session S by performing a search of graph to depth |S|
• a recommendation r is an item at level |S+1| whose
recommendation score is the confidence of rule S ==> r
PKDD 2001 Tutorial: “KDD for Personalization” [PD-17]
[4]
37. Pattern Discovery Sequence mining for personalization
Main Idea: take the ordering of accessed items into account
Two basic approaches
• use contiguous sequences (e.g., Web navigational patterns)
• use general sequential patterns
Contiguous sequential patterns are often modeled as
Markov chains and used for prefetching (i.e., predicting
the next user access based on previously accessed pages
In context of recommendations, they can achieve higher
accuracy than other methods, but may be difficult to obtain
reasonable coverage
PKDD 2001 Tutorial: “KDD for Personalization” [PD-18]
[5]
Pattern Discovery Sequence mining for personalization
Markov chain representation often leads to high space
complexity due to model sizes
Some Solutions
• selective Markov Models (Deshpande, Karypis, 2000 [17])
use various pruning strategies to reduce the number of states
(e.g., support or confidence pruning, error pruning)
• longest repeating subsequences (Pitkow, Pirolli, 1999 [])
similar to support pruning, used to focus only on significant
navigational paths
• increased coverage can be achieved by using all-Kth-order
models (i.e., using all possible sizes for user histories)
PKDD 2001 Tutorial: “KDD for Personalization” [PD-19]
[6]
38. È ØØ ÖÒ × ÓÚ ÖÝ Ë ÕÙ Ò Ñ Ò Ò ÓÖ Ô Ö×ÓÒ Ð Þ Ø ÓÒ
Ì ÔÔÖÓ Ó ÙÐ ² Ë Ñ Ø¹Ì Ñ ¾
Å Ò
µ Ê ÓÑÑ Ò Ø ÓÒ× Ö ×
ÓÒ Ö ÕÙ ÒØ Ô ØØ ÖÒ× Ó Ô ×Ø Ú ÓÙÖº
µ Ö ÓÑÑ Ò Ö × ÔÖ ØÓÖ ÓÖ Ð ×× Ó Ú ÒØ׺
µ Ì ÓÒ×Ø ÐÐ Ø ÓÒ Ó Ø Ö ÓÑÑ Ò Ö× ÓÖ ÐÐ Ð ×× × Ö ØÙÖÒ× Ø
×Ø Ö ÓÑÑ Ò Ø ÓÒ× ÓÖ Ú Ò Ù× Ö ×ØÓÖݺ
ÓÖ Ò ØÓ ÓÙÖ Ø ÓÖ Þ Ø ÓÒ
Î × Ð ØÝ Ë ÖÚ Ð Ñ ÒØ ÍÊÄ׸ × Ø Ó Ø×
Ê ÓÑÑ Ò Ø ÓÒ
Å Ø Ò × ÓÒ Ò Ú Ø ÓÒ Ç ¹Ð Ò ØÖ Ò Ò Ó Ð ×× Ö×
×ØÓÖ × Ò ÍÊÄ ÔÖÓÜ Ñ ØÝ ÐÓ Ð Ö ÓÑÑ Ò Ö ×Ý×Ø Ñ×
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-20]
Ò Ö Ö Ñ ÛÓÖ
¯ Û Ø Ñ ×ÙÖ × ÓÖ Ø ÕÙ Ð ØÝ Ó Ö ÓÑÑ Ò Ø ÓÒ¸ Ø Ò Ø
×Ø Ò ØÛ Ò Ò Ø ÍÊÄ× ÒØÓ ÓÙÒØ
¯ ×Ø Ò Ù × Ò ØÛ Ò ÝÒ Ñ Ò ×Ø Ø Ö ÓÑÑ Ò Ö× Ø Ø
Ó» Ó ÒÓØ Ø Ù× Ö ×ØÓÖ × ÒØÓ ÓÙÒØ
¯ ÓÑ Ò Ò ÐÓ Ð Ö ÓÑÑ Ò Ö ×Ý×Ø Ñ׸ Ó Û ÔÖ Ø×
Ð ×× Ó Ú ÒØ×
Û Ö Ð ×× Ò ÓÒ Ù× Ö ×ØÓÖݸ ÖÓÙÔ Ó ×ØÓÖ × ÓÖ Ø Û ÓÐ
Ø × Øº
Ì Ö Ý¸ Ò Ú Ø ÓÒ ×ØÓÖÝ ×
¬ × Ø Ó Ú ÒØ×
¬ × ÕÙ Ò Ó Ú ÒØ×
¬ ÑÓÖ ÓÑÔÐ Ü ×ØÖÙ ØÙÖ Ó Ó¹Ó ÙÖ Ò Ú ÒØ×
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-21]
39. È ØØ ÖÒ × ÓÚ ÖÝ Í× ÔÖÓ Ð × ÓÖ Ô Ö×ÓÒ Ð Þ Ø ÓÒ
Ì ÔÔÖÓ Ó ÅÓ × Ö Ø Ð ¿¸ ¾
ÌÛÓ ØÝÔ × Ó Ù× ÔÖÓ Ð ×
ÐÙ×Ø Ö× Ó × Ñ Ð Ö Ù× Ö ØÖ Ò× Ø ÓÒ× Ò¹ ÐÙ×Ø Ö× Ó Ô × ××
Ò Ý Û ØÒ × Ñ Ø Ø Ö ÑÓÚ × ØÓ Ø Ö
Ô × ÛØ ×ÙÔÔÓÖØ Ð ×× Ø Ò Ñ Ò Ú ÐÙ
Ö ØÒ Ø Ñ Ñ Ö× Ó ÐÙ×Ø Ö ÒØÓ ÓÒ Ö ÔÖ × ÒØ Ø Ú ÔÖÓ Ð
ÓÖ Ò ØÓ ÓÙÖ Ø ÓÖ Þ Ø ÓÒ
Î × Ð ØÝ È Ö×ÓÒ Ð Ö ÓÑÑ Ò ¹ Ë ÖÚ Ð Ñ ÒØ Ô Ú Û
Ø ÓÒ ÓÖ × Ð ÒØ ÝÒ Ñ Ù×ØÑ ÒØ
Å Ø Ò × ÓÒ Ù× Ö Ú ÓÙÖ Ç ¹Ð Ò × ÓÚ ÖÝ Ó
Ð×Ó Ô ÓÒØ ÒØ Ò Ö Ø ÔÖÓ Ð ×
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-22]
µ Ú × Ñ Ð Ö Ô Ö ÓÖÑ Ò ØÓ ÓÒ¹Ð Ò ÓÐÐ ÓÖ Ø Ú ÐØ Ö Ò
Ñ×
µ Ù× Ò Ñ Ò Ñ Ð ÒÙÑ Ö Ó Ô Ú Û× ÓÖ Ø Ø Ú Ù× Ö
Å Ø Ó ÓÐÓ Ý
¯ ÈÖ ÔÖÓ ×× Ò Ô ×
¬ ×× ÒÑ ÒØ Ó Û Ø× ØÓ Ø Ô Ú Û×
¬ Ë Ò Ò Ø ×Ø Ò ¸ × ÓÒ Ô ×Ø Ý Ø Ñ
¬ ÆÓÖÑ Ð Þ Ø ÓÒ Ó Ô Ú Û Û Ø×
¯ È Ì ÈÖÓ Ð Ö Ø ÓÒ × ÓÒ ÐÙ×Ø Ö Ò Ì Ò ÕÙ ×
½º ÐÙ×Ø Ö Ò Ó Ù× Ø ØÓ ×Ø Ð × Ø Ö Ø ÔÖÓ Ð ×
¾º Å Ø Ö Ð Þ Ø ÓÒ Ó Ø ÔÖÓ Ð × × Ú ØÓÖ× Ó ´Ô ¸Û ص Ô Ö×
¿º Ë Ò Ó Ø Ù× Ö³× ×ØÓÖÝ Ý Ñ Ò× Ó ×Ð Ò Û Ò ÓÛ Ø Ø
ÐÐÓÛ× ÓÒÐÝ × Ø Ó Ô ×× × ØÓ ÓÒ× Ö Ò Ø ÔÖÓ Ð
º Å Ø Ò Ø Ù× Ö × ×× ÓÒ Û Ø ÔÖÓ Ð
º Å Ø Ö Ò Ò
ÈÃ ¾¼¼½ ÌÙØÓÖ Ð Ã ÓÖ È Ö×ÓÒ Ð Þ Ø ÓÒ [PD-23]
40. A Framework for Personalization Based on
Aggregate Profiles
Offline Phase
PKDD 2001 Tutorial: “KDD for Personalization” [PD-24]
[7]
A Framework for Personalization Based on
Aggregate Profiles
Input from the
batch process
Online
Usage Profiles Phase
Content Profiles
• Match current user’s activity against the discovered profiles
• Each recommended item is assigned a score based on
– matching criteria and quality of aggregate profiles
– “information value” of the item based on domain knowledge
PKDD 2001 Tutorial: “KDD for Personalization” [PD-25]
[8]
41. Aggregate Profiles Based on Clustering
Transactions (PACT) (Mobasher, et al, [42, 43])
• Input
– set of relevant pageviews in preprocessed log
P = { p1 , p2 ,! , pn }
– set of user transactions
T = {t1 , t 2 , ! , t m }
– each transaction is a pageview vector
t = w( p1 , t ), w( p2 , t ),..., w( pn , t )
PKDD 2001 Tutorial: “KDD for Personalization” [PD-26]
[9]
Aggregate Profiles Based on Clustering
Transactions (PACT)
• Transaction Clusters
– each cluster contains a set of transaction vectors
– for each cluster compute centroid as cluster
representative
"
c = u1c , u2 ,!, un
c c
• Aggregate Usage Profiles
– a set of pageview-weight pairs: for transaction cluster
c
C, select each pageview pi such that ui (in the cluster
centroid) is greater than a pre-specified threshold
PKDD 2001 Tutorial: “KDD for Personalization” [PD-27]
[10]
42. Example Aggregate Profiles
• Example Profiles based on the PACT method
– Based on data from the Association for Consumer
Research Site:
1.00
1.00 Call for Papers
Call for Papers
0.67
0.67 ACR News Special Topics
ACR News Special Topics
0.67
0.67 CFP: Journal of Psychology and Marketing I
CFP: Journal of Psychology and Marketing I
0.67
0.67 CFP: Journal of Psychology and Marketing II
CFP: Journal of Psychology and Marketing II
0.67
0.67 CFP: Journal of Consumer Psychology II
CFP: Journal of Consumer Psychology II
0.67
0.67 CFP: Journal of Consumer Psychology I
CFP: Journal of Consumer Psychology I
1.00
1.00 CFP: Winter 2000 SCP Conference
CFP: Winter 2000 SCP Conference
1.00
1.00 Call for Papers
Call for Papers
0.36
0.36 CFP: ACR 1999 Asia-Pacific Conference
CFP: ACR 1999 Asia-Pacific Conference
0.30
0.30 ACR 1999 Annual Conference
ACR 1999 Annual Conference
0.25
0.25 ACR News Updates
ACR News Updates
0.24
0.24 Conference Update
Conference Update
PKDD 2001 Tutorial: “KDD for Personalization” [PD-28]
[11]
Hypergraph-Based Clustering
(Han, Karypis, Kumar, Mobasher, 1998 [26])
• Construct a hypergraph from
sets of related items
– Each hyperedge represents a
frequent itemset
– Weight of each hyperedge can
be based on the characteristics
of frequent itemsets or
association rules (e.g.,
support, confidence, interest,
etc.)
PKDD 2001 Tutorial: “KDD for Personalization” [PD-29]
[12]
43. Hypergraph-Based Clustering
• Recursively partition hypergraph so that each partition
contains only highly connected items
– Given a hypergraph we find a k-way partitioning such
that the weight of the hyperedges that are cut is
minimized
– The fitness of partitions measured in terms of the ratio
of weights of cut edges to the weights of uncut edges
within the partitions
– The connectivity measures the percentage of edges
within the partition with which the vertex is associated --
used for filtering partitions
– Vertices from partial edges can be added back to
clusters based on a user-specified overlap factor
PKDD 2001 Tutorial: “KDD for Personalization” [PD-30]
[13]
Profiles Based on Hypergraph Clusters
(Mobasher, Cooley, Srivastava, 1999 [41])
• Input
– input for clustering is the set of large itemsets from
association rule module
– each itemset is a hyperedge (weights are a function of
the interest of the itemset)
support( I )
Interest ( I ) =
∏ i∈I support(i)
– In practice can use the log of interest to avoid few
highly frequent patterns from totally dominating
PKDD 2001 Tutorial: “KDD for Personalization” [PD-31]
[14]
44. Profiles Based on Hypergraph Clusters
• Aggregate Profiles (Item/Pageview Clusters)
– clustering program directly outputs a set of
overlapping pageview clusters
– the weight associated with pageview p in a cluster
C is based on the connectivity value of p in
hypergraph partition:
{e | e ⊆ C , p ∈ e}
conn( p, C ) =
{e | e ⊆ C}
PKDD 2001 Tutorial: “KDD for Personalization” [PD-32]
[15]
Recommendation Engine for Using
Aggregate Profiles
• Match user’s activity against discovered profiles
– a sliding window over the active session to capture the
current user’s “short-term” history depth
– profiles and the active session are treated as vectors
– matching score is computed based on the similarity
between vectors (e.g., normalized cosine similarity)
• Recommendation scores are based on
• matching score to aggregate profiles
• “information value” of the recommended item (e.g., link
distance of the recommendation to the active session)
– recommendations are contributed by multiple profiles
PKDD 2001 Tutorial: “KDD for Personalization” [16]
[PD-33]
45. Active Session Window
• Example: Session window of size 5
A.html ! B.html ! C.html ! D.html ! E.html ! D.html ! F.html
active user session Session window
• Associating weight with items in the active session:
– assigned by site owner based on perceived importance
– based on recency (recent pages weighted higher) or
time spent on pages
– based on page types (e.g., content v. navigational)
PKDD 2001 Tutorial: “KDD for Personalization” [PD-34]
[17]
Example: Recommendations Based on PACT
Example profiles: Current User Session U: A.html => B.html => C.html => E.html
PROFILE 0
------------- Assume session window size of 3 and unit weights, using
1.00 D.html (cosine) similarity between active session and each profile:
0.50 A.html
0.50 C.html Sim(U, P0) = (0.5+0.5) / SQRT (1.75 * 3) = 0.44
0.50 E.html Sim(U, P1) = (0.5+0.5+0.5) / SQRT(2.5*3) = 0.20
Sim(U, P2) = (0.75+0.5) / SQRT(1.69*3) = 0.25
PROFILE 1
------------- Recommendations
1.00 A.html Candidate Recommendations:
0.50 B.html
0.50 C.html P0: D.html (SQRT(0.44*1.00) = 0.66)
0.50 D.html A.html (SQRT(0.44*0.50) = 0.47)
0.50 E.html
0.50 F.html
P1: A.html (SQRT(0.20*1.00) = 0.45)
PROFILE 2 D.html (SQRT(0.20*0.50) = 0.32)
------------- F.html (SQRT(0.20*0.50) = 0.32)
0.75 B.html
0.75 F.html
0.50 A.html P2: F.html (SQRT(0.22*0.75) = 0.41)
0.50 C.html A.html (SQRT(0.22*0.50) = 0.33)
0.25 D.html D.html (SQRT(0.22*0.25) = 0.23)
PKDD 2001 Tutorial: “KDD for Personalization” [PD-35]
[18]
46. Integration of Content Profiles
(Mobasher, et al., 2000 [44])
• Cluster features over the n-dimensional space of pageviews
• For each feature cluster derive a content profile by
collecting pageviews in which these features appear as
significant (represented as overlapping collections of
pageview-weight pairs)
Weight Pageview ID Significant Features (stems)
1.00 CFP: One World One Market world challeng busi co manag global
0.63 CFP: Int'l Conf. on Marketing & Development challeng co contact develop intern
0.35 CFP: Journal of Global Marketing busi global
0.32 CFP: Journal of Consumer Psychology busi manag global
Weight Pageview ID Significant Features (stems)
1.00 CFP: Journal of Psych. & Marketing psychologi consum special market
1.00 CFP: Journal of Consumer Psychology I psychologi journal consum special market
0.72 CFP: Journal of Global Marketing journal special market
0.61 CFP: Journal of Consumer Psychology II psychologi journal consum special
0.50 CFP: Society for Consumer Psychology psychologi consum special
0.50 CFP: Conf. on Gender, Market., Consumer Behavior journal consum market
PKDD 2001 Tutorial: “KDD for Personalization” [PD-36]
[19]
Integration of Content Profiles
• Integration with Recommendation Engine
– Usage and content profiles have similar representation,
so they can be used by the recommendation engine in
the same way
• Item weights in profiles must be normalized, so content
and usage profiles can be compared on the same scale
– One approach: match active user session with all
profiles (both content and usage); then use the maximal
recommendation score for candidate recommendations
– Another approach: use content profiles for generating
recommendations only if no matching usage profiles
(with sufficient confidence) is found
PKDD 2001 Tutorial: “KDD for Personalization” [PD-37]
[20]
47. Evaluating Personalization
PKDD 2001 Tutorial: “KDD for Personalization” [E-1]
Evaluating usability: goals / tasks?
Recall operational definition:
A Web site’s usability is high if users
- achieve their goals / perform their tasks in little time,
- do so with a low error rate,
- experience high subjective satisfaction.
Depending on the site, relevant goals / tasks may be to:
- stay in the site, return to the site, buy... => E-metrics
- locate content (search),
- learn,
- ...
PKDD 2001 Tutorial: "KDD for Personalization" [E-2]
48. Evaluating usability: methodological caveats
Questionnaire data:
self-reports are often biased;
observation of behavior in experiments advisable
Comparisons of sites with/without personalization,
or before/after personalization introduced,
with respect to "normal user behavior" (server logs):
usually a quasi-experiment
- many uncontrolled variables (e.g., user intentions)
- poss. several differences between sites/site versions
=> causal attribution of success to personalization
becomes difficult
PKDD 2001 Tutorial: "KDD for Personalization" [E-3]
Evaluating usability: results I
CyberBehavior Research Center 1999 survey
- 81% of 694 respondents have visited a person. site
- 64% of those found it useful: helpful, time saving
- perceived usefulness changes with product
(books > music > inf.technol. > news/articles > other)
- main problems: privacy, ineffectiveness when behav.
did not reflect user "personally" (e.g., buying a gift)
- concern that possible choices may be limited
- little differences of opinion between personalization
occurring in response to behavior or to solicited input
PKDD 2001 Tutorial: "KDD for Personalization" [E-4]
49. Evaluating usability: results II
Belkin [3], reviewing studies of recommendations
in IR systems carried out at Rutgers Univ. since 1995:
- measures of performance and subj. satisfaction
- relevance feedback worked well, but bettter with both
increased knowledge of how it worked, and with
increased control by the user of its suggestions:
- relevance feedback + term suggestion performed better
than, and was preferred to, pure relevance feedback
- users preferred to save effort:
were willing to hand over the subsidiary task of term
selection to a system they trust ed
PKDD 2001 Tutorial: "KDD for Personalization" [E-5]
Evaluating usability: results III
Nielsen Net Ratings 1999
registered visitors of portal sites,
i.e., those who can customize,
- spend > 3 times longer at home portal than others
- view 3-4 times more pages
PKDD 2001 Tutorial: "KDD for Personalization" [E-6]
50. Why are results scarce? Possible reasons
"In essence, web design is a problem in user interface design.
However, ... few web designers can afford to subject their
web sites to formal usability testing in special labs."
Perkowitz & Etzioni [52]: Adaptive web sites: an AI challenge.
"Web personalization is much over-rated and mainly used as
a poor excuse for not designing a navigable website."
Nielsen [47]: Personalization is over-rated.
"Personalization costs. ... You’re more likely to get a good
return on your efforts ... by fixing other problems, such as
difficulty in locating content."
Lighthouse on the Web [36], quoting from
Mainspring and User Interface Engineering
PKDD 2001 Tutorial: "KDD for Personalization" [E-7]
Can other results be transferred?
Research on adaptive educational software since ~ 1970
- usually, user control helpful for learning;
adaptive interfaces particularly helpful for novices
- interfaces changing over time: difficult to learn
- adaptive presentation (more info depending on user
knowledge) improves comprehension and reduces
reading time
- adaptive link annotation
- can reduce no. of visited pages + learning time
- encourages novices to navigate non-sequentially
- enables users to rate the difficulty of a page better
PKDD 2001 Tutorial: "KDD for Personalization" [E-8]
51. Can other results be transferred? (contd.)
- adaptive link ordering improves user performance
in information search tasks
- but unstable order of options is confusing for novices
so hiding is better for novices
- for novices, direct guidance is useful
("next" link is most popular choice)
- the more users agree with the system’s suggestions,
the better their test results
(surveys in [11,12])
PKDD 2001 Tutorial: "KDD for Personalization" [E-9]
Further factors affecting subjective satisfaction
- user control (general guideline for software development)
- must match user’s interests at the moment
- users don’t want extra work: "paradox of the active user"
- users don’t like to be recognized too soon
- users want to be anonymous, at least at certain times
- users want openness / disclosure
- people don’t want relationships with corporations,
but with other people
- be specific without being exclusive
- consider information structure on Web
(non-monetary rewards better than differential pricing)
respect the user !
PKDD 2001 Tutorial: "KDD for Personalization" [E-10]