Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Gaining New Insights into Usage Log Data
1. Introduction
Web Usage Log Case Study
Conclusion
Gaining New Insights into Usage Log Data
via Explorative Visualisation
Markus Kirchberg, Ryan K L Ko, and Bu Sung Lee
Hewlett-Packard Labs (HP Labs) Singapore
Contact: Markus.Kirchberg@hp.com
Business Analytics 2011 – A SAS Forum Event
– May 25th , 2011 –
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 1 / 27
2. Introduction
Web Usage Log Case Study
Conclusion
Outline.
1 Introduction
Usage Log Analysis
Explorative Visualisation
2 Web Usage Log Case Study
Basics
Relevance
3 Conclusion
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 2 / 27
3. Introduction
Usage Log Analysis
Web Usage Log Case Study
Explorative Visualisation
Conclusion
Introduction
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 3 / 27
4. Introduction
Usage Log Analysis
Web Usage Log Case Study
Explorative Visualisation
Conclusion
Background and Motivation
Cloud computing, MPP/map-reduce, data explosion, semantic
technologies, ... increased interest in data analytics.
Logged data Generated by almost all systems/services in-use.
Capabilities to extract value from logs ∼ Key distinguishing factor.
=
Current approaches (e.g., link & usage log analysis) need
revision.
Typically time is considered as an orthogonal factor.
Limitation of the potential impact of the measured importance.
Real-world events, topics or keywords are not consistently
interpreted over time.
Focus: Extract meaningful information (e.g., usage patterns or
relevance indicators) and relate to users / real-world events. university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 4 / 27
5. Introduction
Usage Log Analysis
Web Usage Log Case Study
Explorative Visualisation
Conclusion
Sample Events
university-logo
6. Introduction
Usage Log Analysis
Web Usage Log Case Study
Explorative Visualisation
Conclusion
Sample Events
university-logo
7. Introduction
Usage Log Analysis
Web Usage Log Case Study
Explorative Visualisation
Conclusion
Sample Events
university-logo
8. Introduction
Usage Log Analysis
Web Usage Log Case Study
Explorative Visualisation
Conclusion
Sample Events
university-logo
9. Introduction
Usage Log Analysis
Web Usage Log Case Study
Explorative Visualisation
Conclusion
Usage Log Analysis – Basics
Usage Log Types (It’s more than just Web server logs!):
Network / Firewall Logs (bandwidth per msg type, inbound vs outbound,
Intranet vs Internet, ...)
Medical Device Usage Logs (proper usage, treatment improvement, ...)
Vehicle Usage Logs (ERP, road monitoring, accident prevention /
investigation, ...)
Database Usage Logs (auditing, consistency, recovery, performance
optimisation, ...)
Web, ftp, mail, ... server usage logs (usage statistics, relevancy,
advertising, ...)
Call Center Usage Logs, Social Networking Usage Logs, ...
Purposes: Data enrichment, identification of redundant data, data
cleaning, detection of hidden patterns, statistical verification, usage
context / relevancy, marketing / advertisement placement, ... university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 6 / 27
11. Introduction
Usage Log Analysis
Web Usage Log Case Study
Explorative Visualisation
Conclusion
Explorative Visualisation
‘Data science is the future and there cannot be data science without
data visualization and vice versa.’ DavidMcCandless@TED,July 2010
∼ Graphics that give important clues and observations of patterns
=
and consistent trends.
Useful to prove the existence or understanding of a certain
phenomenon;
Assist with modelling findings as mathematics, algorithms or other
formalisms that can reproduce such trends.
Proven to be of great value in analysing and exploring big data.
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 8 / 27
12. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Web Usage Log Case Study
Basics
M. Kirchberg, R. K L Ko, B. S. Lee. From Linked Data to Relevant
Data – Time is the Essence. In Proceedings of the 1st International
Workshop on Usage Analysis and the Web of Data (USEWOD) held
in conjunction with the 20th International World Wide Web
Conference (WWW), 2011. (Best Paper Award)
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 9 / 27
13. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
How Do Obtain MEANINGFUL Web Usage Data?
Usage Log Analysis
Non-invasive; implicitly collected; potential source of privacy
concerns!
Challenges: up to 90% of data is rubbish; lack of relevancy notion.
Social Tagging / Annotations
Required explicit user inputs; limited to social networking sites.
Proven useful to define better folksonomies; but lack of use cases.
Explicit User Feedback (Like/Unlike, Rate Up/Down) in the GUI
Required new GUIs and explicit user inputs.
Proven useful for location-dependent search; long-tail queries.
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 10 / 27
14. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study: (Linked) Data Sets & their Usage Logs
Semantic Web Dog Food (SWDF): Web/Semantic Web
publications, people and organisations.
Usage logs cover 2 years from Nov 01, 2008 to Dec 14, 2010[1] .
Log # Resources # Accessed Days Hits # Success-
Size Resources ful Hits
2GB > 100, 000 40, 322 720 8.1m 7.1m
DBpedia: twin of Wikipedia; focal points of the Web of data.
Usage logs covering Jul 01, 2009 & Feb 01, 2010[1]
(avg of 1m hits/day; 6m accessed resources).
SWDF serves a specific purpose; DBpedia is general-purpose.
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 11 / 27
15. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study Evaluation Framework: Log-to-Database
1 Eval log entries & removed hits with 4/5xx HTTP status codes.
SWDF: Very clean and conform to the CLF format.
DBpedia: > 1, 000 non-UTF8 / non-CLF-conform entries.
2 Map log entry fields to specifically designed PostgreSQL DB.
3 Post-process DB entries:
URIs and matching HTML/RDF representations;
Bots, spiders, crawlers, ... (user agent field, access to
robots.txt, high frequency accesses); and
Access types – Plain/HTML vs. Semantic vs. Search vs. SPARQL.
4 Basic analysis of usage log data.
5 Relevance-driven usage log analysis. university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 12 / 27
16. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study: Basic Statistics & Findings
Top hits excluding bots & spiders are 10% of those overall.
Adequante filtering is vital to obtain a better insights.
However, it is not enough to already derive at a useful notion of relevance.
Möller et.al.[2] on a possible metric to determine relevance: ‘[...] In the
case of the Dog Food dataset, the hypothesis is that requests for data
from specific conferences would be noticeably higher around the time
when the event took place. [...] Contrary to our expectations, there areuniversity-logo
no significantly higher access rates around the time of the event. [...]’.
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 13 / 27
17. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study: Basic Statistics & Findings
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 13 / 27
18. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Web Usage Log Case Study
Relevance
Web-site: http://usewod2011.thekirchbergs.info/
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 14 / 27
19. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Relevance – Basics
SWDF/DBpedia data sets provide clues pointing to concepts of
relevance of Web resources with time and events in reality.
Consider two spaces in which semantic data are communicated:
Real Space: where r/w events take place at unique time windows.
A same semantic of an event (e.g., National Day) can take place
frequently with the same objectives and content; BUT different time
windows understand temporal and situational context/meaning.
Web Space: Desc of Real Space events in the form of linked data.
Without time window more difficult to give ‘meaning’ to a set of
keywords/topics/Web data describing a Real Space event.
Study representations of events in Real Space recorded as
linked data in Web Space.
Time windows + exploratory graphics Meaningful change.
university-logo
∼ Time window, traffic & linked resources.
Relevance =
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 15 / 27
20. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study: Key Contributions
Present evidence that Web usage logs can lead to relevance
notion.
Essential: Consider not only interlinking of weighted resources:
Whether users make use of links (use versus mere existence),
How users utilise links (browsing depth, browsing patterns, ...), and
How the usage changes over time.
Conclude that time is indeed a key factor to be considered.
Propose new approach by combining link and usage analysis for
events based on time-windowed views over usage logs.
Event ∼ A situation that creates a need in a user to search or
=
browse for related information which, in turn, triggers a
visit to a Web resource that is associated with
topics and keywords via the Web 3.0. university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 16 / 27
21. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study: Measuring Relevance
Web Travel Footprint (WTF) of an IP Address:
∼ Road network on a map with footprint being the user’s trail.
=
Characteristics from linking ‘referrer’ to ‘resource requested’:
1 Fan – Linkages between a data resource and other data resources.
Spread of influence of a resource; eliminates unused resources.
2 Depth – how ‘deep’ a user surfs into the Web-site.
Measure about ‘curiosity’ w.r.t. a certain set of resources.
Characteristics from counting a link’s hits within a time window:
1 Weight – Number of times a path was accessed.
university-logo
Relevancy based on all three characteristics – not in isolation.
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 17 / 27
22. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study: Measuring Relevance
int (WTF) of an IP Address university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 17 / 27
23. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study: Kandinsky Graphs (KGs)
∼ Sum of all WTFs of visitors’ access paths & linkage of the
=
resources within the site at a particular time window.
Exploratory graph sums of (1) how deep users have travelled
into/within a site; (2) how resources are linked to each other; and
(3) which resources are highly relevant – at a given time window.
Technically : GraphViz dot files as circo-layouts.
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 18 / 27
24. Introduction
Basics
Web Usage Log Case Study
y : GraphViz dot files as circo-layouts. Conclusion
Relevance
Case Study: Kandinsky Graphs (KGs)
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 18 / 27
25. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study: Kandinsky Graphs for WWW 2010
Recurring Top Relevant Resources in the Paper Before During After
SWDF Web-site Due Conf Conf Conf
http://data.semanticweb.org/conference/www/2009 2 2 1 3
http://data.semanticweb.org/conference/iswc/2009 1 1 2 2
http://data.semanticweb.org/papers 3 3 3 4
http://data.semanticweb.org/index.html 1
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 19 / 27
26. Introduction
Basics
Web Usage Log Case Study
Relevance
Conclusion
Case Study: DIFF-Kandinsky Graphs for WWW 2010
KGs capture relevance for each time window.
DIFF-KGs capture changes between time windows:
Relevance(TimeWindow2 ) − Relevance(TimeWindow1 )
whereby weights are calculated using division.
Emphasise on new hits; remove/penalise edges with similar hits.
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 20 / 27
27. Introduction
Web Usage Log Case Study
Conclusion
Conclusion
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 21 / 27
28. Introduction
Web Usage Log Case Study
Conclusion
Real Space Web/Cyber Space
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 22 / 27
29. Introduction
Web Usage Log Case Study
Conclusion
Web/Cyber Space Real Space
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 23 / 27
30. Introduction
Web Usage Log Case Study
Conclusion
Real Space Web/Cyber Space Real Space
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 24 / 27
31. Introduction
Web Usage Log Case Study
Conclusion
Real Space Web/Cyber Space Real Space
Did you notice something?
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 24 / 27
32. Introduction
Web Usage Log Case Study
Conclusion
Real Space Web/Cyber Space Real Space
Did you notice something? No annotations!
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 24 / 27
33. Introduction
Web Usage Log Case Study
Conclusion
Real Space Web/Cyber Space Real Space
Did you notice something? No annotations!
Results/observations of relevance in active and purposeful Web-sites
could only be achieved because of the fundamental linkage of time
windows to the study of semantics in linked data.
Small but crucial step towards identification of data relevant to
real-life events from previously deemed contextless data.
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 24 / 27
34. Introduction
Web Usage Log Case Study
Conclusion
Summary & Future Work
Argue: Sum of WTFs & linkage of a site’s resources
(time-windowed) gives insights at what constitutes relevance.
Important properties include: Fan, depth of traversals & weight.
Lessons:
Clean your data thoroughly!
Visualisation helps to gain new perspectives.
Visualisation is great for semi- & unstructured big data.
Future Work:
Extend notion of relevance to multiple data nodes.
Determine relevance value programmatically .
Extend to other types of Usage Logs. university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 25 / 27
35. Introduction
Web Usage Log Case Study
Conclusion
Thank You!
Questions and/or Comments?
Contact: Markus.Kirchberg@hp.com
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 26 / 27
36. Introduction
Web Usage Log Case Study
Conclusion
B ERENDT, B., H OLLINK , L., H OLLINK , V., L UCZAK -R ÖSCH , M., M ÖLLER , K. H., AND VALLET, D.
Usewod2011 – 1st international workshop on usage analysis and the web of data.
In 20th International World Wide Web Conference (WWW) (Hyderabad, India, 2011).
M ÖLLER , K., H AUSENBLAS , M., C YGANIAK , R., H ANDSCHUH , S., AND G RIMNES , G. A.
Learning from linked open data usage: Patterns & metrics.
In Proceedings of the Web Science Conference (WebSci) (2010).
university-logo
M. Kirchberg et.al. @ (SAS) Business Analytics 2011 Gaining New Insights into Usage Log Data Slide 27 / 27