Advisor: Dr. Michele C. Weigle.
Committee: Dr. Michael L. Nelson, Dr. Ravi Mukkamala
Slide 32-34 contain embedded video which has been embedded as youtube video in this slideshare
Direct links to videos:
Treemap: http://www.youtube.com/watch?v=BJDrxQEEFYM
Timecloud: http://www.youtube.com/watch?v=YYkI6aBO0to
Bubble Chart, Image Plot and Timeline: http://www.youtube.com/watch?v=j94clxqKQk8
Abstract:
Archive-It, a subscription service from the Internet Archive, allows users to create,
maintain, and view digital collections of web resources. The current interface of
Archive-It is largely text-based, supporting drill-down navigation using lists of URIs.
While this interface provides good searching capabilities, it is not very efficient for
browsing. In the absence of keywords, a user has to spend large amount of time trying
to locate a webpage of interest. In order to provide a better visual experience to
the user, we have studied the underlying characteristics of Archive-It collections and
implemented six different visualizations (treemap, time cloud, bubble chart, image
plot, timeline and wordle), each highlighting one or more of the underlying characteristics
of the collection. Archive-It supports grouping of webpages into categories,
however, it does not enforce its usage. As a result there are many collections with
missing or improper grouping. For such collections, we present a method of grouping
webpages based on a set of pre-defined rules.
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It
1. Visualizing Digital Collections at
Archive-It
Kalpesh Padia
Director: Michele C. Weigle
Committee: Michael L. Nelson
Ravi Mukkamala
7/20/2012 MS Thesis - August 2012 1
2. Agenda
Introduction
Motivation
Related Work
Collection Retrieval and Processing
Visualizations
Case Studies
Future Work
Conclusion
7/20/2012 MS Thesis - August 2012 2
5. Archive-It
http://archive-it.org/
7/20/2012 MS Thesis - August 2012 5
6. Archive-It Collection Hierarchy
Collection
Root Title
Level 1 Category 1 Category n
Level 2 Web page 1 Web page n
Level 3 (Leaf Archived Archived
Nodes) Version 1 Version n
7/20/2012 MS Thesis - August 2012 6
13. Drawbacks
No visual feedback
Discovering individual pages is difficult
Optional metadata and categorization
Collection structure known only to curator
7/20/2012 MS Thesis - August 2012 13
14. Contribution
Interactive visualizations
Treemap
Time cloud
Bubble chart
Image plot
Wordle
Timeline
Temporal exploration of collections
Uncover collection structure
7/20/2012 MS Thesis - August 2012 14
16. Microsoft Pivot
http://www.microsoft.com/silverlight/pivotviewer/
7/20/2012 MS Thesis - August 2012 16
17. Page History Explorer
A. Jatowt, Y. Kawai, and K.
Tanaka, “Visualizing
Historical Content of Web
Pages,” in Proceedings of the
17th international conference
on World Wide Web,2008.
7/20/2012 MS Thesis - August 2012 17
18. 3D Wall
http://www.webarchive.org.uk/ukwa/wall/Blogs
7/20/2012 MS Thesis - August 2012 18
19. Treemap
Johnson and Shneiderman, “Space-Filling Approach to the Visualization of Hierarchical Information Structures” in
proceedings of the 2nd conference on Visualization '91
7/20/2012 MS Thesis - August 2012 19
20. Series Browser
M. Whitelaw, “Visualising Archival Collections: The Visible Archive
Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.
7/20/2012 MS Thesis - August 2012 20
21. A1 Explorer
M. Whitelaw, “Visualising Archival Collections: The Visible Archive
Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.
7/20/2012 MS Thesis - August 2012 21
22. EASY
Scharnhorst et.al. “Looking at a digital research data archive Visual
interfaces to EASY,” in CORR, 2012, http://arxiv.org/abs/1204.3200
7/20/2012 MS Thesis - August 2012 22
23. Wordle
. Jonathan Feinberg, http://wordle.net/ , Dogear
7/20/2012 MS Thesis - August 2012 23
48. Informal User Evaluation
Alex Thurman, Columbia University Libraries
Feedback on
ease of browsing and obtaining information
user-friendliness of the interface
whether they prefer textual or graphical
interface
most effective visualization
effectiveness of the rule-based categorization
in exploring archives
7/20/2012 MS Thesis - August 2012 48
49. Feedback
Effective visualizations:
Treemap – color coding useful for identifying newer
additions
Image plot – screenshots with mouse-over wordles
allow for good navigation
Timeline – useful for visualizing development of
groups in collection
Suggestions
Broader timescale for treemaps
Include stop words from other languages
7/20/2012 MS Thesis - August 2012 49
50. FUTURE WORK AND
CONCLUSION
7/20/2012 MS Thesis - August 2012 50
51. Future Work
N-Gram wordles
Term expansion
Krovetz stemmer (dictionary based stemmer)
Integration with Archive-It
Detailed user evaluation
Implementation for other archives
7/20/2012 MS Thesis - August 2012 51
52. Conclusion
Identified metrics for collections
7/20/2012 MS Thesis - August 2012 52
53. Conclusion
Identified metrics for collections
Visualizations
Treemap
7/20/2012 MS Thesis - August 2012 53
54. Conclusion
Identified metrics for collections
Visualizations
Treemap
Time cloud
7/20/2012 MS Thesis - August 2012 54
55. Conclusion
Identified metrics for collections
Visualizations
Treemap
Time cloud
Bubble chart
7/20/2012 MS Thesis - August 2012 55
56. Conclusion
Identified metrics for collections
Visualizations
Treemap
Time cloud
Bubble chart
Image plot
7/20/2012 MS Thesis - August 2012 56
57. Conclusion
Identified metrics for collections
Visualizations
Treemap
Time cloud
Bubble chart
Image plot
Wordle
7/20/2012 MS Thesis - August 2012 57
58. Conclusion
Identified metrics for collections
Visualizations
Treemap
Time cloud
Bubble chart
Image plot
Wordle
Timeline
7/20/2012 MS Thesis - August 2012 58
59. Conclusion
Identified metrics for collections
Visualizations
Treemap
Time cloud
Bubble chart
Image plot
Wordle
Timeline
Rule – based categorization
7/20/2012 MS Thesis - August 2012 59
61. Time Span
Small 1 Day - 2 Weeks
Time span Medium 2 Weeks - 4 Months
Large > 4 Months
http://wayback.archive-it.org/1068/*/http://amigosdemujeres.org/
7/20/2012 MS Thesis - August 2012 61
62. Groups
Small 1
Groups Medium 2-5
Large >5
http://www.archive-it.org/collections/1068
7/20/2012 MS Thesis - August 2012 62
63. URI Domains
Small 1 - 10
URI Domains Medium 11 - 20
Large > 20
http://www.archive-it.org/collections/2836
7/20/2012 MS Thesis - August 2012 63
64. Number of Web Pages
Small 1 - 10
# of Web Pages Medium 11 - 99
Large > 99
http://www.archive-it.org/collections/2836
7/20/2012 MS Thesis - August 2012 64
65. Jigsaw
Stasko et.al., IEEE VAST 2007
7/20/2012 MS Thesis - August 2012 65
66. Themeriver
Wei et.al. in SIGKDD, 2010
7/20/2012 MS Thesis - August 2012 66
68. Bubble Chart
7/20/2012 MS Thesis - August 2012 69
http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
69. Image Plot with Wordle
7/20/2012 MS Thesis - August 2012 70
http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
70. Timeline
7/20/2012 MS Thesis - August 2012 71
http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
Notas do Editor
Collections used for developmentPresent a good mix of various metrics
Many Archive-It collections are not curated wellLack categorizationImproper CategorizationSuggest categorization forOrganizing collectionsProperly categorizing articles with existing categorization
If the domain is news web sites,, such as cnn,abc, bbc, put them into news web site
Domains and subdomains
Number of articles. Or sites
Extend stacked bar charts to represent independent values as imagesAll values in a sample given equal weightThe height of each stack represents the size of each sampleCategories are samples, articles are valuesHover over each article to reveal number of mementos, timespan and wordle summarizing articles’ content