If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne
1. If we’re not there yet, how far
do we have to go ?
A review of web metadata at
The University of Melbourne
Eve Young, Metadata Coordinator
Information Acquisition and Organisation Section
Information Division
Baden Hughes, Research Fellow
Department of Computer Science and Software
Engineering
The University of Melbourne
Young & Hughes, DC-ANZ 2005 1
2. Overview
Background
Web publishing policies circa 1999, 2001
Research projects
Towards standardization
Dublin Core
UniMelb administrative metadata
Broad scale compliance analysis
UniMelb web environment
DC Metadata on the UniMelb Web
UniMelb Metadata on the UniMelb Web
Reflections and challenges for the future
Young & Hughes, DC-ANZ 2005 2
3. Before Metadata on UniM site
Existing standard (1999) not widely adopted
9 metadata tags
expiryDate, maintainer, authoriser, author, description,
keywords, lastModified, distribution, contentType
Operational and implementation issues
Difficulty finding information
Suspected non-compliance
Investigate and analyze
Manual research
Young & Hughes, DC-ANZ 2005 3
4. Expiry Tag Analysis
Expiry tag functionality important
Analysis into non-compliance (608 pages)
Only 27% of pages audited were compliant
Of the remainder of pages reviewed, 441
had no date, or NA as value
Young & Hughes, DC-ANZ 2005 4
5. A to Z Index: Compliance Audit
Audit of metadata on 78 web pages
Highest compliance 84.6% (content type)
Lowest 11.5 % (expiry date)
More unknown than known maintainers
Default value tags had high degree of
compliance
Page specific tags (keywords) had lowest
Young & Hughes, DC-ANZ 2005 5
6. Metadata Working Group
Advise on implementation of a uniform approach to
the creation of metadata
Membership drew on expertise from across the
university - academics, IT, web, metadata, and
library
Reviewed metadata standards, DC, IMS, AGLS
Metadata use in large information –rich
organizations, eg, Aust Govt, UK Government,
UNSW libraries
Young & Hughes, DC-ANZ 2005 6
7. UniMelb metadata standard
19 elements (meta tags) to describe and
manage a resource
2003 revised standard endorsed by
Information Strategy Committee.
Requirement on all University of Melbourne
web pages
Young & Hughes, DC-ANZ 2005 7
8. Why Dublin Core (besides this
being a DC-ANZ conference) ?
ISO 15836
15 elements, simple
International consensus
Well supported
Offers semantic interoperability
Extensible
Easy to implement in our environment
Young & Hughes, DC-ANZ 2005 8
9. University of Melbourne DC
Metadata Elements
D C.Ti le
t D C.Right s
D C.Creato r D C.Date
D C.Subject D C.Date.Mod fiied
D C.Descr ton
ip i D C.Language
D C.Publi sher D C.Format
D C.Contr t
ibu or D C. dent f r
I iie
Young & Hughes, DC-ANZ 2005 9
10. University of Melbourne
Administrative Metadata
Elements
U M.Creato .Emai
r l
U M.Author .Name
iser
U M.Author .T te
iser il
U M.Mainta .Name
iner
U M.Mainta .Depar ment
iner t
U M.Mainta .Ema l
iner i
U M.Date.Revi ewDue
Young & Hughes, DC-ANZ 2005 10
11. Broad Scale Compliance
Analysis
Full crawl of the University of Melbourne web
presence in March 2005
Used was the Internet Archive's Heritrix suite
(an open-source, extensible, web-scale,
archival-quality web crawler)
Total 57Gb of data was retrieved from
www.unimelb.edu.au and its associated sub-
domains over a period of 146 hours
1.4 million documents were retrieved
Young & Hughes, DC-ANZ 2005 11
12. The UniMelb Web Environment
Format Demographics of UniMelb Web
text/html
image/jpeg
image/gif
application/pdf
text/plain
application/msword
application/msexcel
application/mspowerpoint
application/postscript
others
Young & Hughes, DC-ANZ 2005 12
13. Observations
HTML is no longer the dominant format
UniMelb’s metadata creation processes primarily oriented at
creating Dublin Core-extended metadata as simple HTML meta
tags
Pure HTML content in fact is no longer dominant format
Web-accessibility of “non-native” document types
Many MIME Types are not addressed by the UniMelb guidelines
for metadata creation but which do offer some potential for
restricted metadata inclusion
Emerging document types such as XML and RDF do not easily
allow for the embedding of metadata internal to the resource.
The emergence of dynamic documents
Analysis of “All Other” categories shows many (~38%) of these
documents are dynamic, generated server side on demand by
PHP, ASP, JSP etc.
No thought currently given to inclusion of metadata in automatically
generated documents of this type
Young & Hughes, DC-ANZ 2005 13
14. DC Metadata on the UniMelb Web
Usage of DC Elements
90.0
80.0
70.0
% Coverage
60.0 % HTML Pages with
50.0 Metatdata in <HEAD>
40.0
30.0
20.0 %HTML Pages with
10.0 Metadata in <BODY>
0.0 element
DC.Subject and
DC.Publisher
DC.Language
Overall Average
Dc.Contributor
DC.DateModified
DC.Identifier
DC.Format
DC.Title
DC.Description
DC.Rights
DC.Creator
DC.Date
Total % HTML Pages
containing Metadata in
either <HEAD> or
<BODY>
DC Metadata Element
Young & Hughes, DC-ANZ 2005 14
15. Observations
Alignment with broad Dublin Core norms
These figures are generally in line with the
findings of broad scale Dublin Core-oriented
metadata communities
OAI (Ward, 2003)
OLAC (Hughes, 2004)
Young & Hughes, DC-ANZ 2005 15
16. UM Metadata on the UniMelb Web
Usage of UM Elements
80.0
70.0
% HTML Pages with
60.0 Metatdata in <HEAD>
% Coverage
50.0
%HTML Pages with Metadata
40.0 in <BODY> element
30.0
Total % HTML Pages
20.0 containing Metadata in either
<HEAD> or <BODY>
10.0
0.0
e
e
ue
ge
l
l
ai
ai
itl
am
m
D
m
.T
ra
ew
r.E
r.N
.E
ve
er
or
ris
ll A
vi
ne
ne
at
Re
ho
ai
ai
ra
re
nt
ut
e.
nt
ve
.C
ai
at
ai
.A
O
M
.M
.M
.D
M
U
U
M
M
M
U
U
U
UM Metadata Element
Young & Hughes, DC-ANZ 2005 16
17. Observations
Differences between core Dublin Core and institutional metadata
institutional metadata is more regularly contributed, despite the
automatic creation of some DC by content creation applications
Correlation with manual inspection statistics
these experiments suggest trends detected in earlier focused
studies such as Zajacek (2002a, 2002b) are valid.
Differences between metadata included in <HEAD> vs <BODY>
elements
for institutional metadata, there is a strong tendency to include
metadata in the <BODY> elements where it is immediately visible
on the page rather than in the <HEAD> elements which may
reflects the emphasis of the training materials
Young & Hughes, DC-ANZ 2005 17
18. Reflections and Challenges 1
% coverage of HTML sources is relatively low, but it does
account for a large number of documents (650K total in this
survey)
Many documents are non-compliant for identifiable reasons – eg
exclusion of metadata in template based pages such as those within
the learning management system
External search engines like are not using meta tag information
any more but perform full text indexing (see Richardson, 2004)
Benefit to general web searchers of institutional metadata creation is
almost zero
May still retain currency for other administrative purposes eg the
authorization of web content publication.
Need to distinguish between the institutions need for web content
management, and how metadata facilitates this goal, and
decoupling from web search experience in general.
Young & Hughes, DC-ANZ 2005 18
19. Reflections and Challenges 2
Potential impact of institution wide Content Management System
Existing metadata standards failed to address distributed content
creation (or underestimated the pervasive effect of “publish to web”
type technologies to all staff),
Opportunity to increase compliance with new generation tools and
practices.
Revisiting motivation for web metadata: search assistance or
administrative processes ?
Changes to work practices required for web publishing authorisation
“Compliance audit” service
for run time verification of metadata compliance, with a
“watermarking” service which automatically imprimaturs compliant
pages in the absence of manual inspection.
Require the formalisation of University of Melbourne metadata as a
true Dublin Core application profile and an associated formal
schema, and the creation of controlled vocabularies for extensions.
Young & Hughes, DC-ANZ 2005 19
20. Reflections and Challenges 3
Large number of pages which will be updated only at an irregular
interval
Substantially increasing the coverage of institutional metadata in the
short to medium term may require the deployment of an automated
metadata creation service such as DCdot (Powell, 2000) or an
augmentation service such as OLACdot (Hughes, 2005).
Early experiments with DCdot show significant promise, but need to
be more carefully evaluated in light of recent research in the area
(Greenberg, 2005).
Training of critical importance
Significant effort was invested in training key personnel, and the
propagation of the institutional standards and training notes online,
only a small number of face to face classes have been held.
Young & Hughes, DC-ANZ 2005 20
21. Conclusion
UniMelb was identified as one of the leading universities with
regard to metadata implementation (Ivanova, 2004)
Empirical evidence suggests that The University of Melbourne
still faces significant challenges
Compliance in the age of moving standards - over a 2 year
period the evolution of external standards, web content creation
tools, and web content demography is significant
Strong basis for institutional metadata was formed by the
adoption of Dublin Core
the disparate content creation environment and rapidly changing
composition of web content has induced a less than satisfactory
application of these standards.
Automated metadata creation and assessment, forming a
significant component of future work may address this problem in
part
Young & Hughes, DC-ANZ 2005 21
22. Questions / Comments
http://eprints.unimelb.edu.au/archive/00000983
Eve Young
Metadata Coordinator
Information Acquisition and Organisation Section
Information Division
e.young@unimelb.edu.au
Baden Hughes
Research Fellow
Department of Computer Science and Software Engineering
badenh@cs.mu.oz.au
Young & Hughes, DC-ANZ 2005 22