Discovery: Towards a (meta)data ecology for education and research
1. Towards a (meta)data ecology for UK education
and research
Joy Palmer, @joypalmer
Mimas, University of Manchester
2. Our business drivers vary and use cases
are distinct, but our ultimate aim is the
same.We want our content and services
to be instrumental in teaching, learning
and research.
And to achieve this we need to make
sure they are discovered…
3.
4. From ‘infrastructure’ to ‘ecosystem’
…given the paradigms of the web, that
aim is most likely to be achieved if
content is discoverable through popular
search engines as well as through
specialised services and
aggregations, and if they can be exposed
through social platforms ranging from
scholarly reference management systems
to Facebook and Twitter.
8. microdata
“it’s not about whether
microdata is going to
win, but that semantics
published along with our
html is going to drive new
functionality in the
applications we use
everyday”
(Ed Summers, 2012)
9. Open Data & Metadata
A tactical approach
'Open metadata creates the opportunity for
enhancing impact through the release of
descriptive data about library, archival and
museum resources. It allows such data to be
made freely available and innovatively reused to
serve researchers, teachers, students, service
providers and the wider community in the UK and
internationally.'
15. In short…
• Adopt an ‘open by default’ mindset
• All metadata releases should adopt standard
open licenses
• In the vast majority of cases ODC-PDDL, CC0
are appropriate
• Avoid home grown variations
16. And it means being open to machines
Technical Principles
The Discovery ecosystem is…..
Heterogeneous
Resource-orientated (not service)
Built on aggregations of metadata
Distributed
Reliant on persistent global identifiers
Striving to work well with global
search engines
17. This is all very well in principle…
Openness is a means and not an
end…
18. Open data is of no value to
the ecosystem if it isn’t used
19. Recasting the value chain
New business models
New value propositions
New paradigms
New purpose?
I was asked to deliver a presentation that focused on what it was that we shared. You only have to look at the projibect and programme descriptions of those representing today to see that we are coming from very diverse domain and disciplinary contexts – library infrastructure, linked data approached, mass digititization, open education resource development, prototype service application development, geodata, bib data, paradata, and so forth.
In 2008 JISC convened a group of experts representing key stakeholder bodies and universities across the UKThey were tasked with answering the question ‘what would we do today if we had to start from scratch?’Among this group there was significant disagreement over the services and infrastructure models required to achieve this visionBut there was agreement that opening up data, and freeing it from silos would be a key enabler to achieving the vision… (I’ll come back to that)
There was also agreement on the pivotal role of the wider, wilder web.As the RDTF work was translated into the Discovery initiative, it became that we needed to talk in terms of an ecosystem as opposed to an ‘infrastructure’ because the latter suggested that the initiative was aiming to impose an overarching infrastructure model over the entire museums, libraries and archives (and JISC) discovery spaceTo a large degree, what today is about is determining to what degree we can operate as a healthy and thriving ecosystem, where components of our content or applications interact as a system, linked together by the flow of data and transactions.
But this is not to oversimplify matters. As most of you are well aware, there are many (apparently) competing theories about how to enable discovery in the dataspace. I’m going to touch on a few here, because I think it’s important to acknlowledge the complexity of our strategic backdrop – the complexity we’re all confronting as we make decisions about the discovery and use of our data.In the last year the concept of Big Data and The Cloud has dominated tech news headlines. As a global society we are creating more and more data, not simply content and metadata, but the para or transactional data that represents our interactions with entities – a blog post, a holday we’ve just bought, an image we’re decoding, a discussion with one another (if know how to render it) Datasets that have become so large that they can’t be handled using traditional relational database models; Big data requires massively parallel software running on vast numbers of servers, with massive scale parallel processingWe can, arguably, take these massive sets of sructured and unstructured data from multiple formats and use powerful processing toolsincluding semantic text-mining,image matching, and analytics to render them meaningful and useful.Includes paradata!!!
And then of course there is the heady possibility of Linked Data, a model which promises the ability to semantically ‘mesh’ data at scale through application of the RDF triple model. This is another way in which we can break down the barriers and meaningfully link content together.This image of TBL, the ‘father’ of LD is pretty out of date – you can tell this because the Linked Open data cloud he has behind him on this TED talk is quite small. The cloud appears to grow exponententially each month, and here is a later more current version.This is a model being explored applied by the BBC, The National Archives, the British Library, medical journals, and several of the projects represented here. The possibilities of ‘Linked data’ as the semantic web done right are beguiling in this contextCould Linked Data be the answer – the way to knit together content to create meaningful ‘web of data’ for researchers?http://richard.cyganiak.de/2007/10/lod/
But of course, any data theory on that scale will have its detractors. We appear to like creating polarizing dialogue around these issues.In February Roy Tennant, and outspoken tech expert who has worked with OCLC among others posted a provocative piece on the ‘death’ of Linked Data.This post is well worth a read, not so much to help you become among those to debunk Linked Data, but for the balanced discussion that takes place in the comments section that discusses the potential of microdata, and the symbiosis between the microdata approach, linked data, and Linked Data.
The return of structured dataSchema.org is a collection of schemas in the form of html tags that developers can use to mark up their pagesIts an html specification to allow developers to nest semantics within the existing content of web pages with machine readable tags.Ed Summers“it’s not about whether microdata is going to win, but that semantics published along with our html is going to drive new functionality in the applications we use everyday’
Obviously, the people in this room represent a community that extends far beyond museums, libraries and archives –But in terms of our shared goals to have our content discoverable or useable via the web, this tactic nonetheless is relevant to us all, even if our challenges in achieving ‘openness’ differ.
Approach taken by Discovery (screenshot of website pages & PDF)Open Metadata Licensing PrinciplesCC-0 vs CC-BYCommunity consensus/building a critical mass or movement (again, minimizes risk)
We recommend that institutions and agencies should proceed on the presumption that their metadata is by default made freely available for use and reuse, unless explicitly precluded by third party rights or licences.We strongly advocate that all metadata releases require licensing, for which institutions and agencies should adopt a standard open licensing framework that is suited to their purposes.Reference to permissible usage under the terms of a standard open licence will promote confident and appropriate use. When licensing open metadata in the majority of circumstances, the standard Open Data Commons Public Domain Dedication & Licence (ODC-PDDL), the broadly similar Creative Commons CC0 licence or the UK Open Government Licence (OGL) will be appropriate.Avoidance of variations to such standard licences will make it easier to combine data from different resources and will reduce repeated requirement for legal advice.Highlight/list key points from principles here…Approach taken by Discovery (screenshot of website pages & PDF)Open Metadata Licensing PrinciplesCC-0 vs CC-BYCommunity consensus/building a critical mass or movement (again, minimizes risk)
We need to open data from a technical perspective – exposing data in ways that ensure it is findable by Google and reuseable by developers.1. Discovery is heterogeneousThe Discovery ecosystem is a heterogeneous environment, encompassing a wide variety of users, resources and types of resources, domains, technologies and and businessmodels. Discovery balances the need for a degree of homogeneity to serve management and interoperability requirements, with a recognition of the importance of variety in any ecosystem.2. Discovery is resource-orientedDiscovery is innately resource-oriented. It is a principle of Discovery that metadata resources may have intrinsic value, and that the ‘opening up’ of these to all will create more value as they are used, enhanced and combined with other resources.5. Discovery is built on aggregations of metadataMetadata aggregation is a foundational aspect of the Discovery vision. This might seem somewhat in opposition to the previous principle: however, The Web is sufficiently unrestrictive that it allows both distribution and aggregation as useful strategies in certain contexts. Dempsey uses the terms diffusion and concentration to describe these two approaches and indicates how they are complementary.3. Discovery is distributedThe Web is starting to be realised as a network where nodes are both client and server - functioning in potentially many different interactions with other nodes.This allows for, and even encourages, the possibility that systems operating in the Discovery ecosystem can be both providers of information resources and services at the same time that they consume and use other, remote resources and services.The idea of the Application Programming Interface (API), and principles of modular systems design, are important concepts for Discovery.4. Discovery relies on persistent global identifiersThe resource oriented architecture encourages the identification of information entities. In the Discovery ecosystem, such entities are typically metadata records, although there is growing interest in experimenting with a finer granularity of metadata in a Linked Data context. In any information system, such entities are uniquely identified. As Discovery deals with open data, such identifiers must be globally unique for the distribution of resources and services to work. The default global identifier scheme for The Web is the HTTP URI, however there are other important schemes in use in the Discovery ecosystem.6. Discovery works well with global search enginesSearch Engine Optimisation (SEO) is the process of exploiting an understanding of the functions and algorithms of the major global search engines. With such an understanding, Web content providers can present web resources in such a way that they gain the optimum ranking in the indexes created by those search engines. SEO is a fully developed industry in the commercial sector, but many of it principles and techniques are well known and applicable to the Discovery ecosystem.7. Discovery balances consensus with agilityConsensus on technical and information standards is what allows information systems in the Discovery ecosystem to interoperate. Discovery favours open standards, but is also pragmatic about the adoption of less open standards where they are in mainstream use.While there are, undoubtedly, benefits to be gained from consolidation, standardisation and consistency of approach in the Discovery ecosystem, it is also understood that there are domains and communities of practice within the ecosystem which take different approaches, use different standard
The point I am aiming to make here is that are there is a conbined set of criteria that makes these principles translate into successful, sustainable services.Some of the projects represented here today will be engaging with all or a few of these areas, but it’s a framework we’re beginning for formalise at the higher level to help us advice project approaches to discoverability…The key areas…
Open material has no value if it isn’t usedWe need not only open materials or content, but tools to deliver value, and people to take them up (RP)We are at a key moment where are transitioning from just ‘getting the data’ (and building the applications) to a real data ecosystem in which data is transformed, shared, and integrated – we replace data pipelines with data cycleshttp://www.flickr.com/photos/smthng/2133360085/sizes/o/in/photostream/Business case for open dataMany of the projects here are helping us find this out.Open data creates the opportunity for enhancing the impact; allows data to be made freely available and innovatively reused.To achieve the vision we need to publish more data openly and unambiguouslyReduce barriers, and work towards a thriving ecosystemBreak out of silos and walled gardensThe Discovery initiative and this movement more broadly is about embracing and facilitating the growth of new business models, not only rethinking our value proposition but also reflecting on our very purpose. We hope you will join us, not in blind pursuit of an ideal but rather by contributing to the community dialogue about rationale and business case and consequently to the shared reservoir of open metadata.
New business models – Mendeley and bibliographic data; Talis investment in Linked DataOER, Jorum, Copac, etc – articulating new value propositions beyond centralised aggregationPublishers – collaborating to compete.KB + becomes a mechanism to change practice among publishers who are charging libraries different rates for journal subscriptions; also an opportunity to start to do other things, for instance leverage analytics and usage data to inform decisions about purchase, for users, to inform decisions on what to readThis is about recasting the value chain