The document describes the Humanities Networked Infrastructure (HuNI) project. HuNI aims to create a virtual laboratory that integrates 28 Australian cultural datasets and enables new forms of humanities research. It will harvest data from partner organizations, transform it into a searchable format and linked open data, and develop tools for researchers to discover, analyze, annotate, and share collections across the integrated datasets. The project is led by Deakin University with funding from NeCTAR and contributions from partner organizations.
2. CRICOS Provider Code: 00113B
NATIONAL E-RESEARCH
COLLABORATION TOOLS AND
RESOURCES (NECTAR)
NeCTAR is a $47 million dollar, Australian Government
project, conducted as part of the Super Science
initiative and financed by the Education Investment
Fund. The University of Melbourne is the lead
agent, chosen by the Commonwealth Government.
4. • Ensure that Australian cultural datasets and the research
associated with them become part of the emerging
international Linked Open Data environment.
• Enable research enquiries to move easily from: what is?
to where is?
• Support the role of annotation and metadata in discovery
of new knowledge or the means to elucidate new
knowledge
• Position the idea of data as both a subject and object of
analysis in humanities
• Contribute to debates around standards for development
and implementation
HuNI BROAD BENEFITS
5. • Enable humanities researchers to work with cultural datasets
more efficiently and effectively, and on a larger scale;
• Encourage the systematic sharing of research data between
humanities researchers (including the cultural dataset
curators themselves), the community and cultural
institutions;
• Encourage a greater level of cross-disciplinary and
interdisciplinary research, both within the
humanities/creative arts and between the
humanities/creative arts and other disciplines, and the wider
public;
• Support innovative methodologies such as network
analysis, game theory and ‘virtual history’ that rely on large-
scale datasets
HUNI: SPECIFIC BENEFITS
6. 1. Organisational level: the goals and processes of the institutions
involved
2. The semantic level: meaning of the exchanged digital resources
3. Technical level: implementing data interoperability requires
both data integration and data exchange processes as well as
enabling effective use of the data that becomes available
Pasquale Pagano, ‘Data Interoperability’ (GRDI2020)
4. Project level: The advent of more complex ‘big humanities’
projects requires multiple and multi-disciplinary personnel
which in turn entails the organization of different workflows
and expectations: e.g. challenge of developing a
comprehensive or consortial approach, common definition of
project method etc.
INTEROPERABILITY
7. 1. A PARTNERSHIP
… a Deakin led consortium
• Cultural data providers (10) – project co-operators
• Humanities software developer (1) – project co-
developers
• eResearch organisations (2) – lead development
agencies
8. HUNI PARTNER DATASETS
AMHD
MAP
CAARP
Bonza
AFIRC
Circus Oz
AusStage
Media:
film, cinema, theatre, newspapers, magazines, advertis
ing, music, live performances
DAAO
AustLit
AWR
ADB
DoS
Biographical: artists, designers, writers, significant
people, scientists, Sydney demographics
EOAS
AUSTLANG
Mura
Indigenous languages
17. Welcome to the Cinema and Audiences Research Project (CAARP) database: An online encyclopaedia of
cinema-going in Australia.
Data
This site contains information on film screenings and venues in Australia.
430,137 screenings
10,256 films
1,978 cinemas
1,649 companies
From 1846 to now
18. • NeCTAR investment of $1.33M
• Partner contributions of $480,000
• Partner in-kind contributions amounting to >$1M
A FISCAL COLLABORATION
19. COMMUNITY BUILDING
• Collated user-stories (20)
• Online showcase events – next one is 4th September
2013
• Live link to the latest alpha prototype on huni.net.au;
feedback buttons
• Wider beta launch at eResearch Australasia in October
2013
• Stay up to date through our monthly Newsletter and
blog feed
• Follow us on twitter - @HuNIVL
20. Information design challenge to build an ontology and use
linked data and controlled vocabularies for data to be
aligned and related.
• Reading the data. Characteristics of the data determine
the ontological components selected and the major
“entities” (aka “access points”).
• Identified early as:
people, organisations, events, relationships, places, dates,
resources, and subjects.
• Components from ontologies already available and being
reused or kept in our sights: CIDOC-
CRM, FOAF, FRBR, FRBR-OO, BibFrame and PROV-O.
2. INTEGRATING MEANING
23. HUNI ONTOLOGY (all classes and
object properties)
cidoc:E41Appellation
cidoc:E49TimeAppellation
has subclass
cidoc:E44PlaceAppellation
has subclass
cidoc:E18PhysicalThing
cidoc:E24PhysicalManMadeThing
has subclass
cidoc:E19PhysicalObject
has subclass
frbr:F7Object
has subclass
cidoc:P1isIdentifiedBy (Domain>Range)
frbr:F9Place
cidoc:P53hasCurrentOrFormerLocation (Domain>Range)
cidoc:P1isIdentifiedBy (Domain>Range)cidoc:E22Man-MadeObject
has subclass
cidoc:P1isIdentifiedBy (Domain>Range)
cidoc:E52Time-Span
cidoc:E2TemporalEntity
has subclasscidoc:P4hasTimeSpan (Domain>Range)
cidoc:E4Period
has subclass
frbr:F22Self-Contained_Expression
frbr:F25Performance_Plan
has subclass
frbr:F26Recording
has subclass
frbr:F24Publication_Expression
has subclass
frbr:F15Complex_Work
frbr:F18Serial_Work
has subclass
cidoc:E21Person
frbr:F10Person
has subclass
cidoc:E67Birth
cidoc:P98iwasBorn (Domain>Range)
foaf:Person
has subclass
cidoc:E74Group
cidoc:P107iisCurrentOrFormerMemberOf (Domain>Range)
cidoc:E69Death
cidoc:P101idiedIn (Domain>Range)
cidoc:E7Activity
cidoc:P14iperformed (Domain>Range)
Thing
cidoc:E39Actor
has subclasscidoc:E15IdentifierAssignment
has subclass
huni:PrimaryTopic
has subclass
cidoc:E35Title
has subclass
cidoc:E71Man-MadeThing
has subclass
has subclass
cidoc:E53Place
has subclass
has subclass
huni:SKOS.Occupation
has subclass
has subclass
foaf:Group
has subclass
huni:SKOS.Role
has subclass
frbr:F6Concept
has subclass
frbr:F11Corporate_Body
has subclass
huni:SKOS.Collection
has subclass
cidoc:E42Identifier
has subclass
has subclass
frbr:F8Event
has subclass
huni:SKOS.Item
has subclass
has subclass
cidoc:E56Language
has subclass
has subclass
frbr:F13Identifier
has subclass
has subclass
cidoc:E55Type
has subclass
has subclassfrbr:F40Identifier_Assignment
has subclass
cidoc:P2hasType (Domain>Range)
cidoc:P11iparticipatedIn (Domain>Range)
has subclass
cidoc:P2HasType (Domain>Range)
has subclass has subclass
has subclass
has subclass
cidoc:E65Creation
has subclass
frbr:F31Performance
has subclasshas subclass
cidoc:E12Production
has subclass
cidoc:P1isIdentifiedBy (Domain>Range)
cidoc:P1isIdentifiedBy (Domain>Range)
has subclass
huni:timeIsIdentifiedBy (Domain>Range)
cidoc:E5Event
has subclass
cidoc:P1isIdentifiedBy (Domain>Range)
has subclass
cidoc:P1isIdentifiedBy (Domain>Range)cidoc:P1isIdentifiedBy (Domain>Range)
cidoc:P7tookPlaceAt (Domain>Range)cidoc:P1isIdentifiedBy (Domain>Range)
huni:hasOccupation (Domain>Range) huni:hasRole (Domain>Range)
cidoc:E48PlaceName
has subclass
frbr:F30Publication_Event
frbr:R24created (Domain>Range)frbr:F21Recording_Work
frbr:R23createdARealisationOf (Domain>Range)
frbr:F19Publication_Work
frbr:R24created (Domain>Range)
has subclass
cidoc:P1isIdentifiedBy (Domain>Range)cidoc:P1isIdentifiedBy (Domain>Range)
cidoc:P1isIdentifiedBy (Domain>Range)
huni:placeIsIdentifiedBy (Domain>Range)
frbr:F28Expression_Creation
has subclass cidoc:P108hasProduced (Domain>Range)
has subclassfrbr:F1Work
frbr:R19createdARealisationOf (Domain>Range)
frbr:F2Expression
frbr:R17created (Domain>Range)
frbr:F21Recording_Event
has subclass
cidoc:E28ConceptualObject
has subclass
has subclass
has subclass
cidoc:E89PropositionalObject
has subclass
frbr:F14Individual_Work
frbr:F17Aggregation_Work
has subclass
cidoc:P94hasCreated (Domain>Range)
frbr:f25Work_Conception
has subclass
cidoc:P102hasTitle (Domain>Range) huni:hasCollection (Domain>Range)
cidoc:P2hasType (Domain>Range)
has subclass
cidoc:P148hasComponent (Domain>Range)
cidoc:E73InformationObject
has subclass huni:hasItem (Domain>Range)
cidoc:P2HasType (Domain>Range) frbr:f16Container_Work
has subclass
has subclass
has subclass has subclass
frbr:F20Performance_Work
has subclasshas subclass has subclasscidoc:P72hasLanguage (Domain>Range) has subclass
cidoc:P2hasType (Domain>Range)cidoc:P2HasType (Domain>Range)
has subclass
frbr:R23createdARealisationOf (Domain>Range)
frbr:R24created (Domain>Range)
frbr:R24created (Domain>Range)
has subclass
cidoc:P102hasTitle (Domain>Range)
frbr:R12isRealisedIn (Domain>Range)
has subclass
has subclass
frbr:R16initiated (Domain>Range)
cidoc:P14iperformed (Domain>Range) has subclass
25. 3. HuNI DATA ARCHITECTURE
Data
integration
HuNI
side
Partner
side
Data harvest,
transform
and ingest
Solr Search Server
[HuNI Data]
RDF Triple Store
[HuNI Linked Data]
Data
analysis
and
mapping
HuNI Virtual Laboratory
Scholarly researcher workflow tasks Admin tasksPublic and citizen
researcher workflow tasks
Data
discovery
Data
analysis
Data
sharing
Analyse and annotate
collection
Export collection
Share collection and
analysis
Share search results
Corbicula
Registration and login
Profile management
History recording
Project management
Simple search
Advanced search
Save search results as
private collection
Refine / expand
collection
Simple search
Advanced search
Deep (SPARQL-based)
search
Data update
and
publish ADB DAAO CAARP AFIRC AusStage
26. A total of 28 Australian datasets are being harvested for integration into
HuNI
• Data gateway components, called HuNI Corbicula, deployed on the
NeCTAR Cloud to harvest the XML feed data and transforming it into
forms suitable for ingestion into two HuNI data aggregates: a Solr
search server [HuNI Data], and a Jena RDF Triple Store [HuNI Linked
Data]
DATA INTEGRATION
The harvesting process
requires:
• Live data feeds
deployed at the partner
sites to publish
updated partner data
as XML
Data
integration
HuNI
side
Partner
side
Data harvest,
transform
and ingest
Solr Search Server
[HuNI Data]
RDF Triple Store
[HuNI Linked Data]
Data
analysis
and
mapping
Corbicula
Data update
and
publish ADB DAAO CAARP AFIRC AusStage
28. TECHNOLOGY STACK
• front-end frameworks - AngularJS and Twitter
Bootstrap single page web app
• tools hosting framework - Open Social via Apache
Shindig
• back-end framework - SpringMVC via Roo.
• layer integration - RESTful web services
29. • Search the HuNI Data
• Save their search results as a
private collection
• Refine their collection through
additional searches
• Analyse and annotate their
collection with their own
assertions and commentary
• Export their collection for
further analysis
• Publish and share their
collection and research
RESEARCH ACTIVITIES
A researcher with a HuNI account will be able to:
HuNI Virtual Laboratory
Scholarly researcher workflow tasks Admin tasksPublic and citizen
researcher workflow tasks
Data
discovery
Data
analysis
Data
sharing
Analyse and annotate
collection
Export collection
Share collection and
analysis
Share search results
Registration and login
Profile management
History recording
Project management
Simple search
Advanced search
Save search results as
private collection
Refine / expand
collection
Simple search
Advanced search
Solr Search Server
[HuNI Data]
30. Scholarly researchers will also
be able to perform a “deep
search” of the graphs in RDF
Triple Store.
The large-scale aggregation of
Linked Data makes explicit the
relationships and connections
between related records across
all the partner
datasets, enabling the
researcher to construct more
complex semantic queries.
RESEARCH ACTIVITIES 2
HuNI Virtual Laboratory
Scholarly researcher workflow tasks Admin tasksPublic and citizen
researcher workflow tasks
Data
discovery
Data
analysis
Data
sharing
Registration and login
Profile management
History recording
Project management
Deep (SPARQL-based)
search
RDF Triple Store
[HuNI Linked Data]
39. 4. THE PROJECT
• project director/community liaison (20%)
• project manager (100%)
• technical coordinator (100%)
• information services coordinator (90%)
• community engagement (30%)
• communication coordinator (20%)
• administrative support (20%)
• software developer(s)
NeCTAR
Directorate
HuNI
Steering
Committee
Team HuNI
Technical
Working
Group
Expert
Advisory
Group
Expert Data
Group
42. HuNI: a virtual laboratory for the humanities
http://huni.net.au/@HuNIVL
Notas do Editor
Components of CIDOC-CRM, FOAF and FRBR-OO ontologies have been reused for the integration of the initial datasets. This is a means to encode people, their existence (birth and death events), their occupations and associations with organisations. More components have been added to record two further events, i.e. creation and production events, and to record works and expressions. Work is underway to plugin SKOS and structure vocabularies, using the data supplied (in EAC type schemas) to manage the range of terminology, e.g. recreational, vocational, professional and occupational. This draft is based on a portion of the data analysed and a "mud map" (based on an assessment of data available through web interfaces). See the draft as a line diagram. A view of the ontology generated in the tool Protege reveals FRBR-OO as an extension of CIDOC-CRM. Draft v0.3 using Initial DatasetsLimitations with using FOAF to handle personal names (culturally situated) have been found. The CIDOC component E41_Appellation and its subclasses will now be used, collections are being dealt with and further events are being added, e.g. E87_Curation_Activity to reflect actions of selection and collection development. Under discussion is: the inclusion of E90_Symbolic_Object to deal with citations (that are not feasible to strip apart and process but provide useful contextual information for an entity); the creation of "Floruit" as a time-related entity for E21_Person and E74_Group; categorising the datasets and collections as E89_Propositional_Object; and F3_Manifestation_Product_Type to deal with the disambiguation of portable and web formats of works.
This section of the HuNI ontology reveals the "joins" and class relationships, that reveal where the CIDOC-CRM and FRBR-OO ontologies align. The yellow-green bubbles record the CIDOC entities and the red bubbles record the FRBR entities. The bidirectional arrows indicate where there is a "sameAs" relationship, the unidirectional arrow indicate where there is a sub-class relationship.
The integration of partner data into HuNI requires two technical component:1. Live data feeds (at partner sites)Three technology options are available for the partners to publish their data as XML: jOAI, OAIcat and, for those who are not exposing their data via the OAI-PMH harvesting protocol, a custom-built solution that requires very little work to integrate at a provider’s site.We are not harvesting all the data – we are only harvesting the primary entity classes (and as much of the uniquely identifying information as possible for each class) that are common “touch points” across many of the partner data sites – people, places, events and objects. Therefore, the lowest common denominator for making the partner data harvestable is a flat XML file per class entity, together with the uniquely identifying information. For example, for the person class entity, uniquely identifying information will include first name, last name, date of birth/death, bio, occupation. 2. A data gateway component called CorbiculaTechnology is being deployed toharvest updated content from the partner XML data feeds and transform the data into forms suitable for ingestion into:A Solr search server: this aggregation of harvested XML records is referred to as ‘HuNI Data’ A Jena RDF Triple Store: this aggregation of stored RDF Graphs is referred to as ‘HuNI Linked Data’
Based on the data architecture as set out in the original RFP, there is a requirement to harvest, transform and ingest data each of the partner datasets into some sort of Linked Data store, and very early on in the technical decision making process, it was agreed that RDF (Resource Description Framework) – a metadata modeling specification - would be the lingua franca, and that all the technical components would be developed to work with this Linked Data specification.So we began by:Making some of the partner datasets harvestable to HuNI: by developing a harvest feed for those data providers who were technically able to publish their data in a standard export format/schema (EAC-CPF)Constructing the HuNI ontology and mapping partner data to this common data model. A number of standard cultural heritage ontologies were selected for examination because of their perceived close semantic fit to the nature and types of data in each of the 28 data sources. – CIDOC-CRM, FOAF, FRBR-OO, PROV-ODeploying a data gateway component – called Corbicula – on the NeCTAR Cloud, which is able to technically harvest and transforms the updated XML data from the partner feeds and ingest it into the RDF Triple Store. Once the mappings for a given data source are known, XSLT scripts are written to interpret the XML records and re-expresses (transforms) them as RDF graphs (essentially captures the relationship/link between records from all integrated data sets. But the integration into RDF has proven to be semantically complex and technically complex, because: The publishing format necessary to allow us to do the mappings is too high a technical barrier for most data custodians The data analysis and mapping to a common data model is proving time consuming and complexThe gateway component that harvests and transforms the data into RDF using XSLT has performance and memory issuesThe SPARQL-based search interface developments – where people can search and query the graphs – was proving too slowAs a result, after 10 months of development, only 6 partner data sources have completed their integration journey into the RDF Triple Store, and the search UI isn’t very performantSo back in May it was flagged that there is a as real project risk that we will not be able to fully transform all the partner data into Linked Data, and that only a small subset of partner datasets will be barely discoverable through the lab. This was a real problem, given that the main objective of HuNI is to provide a coSo the decision was made to exYour probably wondering – why have 2 data aggregates –why we mixed the data architectures – purely a project risk management decision – harvesting, mapping, transforming and ingesting into Linked Data is complex and time consuming, and there is a real danger that we won’t have a sufficient Linked Data layer in which to build the lab on – so in order to deliver some cross dataset search capability within the project timeframe, we introduced a new development strand which sees the accelerated harvesting and integration of data into the Solr aggregate So the decision has been made to continue populating the RDF store with partner data for the remainder of 2013, and work on UI in 2014To populate the Solr search server is easy, HuNI periodically harvests the updated XML records from the partner feeds, processes the XML content via a suitable transform, and submits the transformed XML data into the Solr search server. The transformation of partner XML records into HuNI Linked Data is complex and time consuming, and we’ve faced a number of technical issues, which isn’t surprising since we’re using a combination of largely unproven technologies, on the scale required for HuNI deployment First, the harvested data had to be cleaned and mapped to a core HuNI ontology. A range of cultural heritage ontologies were examined as the starting-point for building this core ontology framework. This has been an iterative process, determined by the nature of each data source and by the main types of data found in each source. The following standard ontologies are being aligned to create the HuNI Ontology:People and Organisations (using the CIDOC-CRM and FOAF ontologies) Items, Collections and Resources (using the PROV-O, CIDOC-CRM, FOAF and FRBR-OO ontologies) Events and Relations (using the PROV-O, CIDOC-CRM, FOAF and FRBR-OO ontologies) Place and Subject (using PROV-O, CIDOC-CRM, FOAF and FRBR-OO ontologies) Once the mappings to a common data model are known, the data needs to be technically transformed and ingested. This is made possible through the HuNI gateway component called Corbicula, which performs the following steps: Periodically harvests updated XML records from the source provider feedsUses XSLT to interpret the XML records and re-express (transform) them as RDF graphs.Stores the RDF graphs.The search feature needs to be based on the linked data, to take advantage of the semantic integration provided by the RDF aggregation
But of course this is a VL project and not a data integration project
Support the non-linear research methods practiced by humanities researchersHuNI is about inclusivity and not exclusivity – using 3rd party authentication for login - for the a community to form around HuNI, its user-base needs to extend beyond scholarly researchers. Also worth noting that any member of the general public interested in Australian culture can run a search across the related databases (the HuNI Data), and share their search results online – not just scholarly researchersThere are discovery limitations – whilst the context is given for each record found, what isn’t available are the known relationships between related records across the disparate data sources - so we’re currently working on a ‘Social Linked Data’ feature
Equipped with a full set of known facets and related data fields for each record type, researchers should be able to interact with, and construct complex queries of, the large-scale aggregation of Linked Data.
Link will be made available on huni.net.au soon
The lab is being designed to support the non-linear research methods practiced in the humanities and creative arts, and will support a workflow centred around discovery, analysis and sharing. As part of the discovery interface a researcher will be able to:Run a free text search across the aggregate and display their results Perform an advanced faceted browse of the aggregate by filtering their results by dataset and entity classes defined in the ontology: people, works, events, organisation, occupation/role, time, place, collections, language, objects. Narrow their search parameters at the start of their search by browsing for information within pre-defined access points. These are likely to be people, works and events since these entity classes are representative across all 28 data sources. Following the initial browse, the user can then filter their search results by dataset and the remaining entity classes.Run a SPARQL query to interrogate the underlying Linked Data The discovery interface is also going to enable serendipitous discovery (i.e. the ability to present information to users before they know what they want to search for):You might also be interested in... (based on the semantic relationships captured in the ontology)The notion of a generous interface is being included (based on some pre-defined daily query feeds), to give the researcher a sense of what is discoverable:On this day…Most popular searchesMost popular records The result sets will be displayed in a number of forms, with list being the default and map and timeline being optional. All search results will be displayed with hyperlinks that allow navigation to the source entity and will show the connections between records as per the ontology mappings
The LORE Tool (developed at UQ) will be made available in the lab where researchers will be able to:Display existing connections between relevant records held within their virtual collection, and Add further links between particular records, with commentary describing the relationship between them
Researchers will have the option to export their Virtual Collection as a .csv file so they can undertake further computational analysis outside of the HuNI lab and within their preferred tool environment.Whilst the lab will include a Tool Integration Framework specifying how third party tools can integrate within the lab and work with HuNI data, we recognize that tools come and go, and that researchers create their own relationship with their tools of choice. So offering an export function is crucial.
Researchers will have the option to share their virtual collection, and their analysis findings, via FB, twitter and email with other researchers
The development of HuNI is being managed as a projectHas a collaborative governance structure in place so that all key project decisions are only made as part of a consultative process Using Prince2 methodology in help manage the projectQuestion of consortial project management…Need to create best practice exemplars at the project management level…Staff in 4 states. Communication in skype or google hangout. Issues around discomfort with these communication technologies. Etc.