Tools for Data Manipulation - UKAD Open Refine Workshop
1. Adrian Stevenson, Senior Technical Coordinator, Jisc Manchester
Tools for Data Manipulation
UKAD Open RefineWorkshop, Jisc London, 18th March 2016
2. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 2
Workshop Resources
Available from:
http://data.archiveshub.ac.uk/workshops/ukad2016/readme.html
Link to Open Refine and plugins
Link to example data used for workshop
Link to completed Open Refine project from todays
workshop
3. Open Refine
OpenRefine (formerly Google Refine) is a powerful tool for
working with messy data: cleaning it; transforming it from
one format into another; and extending it with web
services and external data.
Main Uses:
• Explore data
• Clean and transform data
• Reconcile and match data
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 3
4. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 4
Installing and running Open Refine
Download from:
http://openrefine.org/download.html
Run and in a web browser go to: http://127.0.0.1:3333/
Select ‘create project’ and browse for Archives Hub
example csv data file
Note: May need to clear browser cache to see new projects
5. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 5
Clean andTransform - Facets and Clustering
Strip white space
Transform Upper case, title case
Split multi valued cells or Edit col > Split several cols
Facet on label
Order by count
Cluster and rename rows
Undo
6. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 6
Clean - Remove Duplicate rows
Sort on column with duplicates and reorder permanently
Facet duplicates to check
Watch for OR switching from rows to records view
Edit cells > Blank Down
Facet by blank
Remove all matching
Essence of Open Refine is using facets and filters to isolate
rows and invoke commands to affect all these rows together
7. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 7
9. Triples
Triples statements
»‘Things’ have ‘properties’ with ‘values’
»Subject – Predicate - Object
Archival
Resource
Repository Provides Access To
Pride and
Prejudice
Jane Austen Is Author Of
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 9
Triples are the basis of RDF and Linked Data
10. owl:sameAs
Hub Person - owl:sameAs -VIAF Person
<http://data.archiveshub.ac.uk/id/person/nra/webbma
rthabeatrice1858-1943socialreformer>
owl:sameAs
<http://viaf.org/viaf/86607236> .
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 10
11. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 11
Matching Names toVIAF
May need to join columns together, for example to give more
consistent name form, e.g using:
cells["FamilyName"].value + ", " + cells["GivenName"].value + ", " +
cells["Dates"].value
12. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 12
Matching Names toVIAF
VIAF reconciliation service details at:
http://iphylo.blogspot.co.uk/2013/04/reconciling-author-names-using-open.html
May need to add as a ‘standard service’ under Reconcile >
Start reconciling. Service URL is:
http://iphylo.org/~rpage/phyloinformatics/services/reconcil
iation_viaf.php
Other recon services e.g. LCSH at:
https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-
Sources
13. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 13
RDF Export
Download RDF Refine Extension from http://refine.deri.ie/
Unzip
Open Project > Browse workspace directory
Create ‘extensions’ folder (if doesn’t exist)
Copy RDF Refine unzipped folder to workspace directory
Restart Open Refine
Need to create column withVIAF URIs for export:
"http://viaf.org/viaf/"+cell.recon.match.id
14. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 14
Matching Subjects to LCSH
Click RDF button in the top right corner, select ‘Add reconciliation
service, Based on SPARQL endpoint’.
Add following parameters:
Name: LCSH
Endpoint URL: http://sparql.freeyourmetadata.org/
Graph URI: http://id.loc.gov/authorities/subjects
Type:Virtuoso
Label properties: check only skos:prefLabel
15. Martha BeatriceWebb
Place of birth:Gloucester,
England
Place of death: Liphook,
Hampshire, England
Life dates: 1858-1943
Epithet: social reformer
and historian
Family name:Webb
Image
from: BeatriceWebb letters
BeatriceWebb (1858 - 1943). Fabian Socialist, social reformer, writer,
historian, diarist.Wife, collaborator and assistant of SidneyWebb,
later Lord Passfield.Together they contributed to the radical
ideology first of the Liberal Party and later of the Labour Party.
from: BeatriceWebb,A summer holiday in Scotland, 1884.
BeatriceWebb (1858-1943), nee Potter, social reformer and diarist.
Married to SidneyWebb, pioneers of social science. She was
involved in many spheres of political and social activity including the
Labour Party, Fabianism, social observation, investigations into
poverty, development of socialism, the foundation of the National
Health Service and post war welfare state, the London School of
Biographical Notes
Works
Our Partnership
My Apprenticeship
The case for the factory acts
BeatriceWebb’s diaries; edited by MargaretCole
The Diary
Knows
http://dbpedia.org/page/George_Bernard_Shaw
http://dbpedia.org/page/Sidney_Webb,_1st_Bar
on_Passfield
15Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/
16. Contact
Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 16
Adrian Stevenson
SeniorTechnical Coordinator
Jisc Manchester
http://www.jisc.ac.uk
adrian.stevenson@jisc.ac.uk
http://www.twitter.com/adrianstevenson
https://www.linkedin.com/in/adrianstevenson
17. Tools for Data Manipulation - Workshop resources at http://data.archiveshub.ac.uk/workshops/ukad2016/ 17
CC License
This presentation available under creative commons Non
Commercial-Share Alike:
http://creativecommons.org/licenses/by-nc/2.0/uk/
Editor's Notes
Hub used mainly for linked data project where we wanted to match to VIAF. Will come to later in the workshop.
Review options on import screen
Talk through the example data and the purpose of the columns
Facet
Mention that facet on duplicates for person URI doesn’t necc mean want to remove the rows as the Arc Res URIs may be different. Depends what wanting to do.
More tutorials
http://kb.refinepro.com/2011/08/remove-duplicate.html
http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial#Deduplicate_entries
Explain why might want to reconcile to VIAF.
Other recon services at https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources
http://www.w3.org/DesignIssues/LinkedData.html
If any of cells in the columns are blank, the merge will fail for that row. To fix, create a facet of blank cells with "Text Facet" ⇒ "Customized Facets" ⇒ "Facet by Blank". Then use "Edit Cells" ⇒ "Transform ..." and enter a string with a space: ' '. This also has it’s limitations as some names have inconsistent number of commas.
Talk through faceting of judgement. How check and accept reconclied rows.
Explain why this is why have included Hub URI and ArcRes URI for manual checking
Mock-up of the LInking Lives interface shows the way data is brought together.