2. What is LinkedGov?
A community project
aiming to
make public data more usable
Cleaning
Improving access
Enriching
Linking
@danpaulsmith
3. Data flow
Cleaning tasks
✖
Question
site
LinkedGov
Import database
existing &
data core components
(CSV, Excel, data
XML) (Data is stored as machine- .linkedgov
readable data) .org
@danpaulsmith
4. What is Google Refine?
“A power tool for working with messy data”
“cleaning it up”,
“ transforming it”,
“extending it”,
“and linking it”
@danpaulsmith
6. Spreadsheet software
Spreadsheet software Google Refine
Single-cell editing Bulk-editing
Create & input data Use & transform existing data
Document-based Data-based
Allows extensions to be
installed
@danpaulsmith
15. What a machine understands
before
(CSV, TSV, Excel)
Column Column Column Column Column Column Column
Row number word number word date number number
Row number word number word date number number
Row number word number word date number number
Row number word number word date number number
Row number word number word date number number
@danpaulsmith
16. What a machine understands
after
(machine-readable format)
Water
Temp Name Gas/hour Postcode Date Height
/hour
Building Celsius string kWh Postcode date m3 metres
Building Celsius String kWh Postcode date m3 metres
Building Celsius string kWh Postcode date m3 metres
Building Celsius string kWh Postcode date m3 metres
Building Celsius string kWh Postcode date m3 metres
@danpaulsmith
17. The power of linking
Latitude &
Postcodes Dates Measurements
longitude
GP Surgery NHS events GP Surgery energy
NHS geo data address data data use data
@danpaulsmith
18. Data flow
Cleaning tasks
✖
Question
site
LinkedGov
Import database
existing &
data core components
(CSV, Excel, data
XML) .linkedgov
(Data exists as linked data)
.org
@danpaulsmith
Me. Recent graduate. Have been building interfaces and visualisations for last two years on government projects themed on transparency, big data, open data and linked machine-readable data.This is a presentation on an interface I’ve been building for LinkedGov recently.
When you’re looking for public data – it can be quite hard to find(you need to create accounts, arrive at broken download links, searches fail due to a lack of metadata). Once you’ve found the data – it can be in the wrong format(so you then begin the time consuming process of converting that data into a format you can work with). Then once you’ve started working with the data – you can find it to be mysterious and lacking in explanation. So! LinkedGov makes life easier by:1. Cleaning data (spelling mistakes, formats…). 2. Improving access (format of choice, API’s, high quality metadata). 3. Enriches data – (labels and descriptions for the data at a fine-grained level, uses online vocabularies to describe what the data contains). 4. Links datasets to each other.
The purple block here is Google Refine – with which data is imported. The importeddata is then cleaned and enriched by the LinkedGov extension. The final step of the import process is to store the data in LinkedGov’s database in a machine-understandable format. With the data stored, we can then do a few things: Create “cleaning tasks” for the community that help fix errors in the data. Power a “question site” that lets non-technical users form queries to query datasets. 3. And also power a technical search site aimed at developers that helps them find the data they want.
Free. Open source. Runs in the web browser.
This is what Refine looks like. A little bit like spreadsheet software – you have columns and rows. Though you don’t have any toolbars allowing you edit the style, insert charts, generate reports… That’s because…
Refine has some key differences to spreadsheet software. Spreadsheet software focuses on single-cell editing and inputting of data, Refine focuses on editing hundreds of rows & columns at the same time. ------ Spreadsheet software is largely for creating and capturing data, Refine is for users to reshape and transform existing data. ------- Spreadsheet software is very document-based- allowing you to style the data, use multiple pages or insert media, Refine is data-based – only allowing you to alter the structure and values of the data. ------ Refine also allows people to build extensions for it!
However. Cleaning and transforming data *is*complicated. A non-technical personwill get confused. Google Refine is designed for programmers / frequent data-wranglers…It would be useful if the people who create or own the data are able to clean the data themselves (they after all should know the most about it).
Hides the technical stuff! Instead, asks the user questions about their data… Creates clean, formatted, machine-readable data.
So what are we askingthe user? We ask them “can you spot any of these things in your data?”.Why do we ask these things? These four types of data are a good starting ground for linking datasets as they are common across most datasets. --------- If multiple datasets contain the same time span – you can try to compare them to see if there’s anything that connects. If multiple datasets contain the same measurements (i.e. kilowatts per hour) – it’s a good starting point to see if any of them relate. If multiple datasets contain latitude and longitude values – you can gather and compare data spatially and begin to plot things on maps which everybody seems to love. If multiple datasets contain postcodes – & if any of them match, you automatically have a number of different types of information for each postcode. --------- These questions come in the form of “wizards” – which basically leads the user through a small number tasks - asking them to select a column, specify how the data is currently formatted and then they press “Done”!
Thereare also a few other wizards: The “colums to rows” & “rows to columns” wizards help the user reshape their data in a way that helps us store the data. These are currently the most problematic wizards in regards to the wording and conveying the benefit or reason behind asking the user to do this. The “blank” values wizard BLANKS out any values in the data that represent “NULL” values – each dataset is to it’s own, I’ve come across dashes, full stops and words like “missing” or “none”. The “codes and symbols” wizard asks the user to replace any codes or symbols with what they actually mean, so for example, in some NHS data, a column was filled with lots of A’s, C’s, D’s and P’s – after googling about, I found out that they actually meant Active, Closed, Dormant and Proposed. So having their actual meaning present in the data is obviously a lot more helpful to people trying to use the data.
So, this is what Refine looks like before the extension has been installed… and after the LinkedGov extension is installed. The main addition to the interface being a new panel called the “Typing” panel – which houses the wizards. So, I’ll just walk you through a couple of wizards… Imagine I have some dates in my data and I click on the Date & Time wizard…
The wizard appears and it asks me to select any columns that contain dates… So I select two columns “open date” and “close date” by clicking on their headers…
We ask the user to specify each dart part for each column – as the values could be in any combination: year-month-day, year-month, day-month, month-day…. You can see the column contains a day, month and year – but in a mixture of formats. You have words, dashes and slashes as separators…which the user doesn’t have to worry about. They then press “Finish” and the magic happens. The values are all formatted properly to using the ISO standard, they are also linked to an online definition and breakdown of that specific date and finally stored as machine-readable linked data.
This is the measurements wizard. Select “Avg. Temp” column. It then asks me to search for a measurement type by typing into a text box, which searches an online database of measurements. I click “Finish” after I’ve found the right measurement – “Celsius”, and then the measurements are stored using their online definition – which comes bundled with wikipedia-like information such as alternative names, a description or related measurements (i.e. centimeters, meters, kilometers). So not only is the measurement being stored as an actual measurement, but because we’re using an online database to define it, it comes bundled with a lot of other relevant and potentially useful information to the end user.
Here’s an example of what a machine understands about the data before and after using our extension. After saving a file in spreadsheet software, a machine, at best, only understands that the data is a bunch of columns and rows, containing numbers, words and dates. The ability for machines to understand the data is the magic that powers the question site, the dataset directory and makes linking datasets together a breeze.
After using the wizards, machines are able to understand a little bit more about the data. Now machines have a more in-depth understanding of what the data actually means, The guesswork and inaccuracy is removed when searching and querying the data.
An example of how datasets can link… The red dataset contains latitude/longitudes. The blue dataset contains postcodes and latitude/longitudes. The green dataset contains postcods and dates. And the orange dataset contains dates and measurements… All four datasets can be linked together by those linkable values. When you’re able to start linking datasets together like this – NEW information is created from a NEWLY acquired sense of UNDERSTANDING of those datasets.
So that’s what the LinkedGov extension is and does. I’ll briefly finish off with what happens to the machine-readable data. Cleaning tasks can now be created for the community – asking them to use their expertise and judgement to correct problematic data. For example, a column may contain cryptic codes that represent types of NHS walk-in-clinics. So a task may be to decode one of these values and replace it with what it actually means.
Here’s a screenshot of an example task – It’s asking the user to try to fix a value that contains two dashes instead of a decimal point. The user has the options to say “Yes I can fix this”, ”Refer this to an expert”, “It’s actually fine” etc.
The question site
The question site is aimed at non-technical users. It allows them to form queries to retrieve data, without requiring any knowledge of query languages. They form the question in a human-readable way, using a mixture of selectable question fragments together with free text input. An example: Give me ALL … GP SURGERIES … in … LONDON…
A finally, the data site.
The data site is targeted at the developer community. and is powered by the enriching parts of the data such as: their metadata What types of data are actually in the datasets (postcodes, dates, measurements) What they could potentially link to…
So that’s where we are so farFeedback & questions?