Lincoln2014 ddj (ppt)

•

5 likes•2,679 views

Tony Hirst

ouseful.info - A Wrangling Example With OpenRefine: Making “Oven Ready Data”

Data makes
most sense
when
contextualised

Data only
makes sense
when
contextualised

[statistics]
(the art of looking at one number in the context of other numbers)

“

When we create a graph, we design it to tell a story.
To do this, we must ﬁrst ﬁgure out what the story is.
Next, we must make sure that the story is presented simply, clearly, and
accurately, and that the most important parts will demand the most
attention.
When we communicate verbally, there are times when we need to raise
our voices to emphasize important points.
Similarly, when we communicate graphically, we must ﬁnd ways to make
the important parts stand out visually.

http://www.perceptualedge.com/articles/visual_business_intelligence/sometimes_we_must_raise_our_voices.pdf

https://www.youtube.com/watch?v=oP3c1h8v2ZQ

https://www.youtube.com/watch?v=lYpX4l2UeZg

underspendfiletype:xlssite:gov.uk

Search limits

Structured queries

underspendfiletype:xlssite:gov.uk

select webPages where
text like “%underspend%”
andfiletype=“xls”
SQL
and domain=“gov.uk”

Data can surprise
us and force us to
rethink what we
think we know

Similar to Lincoln2014 ddj (ppt)

Lincoln Journalism Research Day - Data JournalismTony Hirst

The power of Structured Journalism & Hacker Culture in NPRPoderomedia

Peter Drury - Audience Segmentation: Communicate With One to Reach ManySocial Media for Nonprofits

Usabilidad y disenoXosé María Cid

Data stories - how to combine the power storytelling with effective data visu...Coincidencity

An Incident EssayNichole Doran

Data+DesignAmanda Makulec

Lincoln2013 febTony Hirst

Image and reputation in the age of digital communicationBob Pickard

Building Intelligence: How Data + Storytelling is the Ultimate Act of CreationGunther Sonnenfeld

Storytelling for policy entrepreneursMarco Ricorda

Communication EssaysWrite My College Paper For Me North Dartmouth

Artificial Intelligence For Investigative ReportingJennifer Strong

When data journalism meets science | Erice, June 10th, 2014Dataninja

Effective Presentations using Data VisualizationHeather Wilmore Hornbeak

How to Entertain audiences using data led content - Trend Report Spring 2015infogr8

Chi 2001 workshop proposal on narrative techniquesJohn Thomas

Figures of the Many - Quantitative Concepts for Qualitative ThinkingBernhard Rieder

Storytelling in a digital age - challenges of a Data JournalistHille van der Kaa MA MBA

A Nose For News 1Grimsby Institute

Similar to Lincoln2014 ddj (ppt) (20)

Lincoln Journalism Research Day - Data Journalism

The power of Structured Journalism & Hacker Culture in NPR

Peter Drury - Audience Segmentation: Communicate With One to Reach Many

Usabilidad y diseno

Data stories - how to combine the power storytelling with effective data visu...

An Incident Essay

Data+Design

Lincoln2013 feb

Image and reputation in the age of digital communication

Building Intelligence: How Data + Storytelling is the Ultimate Act of Creation

Storytelling for policy entrepreneurs

Communication Essays

Artificial Intelligence For Investigative Reporting

When data journalism meets science | Erice, June 10th, 2014

Effective Presentations using Data Visualization

How to Entertain audiences using data led content - Trend Report Spring 2015

Chi 2001 workshop proposal on narrative techniques

Figures of the Many - Quantitative Concepts for Qualitative Thinking

Storytelling in a digital age - challenges of a Data Journalist

A Nose For News 1

More from Tony Hirst

15 in 20 research fiestaTony Hirst

Dev8d jupyterTony Hirst

Ili 16 robotTony Hirst

Jupyternotebooks ou.pptxTony Hirst

Virtual computing.pptxTony Hirst

ouseful-parlihacksTony Hirst

Gors appropriateTony Hirst

Robotlab jupyterTony Hirst

Fco open data in half day th-v2Tony Hirst

Notes on the Future - ILI2015 WorkshopTony Hirst

Community Journalism Conf - hyperlocal data wireTony Hirst

Residential school 2015_robotics_interestTony Hirst

Data Mining - Separating Fact From Fiction - NetIKXTony Hirst

Week4Tony Hirst

A Quick Tour of OpenRefineTony Hirst

Conversations with dataTony Hirst

Data reuse OU workshop bingoTony Hirst

Inspiring content - You Don't Need Big Data to Tell Good Data Stories Tony Hirst

Lincoln jun14datajournalismTony Hirst

More from Tony Hirst (20)

15 in 20 research fiesta

Dev8d jupyter

Ili 16 robot

Jupyternotebooks ou.pptx

Virtual computing.pptx

ouseful-parlihacks

Gors appropriate

Robotlab jupyter

Fco open data in half day th-v2

Notes on the Future - ILI2015 Workshop

Community Journalism Conf - hyperlocal data wire

Residential school 2015_robotics_interest

Data Mining - Separating Fact From Fiction - NetIKX

Week4

A Quick Tour of OpenRefine

Conversations with data

Data reuse OU workshop bingo

Inspiring content - You Don't Need Big Data to Tell Good Data Stories

Lincoln jun14datajournalism

Lincoln2014 ddj (ppt)

1. An Intro to Data Journalism Tony Hirst @psychemediat Computing & Communications, The Open University

2. What is journalism?

3. [sensemaking]

4. What is data?

5. [a particular type of source]

6. What is data journalism?

7. find stories tell stories

8. find stories

9. “Conversations with data”

10. ouseful.info - A Wrangling Example With OpenRefine: Making “Oven Ready Data”

11. Data Distributions Outliers

12. Trends and (anti)correlations...

13.

14. Data makes most sense when contextualised

15.

16. Data only makes sense when contextualised

17. [statistics] (the art of looking at one number in the context of other numbers)

18. tell stories

19. BE CAREFUL…. 82 + 4 + 6 ≠ 100%

20.

21. “ When we create a graph, we design it to tell a story. To do this, we must first figure out what the story is. Next, we must make sure that the story is presented simply, clearly, and accurately, and that the most important parts will demand the most attention. When we communicate verbally, there are times when we need to raise our voices to emphasize important points. Similarly, when we communicate graphically, we must find ways to make the important parts stand out visually. http://www.perceptualedge.com/articles/visual_business_intelligence/sometimes_we_must_raise_our_voices.pdf

22. https://www.youtube.com/watch?v=oP3c1h8v2ZQ

23. https://www.youtube.com/watch?v=lYpX4l2UeZg

24. “ When we create a graph, we design it to tell a story. To do this, we must first figure out what the story is. Next, we must make sure that the story is presented simply, clearly, and accurately, and that the most important parts will demand the most attention. When we communicate verbally, there are times when we need to raise our voices to emphasize important points. Similarly, when we communicate graphically, we must find ways to make the important parts stand out visually. http://www.perceptualedge.com/articles/visual_business_intelligence/sometimes_we_must_raise_our_voices.pdf

25.

26. [Google spreadsheets]

27.

28.

29.

30.

31.

32.

33.

34. How else can we look at data?

35.

36.

37. How do we ask questions of data?

38.

39.

40. underspendfiletype:xlssite:gov.uk Search limits

41. Structured queries underspendfiletype:xlssite:gov.uk select webPages where text like “%underspend%” andfiletype=“xls” SQL and domain=“gov.uk”

42. Count things

43. How do we interpret the answers?

44. Look for outliers Top 3… …bottom 3

45. Libraries

46. Look for similarities &differences

47.

48.

49.

50.

51. Look for trends

52.

53.

54.

55. Look for patterns&str ucture

56.

57.

58.

59.

60.

61. Data can confirm what we think we know

62. Data can surprise us and force us to rethink what we think we know

63. SchoolOfData.org

Editor's Notes

One take on what data journalism may or may not be… a lecture presented to journalism students at the University of Lincoln, UK, February 2014.
Let’s start with an easy(?!) question - what is journalism?One way of answering that question is tolist some of the functions, or attributed, associated with it – informing, educating, holding to account, watchdog function, campaigning, contextualising.
Sensemakingseems to me to be an important part of it… In part contextualisation, in part identifying the bits that make the difference, the bits that make it important, the bits that make is news that people need to know..
Second question: what is data? National statistics, sports results, polls, financial figures, health data, school league tables, etc etc.Is a book data? Or a speech? What if I split a speech up into separate words, count the occurrence of each unique word and then display the result as a “tag cloud”, or word frequency diagram.
One way of thinking about data is that it is a particular sort of source, or a source that can respond to a particular style of questioning in a particular way.Another take on this is that many “data sources” are experts on a particular topic, experts that know a lot of a very particular class of facts.
So what is data journalism?One way is to think of it as a process, as exemplified by Paul Bradshaw’s inverted pyramid of data journalism. I see it more as a conversation in which data is one of the conversants. The conversational view also allows us to think about process, but more important, for me, is that in a conversation, it gets personal…
The inverted pyramid gives us one way of considering the data journalistic process, or at least identifying some of the steps involved in a data investigation.But there are many other ways of conceptualising the process – for example, finding stories and telling stories…
When it comes to finding stories, do we:want to find stories in a dataset we are provided with, oruse data to help draw out a story lead we have already been tipped off to?
One of the ways I like to work with data is to have a conversation with it – asking questions of it and then further questions based on the responses I get.
Sometimes it looks at first as if we have data in a form where we might be able to do something with it – then we realise it needs cleaning and reshaping.For example, in this case we have percentage signs contaminating numbers, data organised in separate sections – but how do we get a “well behaved” view over data from all the wards – and different sorts of data: votes polled per candidate versus the size of the electorate in a particular ward for example.Walkthrough: http://blog.ouseful.info/2013/05/03/a-wrangling-example-with-openrefine-making-ready-data/
Maps can be used to pull out different sorts of relationships – for example, plotting markers in the centre of each MP’s ward coloured by the total value of travel expenses claim in a particular area, we can easily see whether or not an MP is claiming an amount significantly different to MPs in neighbouring wards. In this case – travel expenses – we might expect (at first glance at least) a homophilitic effect – folk a similar distance away from Westminster should presumably make similar sorts of travel claim? At second glance, we might then start to refine our questioning – does ward size (in terms of geographical area) or rurality have an effect? Does an MP travel to and from home more than neighbours (or perhaps claim more in terms of accommodation in London?)
Sometimes we need to provide quite a lot of explanation when it comes to making sense of even a simple data visualisation – “what am I supposed to be looking at?”
Contextualisation can take many forms – Trinity Mirror Group have a data unit that produces partially packaged data stories and lines for regional titles, who can then add local colour, knowledge, interpretation and spin to the resulting story.
For many readers – it may be that data ONLY makes sense when appropriately contextualised.In passing, it’s also worth noting that sometimes the data you don’t collect sometimes affects the interpretation of the data you do…Foe example: http://www.open.edu/openlearn/science-maths-technology/mathematics-and-statistics/statistics/diary-data-sleuth-when-the-data-you-dont-collect-affects-the-data-you-do
In passing, it’s worth mentioning that one thing statistics does is help provide context.Is this number a big number in the greater scheme of things? Is this thing likely to happen by chance or is there a meaningful causal relationship between this thing and another thing?The chart in the corner is a reminder about how surprising probabilities can be. The chart shows the probability (y-axis) that two people share a birthday (the number of people is given on the x-axis). The chart shows that if there are 23 or more people in a room, there is more than a 50/50 chance that two of them will share a birthday (that is, share the same birth day and month, though not necessarily same birth year).How many people are in the room? If it’s more than 23 – I bet that at least two people share a birthday (at least in terms of day and month).
The other way of using data is to tell stories. But what does that even mean…?
A common source of stories based on data are polls, either polls that are commissioned by a publisher with a view to generating a story, or commissioned by a lobbying group or PR form to promote not only stories around a particular issue, but stories that follow a line favourable to the organisation that commissioned the poll (or detrimental to positions that whoever commissioned the poll is campaigning against).When presented with a press release written around a PR company commissioned poll, look to the raw data to see where the numbers that appear in the press release quotes actually come from.In the above example, I could for example claim that 96% of people (creative reading of the numbers) did not appear to disagree with the idea that press behaviour should be independently regulated (creative reading of the question; the repeated negatives also serve to further confuse the clarity of what is, or isn’t actually being claimed…).And when reading raw results, or quoting from them, take care which numbers you quote. Sometimes the presentation of the results can lead to you misreading them or the way they add up.Sometimes, two or more polls may be commissioned around the same topic and appear to give contradictory results. For an example of this, see: http://www.open.edu/openlearn/science-maths-technology/mathematics-and-statistics/statistics/two-can-play-game-when-polls-collide
Many polling organisations publish press releases featuring “highlight” results from a poll. The more reputable ones also publish copies of the poll or survey questions and the results that were returned.YouGov polls often split results down by political persuasion or newspaper preference, as well as demographically segmenting responses by gender, age or region.The majority of polling organisations publish the data via PDFs rather than “as data”, for example, in the form of spreadsheet datatables. Tools such as Tabula (URL) are making it increasingly easy to extract the data contained within PDFs into actual datatables. Your local techie should also be able to “scrape” the data from a PDF document and put it into a data from.For examples of how to scrape data as well as images from PDF documents, see:scraping data tables from PDFs:extracting images from PDFs:Even if you feel as if you can’t do this yourself, you should make yourself aware of what is possible and achievable by people who have the skills to performs these tasks.
Stephen Few has written several excellent books about creating data visualisations and data dashboards, although you shouldn’t necessarily believe everything he says!This quote gets across the idea that just as we use emphasis and tone in written communication, we can also can and should make use of emphasis and tone in charts.Many newspapers are starting to make use of charts that show several datapoints (for example, several bars in a bar chart) but highlight one or two of them that are the focus of a particular storyline, the other points or bars being used to provide context.In chart design, “less is more” often works (this reflects a principle attributed to data visualisation guru Edward Tufte of using “least ink” when creating charts).
This video - showing part of a lecture by science fiction writer Kurt Vonnegut – shows how simple lines can tell archetypal stories. Note how the narration sets the scene - the axes are explained then the line is constructed. When the x-axis represents time, remember that someone riding the line as it was constructed does not necessarily know what the future holds. When you see a line chart with time as an x-axis, remember that it shows a trace of a story that unfolded over time.Another powerful example of this can be found on Youtube – search for house price rollercoaster to find an animation where how price values over time are visualised as an animated roller coaster ride…
This second clip shows Swedish health statistician made famous by his “data performances”, Hans Rosling, narrating an animated data visualisation rendered using a dynamic bubble chart technique that he popularised via his Gapminder website. Note how the first 30 seconds of the clip are spent explaining the set up of the chart – what the axes mean, what the bubbles represent. When you see a rich data driven interactive on a website, how much coaching and contextualisation is provided to help the user/reader make sense of it?If you turn the sound off on the Rosling clip, how much sense do the moving bubbles make in terms of the story they tell without the benefit of Rosling’s narration? Can you tell where to focus your attention to pull out a meaningful storyline? Are there many possible storylines that can be pulled out? What tricks does Rosling use to focus your attention on – and illustrate – the story he is telling? Is there any sleight of hand in terms of not commenting on what some of the other bubbles are doing (is he using, or could he potentially use, misdirection to focus your attention aware from possible stories he does not want you to pull out of the data?)For more examples of Rosling’s compelling performances, see the recent OU/BBC Two co-production “Don’t Panic – The Truth About Population Change” available on the Gapminder website: http://www.gapminder.org/videos/dont-panic-the-facts-about-population/
Few suggests that graphical communication requires stylistic devices that emphasise particular aspects of a graphic. Hans Rosling achieves this by both pointing to items of interest, reinforcing with emphasis with both his narration and the use of overlays on the graphic itself.So how can we go about drawing emphasis within a static graphic or chart, such as one might find in a print publication?
To show one way of emphasising particular elements of a graphic, let’s produce a quick chart of our own.The first thing we need is some data – I’m going to use some data from the Winter Olympics, a grab of the medal table from the back end of the first week of the 2014 games. The question I want to explore is the extent to which the country that is leading the medal table as measured by most number of gold medals awarded, compared to a ranking in which the table is ordered according to the total number of medals awarded.The data I’m going to use comes from a Wikipedia page. The medal table is contained within an HTML table. To get the data out of the page we are going to screenscrapethe HTML table that contains the data. There are a variety of tools for doing this, from browser extensions to scraper applications such as import.io, to environments such as Scraperwiki that provide a range of developer tools configured to support screenscraping based data collection.But the tool I’m going to use is…
..Google (spread)sheets, and in particular a formula that will import a particular HTML table – in this case, the 2nd table in the page – from a specified URL, In this case the URL of the Wikipedia page containing the medal table.The formula? =importhtml(“URL”,”table”, tableNumber) On entering the formula, the spreadsheet will pull the data in from the Wikipedia page and make it available as spreadsheet data. We can now use the spreadsheet to create charts within the sheet itself. If the data in the Wikipedia page is updated, the data in the spreadsheet will be updated whenever the spreadsheet is refreshed.
Whilst we could generate charts within the spreadsheet, I’m actually going to use an online tool called datawrapper (available at datawrapper.de).Datawrapper charts are starting to make an appearance in many online news reports, such as those published by the Guardian and Trinity Mirror’s ampp3d, so being familiar with this tool - and what you can do with it – could be a useful skill to have.To get the data in to datawrapper you can upload a CSV file, or paste a copy of the data in to the upload area. I’ve taken the latter approach, highlighting and copying the table from the spreadsheet and then pasting it in to datawrapper.
Having uploaded the data, we can configure several properties for each column. In many cases datawrapper should be able to detect what sort of content is contained within each column (for example, whether it is a number or a text field).If necessary, we can apply a limited amount of processing to the contents of a specified column. We can also choose to hide one or more columns from the displayed view. In this case, I am going to hide the Rank, Silver and Bronze columns.
We now get to choose the chart type – I’m going to go for a horizontal bar chart and select the default datawrapper style.
Different chart types have different configuration options. I’m going to choose to automatically sort the bars based on the selected value – notice the buttons in the chart that allow us to select whether to display the Gold medal count or the Total medal count.
Now we get to add some emphasis – remember emphasis? This is an example about how to show emphasis in a chart…In this case, I’m going to emphasise the top 2 positions in the Gold medal ranking – the “point” of the piece is to explore the extent to which these positions hold, or don’t hold, when we rank the table by total medal count.At this point, we can also give the chart a title, and add some provenance information describing and pointing to the source of the data.
Here’s an example of the final chart, with the ranking (automatically) sorted according to total medal count. Note how the order and positioning of the two highlighted countries has changed.The difference is further exemplified when switching between the Gold and Total counts by the use of animation – the highlighted bars draw the eye and allow you to better see how their relative positions change across each of the two ranking schemes.
Having created chart, you can now save it to your datawrapper account. An embed code for the chart is provided so that you embed the chart within your own web page.
Bar charts are a very effective way of displaying particular sorts of information, such as counts. But what other ways are there of displaying data?
Datawrapper provides a variety of chart types, including: horizontal and vertical (column) bar charts, grouped bars that collate different bars according to groups (for example, election on election percentage of the vote for different political parties), stacked column charts (for example, for a selection of countries we could display a column showing the total number of medals constructed by stacking the individual gold, silver and bronze medal counts for those countries) line charts, which are widely used for plotting some value on the vertical y-axis against time on the horizontal x-axis pie charts, to show proportions of a whole, and variants thereof, such as the donut chart (a pie chart with the middle cut out) simple data tables (never underestimate the power of a table – they can be really useful for showing specific values, and can be very powerful when allowing the user to sort the table either by ascending or descending values in particular columns) maps, which as we shall see, can draw out very powerful relationships across data elements.
We’ve also seen some other “basic” charts that can be useful for displaying the distribution of data elements: the block histogram shows a count on the y-axis of data elements falling within particular ranges of values on the x-axis the scatterplot allows us to plot two values against each other, for example height versus weight. These charts can provide us with clues about possible correlations or relationships between the two values. Some scatterplot tools further allow us to colour each point according to group membership so that we can look to see whether numbers are clustered or grouped according to group membership.
Visualising data is a powerful way of asking questions of data – what data points you choose to display and how you display them represent the framing of the question. What the data looks like is the response, but a response that often takes careful reading. The data source has drawn you the answer – you need to turn it into words that you can use to formulate further questions to check your understanding of the answer first provided. (Each question (each chart) typically leads to another… or more than one other…)Asking questions that have a graphical answer is one way of querying a data source – but are there other approaches? Let’s explore that a little more – what do we mean by asking questions of data?
A database that most of us use every day is the Google web search engine. We put in a key term or phrase and Google finds web pages ranked according to a variety of criteria that are deemed most relevant to the query you (and it could well be who you actually are that affects the ranking) have made.Sometimes we may know what websites we actually want to search over. Google Custom Search Engines provide one way of defining your own search engine that just searches over part of the web that you are interested in.One of the custom search engines I have developed searches over websites that act as wire services for press releases: https://www.google.com/cse/publicurl?cx=016419300868826941330:wvfrmcn2oxcThis allows us to track down the source of many a news item and explore the extent to which a given news story has just churned a press release.See also: http://blog.ouseful.info/2014/02/06/polling-the-news/ This post also describes how to create a bookmarklet that allows you to highlight a quote in a news report and search for press releases that contain that quote.
Here’s an example of the search engine in action – I’ve used a bookmarklet that takes a highlighted quote from a news story and passes it to the custom search engine, allowing me to easily see the source of the quote, and the story itself. I’ve also started defining another related custom search engine that allows us to search news sites and polling companies for stories about, and sources of, polls and surveys:https://www.google.com/cse/publicurl?cx=016419300868826941330:ewbi9skvnmq
Custom search engines are a powerful tool for helping us developed focussed web search tools that limit results to a particular part of the web we are interested in, either by location or topic.We can also use (advanced) search limits in ‘everyday’ web queries using the major web search engine.For example, the query shown on this slide searches for the word underspend appearing in Excel spreadsheets (filetype:xls) that can be found on UK government websites (or more specifically, websites hosted on the gov.uk domain (site:gov.uk)).Another query limit combination I have found useful is:confidential filetype:pptThis can turn up presentations that have been delivered at closed corporate events but that have leaked on to the web…
Even if you don’t consider yourself a geek or database expert, writing advanced search queries using search limits is but a small step away from writing queries over databases themselves.One of the most widely used languages for querying databases is SQL. The above slide shows a simple, made up SQL query that could have a similar effect to the simpler search engine query made over a very simple search engine database.The idea is that we select those webPages where the text content of the webpage contains the word underspend anywhere – the % signs denote wildcard characters so the underspend word can appear preceded or followed by any number of arbitrary characters. We also want the query to be limited to pages that have a particular filetype and domain.Far more complicated queries can be written over far more complex databases. What’s important is that you develop an idea of what sorts of database structure and query are possible, not necessarily that you can run and query such databases yourself.For more examples, see:Asking Questions of Data – Garment Factories Data Expedition – http://schoolofdata.org/2013/05/24/asking-questions-of-data-garment-factories-data-expedition/ Asking Questions of Data – Some Simple One-Liners http://schoolofdata.org/2013/05/13/asking-questions-of-data-some-simple-one-liners/
One of the simplest, but often one of the most useful, things we can do is to count things. You just need to be creative in what you count!One of the nice features about working with database query languages such as SQL is that we can write queries that count the number of responses and allows us to rank results on that basis. For example, in a database of public spending transactions with different companies, we could count the number of transactions with a particular company, sum the value of transactions carried out with a particular company, or find the companies with the largest total amount spent with a particular company.
As has already been mentioned, a key part of the journalistic exercise is putting things into context.When working with data, interpreting what the data says often depends on understanding the context and more importantly, the caveats, that arise by virtue of asking a particular question of a particular dataset that has been collected in a particular way under particular conditions.That said, given a particular data set, are there any obvious questions we can ask of it?
When results are ranked, as for example in the case of league tables, there are often easy picking stories to be had around top 3/bottom three positions. In national rankings, local news stories can be identified if your local schools or council appears in either of those extremes.For contextualisation purposes, it often makes sense to look at distributions. Many summary statistics report on the mean value, but looking at measures of variation, or spread, about a mean, as well as the position of a median value, can often change the context of a story.If the lecture room has 20 students in it on an income of £6,000 maintenance loan per year, the total income is £120,000 and their average mean income is £6,000. If an academic in the room is on £40,000, the total income for the room is £160,000. The average mean income is now just a little over £7,500. If we define a poverty level as a mean income below £10, 000, the members of the room are, on average, in poverty. If a senior academic such as professor on an income over £65,000 wanders into the room, the total income goes to over £225,000. With 22 people now in the room, the average mean income is now over £10,000: the room is out of poverty. The median average income, however, is still at £6,000.As well as top, bottom, mean and median, we should also look to outliers. If Bill Gates or Mark Zuckerberg walks into a bar, the average net worth of people in that bar is likely to go up to a level of previously unimagined wealth.Here are several reasons why you should pay attention to outliers: they may be ‘dirty’ or incorrect data points that need to be corrected and that may well raise questions about data quality; the outlier may truly be an outlier, a remarkable point and a story in its own right; the outlier may skew other measures, such as mean values or other summary statistics. In such cases, it may make sense to use other measures or to rerun the summary statistic without including the outlier values to get a better feel for how the other members of the distribution relate to each other.
This rather dense graphic is a view over local council spending data in my local area as relates to spend on libraries. The separate charts show the accumulated spend over a period of time with different suppliers. The intention of the display was to provide at a glance a view of accumulated spend with different companies across different directorates and spending areas to see whether any companies had a significant spend compared to other companies.The table at the bottom shows the top of a league table of companies with the largest accumulated spend by directorate and expense type.At first glance, the spend on phone lines with different suppliers seems to outweigh the spend on books. How can that be? Are the librarians spending their time calling premium rate phone lines?If we guess at 20 libraries and a 6 month spend period, then assume that the phone lines correspond to broadband data bills, do the monthly payments per library still seem outrageous? These assumptions are testable via questions to the relevant authorities, of course, but demonstrate the care we need to take when trying to understand why a number that may appear to be large is that large.See also: Local Council Spending Data – Time Series Charts http://blog.ouseful.info/2013/11/06/local-council-spending-data-time-series-charts/
As well as looking for outliers, we should also look for similarities between things we expect to be different and differences between things we expect to be the same, or at least, similar.
Looking again at some of my local council’s spending data, I noticed a search on “music” pulled back what appeared to be a shift in responsibility between directorates for spend on school music service provision.An obvious question that follows is: if the service did change hands (something we can check), was there a resulting difference in the way that the directorates were spending? Could we, for example, identify whether any projects got dropped (or at least, renamed out of scope!)?This forensic approach can also be used to track the consequences of a shift in control of a service, if we know it has happened. When a service changes hand, we can keep a note of the fact and then a year on look for evidence in whether treatment of the service has changed, at least in consequences for spending.See also: What Role, If Any, Does Spending Data Have to Play in Local Council Budget Consultations? http://blog.ouseful.info/2013/11/03/what-role-if-any-does-spending-data-have-to-play-in-local-council-budget-consultations/
When asking questions of data, one question can often lead to another.For example, a query over my local council spending data about amounts spent with the local newspaper, the Isle of Wight Country Press, identified a variety of expense types associated with those spending transactions. One such expense type was Advertising & Publicity. This led to me now steering the conversation I was having with this expert (data) source on council spending and taking it on to a slightly different tack: so who else have you been spending advertising and publicity budgets with?
If you in the position of paying for energy supply bills – electricity and gas – you’ll probably be familiar with the idea that payments are set so you tend to overpay on a monthly basis. After collecting the interest on your overpayments, the utility companies may eventually get round to sending you a small repayment to cover the excess (ex- of any interest, of course…).Is the same true at the council level?One thing I noticed in the spend my local council spent with supplier Southern Electric was that there appeared to be more than a few “negative payments”. So where were these coming from? The chart shown in this slide has positive payments made by date (not ordered on an evenly space timeline) in black, and the magnitude of negative payments shown in red. Where a red triangle sits over a black dot, this shows that a positive and negative payment of the same amount were made on the same day. Why’s that?Some days show several negative payments – again, what’s happening? There’s not necessarily anything suspicious going on, but what story does this chart appear to tell us, particularly in terms of the similarities in amount of certain positive and negative spends?
Just by the by, this chart refines the question I’m asking of the spend with Southern Electric, asking for more information about positive and negative payments made on the gas and electricity accounts separately.
As well as similarities and differences, data can tell us tales about trends…
Regular releases from the ONS – the Office of National Statistics – provide bread and butter news stories on a regular basis according to a known schedule.For example, monthly job seeker figures get a monthly write-up in OnTheWight, the hyperlocal news blog local to me. The report makes a comparison between the current figures and figures from the previous month and from the same month of the previous year. The aim is is so that we can see how the numbers have changed month on month, and year on year.I started to explore a simple script that would take data directly from the ONS and produce assets that could be reused in a news story – for example, to produce a table showing the change in figures over recent months.I also started to explore ways in which we could automate the production of prose from the data [code: https://gist.github.com/psychemedia/7536017]. For example, the following phrase was generated automatically from monthly figures:The total number of people claiming Job Seeker's Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.The words up and down were selected based on simple if-then rule that compared figures to see which was the greater. The numbers and dates are pulled in from the data. The other words are canned phrases.The automated production of text from data is something that has received attention from several companies, particular in the area of baseball reports and financial reporting. See for example: http://blog.ouseful.info/2013/05/22/notes-on-narrative-science-and-automated-insight/Being able to define sentences and natural language constructions that can be used as templates to display data in textual form is a skill that could well feed into specialist areas of data driven reporting. Identifying the patterns in the data that can be mapped onto natural language explanations of those patterns in a reliable way is another area in which wordsmiths, statisticians and developers may have to work together in the future.
If we plot a line chart with some quantity against a time axis, we can often see increasing or decreasing trends over time. If we are looking for constant rates of increase in some value, it often makes sense to use a log/logarithmic scale to display that value on the y-axis Periodic trends can also be seen as ‘waves’ appearing in the line over time, but other displays can draw out periodicity or seasonality in a more visually compelling way.For example, in these charts – of jobless figures on the Isle of Wight once again – we have months ordered along the horizontal x-axis and the number of job allowance claimants on the vertical y-axis. The separate coloured lines represent different years. On the left, we use a legend to identify the lines, on the right is an example of labeling the lines directly.The lines show strong seasonality in behaviour. Being a tourist destination, job seeker figures tend to fall over the summer months. Putting lines for several years on the same axis allows us to compare annual cycles over time.
Another trend we can try to pull out is change over years for each given month. Here, the horizontal x-axis blocks out the months, as before, but within each month we have an ordered range of years. The line within each block thus represents the year-on-year change in numbers within a given month.The step change within each month suggests that the way the figures were calculated changed significantly several years ago.Further reading: a good guide to statistics as used by government, include a description of the way that “seasonal adjustments” are handled, is provided by the House of Commons Library’s Statistical Literacy Guide http://www.parliament.uk/business/publications/research/briefing-papers/SN04944/statistical-literacy-guide
As well as the patterns we can see over time by plotting data against a time axis, we can also look for patterns in space…
In part because they are so recognisable to the majority of people as an idea as well as an artefact, maps are widely used in many publications.I have already mentioned how the use of a map to compare travel claims by MPs based on their constituency locations provided a way of making a particular sort of comparison between MPs (in particular, a comparison based on geographical location).But we can take the idea of a map more generally, as a spatial distribution of points that are related in some way, with strong relations represented as spatial proximity.Things that are close together on the page are taken to be close together in some sort of space, a space which may be conceptual or social, not just (or not even) geographic.
Take this map, for example, a map of Twitter users commonly followed by a sample of followers of @UL_journalism.The map has been laid out so that Twitter users who are heavily interlinked are grouped closely together (for the most part, at least). A network statistic has been used in an attempt to colour clusters of nodes with high interconnection. The coloured regions thus represent a first attempt at identifying different groupings of Twitter user. You will note how the spatial layout algorithm and the grouping/colouring algorithm complement each other well – they both seem to tell a similar story, where the story is that certain groups of individuals are somehow alike.About the technique: http://schoolofdata.org/2014/02/14/mapping-social-positioning-on-twitter/Let’s have a closer look at some of the regions…
This area seems to be Twitter accounts that relate in large part to the University of Lincoln and its related organisations and activities.
This area of the map contains accounts associated with Lincoln more generally. Such a map may be useful for identifying companies that are used by students and as such may be useful leads for advertising agents looking to sell adverts appearing in university magazines or poster areas.
This area of the map actually conflates several different groupings, at least, on my reading of it. In fact, it may make sense to try to find clusters within this group on its on and then recolour accordingly.So what groups can I see? Bottom left there looks to be Lincoln local media outlets. Moving counter-clockwise between the 6 and 3 o’clock positions we see photography related users moving up into celebrities. As we move further up towards the twelve o’clock position, we see news sites, both “popular” and more industry related (@journalismnews, for example).That there does not appear to be a strong independent cluster of journalists and industry related sites suggests that, from the sampled followers of UL_Journalism at least, there isnlt necessarily a very strong notion of following these industry lights…
One of the things to mention about mapping data mapping and visualisation techniques is that they often tells us things we already (think we) know; in that sense, they are not news. But they may also tell us things we know in new, visually appealing ways. And by making use of such ‘confirmatory’ visualisations and displays we can build confidence within an audience that they know how to interpret these sorts of representation.
As the audience becomes comfortable reading the charts and making sense of data, when there is something new or surprising in the data, the surprise manifests itself in the reading of the data or chart.For journalists working with data, developing a sense of familiarity with how to interpret and read data when it is just confirming what you already know helps to refine your senses for spotting things that are odd, noteworthy, or newsworthy.Taking a little bit of time each day to: read charts as if they were stories; look behind the data to find original sources, such as polls or data containing news releases, and then compare the original release with the way it is reported, paying particular attention to the points that are highlighted, and how the data is contextualised;will help you develop some of the skills you will need if you want to be able to identify, develop and treat some of the stories that your specialist source that is data can provide you with, of only you ask…
And finally, a couple of handy books and resources on data journalism if you’re interested in reading more generally around the subject…

Lincoln2014 ddj (ppt)

Recommended

Recommended

More Related Content

Similar to Lincoln2014 ddj (ppt)

Similar to Lincoln2014 ddj (ppt) (20)

More from Tony Hirst

More from Tony Hirst (20)

Lincoln2014 ddj (ppt)

Editor's Notes