Presentation at eHumanities group at Meerten's Institute (Amsterdam) on Thursday 18 April 2013.
Analysing media coverage across several types of media-outlets is a challenging task for (media) historians. A specific example of media coverage research investigates the coverage of political debates and how the representation of topics and people change over time. The PoliMedia project (http://www.polimedia.nl) aims to showcase the potential of cross-media analysis for research in the humanities, by 1) curating automatically detected semantic links between four data sets of different media types, and 2) developing a demonstrator application that allows researchers to deploy such an interlinked collection for quantitative and qualitative analysis of media coverage of debates in the Dutch parliament.
These two goals reflect the two perspectives on the development of a search system such as PoliMedia; data- and user-driven. In this presentation, Laura Hollink (VU) will present the data-driven perspective of linking between different datasets and the research questions that arise in achieving this linkage: how to combine different types of datasets and what kind of research questions are made possible by the data? Max Kemman (EUR) will present the user-driven perspective: which benefits can scholars have from linking of these datasets? What are the user requirements for the PoliMedia search system and how was the system evaluated with scholars in an eye tracking study?
2. Who are we?
Laura Hollink
• Assistant professor at VU
• Modeling, linking and enrichment
of data
• Data-driven research
• @laurahollink
Max Kemman
• Junior researcher at EUR
• Human-Computer Interaction
• User-driven research
• @MaxJ_K
eHumanities group - PoliMedia 2
PoliMedia team
Henri Beunders (EUR)
Jaap Blom (NISV)
Laura Hollink (VU)
Geert-Jan Houben (TU Delft)
Funded by CLARIN-NL
Damir Juric (TU Delft)
Max Kemman (EUR)
Martijn Kleppe (EUR)
Johan Oomen (NISV)
4. The research questions
• How is a person, subject or process covered & visualised by the media?
• How do debates and arguments develop over a longer period of time?
• Analysing the changing ideas, arguments and presentation in different
media
eHumanities group - PoliMedia 4
7. Goal: explicit links to different media
types in one system
eHumanities group - PoliMedia 7
8. PoliMedia system
eHumanities group - PoliMedia 8
PoliMedia
Portal
- Browse:
debate and
date
- Search:
debate and
person
Newspapers
KB
Television
Sound and Vision
Radio
KB
Staten
Generaal
Digitaal
KB
Data-driven (Laura) & user-driven (Max)
10. Debate data
Handelingen der Staten-General or Dutch Hansard
from 1945-1995
Some provenance:
1. Transcripts are made of the complete debates of the Dutch
parliament.
2. Published online by the government on
http://www.statengeneraaldigitaal.nl/ (1818 1995) and
http://officielebekendmakingen.nl/ (from 1995)
3. PoliticalMashup project has translated government pdf and
txt files into XML, incl URI’s as identifiers, see
http://politicalmashup.nl/
4. We build on that.
11. eHumanities group - PoliMedia 11
Debate
Metadata
Topic 1
Topic 2
Speaker 1 / Content
Speaker 2 / Content
Speaker 3 / Content
Speaker 1 / Content
Structureof the
debate data
Including:
• who, when, what
• identifiers for subparts
of the debate
• chronological order of
speakers
12. Media data
• Newspaper articles
– at the National Library of the
Netherlands
– Many newspapers 1950- 1995
– Text + images of newspaper
layout
• Radio bulletins
– Transcripts of ANP news
• Newscasts
– in the Academia collection of the
Netherlands institute for Sound
and Vision
15. Linked Data
eHumanities group - PoliMedia 15
• Data openly accessible in a semantic Web standard
• Easy to combine with other semantic Web data
• E.g. DBpedia data on politicians and parties.
16. Linking Debates to Newspaper
articles that cover them
• Challenges:
– How to link documents that are so different in
nature?
– Can we use the structure of the debates: people,
chronologic order of speeches, introductions to
each new topic, etc.
– How can we do this efficiently, using the access
mechanisms of the archives?
eHumanities group - PoliMedia 16
17. Linking approach
eHumanities group - PoliMedia 17
Detect
topics in
speeches
Create
queries
Search
newspaper
archive
Topics
Named
Entities
Name of
speaker
Detect
Named
Entities in
speeches
Candidate
articles
Queries
Rank
candidate
articles
Links
between
speeches
and articles
Debates
Date of
debate
18. Detect topics
The MALLET topic model package
• Unsupervised analysis of text
• “a Topic consists of a cluster of words that frequently occur together”
• [see http://mallet.cs.umass.edu/topics.php]
• Input:
– Text
– Number of iterations
– Number of topics/clusters
• Output:
– Words that cluster around one topic.
• Example:
– Text: a speech in a debate from 1975
– number of iterations: 2000
– number of topics: 1
19. Create Queries
eHumanities group - PoliMedia 19
Named
Entities from
the speech
Named
Entities from
the debate
intro
Topics from
the speech
Topics from
the debate
intro
Name of
speaker Date of debate
Named
Entities from
the speech
Named
Entities from
the debate
intro
Topics from
the speech
Topics from
the debate
intro
20. Evaluation
• Experiment 1: NEs in speech
• Experiment 2: NEs + topics in speech
• Experiment 3: NEs + topics in speech and debate
eHumanities group - PoliMedia 20
21. Results
• A linked open data set of Dutch parliamentary
debates.
• With links to URL’s of news paper articles and
radio bulletins at the Royal Library.
• A system that supports researchers in finding
the data to answer their questions.
eHumanities group - PoliMedia 21
22. User-driven
What do scholars want?
• Why user research?
• Understanding the user [1, 2]
– Acceptance
– Performance
– Capabilities
– Weaknesses
• Goal
– Creating a system that is intuitive and helpful to the users
[1] Y. Liu, A. Osvalder, and M. Karlsson, “Considering the importance of user profiles in
interface design,” no. May, 2010
[2] J. Preece, Y. Rogers, and H. Sharp, “Interaction Design: Beyond Human-Computer
Interaction,” Design, vol. 18, no. 1, pp. 68-68, 2002
eHumanities group - PoliMedia 22
23. User research in the development
process
• Examine search behaviour of users
– Survey regarding search strategies
– Interviews
• User wishes → user requirements
• Wireframes → Prototype
• Evaluation →New prioritization of remaining
user requirements
• Final version
eHumanities group - PoliMedia 23
24. Survey
General search strategies
• N=294
• Popular search engines
Very often
Often
Regularly
Sometimes
Never
Don’t
know it
Google
GoogleImages
GoogleScholar
YouTube
JSTOR
KB
Flickr
EBSCO
NationaalArchief
WebofKnowledge
UitzendingGemist
Yahoo!
Bing
Academia.nl
Europeana
Scopus
MicrosoftAcademicSearch
EUscreen
Arkyves
24
26. Survey
Conclusions
• Google is the dominant search engine
• This has two consequences
1. People compare other search systems to their
experience with Google
2. The search task is mainly performed by using
keywords
eHumanities group - PoliMedia 26
27. Interviews
• N=5
• Quantitative (n=2) as well as qualitative (n=4)
• Main themes
– How do people search currently?
– What could be improved about current search systems?
– What should PoliMedia offer, given its goals?
• Results
– 39 user wishes
– Prioritized internally
• 19 user wishes deemed out of scope
• 20 user requirements
eHumanities group - PoliMedia 27
28. Interviews
Findings
• Key issue is to provide a good overview of data
– Why are search results retrieved
– How are search results ranked
• Assumptions of relevance
– Higher frequency of keywords indicated higher relevancy to
query?
– Longer segments (speeches and articles) indicate higher
importance?
• Many more or less out-of-scope wishes to make current
research easier
– Sentiment-metadata
– Context metadata
– Ability to export to own software
eHumanities group - PoliMedia 28
29. • Clear and
immediate
keyword-search
• Support for
Booleans and
(some) Google-
search operators
• Separate
advanced-search
eHumanities group - PoliMedia 29
Wireframes
Search interface
30. Wireframes
Search results
• Keyword search
remains
prominent
• User chosen
ranking of results
• Keyword
highlighting
• Overview of
related media
• Support for
filtering
eHumanities group - PoliMedia 30
31. Wireframes
Debate page
• Keyword search
remains
prominent
• Overview of
people in debate
• Easy access to
related material
31eHumanities group - PoliMedia
33. Evaluation
• Eye tracking evaluation of the search system
– Search system was still in development
• N=24
– History
– Political communication
• Goals
– Gain understanding of distribution of attention
– Collect general feedback on interface
eHumanities group - PoliMedia 33
34. Evaluation
Eye tracking
• Viewing Duration
• Search bar received little attention after
search results were displayed
• Facets received a lot of attention
• Page-search (CTRL+F) mainly received
attention on debate page view
eHumanities group - PoliMedia 34
Tasks Search bar Facets Search results Page-search
Known Item 17% 22% 60% 2%
Exploratory 6% 12% 80% 2%
35. Evaluation
Usability feedback
• The ranking of search results was an issue for
users
• The year-filter should be a slider
• The debate page should be greatly improved
– Better identification for speaker, party, topic,
relevance to query
– Provide filters on debate-page as well
eHumanities group - PoliMedia 35
45. Conclusion
• PoliMedia; data- or user-driven?
• Continuous interplay
– Users gave input for usefulness of links
– Data limits what features we can offer to users
• Collection quality and usability are both critical to
users [3]
[3] Xie, I. (2006). Evaluation of digital libraries: Criteria and problems from users’
perspectives. Library & Information Science Research, 28(3), 433–452.
doi:10.1016/j.lisr.2006.06.002
eHumanities group - PoliMedia 45
Notas do Editor
Create explicit links.
Go to archives, look up original data, decide whether there is a link to a debate.
Many systems, cross media analysis is difficult.
Debates.
used to check models, summarize the corpus, and guide exploration of its contents
Manual evaluation of relevance media items to political speech? = unsure about relevance0 = not relevant1 = partially relevant2 = relevant
Context metadata:Roles of peopleLinks toexternal databasesTypes of documentsTypes of presentation (dramatic, humoristic, etc.)