Dev Dives: Streamline document processing with UiPath Studio Web
Information Quality Assessment in the WIQ-EI EU Project
1. November 17th, 2011
www.know-center.at
Information Quality in
Social Media
Presentation at UNSL
Elisabeth Lex
2. Agenda
The Know-Center
The WIQ-EI project
Why Information Quality on the Web?
Selected Results
Conclusion
2
3. The Know Center – We are...
Austria’s Competence Center for Knowledge Management
and Knowledge Technologies
Link between Science and Industry
A multi-disciplinary team of 40+ Scientists and Developers
Over 575 publications since 2001
100 Master theses, 26 Phd theses, 4 habilitations
Editors of 2 Journals: Journal of Universal Knowledge
Management, Journal of Universal Computer Science
Organizer of the International Conference on Knowledge
Management and Knowledge Technologies (I-KNOW)
3
4. The Know Center
2 Areas of Research:
Knowledge Relationship Discovery:
Detecting semantic entities, semantic relations in
unstructured data
Cross-language and cross-domain search and retrieval
Automatic analysis of information structure and quality
User interfaces for visual analysis of large information
repositories
Knowledge Services:
Web 2.0, Collective Intelligence and Social Network Analysis
Semantic Technologies, Semantic Web, Semantic Retrieval
Communication and Collaboration Technologies
Mobile Technologies
4
5. The WIQ-EI Project - Goals
Web Information Quality Evaluation Initiative
3 Objectives:
Development of Web Content Information Quality Measures
Plagiarism Detection and Authorship Attribution
Multilingual Opinion and Sentiment Mining
Derive algorithms, tools and test data sets
5
6. The WIQ-EI Project - Implementation
On a global scale:
Researcher exchanges between organisations from
European (Austria, Germany, Spain, Greece) and
non European countries with expertise in topic
relevant fields (Argentina, Mexico, India)
Carry out research secondments, training and
dissemination activites, challenges, workshops
6
8. Introduction
On the Web - large amount of potentially useful content
Navigating is challenging
Web is changing: User Generated Content, Social Media
8
9. Introduction
On the Web - large amount of potentially useful content
Navigating is challenging
Web is changing: User Generated Content, Social Media
- Social media up to date
- Wide audience, highly dynamic
- Open to (almost) anyone
- Powerful e.g. for media resonance
analysis
9
10. Introduction
On the Web - large amount of potentially useful content
Navigating is challenging
Web is changing: User Generated Content, Social Media
- Social media up to date
- Wide audience, highly dynamic
- Open to (almost) anyone
- Powerful e.g. for media resonance
analysis
Information Quality of
Social Media is questionable! 10
11. What is Information Quality?
A multi-dimensional concept [Klein, 2001]
Different Types of Information Quality (IQ) [Knight2005]
E.g. [Wang1996]:
Intrinsic IQ: Accuracy, Objectivity, Believability,
Reputation
Accessibility IQ: Accessibility, Security
Contextual IQ: Relevancy, Value-Added, Timeliness,
Completness, Amount of Information, Presence of Author
information [Katerattanakul1999]
Representational IQ: Interpretability, Ease of
Understanding, Concise Representation, Consistent
Representation 11
12. Information Quality – Link to Information
Retrieval, Data Mining
The Information Retrieval Process
12
13. Information Quality – Link to Information
Retrieval, Text Mining
Text Mining
The Information Retrieval Process
13
14. Information Quality – Link to Information
Retrieval, Data Mining
Enables to retrieve core
information from
unstructured text
Text Mining - Information Extraction
- Clustering
- ...
The Information Retrieval Process
14
15. Information Quality – Link to Information
Retrieval, Data Mining
Enables to retrieve core
information from
unstructured text
Text Mining - Information Extraction
Faceted Search - Clustering
- ...
The Information Retrieval Process
15
16. Information Quality – Link to Information
Retrieval, Data Mining
Text Mining
Faceted Search
The Information Retrieval Process
16
17. Information Quality – Link to Information
Retrieval, Data Mining
IQ Dimensions:
- Objectivity
- Accuracy
... Text Mining
Faceted Search
The Information Retrieval Process
17
18. Our work – Focus on Media Domain
Goal: Assess intrinsic Information Quality in social
media, traditional media, arbitrary Web content
Several IQ dimensions:
Objectivity
Emotionality
Credibility
Readibility
Indepth versus Shallow
Expert versus Non-Expert
Personal versus Official
18
19. Agenda
The Know-Center
Why Information Quality in Media Domain?
Selected Results
Conclusion
19
20. Results
Information Quality Dimension: Objectivity
Task:
Objectivity Classification in
Blogs
Use features based on style
properties:
Dataset: Trec Blogs08 - 83 blogs,
12844 blog posts
Results:
Accuracy of 87% for Objectivity
Classification in Blogs
20
21. Results
Information Quality Dimension: Credibility
Rank blogs by credibility
Compare blogs with credible source:
Quantity structure
Content similarity: Nouns, Verbs+ Adjectives
Dataset: APA news articles, crawled blogs
Results:
Average precision of 83% for blog credibility ranking
Correlation between quantity structures of blogs and news
e.g. Query “Frankreich”, Pearson Correlation Coeff: 0.79
21
[Juffinger, Granitzer, Lex 2009] Blog credibility ranking by exploiting verified content. In Proc. of WICOW in at WWW‘2009.
22. Results
Web Genre and Quality Classification
ECML/PKDD Discovery Challenge 2010
Task 1: Web Genre and Quality Facets
News/Editorial, Educational, Discussion, Commercial, Personal
/Leisure, Web Spam
Bias, Trustworthiness, Neutrality
Task 2: English Content Quality: Combination of Facets
Quality Score
Task 3: Multilingual Content Quality: German, French
Dataset: English, German, French Web hosts: NLP
Features, Content Features, Terms, Links
Approach: Ensemble Classifier Approach (J48, CFC, SVM)
22
24. Results
Web Genre and Quality Classification
Challenges:
Unbalanced and low quality training data (Training data contained
also Hungarian, Czech,.. Hosts)
News and Educational hard to separate
Too few training data for German and French hosts
Results:
Methods performs best for Educational/Research (NDCG 0.688),
Commercial (0.694), and Personal/Leisure (0.583)
English quality task: NDCG 0.844
Multilingual quality task: Use topic independent features from English
hosts
German: NDCG 0.792
French: NDCG: 0.823 24
[Lex et al., 2010]. Assessing the quality of Web content. In Proceedings of the ECML/PKDD Discovery Challenge.
25. Agenda
The Know-Center
Why Information Quality in Social Media?
Selected Results
Conclusion
25
26. Conclusions
Summary
Information Quality (IQ) consists of multiple dimensions
Depends on Use Case
BUT: Several dimensions are commonly agreed
upon
IQ dimensions can be combined in one quality score
Supervised Classification often used to assess IQ
However, training data needed!
Simple style based features suited to assess IQ
dimensions
26