Benchmarking Web Accessibility Evaluation Tools: Measuring Limitations of Automated Tests

Benchmarking Web Accessibility Evaluation Tools:
10th International Cross-Disciplinary Conference on Web Accessibility
W4A2013
Markel Vigo University of Manchester (UK)
Justin Brown Edith Cowan University (Australia)
Vivienne Conway Edith Cowan University (Australia)
Measuring the Harm of Sole Reliance on Automated Tests
http://dx.doi.org/10.6084/m9.figshare.701216

Problem & Fact
W4A201313 May 2013 2
WWW is not accessible

Evidence
W4A201313 May 2013 3
Webmasters are familiar with accessibility
guidelines
Lazar et al., 2004
Improving web accessibility: a study of webmaster perceptions
Computers in Human Behavior 20(2), 269–288

Hypothesis I
Assuming guidelines do a good job...
H1: Accessibility guidelines awareness is not that
widely spread.
W4A201313 May 2013 4

Evidence II
W4A201313 May 2013 5
Webmasters put compliance logos on non-
compliant websites
Gilbertson and Machin, 2012
Guidelines, icons and marketable skills: an accessibility evaluation of 100 web development company
homepages
W4A 2012

Hypothesis II
Assuming webmasters are not trying to cheat...
H2: A lack of awareness on the negative effects
of overreliance on automated tools.
W4A201313 May 2013 6

• It's easy
• In some scenarios seems like the only
option: web observatories, real-time...
• We don't know how harmful they can be
W4A201313 May 2013 7
Expanding on H2
Why we rely on automated tests

• If we are able to measure these
limitations we can raise awareness
• Inform developers and researchers
• We run a study with 6 tools
• Compute coverage, completeness and
correctness wrt WCAG 2.0
W4A201313 May 2013 8
Expanding on H2
Knowing the limitations of tools

• Coverage: whether a given Success
Criteria (SC) is reported at least once
• Completeness:
• Correctness:
W4A201313 May 2013 9
Method
Computed Metrics
true_ positives
actual _violations
false_ positives
true_ positives+ false_ positives

W4A201313 May 2013 10
Vision Australia
www.visionaustralia.org.au
• Non-profit
• Non-government
• Accessibility resource
Prime Minister
www.pm.gov.au
• Federal Government
• Should abide by the
Transition Strategy
Transperth
www.transperth.wa.gov.au
• Government affiliated
• Used by people with
disabilities
Method
Stimuli

Method
Obtaining the "Ground Truth"
W4A201313 May 2013 11
Ad-hoc sampling
Manual evaluation
Agreement
Ground truth

W4A201313 May 2013 12
Evaluate Compare with
the GT
Method
Computing Metrics
Compute
metrics
T1
For every page in
the sample...
T2
T3
T4
T5
T6
R1
R2
R3
R4
R5
R6
Get reports
GT
M1
M2
M3
M4
M5
M6

Accessibility of Stimuli
W4A201313 May 2013 13
1.1.1
1.2.1
1.2.2
1.2.3
1.2.4
1.2.5
1.3.1
1.3.2
1.3.3
1.4.1
1.4.2
1.4.3
1.4.4
1.4.5
2.1.1
2.1.2
2.2.1
2.2.2
2.3.1
2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.4.6
2.4.7
2.4.9
2.4.10
3.1.1
3.1.2
3.2.1
3.2.2
3.2.3
3.2.4
3.3.1
3.3.2
3.3.3
3.3.4
4.1.1
4.1.2
violated success criteria
frequency
020406080
1.1.1
1.2.1
1.2.2
1.2.3
1.2.4
1.2.5
1.3.1
1.3.2
1.3.3
1.4.1
1.4.2
1.4.3
1.4.4
1.4.5
2.1.1
2.1.2
2.2.1
2.2.2
2.3.1
2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.4.6
2.4.7
2.4.9
2.4.10
3.1.1
3.1.2
3.2.1
3.2.2
3.2.3
3.2.4
3.3.1
3.3.2
3.3.3
3.3.4
4.1.1
4.1.2
frequency
020406080
1.1.1
1.2.1
1.2.2
1.2.3
1.2.4
1.2.5
1.3.1
1.3.2
1.3.3
1.4.1
1.4.2
1.4.3
1.4.4
1.4.5
2.1.1
2.1.2
2.2.1
2.2.2
2.3.1
2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.4.6
2.4.7
2.4.9
2.4.10
3.1.1
3.1.2
3.2.1
3.2.2
3.2.3
3.2.4
3.3.1
3.3.2
3.3.3
3.3.4
4.1.1
4.1.2
frequency
020406080
Vision Australia
www.visionaustralia.org.au
Prime Minister
www.pm.gov.au
Transperth
www.transperth.wa.gov.au

• 650 WCAG Success Criteria violations
(A and AA)
• 23-50% of SC are covered by
automated test
• Coverage varies across guidelines and
tools
W4A201313 May 2013 14
Results
Coverage

• Completeness ranges in 14-38%
• Variable across tools and principles
W4A201313 May 2013 15
Results
Completeness per tool

• How conformance levels influence on
completeness
• Wilcoxon Signed Rank: W=21, p<0.05
• Completeness levels are higher for
'A level' SC
W4A201313 May 2013 16
Results
Completeness per type of SC

• How accessibility levels influence on
completeness
• ANOVA: F(2,10)=19.82, p<0.001
• The less accessible a page is the
higher levels of completeness
W4A201313 May 2013 17
Results
Completeness vs. accessibility

• Cronbach's α = 0.96
• Multidimensional Scaling (MDS)
• Tools behave similarly
W4A201313 May 2013 18
Results
Tool Similarity on Completeness

• Tools with lower completeness scores
exhibit higher levels of correctness 93-
96%
• Tools that obtain higher completeness
yield lower correctness 66-71%
• Tools with higher completeness are
also the most incorrect ones
W4A201313 May 2013 19
Results
Correctness

• We corroborate that 50% is the upper limit
for automatising guidelines
• Natural Language Processing?
– Language: 3.1.2 Language of parts
– Domain: 3.3.4 Error prevention
W4A201313 May 2013 20
Implications
Coverage

• Automated tests do a better job...
...on non-accessible sites
...on 'A level' success criteria
• Automated tests aim at catching
stereotypical errors
W4A201313 May 2013 21
Implications
Completeness I

• Strengths of tools can be identified across
WCAG principles and SC
• A method to inform decision making
• Maximising completeness in our sample
of pages
– On all tools: 55% (+17 percentage points)
– On non-commercial tools: 52%
W4A201313 May 2013 22
Implications
Completeness II

Conclusions
• Coverage: 23-50%
W4A201313 May 2013 23
• Completeness: 14-38%
• Higher completeness leads to lower
correctness

Follow up
13 May 2013 24
Contact
@markelvigo | markel.vigo@manchester.ac.uk
Presentation DOI
http://dx.doi.org/10.6084/m9.figshare.701216
Datasets
http://www.markelvigo.info/ds/bench12/index.html
10th International Cross-Disciplinary Conference on Web Accessibility
W4A2013

Benchmarking Web Accessibility Evaluation Tools: Measuring Limitations of Automated Tests

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (7)

Semelhante a Benchmarking Web Accessibility Evaluation Tools: Measuring Limitations of Automated Tests

Semelhante a Benchmarking Web Accessibility Evaluation Tools: Measuring Limitations of Automated Tests (20)

Mais de Markel Vigo

Mais de Markel Vigo (13)

Último

Último (20)

Benchmarking Web Accessibility Evaluation Tools: Measuring Limitations of Automated Tests