The document summarizes a study that evaluated the coverage, completeness, and correctness of 6 automated web accessibility evaluation tools against the WCAG 2.0 guidelines. The study found that:
1) The tools covered between 23-50% of success criteria across the guidelines.
2) Completeness (the percentage of actual violations found) varied across tools from 14-38%. Higher completeness was correlated with lower correctness.
3) Tools performed better on less accessible websites and on WCAG 2.0 A-level success criteria.
Designing IA for AI - Information Architecture Conference 2024
Benchmarking Web Accessibility Evaluation Tools: Measuring Limitations of Automated Tests
1. Benchmarking Web Accessibility Evaluation Tools:
10th International Cross-Disciplinary Conference on Web Accessibility
W4A2013
Markel Vigo University of Manchester (UK)
Justin Brown Edith Cowan University (Australia)
Vivienne Conway Edith Cowan University (Australia)
Measuring the Harm of Sole Reliance on Automated Tests
http://dx.doi.org/10.6084/m9.figshare.701216
3. Evidence
W4A201313 May 2013 3
Webmasters are familiar with accessibility
guidelines
Lazar et al., 2004
Improving web accessibility: a study of webmaster perceptions
Computers in Human Behavior 20(2), 269–288
4. Hypothesis I
Assuming guidelines do a good job...
H1: Accessibility guidelines awareness is not that
widely spread.
W4A201313 May 2013 4
5. Evidence II
W4A201313 May 2013 5
Webmasters put compliance logos on non-
compliant websites
Gilbertson and Machin, 2012
Guidelines, icons and marketable skills: an accessibility evaluation of 100 web development company
homepages
W4A 2012
6. Hypothesis II
Assuming webmasters are not trying to cheat...
H2: A lack of awareness on the negative effects
of overreliance on automated tools.
W4A201313 May 2013 6
7. • It's easy
• In some scenarios seems like the only
option: web observatories, real-time...
• We don't know how harmful they can be
W4A201313 May 2013 7
Expanding on H2
Why we rely on automated tests
8. • If we are able to measure these
limitations we can raise awareness
• Inform developers and researchers
• We run a study with 6 tools
• Compute coverage, completeness and
correctness wrt WCAG 2.0
W4A201313 May 2013 8
Expanding on H2
Knowing the limitations of tools
9. • Coverage: whether a given Success
Criteria (SC) is reported at least once
• Completeness:
• Correctness:
W4A201313 May 2013 9
Method
Computed Metrics
true_ positives
actual _violations
false_ positives
true_ positives+ false_ positives
10. W4A201313 May 2013 10
Vision Australia
www.visionaustralia.org.au
• Non-profit
• Non-government
• Accessibility resource
Prime Minister
www.pm.gov.au
• Federal Government
• Should abide by the
Transition Strategy
Transperth
www.transperth.wa.gov.au
• Government affiliated
• Used by people with
disabilities
Method
Stimuli
11. Method
Obtaining the "Ground Truth"
W4A201313 May 2013 11
Ad-hoc sampling
Manual evaluation
Agreement
Ground truth
12. W4A201313 May 2013 12
Evaluate Compare with
the GT
Method
Computing Metrics
Compute
metrics
T1
For every page in
the sample...
T2
T3
T4
T5
T6
R1
R2
R3
R4
R5
R6
Get reports
GT
M1
M2
M3
M4
M5
M6
14. • 650 WCAG Success Criteria violations
(A and AA)
• 23-50% of SC are covered by
automated test
• Coverage varies across guidelines and
tools
W4A201313 May 2013 14
Results
Coverage
15. • Completeness ranges in 14-38%
• Variable across tools and principles
W4A201313 May 2013 15
Results
Completeness per tool
16. • How conformance levels influence on
completeness
• Wilcoxon Signed Rank: W=21, p<0.05
• Completeness levels are higher for
'A level' SC
W4A201313 May 2013 16
Results
Completeness per type of SC
17. • How accessibility levels influence on
completeness
• ANOVA: F(2,10)=19.82, p<0.001
• The less accessible a page is the
higher levels of completeness
W4A201313 May 2013 17
Results
Completeness vs. accessibility
19. • Tools with lower completeness scores
exhibit higher levels of correctness 93-
96%
• Tools that obtain higher completeness
yield lower correctness 66-71%
• Tools with higher completeness are
also the most incorrect ones
W4A201313 May 2013 19
Results
Correctness
20. • We corroborate that 50% is the upper limit
for automatising guidelines
• Natural Language Processing?
– Language: 3.1.2 Language of parts
– Domain: 3.3.4 Error prevention
W4A201313 May 2013 20
Implications
Coverage
21. • Automated tests do a better job...
...on non-accessible sites
...on 'A level' success criteria
• Automated tests aim at catching
stereotypical errors
W4A201313 May 2013 21
Implications
Completeness I
22. • Strengths of tools can be identified across
WCAG principles and SC
• A method to inform decision making
• Maximising completeness in our sample
of pages
– On all tools: 55% (+17 percentage points)
– On non-commercial tools: 52%
W4A201313 May 2013 22
Implications
Completeness II
24. Follow up
13 May 2013 24
Contact
@markelvigo | markel.vigo@manchester.ac.uk
Presentation DOI
http://dx.doi.org/10.6084/m9.figshare.701216
Datasets
http://www.markelvigo.info/ds/bench12/index.html
10th International Cross-Disciplinary Conference on Web Accessibility
W4A2013