As in many other scientific domains where computer–based tools need to be evaluated, also medical imaging often requires the expensive generation of manual ground truth. For some specific tasks medical doctors can be required to guarantee high quality and valid results, whereas other tasks such as the image modality classification described in this text can in sufficiently high quality be performed with simple domain experts.
Ground truth generation in medical imaging: a crowdsourcing-based iterative approach
1. Ground truth generation in medical imaging:
a crowdsourcing-based iterative approach
Antonio Foncubierta-Rodríguez
Henning Müller
2. Introduction
• Medical image production grows rapidly in
scientific and clinical environment
• If images are easily accessed, they can be
reused:
• Clinical decision support
• Young physician training
• Relevant document retrieval for researchers
• Modality classification improves retrieval and
accessibility of images
3. Motivation and dataset
• ImageCLEF dataset:
• Over 300,000 images from open access
biomedical literature
• Over 30 modalities hierarchically defined
• Manual classification is expensive and time
consuming
• How can this be done in a more efficient way?
4. Conventional Diagnostic
Ultrasound
Tables, forms
MRI
Program Listing CT
2D X-RAY
Statistical figures,
graphs and Angiography
charts
PET
Radiology
System
overviews SPECT
Infrared
Flowcharts
Combined
Graph
Gene sequence Skin
Gross
Organs
light
Visible
Chromatography,
Endoscopy
Classification Hierarchy
gel
EEG
Chemical
structure ECG, EKG
Compound
waves
Signals,
EMG
Symbol
Light Micr.
Transmission Electron
Math formulae Microscope Micr.
Fluorescence
Phase
Interference
contrast
Microscopy
Dark field
Non
photos
clinical
2D
3D
sketches
Reconstructions
Hand-drawn
6. Iterative workflow
• Avoid manual classification as much as possible
• Iterative approach:
1. Create a small training set
• Manual classification into 34 categories
2. Use an automatic tool that learns from training set
3. Evaluate results
• Manual classification into right/wrong categories
4. Improve training set
5. Repeat from 2
7. Crowdsourcing in medical imaging
• Crowdsourcing reduces time and cost for
annotation
• Medical image annotation is often done by
• Medical doctors
• Domain experts
• Can unknown users provide valid annotations?
• Quality?
• Speed?
8. User Groups
• Experiments were performed with three different
user groups:
1
MD
18 known
experts
2470 contributors from
open crowdsourcing
9. Crowdsourcing platform
• Crowdflower platform was chosen for the
experiments
• Integrated interface for job design
• Complete set of management tools: gold
creation, internal interface, statistics, raw data
• Hub feature: jobs can be announced in several
crowdsourcing pools:
• Amazon Mturk
• Get Paid
• Zoombucks
10. Experiment: Initial training set generation
• Initial training set
generation
• 1,000 images
• Limited to 18
known experts
• Aim: test the
crowdsourcing
interface
11. Experiment: Automated classification verification
• 300,000 images
• Binary task: approve
or refuse classification
• Aim: evaluate speed
and difficulty of
verification task
12. Experiments: trustability
• Trustability experiments
• Aim: compare user groups expected accuracy
• 3,415 images were classified by the Medical
Doctor
• The two user groups were required to reclassify
images
• Random subset of 1,661 images used as gold
standard
• Feedback on wrong classification was given to the
known experts for detecting ambiguities
• Feedback on 847 of the gold images was muted for
the crowd
13. Results: user self assessment
• Users were required to answer how sure they
were of their choice
• Allows discarding untrusted data from trusted
sources
• Confidence rate
• Medical doctor: 100 %
• Known experts group: 95.04 %
• Crowd group: 85.56 %
17. Results: Automatic classification verification
• Verification by experts
• 1,000 images were verified
• Agreement among annotators: 100%
• Speed:
• Users answered twice as fast
18. Conclusions
• Iterative approach reduces amount of manual
work
• Only a small subset is fully manually annotated
• Automatic classification verification is faster
• Significant differences among user groups
• Faster crowd annotations due to the number of
contributors
• Poorer crowd annotations in the most specific
classes
• Comparable performance among user groups
• Broad categories
19. Future work
• Experiments can be redesigned to fit the crowd
behaviour:
• A smaller number of (good) contributors has
previously led to CAD-comparable performance
• Selection of contributors:
• Historical performance on the platform?
• Selection/Training phase within the job
20. Thanks for your attention!
Antonio Foncubierta-Rodríguez and Henning Müller.
“Ground truth generation in medical imaging: A crowdsourcing based
iterative approach”,in Workshop on Crowdsourcing for Multimedia,
ACM Multimedia, Nara, Japan, 2012
Contact: antonio.foncubierta@hevs.ch