Ground truth generation in medical imaging: a crowdsourcing-based iterative approach

Ground truth generation in medical imaging:
a crowdsourcing-based iterative approach

Antonio Foncubierta-Rodríguez
Henning Müller

Introduction

• Medical image production grows rapidly in
scientific and clinical environment
• If images are easily accessed, they can be
reused:
• Clinical decision support
• Young physician training
• Relevant document retrieval for researchers
• Modality classification improves retrieval and
accessibility of images

Motivation and dataset

• ImageCLEF dataset:
• Over 300,000 images from open access
biomedical literature
• Over 30 modalities hierarchically defined
• Manual classification is expensive and time
consuming
• How can this be done in a more efficient way?

Conventional Diagnostic
Ultrasound
Tables, forms
MRI

Program Listing CT

2D X-RAY
Statistical figures,
graphs and Angiography
charts
PET
Radiology

System
overviews SPECT

Infrared
Flowcharts
Combined

Graph
Gene sequence Skin
Gross
Organs
light
Visible

Chromatography,
Endoscopy
Classification Hierarchy

gel

EEG
Chemical
structure ECG, EKG
Compound

waves
Signals,

EMG
Symbol
Light Micr.

Transmission Electron
Math formulae Microscope Micr.

Fluorescence

Phase
Interference
contrast
Microscopy

Dark field
Non

photos
clinical

2D

3D
sketches
Reconstructions

Hand-drawn

Image examples

COMPOUND GENERIC GENERIC
Table Figures/Charts

DIAGNOSTIC DIAGNOSTIC DIAGNOSTIC
Radiology Radiology Microscopy
Ultrasound CT Fluorescence

Iterative workflow

• Avoid manual classification as much as possible
• Iterative approach:
1. Create a small training set
• Manual classification into 34 categories

2. Use an automatic tool that learns from training set
3. Evaluate results
• Manual classification into right/wrong categories

4. Improve training set
5. Repeat from 2

Crowdsourcing in medical imaging

• Crowdsourcing reduces time and cost for
annotation
• Medical image annotation is often done by
• Medical doctors
• Domain experts
• Can unknown users provide valid annotations?
• Quality?
• Speed?

User Groups

• Experiments were performed with three different
user groups:

1
MD

18 known
experts

2470 contributors from
open crowdsourcing

Crowdsourcing platform

• Crowdflower platform was chosen for the
experiments
• Integrated interface for job design
• Complete set of management tools: gold
creation, internal interface, statistics, raw data
• Hub feature: jobs can be announced in several
crowdsourcing pools:
• Amazon Mturk
• Get Paid
• Zoombucks

Experiment: Initial training set generation

• Initial training set
generation
• 1,000 images
• Limited to 18
known experts
• Aim: test the
crowdsourcing
interface

Experiment: Automated classification verification

• 300,000 images
• Binary task: approve
or refuse classification
• Aim: evaluate speed
and difficulty of
verification task

Experiments: trustability

• Trustability experiments
• Aim: compare user groups expected accuracy
• 3,415 images were classified by the Medical
Doctor
• The two user groups were required to reclassify
images
• Random subset of 1,661 images used as gold
standard
• Feedback on wrong classification was given to the
known experts for detecting ambiguities
• Feedback on 847 of the gold images was muted for
the crowd

Results: user self assessment

• Users were required to answer how sure they
were of their choice
• Allows discarding untrusted data from trusted
sources
• Confidence rate
• Medical doctor: 100 %
• Known experts group: 95.04 %
• Crowd group: 85.56 %

Results: tasks completed per user

Open crowdsourcing Internal interface

Results: MD and known experts

• Agreement
• Broad category: 88.76 %
• Diagnostic subcategory: 97.40 %
• Microscopy: 89.06 %
• Radiology: 90.91 %
• Reconstructions: 100 %
• Visible light photography: 79.41 %

• Conventional subcategory: 76 %
• Speed
• MD: 85 judgements per hour
• Experts: 66 judgements per hour and user

Results: MD and Crowd

• Agreement
• Broad category: 85.53 %
• Diagnostic subcategory: 85.15 %
• Microscopy: 70.89 %
• Radiology: 64.01 %
• Reconstructions: 0 %
• Visible light photography: 58.89 %

• Conventional subcategory: 75.91 %
• Speed
• MD: 85 judgements per hour
• Crowd: 25 judgements per hour and user

Results: Automatic classification verification

• Verification by experts
• 1,000 images were verified
• Agreement among annotators: 100%
• Speed:
• Users answered twice as fast

Conclusions

• Iterative approach reduces amount of manual
work
• Only a small subset is fully manually annotated
• Automatic classification verification is faster
• Significant differences among user groups
• Faster crowd annotations due to the number of
contributors
• Poorer crowd annotations in the most specific
classes
• Comparable performance among user groups
• Broad categories

Future work

• Experiments can be redesigned to fit the crowd
behaviour:
• A smaller number of (good) contributors has
previously led to CAD-comparable performance
• Selection of contributors:
• Historical performance on the platform?
• Selection/Training phase within the job

Thanks for your attention!

Antonio Foncubierta-Rodríguez and Henning Müller.
“Ground truth generation in medical imaging: A crowdsourcing based
iterative approach”,in Workshop on Crowdsourcing for Multimedia,
ACM Multimedia, Nara, Japan, 2012

Contact: antonio.foncubierta@hevs.ch

Ground truth generation in medical imaging: a crowdsourcing-based iterative approach

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Ground truth generation in medical imaging: a crowdsourcing-based iterative approach

Semelhante a Ground truth generation in medical imaging: a crowdsourcing-based iterative approach (20)

Mais de Institute of Information Systems (HES-SO)

Mais de Institute of Information Systems (HES-SO) (20)

Último

Último (20)

Ground truth generation in medical imaging: a crowdsourcing-based iterative approach