What Makes Transfer learning Work for Medical Images
1. What Makes Transfer learning Work for Medical Images
: Feature Reuse and Other Factors
Christos Matsoukas1,2,3 , Johan Fredin Haslum1,2,3, Moein Sorkhei1,2, Magnus Soderberg3, Kevin Smith1,2
1 KTH Royal Institute of Technology, Stockholm, Sweden
2 Science for Life Laboratory, Stockholm, Sweden
3 AstraZeneca, Gothenburg, Sweden
Presenter : Mithunjha Anandakumar
2. What is Transfer Learning?
Source domain
Model Model
Target domain
Knowledge
reuse knowledge gained in one domain, the
source domain, to improve performance in
another, the target domain.
2
3. Source domain vs Target domain
Source Domain/ ImageNet Target Domain/ Medical Images
Natural images with clear global subject large image of a bodily region of interest and use
variations in local textures to identify pathologies
Millions of images Larger Images/ fewer images*
1000 classes Fewer classes
Image credits : https://www.researchgate.net/figure/Examples-of-pictures-randomly-sampled-from-the-Tiny-ImageNet-dataset_fig1_354590544
Content credits: Raghu, M., Zhang, C., Kleinberg, J., & Bengio, S. (2019). Transfusion: Understanding transfer learning for medical imaging. Advances in neural information processing systems, 32.
3
* Rareness of disease, ethical concerns, expense of acquisition
4. Contribution of the paper
• Shows that the benefits of TL increase with:
• Reduced data size
• Smaller Distance between source and target domain
• Models with fewer inductive biases
• Models with more capacity (to a lesser extent)
• Shows that the benefits of TL correlates with feature reuse.
• Shows that there is feature independent benefits of pretraining -
speed up training.
4
5. Related work
• Summary of the paper and contribution
- 2 dataset
- a large dataset : CheXpert
- a private dataset : Retinal fundus
- Architecture:
- Resnet
- Inception
- Contribution :
- Little benefit (due to overparameterization
and weight statistics but not due to feature
reuse)
- Speed up the training
5
6. Methodology
• Dataset
6
N = 3662 N = 10,239 N = 25331 N = 224316 N = 327680
High-resolution
diabetic retinopathy
images
A mammography
dataset
Dermoscopic images Chest X-rays Patches of H&E
stained WSIs of lymph
node sections
Classification : 5
Classes
detect the presence of
masses
Classification : 9
Classes
Classification : 14
Classes
Classification : 2
Classes
8. Methodology
• Initialization – to isolate the contribution of feature reuse and weights
statistics
8
1. Random Initialization (RI)
Kaiming initialization
2. Weight statistics transfer (ST)
Sampling weights from a normal distribution whose mean and std are taken
from an IMAGENET pretrained model
3. Weight Transfer (WT)
Transferring IMAGENET pretrained weights
10. When is TL to medical
domain beneficial and
how important is
feature reuse?
10
Relative increase in the performance,
𝑊𝑇
𝑅𝐼
Relative gain attributed to feature use,
𝑊𝑇 − 𝑆𝑇
𝑅𝐼
11. Which layers benefits from feature reuse?
11
Transferring weights (WT)
upto n block and initializing
remaining m blocks with ST.
12. What properties of TL are revealed via feature similarity?
12
Feature similarity resulting from transfer learning (WT) before and after finetuning.
13. What properties of TL are revealed via feature similarity?
13
Feature similarity between ST and WT initialized models after fine-tuning.
14. Which transferred weight changes?
14
L2 distance between the initial weights of each network and the weights after fine-tuning.
15. Which transferred weight changes?
15
impact of resetting a layer’s weights to their initial values : Reinitialization robustness
16. What is the impact of TL for different model capacities
16
17. What is the impact of TL on convergence speed?
17
18. Contribution of the paper
• Shows that the benefits of TL increase with:
• Reduced data size
• Smaller Distance between source and target domain
• Models with fewer inductive biases
• Models with more capacity (to a lesser extent)
• Shows that the benefits of TL correlates with feature reuse.
• Shows that there is feature independent benefits of pretraining -
speed up training.
18
What is transfer learning?
The feature reuse hypothesis assumes that weights learned in the source domain yield features that can readily be used in the target domain.
The lack of large public datasets has led to the widespread adoption of transfer learning from IMAGENET to improve performance on medical tasks.
Transfer learning is typically performed by taking an architecture, along with its IMAGENET pretrained weights, and then fine-tuning it on the target task.
Why TL is improving ? Good initialization, Speed up training or Feature reuse
Neurips 2019
Raghu et al. showed that the actual values of the weights are not always necessary for good transfer learning performance. One can achieve similar performance by initializing the network using its weight statistics. In this setting, transfer amounts to providing a good range of values to randomly initialize the network– eliminating feature reuse as a factor.
Many other works showed that transfer learning doesn’t significantly helps with medical images.
DEIT (data efficient image transformer) – purely transformers
Swin – self attention with hierarchical structures
Indictive biases - locality, translational equivariance, hierarchical scale
Inception - processes the signal in parallel at multiple scales before propagating it to the next layer.
TL is least beneficial for CNN architectures on large dataset
DEIT (lack of inductive bias) sees a boost from TL even in large dataset than SWIN
All models shows gain on small dataset with TL
ISIC closely resembles IMAGENET : higher gain with CNN models too
SWIN falls in between DEIT and CNNs.
DEIT has lack inductive bias – even large dataset is insufficient to learn better features than those transferred from Imagenet
For large dataset, CNN exhibits relativelt flat line through out the network – no significant benefits over stats transfer
For smaller dataset, linearly increasing trend implies that every layer benefit from feature reuse.
DEIT shows sharp jumps in early layers – local attention learned in early layers; learning local features required huge data
SWIN shows properties of both DEIT and CNNs – on small and imagenet like data behaves like DEIT but with large dataset resembles CNN
On average, ViT (since lack ofinductive bias) benefits from feature reuse, but in early layer.
CNN benefitted from feature reuse in a lesser extent, but consistently throughout the network layer – reflecting hierarchical nature of architecture.
Red indicated high feature reuse – no changes in features after finetuning.
For DEIT, we see feature similarity is strongest in the early- to mid-layers. In later layers, the trained model adapts to the new task and drifts away from the IMAGENET features.
RESNET50 after transfer learning shows more broad feature similarity – with the exception of the final layers which must adapt to the new task.
A common trend shared by both ViTs and CNNs is that when more data is available, the transition point from feature reuse to feature adaptation shifts towards earlier layers because the network has sufficient data to adapt more of the transferred IMAGENET features to the new task.
ViTs
Here, we find that early layers of ST-initialized models are similar to features from the first half of the WT-initialized models. We see that if the network is denied these essential pre-trained weights, it attempts to learn them rapidly using only a few layers (due to lack of data), resulting in poor performance.
CNNs
From the bottom row of Figure 3 we further observe that CNNs seem to learn similar features from different initializations, suggesting that their inductive biases may somehow naturally lead to these features (although the final layers used for classification diverge). We also observe a trend where, given more data, the ST-initialization is able to learn some novel midto high-level features not found in IMAGENET.
The general trend is that transferred weights (WT) remain in the same vicinity after fine-tuning, more so when transfer learning gains are strongest
As the network is progressively initialized more with ST, the transferred weights tend to “stick” less well.
Certain layers, however, undergo substantial changes regardless – early layers in ViTs (the patchifier) and INCEPTION, and the first block at each scale in RESNET50. These are the first layers to encounter the data, or a scale change.
Our main finding is that networks with weight transfer (WT) undergo few critical changes, indicating feature reuse.
When transfer learning is least effective (RESNET on CHEXPERT and PATCHCAMELYON) the gap in robustness between WT and ST is at its smallest. Interestingly, in ViTs with partial weight transfer (WT-ST), critical layers often appear at the transition between WT and ST. Rather than change the transferred weights, the network quickly adapts. But following this adaptation, no critical layers appear. As the data size increases, ViTs make more substantial early changes to adapt to the raw input (or partial WT). Transferred weights in CNNs, on the other hand, tend to be less “sticky” than ViTs. We see the same general trend where WT is the most robust, but unlike ViTs where WT was robust throughout the network, RESNET50 exhibits poor robustness at the final layers responsible for classification, and also periodically within the network at critical layers where the scale changes, as observed by [44].
Wee can observe slight increase in the TL performance as model size increases - Red curve dominating other curve when WT fraction is closer to 1
We observe that convergence speed monotonically increases with the number of WT layers.
Furthermore, we observe that CNNs converge faster at a roughly linear rate as we include more WT layers
vision transformers see a rapid increase in convergence speed for the first half of the network but diminishing returns are observed after that.
Why TL is improving ? Good initialization, Speed up training or Feature reuse