Integrating deep learning into medical imaging pipelines has the potential to transform diagnostics. However, current vision transformer (ViT) models in ophthalmology often struggle with generalizability, and all open-source optical coherence tomography (OCT) ViTs are limited to 2D imaging. Moreover, catastrophic forgetting remains an obstacle, whereby models lose previously acquired knowledge when exposed to new data. Thus, we aim to develop a 3D ViT framework for volumetric OCT scans.
Questions:
1. How can existing 2D pre-trained ViTs be adapted to OCT data, producing feature embeddings from volumetric data?
2. Does incorporating 3D representations improve performance in challenging tasks-such as distinguishing genetic macular dystrophies from geographic atrophy?
3. How do newly developed 3D ViT backbones compare with standard 2D models on downstream tasks?
Objectives:
1. Develop a ViT pre-trained on volumetric OCT data.
2. Employ low-overhead adaptation methods to accommodate domain shifts with minimal computational cost.
3. Incorporate dynamic architectures to facilitate domain adaptation while preventing catastrophic forgetting.
4. Compare both the new 3D ViT and existing open-source ViTs across critical OCT-based tasks (e.g., disease classification and systemic health prediction).
5. Fine-tune the 3D backbone for specialized clinical applications.
6. Reduce model size through distillation while preserving performance, enabling broader applicability in resource-limited environments.
Rationale:
Clinicians naturally evaluate entire OCT volumes to form a single, comprehensive assessment, yet most deep-learning OCT models rely on 2D slices alone. By extending ViTs to 3D and leveraging continuous learning strategies, this project seeks to approximate clinical decision-making more closely. The resulting models might not only advance diagnostic accuracy in ophthalmology but also serve as a versatile foundation for broader medical imaging applications.