Context-Aware Vision Language Foundation Models for Ocular Disease Screening in Retinal Images

Berger, Lucie; Lamard, Mathieu; Zhang, Philippe; Borderie, Laurent; Guilcher, Alexandre Le; Massin, Pascale; Cochener, Béatrice; Quellec, Gwenolé; Matta, Sarah

Abstract:Foundation models are large-scale versatile systems trained on vast quantities of diverse data to learn generalizable representations. Their adaptability with minimal fine-tuning makes them particularly promising for medical imaging, where data variability and domain shifts are major challenges. Currently, two types of foundation models dominate the literature: self-supervised models and more recent vision-language models. In this study, we advance the application of vision-language foundation (VLF) models for ocular disease screening using the OPHDIAT dataset, which includes nearly 700,000 fundus photographs from a French diabetic retinopathy (DR) screening network. This dataset provides extensive clinical data (patient-specific information such as diabetic health conditions, and treatments), labeled diagnostics, ophthalmologists text-based findings, and multiple retinal images for each examination. Building on the FLAIR model $\unicode{x2013}$ a VLF model for retinal pathology classification $\unicode{x2013}$ we propose novel context-aware VLF models (e.g jointly analyzing multiple images from the same visit or taking advantage of past diagnoses and contextual data) to fully leverage the richness of the OPHDIAT dataset and enhance robustness to domain shifts. Our approaches were evaluated on both in-domain (a testing subset of OPHDIAT) and out-of-domain data (public datasets) to assess their generalization performance. Our model demonstrated improved in-domain performance for DR grading, achieving an area under the curve (AUC) ranging from 0.851 to 0.9999, and generalized well to ocular disease detection on out-of-domain data (AUC: 0.631-0.913).

Comments:	4 pages
Subjects:	Image and Video Processing (eess.IV)
Cite as:	arXiv:2503.15212 [eess.IV]
	(or arXiv:2503.15212v1 [eess.IV] for this version)
	https://doi.org/10.48550/arXiv.2503.15212

Electrical Engineering and Systems Science > Image and Video Processing

Title:Context-Aware Vision Language Foundation Models for Ocular Disease Screening in Retinal Images

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators