Surgical Text-to-Image Generation

Nwoye, Chinedu Innocent; Bose, Rupak; Elgohary, Kareem; Arboit, Lorenzo; Carlino, Giorgio; Lavanchy, Joël L.; Mascagni, Pietro; Padoy, Nicolas

doi:10.1016/j.patrec.2025.02.002

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.09230 (cs)

[Submitted on 12 Jul 2024 (v1), last revised 21 Mar 2025 (this version, v3)]

Title:Surgical Text-to-Image Generation

Authors:Chinedu Innocent Nwoye, Rupak Bose, Kareem Elgohary, Lorenzo Arboit, Giorgio Carlino, Joël L. Lavanchy, Pietro Mascagni, Nicolas Padoy

View PDF HTML (experimental)

Abstract:Acquiring surgical data for research and development is significantly hindered by high annotation costs and practical and ethical constraints. Utilizing synthetically generated images could offer a valuable alternative. In this work, we explore adapting text-to-image generative models for the surgical domain using the CholecT50 dataset, which provides surgical images annotated with action triplets (instrument, verb, target). We investigate several language models and find T5 to offer more distinct features for differentiating surgical actions on triplet-based textual inputs, and showcasing stronger alignment between long and triplet-based captions. To address challenges in training text-to-image models solely on triplet-based captions without additional inputs and supervisory signals, we discover that triplet text embeddings are instrument-centric in the latent space. Leveraging this insight, we design an instrument-based class balancing technique to counteract data imbalance and skewness, improving training convergence. Extending Imagen, a diffusion-based generative model, we develop Surgical Imagen to generate photorealistic and activity-aligned surgical images from triplet-based textual prompts. We assess the model on quality, alignment, reasoning, and knowledge, achieving FID and CLIP scores of 3.7 and 26.8% respectively. Human expert survey shows that participants were highly challenged by the realistic characteristics of the generated samples, demonstrating Surgical Imagen's effectiveness as a practical alternative to real data collection.

Comments:	13 pages, 13 figures, 3 tables, published in Pattern Recognition Letters 2025, project page at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.09230 [cs.CV]
	(or arXiv:2407.09230v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.09230
Journal reference:	Pattern Recognition Letters, Volume 190, April 2025, Pages 73-80
Related DOI:	https://doi.org/10.1016/j.patrec.2025.02.002

Submission history

From: Chinedu Nwoye [view email]
[v1] Fri, 12 Jul 2024 12:49:11 UTC (1,900 KB)
[v2] Tue, 30 Jul 2024 16:40:23 UTC (1,935 KB)
[v3] Fri, 21 Mar 2025 09:57:02 UTC (2,212 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Surgical Text-to-Image Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Surgical Text-to-Image Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators