Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic

Shah, Monika; Sarkhel, Somdeb; Venugopal, Deepak

doi:10.1109/BigData62323.2024.10825003

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.13847 (cs)

[Submitted on 18 Mar 2025]

Title:Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic

Authors:Monika Shah, Somdeb Sarkhel, Deepak Venugopal

View PDF HTML (experimental)

Abstract:Multimodal systems have highly complex processing pipelines and are pretrained over large datasets before being fine-tuned for specific tasks such as visual captioning. However, it becomes hard to disentangle what the model learns during the fine-tuning process from what it already knows due to its pretraining. In this work, we learn a probabilistic model using Hybrid Markov Logic Networks (HMLNs) over the training examples by relating symbolic knowledge (extracted from the caption) with visual features (extracted from the image). For a generated caption, we quantify the influence of training examples based on the HMLN distribution using probabilistic inference. We evaluate two types of inference procedures on the MSCOCO dataset for different types of captioning models. Our results show that for BLIP2 (a model that uses a LLM), the fine-tuning may have smaller influence on the knowledge the model has acquired since it may have more general knowledge to perform visual captioning as compared to models that do not use a LLM

Comments:	2024 IEEE International Conference on Big Data (BigData), 10 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.13847 [cs.CV]
	(or arXiv:2503.13847v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.13847
Related DOI:	https://doi.org/10.1109/BigData62323.2024.10825003

Submission history

From: Monika Shah [view email]
[v1] Tue, 18 Mar 2025 02:39:26 UTC (10,773 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators