Skip to main content

Showing 1–13 of 13 results for author: Bazzani, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.04444  [pdf, other

    cs.CV

    ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task

    Authors: Vittorio Pippi, Matthieu Guillaumin, Silvia Cascianelli, Rita Cucchiara, Maximilian Jaritz, Loris Bazzani

    Abstract: Large Multimodal Models (LMMs) are powerful tools that are capable of reasoning and understanding multimodal information beyond text and language. Despite their entrenched impact, the development of LMMs is hindered by the higher computational requirements compared to their unimodal counterparts. One of the main causes of this is the large amount of tokens needed to encode the visual input, which… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  2. arXiv:2502.08254  [pdf, other

    cs.CV

    UniCoRN: Unified Commented Retrieval Network with LMMs

    Authors: Maximilian Jaritz, Matthieu Guillaumin, Sabine Sternig, Loris Bazzani

    Abstract: Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address t… ▽ More

    Submitted 12 February, 2025; originally announced February 2025.

  3. arXiv:2411.17490  [pdf, other

    cs.CV

    Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval

    Authors: Ziwei Wang, Sameera Ramasinghe, Chenchen Xu, Julien Monteil, Loris Bazzani, Thalaiyasingam Ajanthan

    Abstract: Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies… ▽ More

    Submitted 17 March, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

  4. arXiv:2410.08211  [pdf, other

    cs.CV cs.AI cs.CL

    LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

    Authors: Anh-Quan Cao, Maximilian Jaritz, Matthieu Guillaumin, Raoul de Charette, Loris Bazzani

    Abstract: Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets wi… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

  5. arXiv:2402.18842  [pdf, other

    cs.CV

    ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

    Authors: Xianghui Yang, Yan Zuo, Sameera Ramasinghe, Loris Bazzani, Gil Avraham, Anton van den Hengel

    Abstract: Novel-view synthesis through diffusion models has demonstrated remarkable potential for generating diverse and high-quality images. Yet, the independent process of image generation in these prevailing methods leads to challenges in maintaining multiple-view consistency. To address this, we introduce ViewFusion, a novel, training-free algorithm that can be seamlessly integrated into existing pre-tr… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    Comments: CVPR2024,homepage:https://wi-sc.github.io/ViewFusion.github.io/

  6. arXiv:2305.05947  [pdf, other

    cs.CV

    iEdit: Localised Text-guided Image Editing with Weak Supervision

    Authors: Rumeysa Bodur, Erhan Gundogdu, Binod Bhattarai, Tae-Kyun Kim, Michael Donoser, Loris Bazzani

    Abstract: Diffusion models (DMs) can generate realistic images with text guidance using large-scale datasets. However, they demonstrate limited controllability in the output space of the generated images. We propose a novel learning method for text-guided image editing, namely \texttt{iEdit}, that generates images conditioned on a source image and a textual edit prompt. As a fully-annotated dataset with tar… ▽ More

    Submitted 10 May, 2023; originally announced May 2023.

  7. arXiv:2204.12293  [pdf, other

    cs.CV

    Contrastive Language-Action Pre-training for Temporal Localization

    Authors: Mengmeng Xu, Erhan Gundogdu, Maksim Lapin, Bernard Ghanem, Michael Donoser, Loris Bazzani

    Abstract: Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the vid… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: 18 pages, 4 figures

  8. arXiv:2103.13061  [pdf, other

    cs.CV

    Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

    Authors: Amaia Salvador, Erhan Gundogdu, Loris Bazzani, Michael Donoser

    Abstract: Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives, as well as the availability of vast amounts of digital cooking recipes and food images to train machine learning models. In this work, we revisit existing approaches for cross-modal recipe retrieval and propose a simplified end-to-end model based on well established and high perf… ▽ More

    Submitted 24 March, 2021; originally announced March 2021.

    Comments: CVPR 2021

  9. arXiv:1810.04101  [pdf, other

    cs.CV

    Image Captioning as Neural Machine Translation Task in SOCKEYE

    Authors: Loris Bazzani, Tobias Domhan, Felix Hieber

    Abstract: Image captioning is an interdisciplinary research problem that stands between computer vision and natural language processing. The task is to generate a textual description of the content of an image. The typical model used for image captioning is an encoder-decoder deep network, where the encoder captures the essence of an image while the decoder is responsible for generating a sentence describin… ▽ More

    Submitted 15 October, 2018; v1 submitted 9 October, 2018; originally announced October 2018.

  10. arXiv:1609.09251  [pdf, other

    cs.CV

    Kernel Methods on Approximate Infinite-Dimensional Covariance Operators for Image Classification

    Authors: Hà Quang Minh, Marco San Biagio, Loris Bazzani, Vittorio Murino

    Abstract: This paper presents a novel framework for visual object recognition using infinite-dimensional covariance operators of input features in the paradigm of kernel methods on infinite-dimensional Riemannian manifolds. Our formulation provides in particular a rich representation of image features by exploiting their non-linear correlations. Theoretically, we provide a finite-dimensional approximation o… ▽ More

    Submitted 29 September, 2016; originally announced September 2016.

    Comments: 18 double-column pages

  11. arXiv:1603.08199  [pdf, other

    cs.CV

    Recurrent Mixture Density Network for Spatiotemporal Visual Attention

    Authors: Loris Bazzani, Hugo Larochelle, Lorenzo Torresani

    Abstract: In many computer vision tasks, the relevant information to solve the problem at hand is mixed to irrelevant, distracting information. This has motivated researchers to design attentional models that can dynamically focus on parts of images or videos that are salient, e.g., by down-weighting irrelevant pixels. In this work, we propose a spatiotemporal attentional model that learns where to look in… ▽ More

    Submitted 11 February, 2017; v1 submitted 27 March, 2016; originally announced March 2016.

    Comments: ICLR 2017

  12. arXiv:1409.3964  [pdf, other

    cs.CV

    Self-taught Object Localization with Deep Networks

    Authors: Loris Bazzani, Alessandro Bergamo, Dragomir Anguelov, Lorenzo Torresani

    Abstract: This paper introduces self-taught object localization, a novel approach that leverages deep convolutional networks trained for whole-image recognition to localize objects in images without additional human supervision, i.e., without using any ground-truth bounding boxes for training. The key idea is to analyze the change in the recognition scores when artificially masking out different regions of… ▽ More

    Submitted 2 February, 2016; v1 submitted 13 September, 2014; originally announced September 2014.

    Comments: WACV 2016

  13. arXiv:1109.3737  [pdf, other

    cs.AI

    Learning where to Attend with Deep Architectures for Image Tracking

    Authors: Misha Denil, Loris Bazzani, Hugo Larochelle, Nando de Freitas

    Abstract: We discuss an attentional model for simultaneous object tracking and recognition that is driven by gaze data. Motivated by theories of perception, the model consists of two interacting pathways: identity and control, intended to mirror the what and where pathways in neuroscience models. The identity pathway models object appearance and performs classification using deep (factored)-Restricted Boltz… ▽ More

    Submitted 16 September, 2011; originally announced September 2011.