-
INQUIRE: A Natural World Text-to-Image Retrieval Benchmark
Authors:
Edward Vendrow,
Omiros Pantazis,
Alexander Shepard,
Gabriel Brostow,
Kate E. Jones,
Oisin Mac Aodha,
Sara Beery,
Grant Van Horn
Abstract:
We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total match…
▽ More
We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at https://inquire-benchmark.github.io
△ Less
Submitted 11 November, 2024; v1 submitted 4 November, 2024;
originally announced November 2024.
-
Deep learning-based ecological analysis of camera trap images is impacted by training data quality and quantity
Authors:
Peggy A. Bevan,
Omiros Pantazis,
Holly Pringle,
Guilherme Braga Ferreira,
Daniel J. Ingram,
Emily Madsen,
Liam Thomas,
Dol Raj Thanet,
Thakur Silwal,
Santosh Rayamajhi,
Gabriel Brostow,
Oisin Mac Aodha,
Kate E. Jones
Abstract:
Large image collections generated from camera traps offer valuable insights into species richness, occupancy, and activity patterns, significantly aiding biodiversity monitoring. However, the manual processing of these datasets is time-consuming, hindering analytical processes. To address this, deep neural networks have been adopted to automate image labelling, but the impact of classification err…
▽ More
Large image collections generated from camera traps offer valuable insights into species richness, occupancy, and activity patterns, significantly aiding biodiversity monitoring. However, the manual processing of these datasets is time-consuming, hindering analytical processes. To address this, deep neural networks have been adopted to automate image labelling, but the impact of classification error on ecological metrics remains unclear. Here, we analyse data from camera trap collections in an African savannah (82,300 images, 47 species) and an Asian sub-tropical dry forest (40,308 images, 29 species) to compare ecological metrics derived from expert-generated species identifications with those generated by deep learning classification models. We specifically assess the impact of deep learning model architecture, the proportion of label noise in the training data, and the size of the training dataset on three ecological metrics: species richness, occupancy, and activity patterns. Overall, ecological metrics derived from deep neural networks closely match those calculated from expert labels and remain robust to manipulations in the training pipeline. We found that the choice of deep learning model architecture does not impact ecological metrics, and ecological metrics related to the overall community (species richness, community occupancy) were resilient to up to 10% noise in the training dataset and a 50% reduction in the training dataset size. However, we caution that less common species are disproportionately affected by a reduction in deep neural network accuracy, and this has consequences for species-specific metrics (occupancy, diel activity patterns). To ensure the reliability of their findings, practitioners should prioritize creating large, clean training sets with balanced representation across species over exploring numerous deep learning model architectures.
△ Less
Submitted 7 May, 2025; v1 submitted 26 August, 2024;
originally announced August 2024.
-
SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models
Authors:
Omiros Pantazis,
Gabriel Brostow,
Kate Jones,
Oisin Mac Aodha
Abstract:
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs, and have been shown to sometimes exhibit impressive zero- and low-shot image classification performance. However, due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required. To combat this, a series of l…
▽ More
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs, and have been shown to sometimes exhibit impressive zero- and low-shot image classification performance. However, due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required. To combat this, a series of light-weight adaptation methods have been proposed to efficiently adapt such models when limited supervision is available. In this work, we show that while effective on internet-style datasets, even those remedies under-deliver on classification tasks with images that differ significantly from those commonly found online. To address this issue, we present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning. We report an average classification accuracy improvement of 10% in the low-shot setting when compared to existing methods, on a set of challenging visual classification tasks. Further, we present a fully automatic way of selecting an important blending hyperparameter for our model that does not require any held-out labeled validation data. Code for our project is available here: https://github.com/omipan/svl_adapter.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
Matching Multiple Perspectives for Efficient Representation Learning
Authors:
Omiros Pantazis,
Mathew Salvaris
Abstract:
Representation learning approaches typically rely on images of objects captured from a single perspective that are transformed using affine transformations. Additionally, self-supervised learning, a successful paradigm of representation learning, relies on instance discrimination and self-augmentations which cannot always bridge the gap between observations of the same object viewed from a differe…
▽ More
Representation learning approaches typically rely on images of objects captured from a single perspective that are transformed using affine transformations. Additionally, self-supervised learning, a successful paradigm of representation learning, relies on instance discrimination and self-augmentations which cannot always bridge the gap between observations of the same object viewed from a different perspective. Viewing an object from multiple perspectives aids holistic understanding of an object which is particularly important in situations where data annotations are limited. In this paper, we present an approach that combines self-supervised learning with a multi-perspective matching technique and demonstrate its effectiveness on learning higher quality representations on data captured by a robotic vacuum with an embedded camera. We show that the availability of multiple views of the same object combined with a variety of self-supervised pretraining algorithms can lead to improved object classification performance without extra labels.
△ Less
Submitted 16 August, 2022;
originally announced August 2022.
-
Focus on the Positives: Self-Supervised Learning for Biodiversity Monitoring
Authors:
Omiros Pantazis,
Gabriel Brostow,
Kate Jones,
Oisin Mac Aodha
Abstract:
We address the problem of learning self-supervised representations from unlabeled image collections. Unlike existing approaches that attempt to learn useful features by maximizing similarity between augmented versions of each input image or by speculatively picking negative samples, we instead also make use of the natural variation that occurs in image collections that are captured using static mo…
▽ More
We address the problem of learning self-supervised representations from unlabeled image collections. Unlike existing approaches that attempt to learn useful features by maximizing similarity between augmented versions of each input image or by speculatively picking negative samples, we instead also make use of the natural variation that occurs in image collections that are captured using static monitoring cameras. To achieve this, we exploit readily available context data that encodes information such as the spatial and temporal relationships between the input images. We are able to learn representations that are surprisingly effective for downstream supervised classification, by first identifying high probability positive pairs at training time, i.e. those images that are likely to depict the same visual concept. For the critical task of global biodiversity monitoring, this results in image features that can be adapted to challenging visual species classification tasks with limited human supervision. We present results on four different camera trap image collections, across three different families of self-supervised learning methods, and show that careful image selection at training time results in superior performance compared to existing baselines such as conventional self-supervised training and transfer learning.
△ Less
Submitted 13 August, 2021;
originally announced August 2021.
-
The Game Performance Index for Mobile Phones
Authors:
Hesham Dar,
James Kwan,
Yang Liu,
Omiros Pantazis,
Robert Sharp
Abstract:
With the recent increase in the quantity of high fidelity games appearing on mobile devices and the recent trend of gaming focused mobile devices, there is a new requirement for a clear and comprehensive measure of the quality of gaming performance on the mobile device platform. This paper proposes a conceptual framework for a user-experience and user-perception based set of performance measures f…
▽ More
With the recent increase in the quantity of high fidelity games appearing on mobile devices and the recent trend of gaming focused mobile devices, there is a new requirement for a clear and comprehensive measure of the quality of gaming performance on the mobile device platform. This paper proposes a conceptual framework for a user-experience and user-perception based set of performance measures for mobile devices. This paper presents a specific implementation and measurement use case which has been beneficial to Samsung Electronics when applied to our own product range, allowing us to better understand and quantify device performance. We believe that the methods outlined are potentially useful to the consumer, by providing an understandable public facing score for device performance to guide consumers with purchasing decisions. The methods may be useful to game developers and could better enable the developer to add new richer game features based on the performance of the device.
△ Less
Submitted 30 October, 2019;
originally announced October 2019.