Skip to main content

Showing 1–50 of 82 results for author: Albanie, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06261  [pdf, ps, other

    cs.CL cs.AI

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu , et al. (3264 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde… ▽ More

    Submitted 11 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: 72 pages, 17 figures

  2. arXiv:2506.08227  [pdf, ps, other

    cs.CV

    A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

    Authors: Vishaal Udandarao, Mehdi Cherti, Shyamgopal Karthik, Jenia Jitsev, Samuel Albanie, Matthias Bethge

    Abstract: We investigate 17 benchmarks (e.g. SugarCREPE, VALSE) commonly used for measuring compositional understanding capabilities of vision-language models (VLMs). We scrutinize design choices in their construction, including data source (e.g. MS-COCO) and curation procedures (e.g. constructing negative images/captions), uncovering several inherent biases across most benchmarks. We find that blind heuris… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  3. arXiv:2506.05296  [pdf, ps, other

    cs.AI cs.LG

    Control Tax: The Price of Keeping AI in Check

    Authors: Mikhail Terekhov, Zhen Ning David Liu, Caglar Gulcehre, Samuel Albanie

    Abstract: The rapid integration of agentic AI into high-stakes real-world applications requires robust oversight mechanisms. The emerging field of AI Control (AIC) aims to provide such an oversight mechanism, but practical adoption depends heavily on implementation overhead. To study this problem better, we introduce the notion of Control tax -- the operational and financial cost of integrating control meas… ▽ More

    Submitted 14 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  4. arXiv:2504.07086  [pdf, other

    cs.LG cs.CL

    A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

    Authors: Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, Matthias Bethge

    Abstract: Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on benchmarking practices that lack transparency, robustness, or statistical grounding. In this work, we conduct a comprehensive empirical study and find that current mathema… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

    Comments: Technical Report

  5. arXiv:2504.01849  [pdf, other

    cs.AI cs.CY cs.LG

    An Approach to Technical AGI Safety and Security

    Authors: Rohin Shah, Alex Irpan, Alexander Matt Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, Rishub Jain, Rory Greig, Samuel Albanie, Scott Emmons, Sebastian Farquhar, Sébastien Krier, Senthooran Rajamanoharan, Sophie Bridgers, Tobi Ijitoye, Tom Everitt, Victoria Krakovna, Vikrant Varma, Vladimir Mikulik, Zachary Kenton, Dave Orr , et al. (5 additional authors not shown)

    Abstract: Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  6. arXiv:2502.09696  [pdf, other

    cs.CV

    ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

    Authors: Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, David I. Atkinson, Aaditya Baranwal, Alexandru Coca, Mikah Dang , et al. (9 additional authors not shown)

    Abstract: Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for l… ▽ More

    Submitted 6 March, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

    Comments: 20 pages, 13 figures

  7. arXiv:2501.14249  [pdf, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1084 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 19 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  8. arXiv:2412.13602  [pdf, ps, other

    cs.CL

    GAMEBoT: Transparent Assessment of LLM Reasoning in Games

    Authors: Wenye Lin, Jonathan Roberts, Yunhan Yang, Samuel Albanie, Zongqing Lu, Kai Han

    Abstract: Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, robust benchmarks are required to evaluate their capabilities beyond superficial pattern recognition. However, current LLM reasoning benchmarks often face challenges such as insufficient interpretability, performance saturation or data contamination. To address these… ▽ More

    Submitted 1 June, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

    Comments: 9 pages, ACL 2025

  9. arXiv:2412.06745  [pdf, ps, other

    cs.LG cs.CL cs.CV

    ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

    Authors: Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge

    Abstract: Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capab… ▽ More

    Submitted 17 June, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

  10. arXiv:2412.06712  [pdf, other

    cs.LG cs.CL cs.CV

    How to Merge Your Multimodal Models Over Time?

    Authors: Sebastian Dziadzio, Vishaal Udandarao, Karsten Roth, Ameya Prabhu, Zeynep Akata, Samuel Albanie, Matthias Bethge

    Abstract: Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: Technical Report. Code at https://github.com/ExplainableML/fomo_in_flux

  11. arXiv:2411.18674  [pdf, other

    cs.CV cs.LG

    Active Data Curation Effectively Distills Large-Scale Multimodal Models

    Authors: Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie, Federico Tombari, Yongqin Xian, Alessio Tonioni, Olivier J. Hénaff

    Abstract: Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. O… ▽ More

    Submitted 5 May, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

    Comments: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

  12. arXiv:2411.05000  [pdf, other

    cs.CL

    Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

    Authors: Jonathan Roberts, Kai Han, Samuel Albanie

    Abstract: As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has trad… ▽ More

    Submitted 23 April, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

    Comments: Accepted at ICLR 2025

  13. arXiv:2408.14471  [pdf, other

    cs.CV cs.CL cs.LG

    A Practitioner's Guide to Continual Multimodal Pretraining

    Authors: Karsten Roth, Vishaal Udandarao, Sebastian Dziadzio, Ameya Prabhu, Mehdi Cherti, Oriol Vinyals, Olivier Hénaff, Samuel Albanie, Matthias Bethge, Zeynep Akata

    Abstract: Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practi… ▽ More

    Submitted 6 December, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

    Comments: Technical Report. 52 pages. Shorter version published at the NeurIPS 2024 Dataset & Benchmarks track

  14. arXiv:2408.11817  [pdf, other

    cs.CV

    GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

    Authors: Jonathan Roberts, Kai Han, Samuel Albanie

    Abstract: Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the… ▽ More

    Submitted 29 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: V2: Fixed references formatting

  15. arXiv:2407.04622  [pdf, other

    cs.LG

    On scalable oversight with weak LLMs judging strong LLMs

    Authors: Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah

    Abstract: Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI a… ▽ More

    Submitted 12 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: 15 pages (53 including appendices). V2: minor correction to Figure 3; add Figure A.9 comparing open vs assigned consultancy; add a reference

  16. arXiv:2406.06560  [pdf, other

    cs.CL cs.AI

    Inverse Constitutional AI: Compressing Preferences into Principles

    Authors: Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, Samuel Albanie, Robert Mullins

    Abstract: Feedback data is widely used for fine-tuning and evaluating state-of-the-art AI models. Pairwise text preferences, where human or AI annotators select the "better" of two options, are particularly common. Such preferences are used to train (reward) models or to rank models with aggregate statistics. For many applications it is desirable to understand annotator preferences in addition to modelling… ▽ More

    Submitted 21 April, 2025; v1 submitted 2 June, 2024; originally announced June 2024.

    Comments: Accepted at ICLR 2025, v2 is camera-ready version; Main changes from v1: extended experiments, additional baselines

  17. arXiv:2406.03428  [pdf, other

    cs.LG

    HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

    Authors: Tim Franzmeyer, Aleksandar Shtedritski, Samuel Albanie, Philip Torr, João F. Henriques, Jakob N. Foerster

    Abstract: Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development. Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: ACL 2024 Findings

  18. arXiv:2405.10266  [pdf, other

    cs.CV cs.CL

    A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

    Authors: Charles Raude, K R Prajwal, Liliane Momeni, Hannah Bull, Samuel Albanie, Andrew Zisserman, Gül Varol

    Abstract: In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  19. arXiv:2405.08807  [pdf, other

    cs.CV

    SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

    Authors: Jonathan Roberts, Kai Han, Neil Houlsby, Samuel Albanie

    Abstract: Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we pres… ▽ More

    Submitted 5 December, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

    Comments: Accepted at NeurIPS 2024 (Datasets and Benchmarks Track)

  20. arXiv:2404.09932  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    Foundational Challenges in Assuring Alignment and Safety of Large Language Models

    Authors: Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi , et al. (17 additional authors not shown)

    Abstract: This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.

    Submitted 5 September, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

  21. arXiv:2404.04125  [pdf, other

    cs.CV cs.CL cs.LG

    No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

    Authors: Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H. S. Torr, Adel Bibi, Samuel Albanie, Matthias Bethge

    Abstract: Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream conce… ▽ More

    Submitted 29 October, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: Short version accepted at DPFM, ICLR'24; Full paper at NeurIPS'24

  22. arXiv:2402.19472  [pdf, other

    cs.LG cs.CV

    Efficient Lifelong Model Evaluation in an Era of Rapid Progress

    Authors: Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel Bibi, Samuel Albanie

    Abstract: Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. These benchmarks introduce a major challenge: the high cost of evaluating a growing number of mode… ▽ More

    Submitted 23 November, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: Accepted as a conference paper at NeurIPS'24

  23. arXiv:2402.19106  [pdf, other

    eess.AS cs.IR cs.SD

    A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

    Authors: Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke

    Abstract: Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio in… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 9 pages, 2 figures, 9 tables, Accepted at ICASSP 2024

  24. arXiv:2312.12490  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    InstructVideo: Instructing Video Diffusion Models with Human Feedback

    Authors: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni

    Abstract: Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredient… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Project page: https://instructvideo.github.io/

  25. arXiv:2311.14656  [pdf, other

    cs.CV cs.AI

    Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

    Authors: Jonathan Roberts, Timo Lüddecke, Rehan Sheikh, Kai Han, Samuel Albanie

    Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored, despite potential wide-ranging benefits to navigation, environmental research, urban development, and disaster response. We conduct a series of experiments exploring various vision capabilities o… ▽ More

    Submitted 16 January, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

    Comments: V3: Fixed typo in Fig.1; V2: Minor formatting changes and added missing subfigure captions

  26. arXiv:2310.08577  [pdf, other

    cs.CV cs.CL cs.LG

    Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

    Authors: Vishaal Udandarao, Max F. Burg, Samuel Albanie, Matthias Bethge

    Abstract: Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specif… ▽ More

    Submitted 6 December, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

  27. arXiv:2308.10402  [pdf, other

    cs.CV cs.AI cs.CL cs.HC

    Simple Baselines for Interactive Video Retrieval with Questions and Answers

    Authors: Kaiqu Liang, Samuel Albanie

    Abstract: To date, the majority of video retrieval systems have been optimized for a "single-shot" scenario in which the user submits a query in isolation, ignoring previous interactions with the system. Recently, there has been renewed interest in interactive systems to enhance retrieval, but existing approaches are complex and deliver limited gains in performance. In this work, we revisit this topic and p… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

    Comments: ICCV 2023, project page: https://github.com/kevinliang888/IVR-QA-baselines

  28. arXiv:2308.09351  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    RLIPv2: Fast Scaling of Relational Language-Image Pre-training

    Authors: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao

    Abstract: Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging mode… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: Accepted to ICCV 2023. Code and models: https://github.com/JacobYuan7/RLIPv2

  29. arXiv:2306.07968  [pdf, other

    cs.CL cs.AI

    arXiVeri: Automatic table verification with GPT

    Authors: Gyungin Shin, Weidi Xie, Samuel Albanie

    Abstract: Without accurate transcription of numerical data in scientific documents, a scientist cannot draw accurate conclusions. Unfortunately, the process of copying numerical data from one paper to another is prone to human error. In this paper, we propose to meet this challenge through the novel task of automatic table verification (AutoTV), in which the objective is to verify the accuracy of numerical… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

    Comments: Tech report

  30. arXiv:2306.00020  [pdf, other

    cs.CL cs.AI cs.LG

    GPT4GEO: How a Language Model Sees the World's Geography

    Authors: Jonathan Roberts, Timo Lüddecke, Sowmen Das, Kai Han, Samuel Albanie

    Abstract: Large language models (LLMs) have shown remarkable capabilities across a broad range of tasks involving question answering and the generation of coherent text and code. Comprehensively understanding the strengths and weaknesses of LLMs is beneficial for safety, downstream applications and improving performance. In this work, we investigate the degree to which GPT-4 has acquired factual geographic… ▽ More

    Submitted 30 May, 2023; originally announced June 2023.

  31. arXiv:2304.14376  [pdf, other

    cs.CV

    Zero-shot Unsupervised Transfer Instance Segmentation

    Authors: Gyungin Shin, Samuel Albanie, Weidi Xie

    Abstract: Segmentation is a core computer vision competency, with applications spanning a broad range of scientifically and economically valuable domains. To date, however, the prohibitive cost of annotation has limited the deployment of flexible segmentation models. In this work, we propose Zero-shot Unsupervised Transfer Instance Segmentation (ZUTIS), a framework that aims to meet this challenge. The key… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

    Comments: Accepted to CVPRW 2023. Code: https://github.com/NoelShin/zutis

  32. arXiv:2304.11619  [pdf, other

    cs.CV cs.AI cs.LG

    SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models

    Authors: Jonathan Roberts, Kai Han, Samuel Albanie

    Abstract: Interpreting remote sensing imagery enables numerous downstream applications ranging from land-use planning to deforestation monitoring. Robustly classifying this data is challenging due to the Earth's geographic diversity. While many distinct satellite and aerial image classification datasets exist, there is yet to be a benchmark curated that suitably covers this diversity. In this work, we intro… ▽ More

    Submitted 23 April, 2023; originally announced April 2023.

  33. arXiv:2304.10970  [pdf, other

    cs.LG

    Can GPT-4 Perform Neural Architecture Search?

    Authors: Mingkai Zheng, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu, Samuel Albanie

    Abstract: We investigate the potential of GPT-4~\cite{gpt4} to perform Neural Architecture Search (NAS) -- the task of designing effective neural architectures. Our proposed approach, \textbf{G}PT-4 \textbf{E}nhanced \textbf{N}eural arch\textbf{I}tect\textbf{U}re \textbf{S}earch (GENIUS), leverages the generative capabilities of GPT-4 as a black-box optimiser to quickly navigate the architecture search spac… ▽ More

    Submitted 1 August, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

  34. arXiv:2304.00521  [pdf, other

    cs.DL cs.LG

    Large Language Models are Few-shot Publication Scoopers

    Authors: Samuel Albanie, Liliane Momeni, João F. Henriques

    Abstract: Driven by recent advances AI, we passengers are entering a golden age of scientific discovery. But golden for whom? Confronting our insecurity that others may beat us to the most acclaimed breakthroughs of the era, we propose a novel solution to the long-standing personal credit assignment problem to ensure that it is golden for us. At the heart of our approach is a pip-to-the-post algorithm that… ▽ More

    Submitted 2 April, 2023; originally announced April 2023.

    Comments: SIGBOVIK 2023

  35. arXiv:2303.08817  [pdf, other

    cs.CV

    DeepMIM: Deep Supervision for Masked Image Modeling

    Authors: Sucheng Ren, Fangyun Wei, Samuel Albanie, Zheng Zhang, Han Hu

    Abstract: Deep supervision, which involves extra supervisions to the intermediate features of a neural network, was widely used in image classification in the early deep learning era since it significantly reduces the training difficulty and eases the optimization like avoiding gradient vanish over the vanilla training. Nevertheless, with the emergence of normalization techniques and residual connection, de… ▽ More

    Submitted 16 March, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Code and models are available at https://github.com/OliverRensu/DeepMIM

  36. arXiv:2211.16198  [pdf, other

    cs.CV cs.CL cs.MM

    SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

    Authors: Vishaal Udandarao, Ankush Gupta, Samuel Albanie

    Abstract: Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, r… ▽ More

    Submitted 15 August, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: Accepted at ICCV2023

  37. arXiv:2211.08954  [pdf, other

    cs.CV

    Weakly-supervised Fingerspelling Recognition in British Sign Language Videos

    Authors: K R Prajwal, Hannah Bull, Liliane Momeni, Samuel Albanie, Gül Varol, Andrew Zisserman

    Abstract: The goal of this work is to detect and recognize sequences of letters signed using fingerspelling in British Sign Language (BSL). Previous fingerspelling recognition methods have not focused on BSL, which has a very different signing alphabet (e.g., two-handed instead of one-handed) to American Sign Language (ASL). They also use manual annotations for training. In contrast to previous methods, our… ▽ More

    Submitted 16 November, 2022; originally announced November 2022.

    Comments: Appears in: British Machine Vision Conference 2022 (BMVC 2022)

  38. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  39. arXiv:2211.01786  [pdf, other

    cs.CL cs.AI cs.LG

    Crosslingual Generalization through Multitask Finetuning

    Authors: Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel

    Abstract: Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks wi… ▽ More

    Submitted 29 May, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: 9 main pages (119 with appendix), 16 figures and 11 tables

  40. arXiv:2209.11228  [pdf, other

    cs.CV cs.AI cs.LG

    NamedMask: Distilling Segmenters from Complementary Foundation Models

    Authors: Gyungin Shin, Weidi Xie, Samuel Albanie

    Abstract: The goal of this work is to segment and name regions of images without access to pixel-level labels during training. To tackle this task, we construct segmenters by distilling the complementary strengths of two foundation models. The first, CLIP (Radford et al. 2021), exhibits the ability to assign names to image content but lacks an accessible representation of object structure. The second, DINO… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: Tech report. Code: https://github.com/NoelShin/namedmask

  41. arXiv:2209.01814  [pdf, other

    cs.CV

    RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

    Authors: Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni, Mingqian Tang

    Abstract: The task of Human-Object Interaction (HOI) detection targets fine-grained visual parsing of humans interacting with their environment, enabling a broad range of applications. Prior work has demonstrated the benefits of effective architecture design and integration of relevant cues for more accurate HOI detection. However, the design of an appropriate pre-training strategy for this task remains und… ▽ More

    Submitted 16 November, 2022; v1 submitted 5 September, 2022; originally announced September 2022.

    Comments: Accepted to NeurIPS 2022 as a Spotlight paper

  42. arXiv:2208.02802  [pdf, other

    cs.CV

    Automatic dense annotation of large-vocabulary sign language videos

    Authors: Liliane Momeni, Hannah Bull, K R Prajwal, Samuel Albanie, Gül Varol, Andrew Zisserman

    Abstract: Recently, sign language researchers have turned to sign language interpreted TV broadcasts, comprising (i) a video of continuous signing and (ii) subtitles corresponding to the audio content, as a readily available and large-scale source of training data. One key challenge in the usability of such data is the lack of sign annotations. Previous work exploiting such weakly-aligned data only found sp… ▽ More

    Submitted 4 August, 2022; originally announced August 2022.

    Comments: ECCV 2022 Camera Ready

  43. arXiv:2206.07045  [pdf, other

    cs.CV cs.AI cs.LG

    ReCo: Retrieve and Co-segment for Zero-shot Transfer

    Authors: Gyungin Shin, Weidi Xie, Samuel Albanie

    Abstract: Semantic segmentation has a broad range of applications, but its real-world impact has been significantly limited by the prohibitive annotation costs necessary to enable deployment. Segmentation methods that forgo supervision can side-step these costs, but exhibit the inconvenient requirement to provide labelled examples from the target distribution to assign concept names to predictions. An alter… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: Tech report. Code: https://github.com/NoelShin/reco

  44. Scaling up sign spotting through sign language dictionaries

    Authors: Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman

    Abstract: The focus of this work is $\textit{sign spotting}$ - given a video of an isolated sign, our task is to identify $\textit{whether}$ and $\textit{where}$ it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) $\textit{watching}$ existing footage which is sparsely labelled using… ▽ More

    Submitted 9 May, 2022; originally announced May 2022.

    Comments: Appears in: 2022 International Journal of Computer Vision (IJCV). 25 pages. arXiv admin note: substantial text overlap with arXiv:2010.04002

    Journal ref: International Journal of Computer Vision (2022)

  45. arXiv:2203.17265  [pdf, other

    cs.LG

    A 23 MW data centre is all you need

    Authors: Samuel Albanie, Dylan Campbell, João F. Henriques

    Abstract: The field of machine learning has achieved striking progress in recent years, witnessing breakthrough results on language modelling, protein folding and nitpickingly fine-grained dog breed classification. Some even succeeded at playing computer games and board games, a feat both of engineering and of setting their employers' expectations. The central contribution of this work is to carefully exami… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: SIGBOVIK 2022

  46. arXiv:2203.12614  [pdf, other

    cs.CV

    Unsupervised Salient Object Detection with Spectral Cluster Voting

    Authors: Gyungin Shin, Samuel Albanie, Weidi Xie

    Abstract: In this paper, we tackle the challenging task of unsupervised salient object detection (SOD) by leveraging spectral clustering on self-supervised features. We make the following contributions: (i) We revisit spectral clustering and demonstrate its potential to group the pixels of salient objects; (ii) Given mask proposals from multiple applications of spectral clustering on image features computed… ▽ More

    Submitted 23 March, 2022; originally announced March 2022.

    Comments: 14 pages, 5 figures

  47. arXiv:2201.02495  [pdf, other

    cs.CV cs.AI cs.CL

    Sign Language Video Retrieval with Free-Form Textual Queries

    Authors: Amanda Duarte, Samuel Albanie, Xavier Giró-i-Nieto, Gül Varol

    Abstract: Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with free-form textual queries: given a written quer… ▽ More

    Submitted 15 September, 2022; v1 submitted 7 January, 2022; originally announced January 2022.

    Comments: In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022

  48. arXiv:2112.12777  [pdf, other

    cs.CV

    Cross Modal Retrieval with Querybank Normalisation

    Authors: Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, Samuel Albanie

    Abstract: Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings… ▽ More

    Submitted 18 April, 2022; v1 submitted 23 December, 2021; originally announced December 2021.

    Comments: Accepted at CVPR 2022

  49. arXiv:2112.09418  [pdf, other

    eess.AS cs.IR cs.SD

    Audio Retrieval with Natural Language Queries: A Benchmark Study

    Authors: A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, Samuel Albanie

    Abstract: The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like… ▽ More

    Submitted 27 January, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

    Comments: Submitted to Transactions on Multimedia. arXiv admin note: substantial text overlap with arXiv:2105.02192

    Journal ref: IEEE Transactions on Multimedia 2022

  50. arXiv:2111.03635  [pdf, other

    cs.CV

    BBC-Oxford British Sign Language Dataset

    Authors: Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, Andrew Zisserman

    Abstract: In this work, we introduce the BBC-Oxford British Sign Language (BOBSL) dataset, a large-scale video collection of British Sign Language (BSL). BOBSL is an extended and publicly released dataset based on the BSL-1K dataset introduced in previous work. We describe the motivation for the dataset, together with statistics and available annotations. We conduct experiments to provide baselines for the… ▽ More

    Submitted 5 November, 2021; originally announced November 2021.