-
LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
Authors:
Anthony Fuller,
Yousef Yassin,
Junfeng Wen,
Daniel G. Kyrollos,
Tarek Ibrahim,
James R. Green,
Evan Shelhamer
Abstract:
Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution ext…
▽ More
Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by up to 34x and time by 6x. It also excels at standard recognition tasks that are global (ImageNet classification) or local (ADE20K segmentation), improving accuracy while reducing time by 1.36x.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Corporate Needs You to Find the Difference: Revisiting Submodular and Supermodular Ratio Optimization Problems
Authors:
Elfarouk Harb,
Yousef Yassin,
Chandra Chekuri
Abstract:
We study the problem of minimizing or maximizing the average value $ f(S)/|S| $ of a submodular or supermodular set function $ f: 2^V \to \mathbb{R} $ over non-empty subsets $ S \subseteq V $. This generalizes classical problems such as Densest Subgraph (DSG), Densest Supermodular Set (DSS), and Submodular Function Minimization (SFM). Motivated by recent applications, we introduce two broad formul…
▽ More
We study the problem of minimizing or maximizing the average value $ f(S)/|S| $ of a submodular or supermodular set function $ f: 2^V \to \mathbb{R} $ over non-empty subsets $ S \subseteq V $. This generalizes classical problems such as Densest Subgraph (DSG), Densest Supermodular Set (DSS), and Submodular Function Minimization (SFM). Motivated by recent applications, we introduce two broad formulations: Unrestricted Sparsest Submodular Set (USSS) and Unrestricted Densest Supermodular Set (UDSS), which allow for negative and non-monotone functions.
We show that DSS, SFM, USSS, UDSS, and the Minimum Norm Point (MNP) problem are equivalent under strongly polynomial-time reductions, enabling algorithmic crossover. In particular, viewing these through the lens of the MNP in the base polyhedron, we connect Fujishige's theory with dense decomposition, and show that both Fujishige-Wolfe's algorithm and the heuristic \textsc{SuperGreedy++} act as universal solvers for all these problems, including sub-modular function minimization.
Theoretically, we explain why \textsc{SuperGreedy++} is effective beyond DSS, including for tasks like submodular minimization and minimum $ s $-$ t $ cut. Empirically, we test several solvers, including the Fujishige-Wolfe algorithm on over 400 experiments across seven problem types and large-scale real/synthetic datasets. Surprisingly, general-purpose convex and flow-based methods outperform task-specific baselines, demonstrating that with the right framing, general optimization techniques can be both scalable and state-of-the-art for submodular and supermodular ratio problems.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Simpler Fast Vision Transformers with a Jumbo CLS Token
Authors:
Anthony Fuller,
Yousef Yassin,
Daniel G. Kyrollos,
Evan Shelhamer,
James R. Green
Abstract:
We introduce a simple enhancement of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Since there is only one Jumbo token, its cost is minimal,…
▽ More
We introduce a simple enhancement of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Since there is only one Jumbo token, its cost is minimal, and because we share this FFN across layers, its parameter count is controlled. Jumbo significantly improves over ViT+Registers on ImageNet-1K and ImageNet-21K. These gains are largest at small sizes / high speeds, e.g., ViT-nano+Jumbo outperforms ViT-nano+Registers by 13%. In fact, our Jumbo models are so efficient that they outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs, such as support for token dropping and other modalities. Accordingly, we demonstrate that Jumbo excels in these two settings via masked autoencoding and on a suite of time series benchmarks. Code and weights available: https://github.com/antofuller/jumbo
△ Less
Submitted 23 May, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
Authors:
Anthony Fuller,
Daniel G. Kyrollos,
Yousef Yassin,
James R. Green
Abstract:
High-resolution images offer more information about scenes that can improve model accuracy. However, the dominant model architecture in computer vision, the vision transformer (ViT), cannot effectively leverage larger images without finetuning -- ViTs poorly extrapolate to more patches at test time, although transformers offer sequence length flexibility. We attribute this shortcoming to the curre…
▽ More
High-resolution images offer more information about scenes that can improve model accuracy. However, the dominant model architecture in computer vision, the vision transformer (ViT), cannot effectively leverage larger images without finetuning -- ViTs poorly extrapolate to more patches at test time, although transformers offer sequence length flexibility. We attribute this shortcoming to the current patch position encoding methods, which create a distribution shift when extrapolating.
We propose a drop-in replacement for the position encoding of plain ViTs that restricts attention heads to fixed fields of view, pointed in different directions, using 2D attention masks. Our novel method, called LookHere, provides translation-equivariance, ensures attention head diversity, and limits the distribution shift that attention heads face when extrapolating. We demonstrate that LookHere improves performance on classification (avg. 1.6%), against adversarial attack (avg. 5.4%), and decreases calibration error (avg. 1.5%) -- on ImageNet without extrapolation. With extrapolation, LookHere outperforms the current SoTA position encoding method, 2D-RoPE, by 21.7% on ImageNet when trained at $224^2$ px and tested at $1024^2$ px. Additionally, we release a high-resolution test set to improve the evaluation of high-resolution image classifiers, called ImageNet-HR.
△ Less
Submitted 29 October, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.