Skip to main content

Showing 1–50 of 93 results for author: Álvarez, J M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.14808  [pdf, ps, other

    cs.LG

    PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models

    Authors: Jenny Schmalfuss, Nadine Chang, Vibashan VS, Maying Shen, Andres Bruhn, Jose M. Alvarez

    Abstract: Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems. VLMs integrate visual modalities with large language models (LLMs), which are well known to be prompt-sensitive. Hence, it is crucial to determine whether VLMs inherit this instability to varying prompts. We therefore investigate which prompt variations VLMs are mo… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Accepted to CVPR 2025

  2. arXiv:2506.06664  [pdf, ps, other

    cs.RO cs.CV

    Generalized Trajectory Scoring for End-to-end Multimodal Planning

    Authors: Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, Jose M. Alvarez

    Abstract: End-to-end multi-modal planning is a promising paradigm in autonomous driving, enabling decision-making with diverse trajectory candidates. A key component is a robust trajectory scorer capable of selecting the optimal trajectory from these candidates. While recent trajectory scorers focus on scoring either large sets of static trajectories or small sets of dynamically generated ones, both approac… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: The 1st place solution of the End-to-end Driving Track at the CVPR 2025 Autonomous Grand Challenge

  3. arXiv:2506.06659  [pdf, other

    cs.RO cs.AI cs.CV

    DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning

    Authors: Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M. Alvarez, Zuxuan Wu

    Abstract: In complex driving environments, autonomous vehicles must navigate safely. Relying on a single predicted path, as in regression-based approaches, usually does not explicitly assess the safety of the predicted trajectory. Selection-based methods address this by generating and scoring multiple trajectory candidates and predicting the safety score for each, but face optimization challenges in precise… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: 15 pages, 6 figures

  4. arXiv:2505.24498  [pdf, ps, other

    cs.LG

    Efficient Neural and Numerical Methods for High-Quality Online Speech Spectrogram Inversion via Gradient Theorem

    Authors: Andres Fernandez, Juan Azcarreta, Cagdas Bilen, Jesus Monge Alvarez

    Abstract: Recent work in online speech spectrogram inversion effectively combines Deep Learning with the Gradient Theorem to predict phase derivatives directly from magnitudes. Then, phases are estimated from their derivatives via least squares, resulting in a high quality reconstruction. In this work, we introduce three innovations that drastically reduce computational cost, while maintaining high quality:… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted at InterSpeech 2025

  5. arXiv:2504.19819  [pdf, other

    cs.CV

    Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video

    Authors: Hoang Chuong Nguyen, Wei Mao, Jose M. Alvarez, Miaomiao Liu

    Abstract: Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  6. arXiv:2504.04348   

    cs.CV

    OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

    Authors: Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez

    Abstract: The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through co… ▽ More

    Submitted 16 April, 2025; v1 submitted 5 April, 2025; originally announced April 2025.

    Comments: Mistaken resubmission. The original version is at arXiv:2405.01533

  7. arXiv:2504.02168  [pdf, other

    cs.CV cs.AI cs.LG

    MDP: Multidimensional Vision Model Pruning with Latency Constraint

    Authors: Xinglong Sun, Barath Lakshmanan, Maying Shen, Shiyi Lan, Jingde Chen, Jose M. Alvarez

    Abstract: Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where mult… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: Accepted at CVPR 2025

  8. arXiv:2503.12820  [pdf, other

    cs.CV

    Hydra-MDP++: Advancing End-to-End Driving via Expert-Guided Hydra-Distillation

    Authors: Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, Jose M. Alvarez

    Abstract: Hydra-MDP++ introduces a novel teacher-student knowledge distillation framework with a multi-head decoder that learns from human demonstrations and rule-based experts. Using a lightweight ResNet-34 network without complex components, the framework incorporates expanded evaluation metrics, including traffic light compliance (TL), lane-keeping ability (LK), and extended comfort (EC) to address unsaf… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  9. arXiv:2503.12030  [pdf, other

    cs.RO cs.CV

    Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training

    Authors: Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez

    Abstract: End-to-end autonomous driving research currently faces a critical challenge in bridging the gap between open-loop training and closed-loop deployment. Current approaches are trained to predict trajectories in an open-loop environment, which struggle with quick reactions to other agents in closed-loop environments and risk generating kinematically infeasible plans due to the gap between open-loop t… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

  10. arXiv:2503.11650  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Centaur: Robust End-to-End Autonomous Driving with Test-Time Training

    Authors: Chonghao Sima, Kashyap Chitta, Zhiding Yu, Shiyi Lan, Ping Luo, Andreas Geiger, Hongyang Li, Jose M. Alvarez

    Abstract: How can we rely on an end-to-end autonomous vehicle's complex decision-making system during deployment? One common solution is to have a ``fallback layer'' that checks the planned trajectory for rule violations and replaces it with a pre-defined safe action if necessary. Another approach involves adjusting the planner's decisions to minimize a pre-defined ``cost function'' using additional system… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  11. arXiv:2503.03957  [pdf, other

    cs.RO cs.CV

    Enhancing Autonomous Driving Safety with Collision Scenario Integration

    Authors: Zi Wang, Shiyi Lan, Xinglong Sun, Nadine Chang, Zhenxin Li, Zhiding Yu, Jose M. Alvarez

    Abstract: Autonomous vehicle safety is crucial for the successful deployment of self-driving cars. However, most existing planning methods rely heavily on imitation learning, which limits their ability to leverage collision data effectively. Moreover, collecting collision or near-collision data is inherently challenging, as it involves risks and raises ethical and practical concerns. In this paper, we propo… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  12. arXiv:2502.03658  [pdf, other

    cs.LG cs.CV

    Advancing Weight and Channel Sparsification with Enhanced Saliency

    Authors: Xinglong Sun, Maying Shen, Hongxu Yin, Lei Mao, Pavlo Molchanov, Jose M. Alvarez

    Abstract: Pruning aims to accelerate and compress models by removing redundant parameters, identified by specifically designed importance scores which are usually imperfect. This removal is irreversible, often leading to subpar performance in pruned models. Dynamic sparse training, while attempting to adjust sparse structures during training for continual reassessment and refinement, has several limitations… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

    Comments: Accepted at WACV 2025

  13. Counterfactual Situation Testing: From Single to Multidimensional Discrimination

    Authors: Jose M. Alvarez, Salvatore Ruggieri

    Abstract: We present counterfactual situation testing (CST), a causal data mining framework for detecting individual discrimination in a dataset of classifier decisions. CST answers the question ``what would have been the model outcome had the individual, or complainant, been of a different protected status?'' It extends the legally-grounded situation testing (ST) of Thanh et al. (2011) by operationalizing… ▽ More

    Submitted 7 April, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

  14. arXiv:2502.01211  [pdf, other

    cs.LG stat.ML

    Privilege Scores

    Authors: Ludwig Bothmann, Philip A. Boustani, Jose M. Alvarez, Giuseppe Casalicchio, Bernd Bischl, Susanne Dandl

    Abstract: Bias-transforming methods of fairness-aware machine learning aim to correct a non-neutral status quo with respect to a protected attribute (PA). Current methods, however, lack an explicit formulation of what drives non-neutrality. We introduce privilege scores (PS) to measure PA-related privilege by comparing the model predictions in the real world with those in a fair world in which the influence… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

  15. arXiv:2501.14818  [pdf, other

    cs.CV cs.AI cs.LG

    Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

    Authors: Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, Nadine Chang, Karan Sapra, Amala Sanjay Deshmukh, Tuomas Rintamaki, Matthieu Le, Ilia Karmanov, Lukas Voegtle, Philipp Fischer, De-An Huang, Timo Roman, Tong Lu, Jose M. Alvarez, Bryan Catanzaro, Jan Kautz, Andrew Tao , et al. (2 additional authors not shown)

    Abstract: Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, s… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

  16. arXiv:2412.01941  [pdf, other

    cs.CV

    Global Average Feature Augmentation for Robust Semantic Segmentation with Transformers

    Authors: Alberto Gonzalo Rodriguez Salgado, Maying Shen, Philipp Harzig, Peter Mayer, Jose M. Alvarez

    Abstract: Robustness to out-of-distribution data is crucial for deploying modern neural networks. Recently, Vision Transformers, such as SegFormer for semantic segmentation, have shown impressive robustness to visual corruptions like blur or noise affecting the acquisition device. In this paper, we propose Channel Wise Feature Augmentation (CWFA), a simple yet efficient feature augmentation technique to imp… ▽ More

    Submitted 13 December, 2024; v1 submitted 2 December, 2024; originally announced December 2024.

  17. arXiv:2410.23910  [pdf, other

    cs.CV

    Uncertainty Estimation for 3D Object Detection via Evidential Learning

    Authors: Nikita Durasov, Rafid Mahmood, Jiwoong Choi, Marc T. Law, James Lucas, Pascal Fua, Jose M. Alvarez

    Abstract: 3D object detection is an essential task for computer vision applications in autonomous vehicles and robotics. However, models often struggle to quantify detection reliability, leading to poor performance on unfamiliar scenes. We introduce a framework for quantifying uncertainty in 3D object detection by leveraging an evidential learning loss on Bird's Eye View representations in the 3D detector.… ▽ More

    Submitted 31 October, 2024; originally announced October 2024.

  18. arXiv:2409.13860  [pdf, other

    cs.CV

    SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation

    Authors: Maying Shen, Nadine Chang, Sifei Liu, Jose M. Alvarez

    Abstract: In recent years, the data collected for artificial intelligence has grown to an unmanageable amount. Particularly within industrial applications, such as autonomous vehicles, model training computation budgets are being exceeded while model performance is saturating -- and yet more data continues to pour in. To navigate the flood of data, we propose a framework to select the most semantically dive… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

  19. arXiv:2407.07276  [pdf, other

    cs.CV cs.AI

    Exploring Camera Encoder Designs for Autonomous Driving Perception

    Authors: Barath Lakshmanan, Joshua Chen, Shiyi Lan, Maying Shen, Zhiding Yu, Jose M. Alvarez

    Abstract: The cornerstone of autonomous vehicles (AV) is a solid perception system, where camera encoders play a crucial role. Existing works usually leverage pre-trained Convolutional Neural Networks (CNN) or Vision Transformers (ViTs) designed for general vision tasks, such as image classification, segmentation, and 2D detection. Although those well-known architectures have achieved state-of-the-art accur… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  20. arXiv:2406.06978  [pdf, other

    cs.CV

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Authors: Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, Yu-Gang Jiang, Jose M. Alvarez

    Abstract: We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment… ▽ More

    Submitted 29 August, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: The 1st place solution of End-to-end Driving at Scale at the CVPR 2024 Autonomous Grand Challenge

  21. arXiv:2406.04484  [pdf, ps, other

    cs.CV

    Step Out and Seek Around: On Warm-Start Training with Incremental Data

    Authors: Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jose M. Alvarez

    Abstract: Data often arrives in sequence over time in real-world deep learning applications such as autonomous driving. When new training data is available, training the model from scratch undermines the benefit of leveraging the learned knowledge, leading to significant training costs. Warm-starting from a previously trained checkpoint is the most intuitive way to retain knowledge and advance learning. How… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  22. arXiv:2405.18902  [pdf, other

    cs.LG cs.AI stat.ML

    A Causal Framework for Evaluating Deferring Systems

    Authors: Filippo Palomba, Andrea Pugnana, José Manuel Alvarez, Salvatore Ruggieri

    Abstract: Deferring systems extend supervised Machine Learning (ML) models with the possibility to defer predictions to human experts. However, evaluating the impact of a deferring strategy on system accuracy is still an overlooked area. This paper fills this gap by evaluating deferring systems through a causal lens. We link the potential outcomes framework for causal inference with deferring systems, which… ▽ More

    Submitted 7 April, 2025; v1 submitted 29 May, 2024; originally announced May 2024.

    Comments: Accepted at AISTATS 2025

  23. arXiv:2405.17187  [pdf, other

    cs.CV cs.AI cs.RO

    Memorize What Matters: Emergent Scene Decomposition from Multitraverse

    Authors: Yiming Li, Zehong Wang, Yue Wang, Zhiding Yu, Zan Gojcic, Marco Pavone, Chen Feng, Jose M. Alvarez

    Abstract: Humans naturally retain memories of permanent elements, while ephemeral moments often slip through the cracks of memory. This selective retention is crucial for robotic perception, localization, and mapping. To endow robots with this capability, we introduce 3D Gaussian Mapping (3DGM), a self-supervised, camera-only offline mapping framework grounded in 3D Gaussian Splatting. 3DGM converts multitr… ▽ More

    Submitted 29 May, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: Project page: https://3d-gaussian-mapping.github.io; Code and data: https://github.com/NVlabs/3DGM

  24. arXiv:2405.13693  [pdf, other

    cs.LG

    Mutatis Mutandis: Revisiting the Comparator in Discrimination Testing

    Authors: Jose M. Alvarez, Salvatore Ruggieri

    Abstract: Testing for discrimination consists of deriving a profile, known as the comparator, similar to the profile making the discrimination claim, known as the complainant, and comparing the outcomes of these two profiles. An important aspect for establishing discrimination is evidence, often obtained via discrimination testing tools that implement the complainant-comparator pair. In this work, we revisi… ▽ More

    Submitted 1 October, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  25. arXiv:2405.01533  [pdf, other

    cs.CV

    OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

    Authors: Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez

    Abstract: The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through co… ▽ More

    Submitted 16 April, 2025; v1 submitted 2 May, 2024; originally announced May 2024.

  26. arXiv:2404.14908  [pdf, other

    cs.CV

    Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

    Authors: Hoang Chuong Nguyen, Tianyu Wang, Jose M. Alvarez, Miaomiao Liu

    Abstract: This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR2024

  27. arXiv:2404.01990  [pdf, other

    cs.CV

    What is Point Supervision Worth in Video Instance Segmentation?

    Authors: Shuaiyi Huang, De-An Huang, Zhiding Yu, Shiyi Lan, Subhashree Radhakrishnan, Jose M. Alvarez, Abhinav Shrivastava, Anima Anandkumar

    Abstract: Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos. Conventional VIS methods rely on densely-annotated object masks which are expensive. We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models. Our proposed train… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  28. arXiv:2403.09230  [pdf, other

    cs.CV

    Improving Distant 3D Object Detection Using 2D Box Supervision

    Authors: Zetong Yang, Zhiding Yu, Chris Choy, Renhao Wang, Anima Anandkumar, Jose M. Alvarez

    Abstract: Improving the detection of distant 3d objects is an important yet challenging task. For camera-based 3D perception, the annotation of 3d bounding relies heavily on LiDAR for accurate depth information. As such, the distance of annotation is often limited due to the sparsity of LiDAR points on distant objects, which hampers the capability of existing detectors for long-range scenarios. We address t… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  29. arXiv:2401.13408  [pdf, other

    cs.AI cs.CY cs.HC

    Causal Perception

    Authors: Jose M. Alvarez, Salvatore Ruggieri

    Abstract: Perception occurs when two individuals interpret the same information differently. Despite being a known phenomenon with implications for bias in decision-making, as individual experience determines interpretation, perception remains largely overlooked in machine learning (ML) research. Modern decision flows, whether partially or fully automated, involve human experts interacting with ML applicati… ▽ More

    Submitted 22 May, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: arXiv admin note: text overlap with arXiv:2305.09535 by other authors

  30. arXiv:2401.03844  [pdf, other

    cs.CV

    Fully Attentional Networks with Self-emerging Token Labeling

    Authors: Bingyin Zhao, Zhiding Yu, Shiyi Lan, Yutao Cheng, Anima Anandkumar, Yingjie Lao, Jose M. Alvarez

    Abstract: Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framew… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

    Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5585-5595

  31. arXiv:2312.03031  [pdf, other

    cs.CV

    Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

    Authors: Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, Jose M. Alvarez

    Abstract: End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line, many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper, we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observ… ▽ More

    Submitted 2 June, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: Accept to cvpr 2024

  32. arXiv:2312.01696  [pdf, other

    cs.CV

    BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

    Authors: Zhenxin Li, Shiyi Lan, Jose M. Alvarez, Zuxuan Wu

    Abstract: Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However, we argue that dense BEV frameworks remain important due to their outstanding abilities in depth estimation and object localization, depicting 3D scenes accurately and comprehensively. This… ▽ More

    Submitted 24 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

  33. arXiv:2311.14671  [pdf, other

    cs.CV

    SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

    Authors: Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M. Alvarez, Zuxuan Wu, Yu-Gang Jiang

    Abstract: In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target. The resulting models can be generalized seamlessly to novel segmentation tasks, significantly reducing the labeling and training costs compared with conventional pipelines. However, in-context segmentation is mo… ▽ More

    Submitted 22 July, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

    Comments: ECCV-24 camera-ready

  34. arXiv:2310.19731  [pdf, other

    cs.CV cs.AI cs.LG

    ViR: Towards Efficient Vision Retention Backbones

    Authors: Ali Hatamizadeh, Michael Ranzinger, Shiyi Lan, Jose M. Alvarez, Sanja Fidler, Jan Kautz

    Abstract: Vision Transformers (ViTs) have attracted a lot of popularity in recent years, due to their exceptional capabilities in modeling long-range spatial dependencies and scalability for large scale training. Although the training parallelism of self-attention mechanism plays an important role in retaining great performance, its quadratic complexity baffles the application of ViTs in many scenarios whic… ▽ More

    Submitted 26 January, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Introduction of Vision Retention Networks (ViR) for Efficient Visual Modeling

  35. arXiv:2309.05192  [pdf, other

    cs.CV

    Towards Viewpoint Robustness in Bird's Eye View Segmentation

    Authors: Tzofi Klinghoffer, Jonah Philion, Wenzheng Chen, Or Litany, Zan Gojcic, Jungseock Joo, Ramesh Raskar, Sanja Fidler, Jose M. Alvarez

    Abstract: Autonomous vehicles (AV) require that neural networks used for perception be robust to different viewpoints if they are to be deployed across many types of vehicles without the repeated cost of data collection and labeling for each. AV companies typically focus on collecting data from diverse scenarios and locations, but not camera rig configurations, due to cost. As a result, only a small number… ▽ More

    Submitted 10 September, 2023; originally announced September 2023.

    Comments: ICCV 2023. Project Page: https://nvlabs.github.io/viewpoint-robustness

  36. arXiv:2308.02236  [pdf, other

    cs.CV

    FB-BEV: BEV Representation from Forward-Backward View Transformations

    Authors: Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, Jose M. Alvarez

    Abstract: View Transformation Module (VTM), where transformations happen between multi-view image features and Bird-Eye-View (BEV) representation, is a crucial step in camera-based BEV perception systems. Currently, the two most prominent VTM paradigms are forward projection and backward projection. Forward projection, represented by Lift-Splat-Shoot, leads to sparsely projected BEV features without post-pr… ▽ More

    Submitted 17 August, 2023; v1 submitted 4 August, 2023; originally announced August 2023.

    Comments: Accept to ICCV 2023, camera-ready version

  37. arXiv:2307.15398  [pdf, other

    cs.LG cs.CY

    The Initial Screening Order Problem

    Authors: Jose M. Alvarez, Antonio Mastropietro, Salvatore Ruggieri

    Abstract: We investigate the role of the initial screening order (ISO) in candidate screening. The ISO refers to the order in which the screener searches the candidate pool when selecting $k$ candidates. Today, it is common for the ISO to be the product of an information access system, such as an online platform or a database query. The ISO has been largely overlooked in the literature, despite its impact o… ▽ More

    Submitted 2 January, 2025; v1 submitted 28 July, 2023; originally announced July 2023.

    Comments: Forthcoming in the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM'25)

  38. arXiv:2307.04106  [pdf, other

    cs.CV

    Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

    Authors: Jiayu Yang, Enze Xie, Miaomiao Liu, Jose M. Alvarez

    Abstract: Recent vision-only perception models for autonomous driving achieved promising results by encoding multi-view image features into Bird's-Eye-View (BEV) space. A critical step and the main bottleneck of these methods is transforming image features into the BEV coordinate frame. This paper focuses on leveraging geometry information, such as depth, to model such feature transformation. Existing works… ▽ More

    Submitted 11 July, 2023; v1 submitted 9 July, 2023; originally announced July 2023.

  39. arXiv:2307.01492  [pdf, other

    cs.CV cs.RO

    FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation

    Authors: Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, Jose M. Alvarez

    Abstract: This technical report summarizes the winning solution for the 3D Occupancy Prediction Challenge, which is held in conjunction with the CVPR 2023 Workshop on End-to-End Autonomous Driving and CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop. Our proposed solution FB-OCC builds upon FB-BEV, a cutting-edge camera-based bird's-eye view perception design using forward-backward projection.… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: Outstanding Champion and Innovation Award in the 3D Occupancy Prediction Challenge (CVPR23)

  40. arXiv:2306.06189  [pdf, other

    cs.CV cs.AI cs.LG

    FasterViT: Fast Vision Transformers with Hierarchical Attention

    Authors: Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov

    Abstract: We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-… ▽ More

    Submitted 1 April, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

    Comments: ICLR'24 Accepted Paper

  41. Domain Adaptive Decision Trees: Implications for Accuracy and Fairness

    Authors: Jose M. Alvarez, Kristen M. Scott, Salvatore Ruggieri, Bettina Berendt

    Abstract: In uses of pre-trained machine learning models, it is a known issue that the target population in which the model is being deployed may not have been reflected in the source population with which the model was trained. This can result in a biased model when deployed, leading to a reduction in model performance. One risk is that, as the population changes, certain demographic groups will be under-s… ▽ More

    Submitted 31 May, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: *Both authors contributed equally to this work. Accepted at FAccT '23

    Journal ref: FAccT '23: the 2023 ACM Conference on Fairness, Accountability, and Transparency Chicago IL USA June 12 - 15, 2023

  42. arXiv:2302.12251  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

    Authors: Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M. Alvarez, Sanja Fidler, Chen Feng, Anima Anandkumar

    Abstract: Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a… ▽ More

    Submitted 25 March, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

    Comments: CVPR 2023 Highlight (10% of accepted papers, 2.5% of submissions)

  43. arXiv:2302.11944  [pdf, other

    stat.ML cs.CY cs.LG

    Counterfactual Situation Testing: Uncovering Discrimination under Fairness given the Difference

    Authors: Jose M. Alvarez, Salvatore Ruggieri

    Abstract: We present counterfactual situation testing (CST), a causal data mining framework for detecting discrimination in classifiers. CST aims to answer in an actionable and meaningful way the intuitive question "what would have been the model outcome had the individual, or complainant, been of a different protected status?" It extends the legally-grounded situation testing of Thanh et al. (2011) by oper… ▽ More

    Submitted 16 October, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

    Journal ref: Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization; Boston, USA; October 30 - November 1, 2023

  44. arXiv:2301.03992  [pdf, other

    cs.CV cs.LG cs.MM

    Vision Transformers Are Good Mask Auto-Labelers

    Authors: Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar

    Abstract: We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding… ▽ More

    Submitted 10 January, 2023; originally announced January 2023.

  45. arXiv:2211.02206  [pdf, other

    cs.CV

    Soft Masking for Cost-Constrained Channel Pruning

    Authors: Ryan Humble, Maying Shen, Jorge Albericio Latorre, Eric Darve1, Jose M. Alvarez

    Abstract: Structured channel pruning has been shown to significantly accelerate inference time for convolution neural networks (CNNs) on modern hardware, with a relatively minor loss of network accuracy. Recent works permanently zero these channels during training, which we observe to significantly hamper final accuracy, particularly as the fraction of the network being pruned increases. We propose Soft Mas… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Accepted by ECCV 2022

  46. arXiv:2210.06659  [pdf, other

    cs.CV

    Structural Pruning via Latency-Saliency Knapsack

    Authors: Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jianna Liu, Jose M. Alvarez

    Abstract: Structural pruning can simplify network architecture and improve inference speed. We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget on targeting device. For filter importance ranking, HALP leverages latency lookup table to tr… ▽ More

    Submitted 18 October, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: Accepted by NeurIPS 2022. arXiv admin note: substantial text overlap with arXiv:2110.10811

  47. arXiv:2210.01234  [pdf, other

    cs.LG cs.AI cs.CV

    Optimizing Data Collection for Machine Learning

    Authors: Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc T. Law

    Abstract: Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted to NeurIPS 2022

  48. arXiv:2207.01778  [pdf, other

    cs.CV

    Object-Level Targeted Selection via Deep Template Matching

    Authors: Suraj Kothawade, Donna Roy, Michele Fenzi, Elmar Haussmann, Jose M. Alvarez, Christoph Angerer

    Abstract: Retrieving images with objects that are semantically similar to objects of interest (OOI) in a query image has many practical use cases. A few examples include fixing failures like false negatives/positives of a learned model or mitigating class imbalance in a dataset. The targeted selection task requires finding the relevant data from a large-scale pool of unlabeled data. Manual mining at this sc… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

    Comments: In Proceedings of the Intelligent Vehicles Symposium, IV 2022

  49. arXiv:2207.01725  [pdf, other

    cs.CV cs.LG

    How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

    Authors: Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M. Alvarez, Zhiding Yu, Sanja Fidler, Marc T. Law

    Abstract: Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? This question is of critical importance in applications such as autonomous driving or medical imaging where collecting data is expensive and time-consuming. Overestimating or underestimating data requirements incurs substantial costs that could be avoided with… ▽ More

    Submitted 13 July, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Accepted to CVPR 2022

  50. arXiv:2205.14971  [pdf, other

    cs.CV cs.LG

    Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions

    Authors: Shuxuan Guo, Yinlin Hu, Jose M. Alvarez, Mathieu Salzmann

    Abstract: Knowledge distillation facilitates the training of a compact student network by using a deep teacher one. While this has achieved great success in many tasks, it remains completely unstudied for image-based 6D object pose estimation. In this work, we introduce the first knowledge distillation method driven by the 6D pose estimation task. To this end, we observe that most modern 6D pose estimation… ▽ More

    Submitted 28 November, 2022; v1 submitted 30 May, 2022; originally announced May 2022.