-
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
Authors:
Shadi Hamdan,
Chonghao Sima,
Zetong Yang,
Hongyang Li,
Fatma Güney
Abstract:
How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted usin…
▽ More
How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real-time inference speed at 50 ms.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Centaur: Robust End-to-End Autonomous Driving with Test-Time Training
Authors:
Chonghao Sima,
Kashyap Chitta,
Zhiding Yu,
Shiyi Lan,
Ping Luo,
Andreas Geiger,
Hongyang Li,
Jose M. Alvarez
Abstract:
How can we rely on an end-to-end autonomous vehicle's complex decision-making system during deployment? One common solution is to have a ``fallback layer'' that checks the planned trajectory for rule violations and replaces it with a pre-defined safe action if necessary. Another approach involves adjusting the planner's decisions to minimize a pre-defined ``cost function'' using additional system…
▽ More
How can we rely on an end-to-end autonomous vehicle's complex decision-making system during deployment? One common solution is to have a ``fallback layer'' that checks the planned trajectory for rule violations and replaces it with a pre-defined safe action if necessary. Another approach involves adjusting the planner's decisions to minimize a pre-defined ``cost function'' using additional system predictions such as road layouts and detected obstacles. However, these pre-programmed rules or cost functions cannot learn and improve with new training data, often resulting in overly conservative behaviors. In this work, we propose Centaur (Cluster Entropy for Test-time trAining using Uncertainty) which updates a planner's behavior via test-time training, without relying on hand-engineered rules or cost functions. Instead, we measure and minimize the uncertainty in the planner's decisions. For this, we develop a novel uncertainty measure, called Cluster Entropy, which is simple, interpretable, and compatible with state-of-the-art planning algorithms. Using data collected at prior test-time time-steps, we perform an update to the model's parameters using a gradient that minimizes the Cluster Entropy. With only this sole gradient update prior to inference, Centaur exhibits significant improvements, ranking first on the navtest leaderboard with notable gains in safety-critical metrics such as time to collision. To provide detailed insights on a per-scenario basis, we also introduce navsafe, a challenging new benchmark, which highlights previously undiscovered failure modes of driving models.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
Authors:
AgiBot-World-Contributors,
Qingwen Bu,
Jisong Cai,
Li Chen,
Xiuqi Cui,
Yan Ding,
Siyuan Feng,
Shenyuan Gao,
Xindong He,
Xuan Hu,
Xu Huang,
Shu Jiang,
Yuxin Jiang,
Cheng Jing,
Hongyang Li,
Jialu Li,
Chiming Liu,
Yi Liu,
Yuxiang Lu,
Jianlan Luo,
Ping Luo,
Yao Mu,
Yuehan Niu,
Yixuan Pan,
Jiangmiao Pang
, et al. (27 additional authors not shown)
Abstract:
We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loo…
▽ More
We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.
△ Less
Submitted 30 April, 2025; v1 submitted 9 March, 2025;
originally announced March 2025.
-
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
Authors:
Shaoyuan Xie,
Lingdong Kong,
Yuhao Dong,
Chonghao Sima,
Wenwei Zhang,
Qi Alfred Chen,
Ziwei Liu,
Liang Pan
Abstract:
Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a be…
▽ More
Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs' awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs
Authors:
Vincent Emonet,
Jerven Bolleman,
Severine Duvaud,
Tarcisio Mendes de Farias,
Ana Claudia Sima
Abstract:
We introduce a Retrieval-Augmented Generation (RAG) system for translating user questions into accurate federated SPARQL queries over bioinformatics knowledge graphs (KGs) leveraging Large Language Models (LLMs). To enhance accuracy and reduce hallucinations in query generation, our system utilises metadata from the KGs, including query examples and schema information, and incorporates a validatio…
▽ More
We introduce a Retrieval-Augmented Generation (RAG) system for translating user questions into accurate federated SPARQL queries over bioinformatics knowledge graphs (KGs) leveraging Large Language Models (LLMs). To enhance accuracy and reduce hallucinations in query generation, our system utilises metadata from the KGs, including query examples and schema information, and incorporates a validation step to correct generated queries. The system is available online at chat.expasy.org.
△ Less
Submitted 10 February, 2025; v1 submitted 8 October, 2024;
originally announced October 2024.
-
A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications
Authors:
Jerven Bolleman,
Vincent Emonet,
Adrian Altenhoff,
Amos Bairoch,
Marie-Claude Blatter,
Alan Bridge,
Severine Duvaud,
Elisabeth Gasteiger,
Dmitry Kuznetsov,
Sebastien Moretti,
Pierre-Andre Michel,
Anne Morgat,
Marco Pagni,
Nicole Redaschi,
Monique Zahn-Zabal,
Tarcisio Mendes de Farias,
Ana Claudia Sima
Abstract:
Background. In the last decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs…
▽ More
Background. In the last decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning, if a sufficiently large number of examples are provided and published in a common, machine-readable and standardized format across different resources.
Findings. We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1000 example questions and queries, including 65 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology.
Conclusions. We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving
Authors:
Kairui Ding,
Boyuan Chen,
Yuchen Su,
Huan-ang Gao,
Bu Jin,
Chonghao Sima,
Wuqiang Zhang,
Xiaohui Li,
Paul Barsch,
Hongyang Li,
Hao Zhao
Abstract:
End-to-end architectures in autonomous driving (AD) face a significant challenge in interpretability, impeding human-AI trust. Human-friendly natural language has been explored for tasks such as driving explanation and 3D captioning. However, previous works primarily focused on the paradigm of declarative interpretability, where the natural language interpretations are not grounded in the intermed…
▽ More
End-to-end architectures in autonomous driving (AD) face a significant challenge in interpretability, impeding human-AI trust. Human-friendly natural language has been explored for tasks such as driving explanation and 3D captioning. However, previous works primarily focused on the paradigm of declarative interpretability, where the natural language interpretations are not grounded in the intermediate outputs of AD systems, making the interpretations only declarative. In contrast, aligned interpretability establishes a connection between language and the intermediate outputs of AD systems. Here we introduce Hint-AD, an integrated AD-language system that generates language aligned with the holistic perception-prediction-planning outputs of the AD model. By incorporating the intermediate outputs and a holistic token mixer sub-network for effective feature adaptation, Hint-AD achieves desirable accuracy, achieving state-of-the-art results in driving language tasks including driving explanation, 3D dense captioning, and command prediction. To facilitate further study on driving explanation task on nuScenes, we also introduce a human-labeled dataset, Nu-X. Codes, dataset, and models will be publicly available.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
SPARQL Generation: an analysis on fine-tuning OpenLLaMA for Question Answering over a Life Science Knowledge Graph
Authors:
Julio C. Rangel,
Tarcisio Mendes de Farias,
Ana Claudia Sima,
Norio Kobayashi
Abstract:
The recent success of Large Language Models (LLM) in a wide range of Natural Language Processing applications opens the path towards novel Question Answering Systems over Knowledge Graphs leveraging LLMs. However, one of the main obstacles preventing their implementation is the scarcity of training data for the task of translating questions into corresponding SPARQL queries, particularly in the ca…
▽ More
The recent success of Large Language Models (LLM) in a wide range of Natural Language Processing applications opens the path towards novel Question Answering Systems over Knowledge Graphs leveraging LLMs. However, one of the main obstacles preventing their implementation is the scarcity of training data for the task of translating questions into corresponding SPARQL queries, particularly in the case of domain-specific KGs. To overcome this challenge, in this study, we evaluate several strategies for fine-tuning the OpenLlama LLM for question answering over life science knowledge graphs. In particular, we propose an end-to-end data augmentation approach for extending a set of existing queries over a given knowledge graph towards a larger dataset of semantically enriched question-to-SPARQL query pairs, enabling fine-tuning even for datasets where these pairs are scarce. In this context, we also investigate the role of semantic "clues" in the queries, such as meaningful variable names and inline comments. Finally, we evaluate our approach over the real-world Bgee gene expression knowledge graph and we show that semantic clues can improve model performance by up to 33% compared to a baseline with random variable names and no comments included.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
DriveLM: Driving with Graph Visual Question Answering
Authors:
Chonghao Sima,
Katrin Renz,
Kashyap Chitta,
Li Chen,
Hanxue Zhang,
Chengen Xie,
Jens Beißwenger,
Ping Luo,
Andreas Geiger,
Hongyang Li
Abstract:
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate…
▽ More
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate object interactions before taking actions. The key insight is that with our proposed task, Graph VQA, where we model graph-structured reasoning through perception, prediction and planning question-answer pairs, we obtain a suitable proxy task to mimic the human reasoning process. We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent baseline performs end-to-end autonomous driving competitively in comparison to state-of-the-art driving-specific architectures. Notably, its benefits are pronounced when it is evaluated zero-shot on unseen objects or sensor configurations. We hope this work can be the starting point to shed new light on how to apply VLMs for autonomous driving. To facilitate future research, all code, data, and models are available to the public.
△ Less
Submitted 16 January, 2025; v1 submitted 21 December, 2023;
originally announced December 2023.
-
Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
Authors:
Linyan Huang,
Zhiqi Li,
Chonghao Sima,
Wenhai Wang,
Jingdong Wang,
Yu Qiao,
Hongyang Li
Abstract:
Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-bas…
▽ More
Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Scene as Occupancy
Authors:
Chonghao Sima,
Wenwen Tong,
Tai Wang,
Li Chen,
Silei Wu,
Hanming Deng,
Yi Gu,
Lewei Lu,
Ping Luo,
Dahua Lin,
Hongyang Li
Abstract:
Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver's planning. To achieve this, a geometry-aware representation that quantizes the physical 3D scene into structured grid map with semantic labels per cell, termed as 3D Occupancy, would be desirable. Compared to the form of bounding box, a key insight behind occu…
▽ More
Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver's planning. To achieve this, a geometry-aware representation that quantizes the physical 3D scene into structured grid map with semantic labels per cell, termed as 3D Occupancy, would be desirable. Compared to the form of bounding box, a key insight behind occupancy is that it could capture the fine-grained details of critical obstacles in the scene, and thereby facilitate subsequent tasks. Prior or concurrent literature mainly concentrate on a single scene completion task, where we might argue that the potential of this occupancy representation might obsess broader impact. In this paper, we propose OccNet, a multi-view vision-centric pipeline with a cascade and temporal voxel decoder to reconstruct 3D occupancy. At the core of OccNet is a general occupancy embedding to represent 3D physical world. Such a descriptor could be applied towards a wide span of driving tasks, including detection, segmentation and planning. To validate the effectiveness of this new representation and our proposed algorithm, we propose OpenOcc, the first dense high-quality 3D occupancy benchmark built on top of nuScenes. Empirical experiments show that there are evident performance gain across multiple tasks, e.g., motion planning could witness a collision rate reduction by 15%-58%, demonstrating the superiority of our method.
△ Less
Submitted 26 June, 2023; v1 submitted 5 June, 2023;
originally announced June 2023.
-
OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping
Authors:
Huijie Wang,
Tianyu Li,
Yang Li,
Li Chen,
Chonghao Sima,
Zhenbo Liu,
Bangjun Wang,
Peijin Jia,
Yuting Wang,
Shengyin Jiang,
Feng Wen,
Hang Xu,
Ping Luo,
Junchi Yan,
Wei Zhang,
Hongyang Li
Abstract:
Accurately depicting the complex traffic scene is a vital component for autonomous vehicles to execute correct judgments. However, existing benchmarks tend to oversimplify the scene by solely focusing on lane perception tasks. Observing that human drivers rely on both lanes and traffic signals to operate their vehicles safely, we present OpenLane-V2, the first dataset on topology reasoning for tra…
▽ More
Accurately depicting the complex traffic scene is a vital component for autonomous vehicles to execute correct judgments. However, existing benchmarks tend to oversimplify the scene by solely focusing on lane perception tasks. Observing that human drivers rely on both lanes and traffic signals to operate their vehicles safely, we present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure. The objective of the presented dataset is to advance research in understanding the structure of road scenes by examining the relationship between perceived entities, such as traffic elements and lanes. Leveraging existing datasets, OpenLane-V2 consists of 2,000 annotated road scenes that describe traffic elements and their correlation to the lanes. It comprises three primary sub-tasks, including the 3D lane detection inherited from OpenLane, accompanied by corresponding metrics to evaluate the model's performance. We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes.
△ Less
Submitted 28 October, 2023; v1 submitted 20 April, 2023;
originally announced April 2023.
-
Sparse Dense Fusion for 3D Object Detection
Authors:
Yulu Gao,
Chonghao Sima,
Shaoshuai Shi,
Shangzhe Di,
Si Liu,
Hongyang Li
Abstract:
With the prevalence of multimodal learning, camera-LiDAR fusion has gained popularity in 3D object detection. Although multiple fusion approaches have been proposed, they can be classified into either sparse-only or dense-only fashion based on the feature representation in the fusion module. In this paper, we analyze them in a common taxonomy and thereafter observe two challenges: 1) sparse-only s…
▽ More
With the prevalence of multimodal learning, camera-LiDAR fusion has gained popularity in 3D object detection. Although multiple fusion approaches have been proposed, they can be classified into either sparse-only or dense-only fashion based on the feature representation in the fusion module. In this paper, we analyze them in a common taxonomy and thereafter observe two challenges: 1) sparse-only solutions preserve 3D geometric prior and yet lose rich semantic information from the camera, and 2) dense-only alternatives retain the semantic continuity but miss the accurate geometric information from LiDAR. By analyzing these two formulations, we conclude that the information loss is inevitable due to their design scheme. To compensate for the information loss in either manner, we propose Sparse Dense Fusion (SDF), a complementary framework that incorporates both sparse-fusion and dense-fusion modules via the Transformer architecture. Such a simple yet effective sparse-dense fusion structure enriches semantic texture and exploits spatial structure information simultaneously. Through our SDF strategy, we assemble two popular methods with moderate performance and outperform baseline by 4.3% in mAP and 2.5% in NDS, ranking first on the nuScenes benchmark. Extensive ablations demonstrate the effectiveness of our method and empirically align our analysis.
△ Less
Submitted 9 April, 2023;
originally announced April 2023.
-
Planning-oriented Autonomous Driving
Authors:
Yihan Hu,
Jiazhi Yang,
Li Chen,
Keyu Li,
Chonghao Sima,
Xizhou Zhu,
Siqi Chai,
Senyao Du,
Tianwei Lin,
Wenhai Wang,
Lewei Lu,
Xiaosong Jia,
Qiang Liu,
Jifeng Dai,
Yu Qiao,
Hongyang Li
Abstract:
Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of tasks and achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from accumulative error…
▽ More
Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of tasks and achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from accumulative errors or deficient task coordination. Instead, we argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query interfaces to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven by substantially outperforming previous state-of-the-arts in all aspects. Code and models are public.
△ Less
Submitted 23 March, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe
Authors:
Hongyang Li,
Chonghao Sima,
Jifeng Dai,
Wenhai Wang,
Lewei Lu,
Huijie Wang,
Jia Zeng,
Zhiqi Li,
Jiazhi Yang,
Hanming Deng,
Hao Tian,
Enze Xie,
Jiangwei Xie,
Li Chen,
Tianyu Li,
Yang Li,
Yulu Gao,
Xiaosong Jia,
Si Liu,
Jianping Shi,
Dahua Lin,
Yu Qiao
Abstract:
Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia. Conventional approaches for most autonomous driving algorithms perform detection, segmentation, tracking, etc., in a front or perspective view. As sensor configurations get more complex, integrating multi-source information from different sens…
▽ More
Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia. Conventional approaches for most autonomous driving algorithms perform detection, segmentation, tracking, etc., in a front or perspective view. As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance. BEV perception inherits several advantages, as representing surrounding scenes in BEV is intuitive and fusion-friendly; and representing objects in BEV is most desirable for subsequent modules as in planning and/or control. The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; (c) how to formulate the pipeline to incorporate features from different sources and views; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios. In this survey, we review the most recent works on BEV perception and provide an in-depth analysis of different solutions. Moreover, several systematic designs of BEV approach from the industry are depicted as well. Furthermore, we introduce a full suite of practical guidebook to improve the performance of BEV perception tasks, including camera, LiDAR and fusion inputs. At last, we point out the future research directions in this area. We hope this report will shed some light on the community and encourage more research effort on BEV perception. We keep an active repository to collect the most recent work and provide a toolbox for bag of tricks at https://github.com/OpenDriveLab/Birds-eye-view-Perception
△ Less
Submitted 27 September, 2023; v1 submitted 12 September, 2022;
originally announced September 2022.
-
BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
Authors:
Zhiqi Li,
Wenhai Wang,
Hongyang Li,
Enze Xie,
Chonghao Sima,
Tong Lu,
Qiao Yu,
Jifeng Dai
Abstract:
3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal in…
▽ More
3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.
△ Less
Submitted 13 July, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark
Authors:
Li Chen,
Chonghao Sima,
Yang Li,
Zehan Zheng,
Jiajie Xu,
Xiangwei Geng,
Hongyang Li,
Conghui He,
Jianping Shi,
Yu Qiao,
Junchi Yan
Abstract:
Methods for 3D lane detection have been recently proposed to address the issue of inaccurate lane layouts in many autonomous driving scenarios (uphill/downhill, bump, etc.). Previous work struggled in complex cases due to their simple designs of the spatial transformation between front view and bird's eye view (BEV) and the lack of a realistic dataset. Towards these issues, we present PersFormer:…
▽ More
Methods for 3D lane detection have been recently proposed to address the issue of inaccurate lane layouts in many autonomous driving scenarios (uphill/downhill, bump, etc.). Previous work struggled in complex cases due to their simple designs of the spatial transformation between front view and bird's eye view (BEV) and the lack of a realistic dataset. Towards these issues, we present PersFormer: an end-to-end monocular 3D lane detector with a novel Transformer-based spatial feature transformation module. Our model generates BEV features by attending to related front-view local regions with camera parameters as a reference. PersFormer adopts a unified 2D/3D anchor design and an auxiliary task to detect 2D/3D lanes simultaneously, enhancing the feature consistency and sharing the benefits of multi-task learning. Moreover, we release one of the first large-scale real-world 3D lane datasets: OpenLane, with high-quality annotation and scenario diversity. OpenLane contains 200,000 frames, over 880,000 instance-level lanes, 14 lane categories, along with scene tags and the closed-in-path object annotations to encourage the development of lane detection and more industrial-related autonomous driving methods. We show that PersFormer significantly outperforms competitive baselines in the 3D lane detection task on our new OpenLane dataset as well as Apollo 3D Lane Synthetic dataset, and is also on par with state-of-the-art algorithms in the 2D task on OpenLane. The project page is available at https://github.com/OpenPerceptionX/PersFormer_3DLane and OpenLane dataset is provided at https://github.com/OpenPerceptionX/OpenLane.
△ Less
Submitted 19 July, 2022; v1 submitted 21 March, 2022;
originally announced March 2022.
-
BV equivalence with boundary
Authors:
Francisco Manuel Castela Simão,
Alberto S. Cattaneo,
Michele Schiavina
Abstract:
An extension of the notion of classical equivalence of equivalence in the Batalin--(Fradkin)--Vilkovisky (BV) and (BFV) framework for local Lagrangian field theory on manifolds possibly with boundary is discussed. Equivalence is phrased in both a strict and a lax sense, distinguished by the compatibility between the BV data for a field theory and its boundary BFV data, necessary for quantisation.…
▽ More
An extension of the notion of classical equivalence of equivalence in the Batalin--(Fradkin)--Vilkovisky (BV) and (BFV) framework for local Lagrangian field theory on manifolds possibly with boundary is discussed. Equivalence is phrased in both a strict and a lax sense, distinguished by the compatibility between the BV data for a field theory and its boundary BFV data, necessary for quantisation. In this context, the first- and second-order formulations of non-Abelian Yang--Mills and of classical mechanics on curved backgrounds, all of which admit a strict BV-BFV description, are shown to be pairwise equivalent as strict BV-BFV theories. This in particular implies that their BV-complexes are quasi-isomorphic. Furthermore, Jacobi theory and one-dimensional gravity coupled with scalar matter are compared as classically-equivalent reparametrisation-invariant versions of classical mechanics, but such that only the latter admits a strict BV-BFV formulation. They are shown to be equivalent as lax BV-BFV theories and to have isomorphic BV cohomologies. This shows that strict BV-BFV equivalence is a strictly finer notion of equivalence of theories.
△ Less
Submitted 7 March, 2023; v1 submitted 11 September, 2021;
originally announced September 2021.
-
Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data
Authors:
Ana Claudia Sima,
Tarcisio Mendes de Farias,
Maria Anisimova,
Christophe Dessimoz,
Marc Robinson-Rechavi,
Erich Zbinden,
Kurt Stockinger
Abstract:
The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training dat…
▽ More
The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available.
In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets.
△ Less
Submitted 14 June, 2021; v1 submitted 28 April, 2021;
originally announced April 2021.
-
Optical and mechanical properties of nanofibrillated cellulose: towards a robust platform for next-generation green technologies
Authors:
Claudia D. Simao,
Juan S. Reparaz,
Markus. R. Wagner,
Bartlomiej Graczykowski,
Martin Kreuzer,
Yasser B. Ruiz-Blanco,
Yamila Garcia,
Jani-Markus Malho,
Alejandro R. Goni,
Jouni Ahopelto,
Clivia M. Sotomayor Torres
Abstract:
Nanofibrillated cellulose, a polymer that can be obtained from one of the most abundant biopolymers in Nature, is being increasingly explored due to its outstanding properties for packaging and device applications. Still, open challenges in engineering its intrinsic properties remain to address. The results obtained show the precise determination of significant properties as elastic properties and…
▽ More
Nanofibrillated cellulose, a polymer that can be obtained from one of the most abundant biopolymers in Nature, is being increasingly explored due to its outstanding properties for packaging and device applications. Still, open challenges in engineering its intrinsic properties remain to address. The results obtained show the precise determination of significant properties as elastic properties and interactions that are compared with similar works and, moreover, demonstrate that nanofibrillated cellulose properties can be reversibly controlled, supporting the extended potential of nanofibrillated cellulose as a robust platform for green-technology applications
△ Less
Submitted 1 April, 2015;
originally announced April 2015.
-
Order quantification of hexagonal periodic arrays fabricated by in situ solvent-assisted nanoimprint lithography of block copolymers
Authors:
Claudia Simao,
Worawut Khunsin,
Nikolaos Kehagias,
Mathieu Salaun,
Marc Zelsmann,
Michael A. Morris,
Clivia M. Sotomayor Torres
Abstract:
Directed self-assembly of block copolymer polystyrene-b-polyethylene oxide (PS-b-PEO) thin film was achieved by one-pot methodology of solvent vapour assisted nanoimprint lithography (SAIL).
Directed self-assembly of block copolymer polystyrene-b-polyethylene oxide (PS-b-PEO) thin film was achieved by one-pot methodology of solvent vapour assisted nanoimprint lithography (SAIL).
△ Less
Submitted 10 March, 2014;
originally announced March 2014.