Skip to main content

Showing 1–12 of 12 results for author: Zhao, S Z

.
  1. arXiv:2506.13757  [pdf, ps, other

    cs.CV

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Authors: Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, Jiaqi Ma

    Abstract: Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and actio… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Website link:https://autovla.github.io/

  2. arXiv:2505.00690  [pdf, other

    cs.CV cs.AI cs.RO

    Towards Autonomous Micromobility through Scalable Urban Simulation

    Authors: Wayne Wu, Honglin He, Chaoyuan Zhang, Jack He, Seth Z. Zhao, Ran Gong, Quanyi Li, Bolei Zhou

    Abstract: Micromobility, which utilizes lightweight mobile machines moving in urban public spaces, such as delivery robots and mobility scooters, emerges as a promising alternative to vehicular mobility. Current micromobility depends mostly on human manual operation (in-person or remote control), which raises safety and efficiency concerns when navigating busy urban environments full of unpredictable obstac… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: CVPR 2025 Highlight. Project page: https://metadriverse.github.io/urban-sim/

  3. arXiv:2504.05700  [pdf, other

    cs.CV

    Pose-Aware Weakly-Supervised Action Segmentation

    Authors: Seth Z. Zhao, Reza Ghoddoosian, Isht Dwivedi, Nakul Agarwal, Behzad Dariush

    Abstract: Understanding human behavior is an important problem in the pursuit of visual intelligence. A challenge in this endeavor is the extensive and costly effort required to accurately label action segments. To address this issue, we consider learning methods that demand minimal supervision for segmentation of human actions in long instructional videos. Specifically, we introduce a weakly-supervised fra… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  4. arXiv:2503.10034  [pdf, other

    cs.CV cs.RO

    V2X-ReaLO: An Open Online Framework and Dataset for Cooperative Perception in Reality

    Authors: Hao Xiang, Zhaoliang Zheng, Xin Xia, Seth Z. Zhao, Letian Gao, Zewei Zhou, Tianhui Cai, Yun Zhang, Jiaqi Ma

    Abstract: Cooperative perception enabled by Vehicle-to-Everything (V2X) communication holds significant promise for enhancing the perception capabilities of autonomous vehicles, allowing them to overcome occlusions and extend their field of view. However, existing research predominantly relies on simulated environments or static datasets, leaving the feasibility and effectiveness of V2X cooperative percepti… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  5. arXiv:2412.01812  [pdf, other

    cs.CV

    V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

    Authors: Zewei Zhou, Hao Xiang, Zhaoliang Zheng, Seth Z. Zhao, Mingyue Lei, Yun Zhang, Tianhui Cai, Xinyi Liu, Johnson Liu, Maheswari Bajji, Xin Xia, Zhiyu Huang, Bolei Zhou, Jiaqi Ma

    Abstract: Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents' information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus… ▽ More

    Submitted 13 March, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

    Comments: Website link: https://mobility-lab.seas.ucla.edu/v2xpnp/

  6. arXiv:2410.04759  [pdf, other

    cs.AI

    Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM

    Authors: Tianhui Cai, Yifan Liu, Zewei Zhou, Haoxuan Ma, Seth Z. Zhao, Zhiwen Wu, Jiaqi Ma

    Abstract: This work presents an interpretable decision-making framework for autonomous vehicles that integrates traffic regulations, norms, and safety guidelines comprehensively and enables seamless adaptation to different regions. While traditional rule-based methods struggle to incorporate the full scope of traffic rules, we develop a Traffic Regulation Retrieval (TRR) Agent based on Retrieval-Augmented G… ▽ More

    Submitted 13 March, 2025; v1 submitted 7 October, 2024; originally announced October 2024.

  7. arXiv:2408.11241  [pdf, ps, other

    cs.CV

    CooPre: Cooperative Pretraining for V2X Cooperative Perception

    Authors: Seth Z. Zhao, Hao Xiang, Chenfeng Xu, Xin Xia, Bolei Zhou, Jiaqi Ma

    Abstract: Existing Vehicle-to-Everything (V2X) cooperative perception methods rely on accurate multi-agent 3D annotations. Nevertheless, it is time-consuming and expensive to collect and annotate real-world data, especially for V2X systems. In this paper, we present a self-supervised learning framwork for V2X cooperative perception, which utilizes the vast amount of unlabeled 3D V2X data to enhance the perc… ▽ More

    Submitted 17 June, 2025; v1 submitted 20 August, 2024; originally announced August 2024.

  8. arXiv:2309.13570  [pdf, ps, other

    cs.CV

    Robust 6DoF Pose Estimation Against Depth Noise and a Comprehensive Evaluation on a Mobile Dataset

    Authors: Zixun Huang, Keling Yao, Seth Z. Zhao, Chuanyu Pan, Allen Y. Yang

    Abstract: Robust 6DoF pose estimation with mobile devices is the foundation for applications in robotics, augmented reality, and digital twin localization. In this paper, we extensively investigate the robustness of existing RGBD-based 6DoF pose estimation methods against varying levels of depth sensor noise. We highlight that existing 6DoF pose estimation methods suffer significant performance discrepancie… ▽ More

    Submitted 2 June, 2025; v1 submitted 24 September, 2023; originally announced September 2023.

  9. arXiv:2309.10121  [pdf, other

    cs.CV

    Pre-training on Synthetic Driving Data for Trajectory Prediction

    Authors: Yiheng Li, Seth Z. Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan

    Abstract: Accumulating substantial volumes of real-world driving data proves pivotal in the realm of trajectory forecasting for autonomous driving. Given the heavy reliance of current trajectory forecasting models on data-driven methodologies, we aim to tackle the challenge of learning general trajectory forecasting representations under limited data availability. We propose a pipeline-level solution to mit… ▽ More

    Submitted 28 August, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

  10. arXiv:2309.09088  [pdf, other

    cs.SD eess.AS

    Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition

    Authors: Haoming Guo, Seth Z. Zhao, Jiachen Lian, Gopala Anumanchipalli, Gerald Friedland

    Abstract: Vocoder models have recently achieved substantial progress in generating authentic audio comparable to human quality while significantly reducing memory requirement and inference time. However, these data-hungry generative models require large-scale audio data for learning good representations. In this paper, we apply contrastive learning methods in training the vocoder to improve the perceptual q… ▽ More

    Submitted 18 December, 2023; v1 submitted 16 September, 2023; originally announced September 2023.

  11. arXiv:2302.05991  [pdf, other

    cs.CV

    Digital Twin Tracking Dataset (DTTD): A New RGB+Depth 3D Dataset for Longer-Range Object Tracking Applications

    Authors: Weiyu Feng, Seth Z. Zhao, Chuanyu Pan, Adam Chang, Yichen Chen, Zekun Wang, Allen Y. Yang

    Abstract: Digital twin is a problem of augmenting real objects with their digital counterparts. It can underpin a wide range of applications in augmented reality (AR), autonomy, and UI/UX. A critical component in a good digital-twin system is real-time, accurate 3D object tracking. Most existing works solve 3D object tracking through the lens of robotic grasping, employ older generations of depth sensors, a… ▽ More

    Submitted 11 April, 2023; v1 submitted 12 February, 2023; originally announced February 2023.

  12. arXiv:2202.07706   

    cs.CV

    Misinformation Detection in Social Media Video Posts

    Authors: Kehan Wang, David Chan, Seth Z. Zhao, John Canny, Avideh Zakhor

    Abstract: With the growing adoption of short-form video by social media platforms, reducing the spread of misinformation through video posts has become a critical challenge for social media providers. In this paper, we develop methods to detect misinformation in social media posts, exploiting modalities such as video and text. Due to the lack of large-scale public data for misinformation detection in multi-… ▽ More

    Submitted 30 July, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: We discovered an error in our dataset construction where retweets were not properly filtered. This resulted in test data leakage in training data, and the results reported are affected