-
CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
Authors:
Xiaoqi Li,
Lingyun Xu,
Mingxu Zhang,
Jiaming Liu,
Yan Shen,
Iaroslav Ponomarenko,
Jiahui Xu,
Liang Heng,
Siyuan Huang,
Shanghang Zhang,
Hao Dong
Abstract:
In robotic, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To tackle these challenges, we introduce CrayonRobo that leverages comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a…
▽ More
In robotic, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To tackle these challenges, we introduce CrayonRobo that leverages comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a simple manner. Specifically, for each key-frame in the task sequence, our method allows for manual or automatic generation of simple and expressive 2D visual prompts overlaid on RGB images. These prompts represent the required task goals, such as the end-effector pose and the desired movement direction after contact. We develop a training strategy that enables the model to interpret these visual-language prompts and predict the corresponding contact poses and movement directions in SE(3) space. Furthermore, by sequentially executing all key-frame steps, the model can complete long-horizon tasks. This approach not only helps the model explicitly understand the task objectives but also enhances its robustness on unseen tasks by providing easily interpretable prompts. We evaluate our method in both simulated and real-world environments, demonstrating its robust manipulation capabilities.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment
Authors:
Xiaoqi Li,
Jiaming Liu,
Nuowei Han,
Liang Heng,
Yandong Guo,
Hao Dong,
Yang Liu
Abstract:
The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse po…
▽ More
The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse point cloud format, making category distinction challenging. Instance-level complexity stems from multiple instances of the same category coexisting in a scene, leading to distractions during grounding. To address these challenges, we propose a novel weakly-supervised grounding approach that explicitly differentiates between categories and instances. In the category-level branch, we utilize extensive category knowledge from a pre-trained external detector to align object proposal features with sentence-level category features, thereby enhancing category awareness. In the instance-level branch, we utilize spatial relationship descriptions from language queries to refine object proposal features, ensuring clear differentiation among objects. These designs enable our model to accurately identify target-category objects while distinguishing instances within the same category. Compared to previous methods, our approach achieves state-of-the-art performance on three widely used benchmarks: Nr3D, Sr3D, and ScanRef.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation
Authors:
Rongyu Zhang,
Menghang Dong,
Yuan Zhang,
Liang Heng,
Xiaowei Chi,
Gaole Dai,
Li Du,
Yuan Du,
Shanghang Zhang
Abstract:
Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data, enabling generalist robotic systems to interpret instructions and perform embodied tasks. Nevertheless, their real-world deployment is hindered by substantial computational and storage demands. Recent insights into the homogeneous patterns in the LLM layer have inspired sparsification techniques to ad…
▽ More
Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data, enabling generalist robotic systems to interpret instructions and perform embodied tasks. Nevertheless, their real-world deployment is hindered by substantial computational and storage demands. Recent insights into the homogeneous patterns in the LLM layer have inspired sparsification techniques to address these challenges, such as early exit and token pruning. However, these methods often neglect the critical role of the final layers that encode the semantic information most relevant to downstream robotic tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-Layers Vision-Language-Action model (MoLe-VLA, or simply MoLe) architecture for dynamic LLM layer activation. We introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot's current state, mimicking the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognitive ability of LLMs lost in MoLe, we devise a Cognition Self-Knowledge Distillation (CogKD) framework. CogKD enhances the understanding of task demands and improves the generation of task-relevant action sequences by leveraging cognitive features. Extensive experiments conducted in both RLBench simulation and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance. Specifically, MoLe-VLA achieves an 8% improvement in the mean success rate across ten tasks while reducing computational costs by up to x5.6 compared to standard LLMs.
△ Less
Submitted 14 April, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
A Self-Correcting Vision-Language-Action Model for Fast and Slow System Manipulation
Authors:
Chenxuan Li,
Jiaming Liu,
Guanqun Wang,
Xiaoqi Li,
Sixiang Chen,
Liang Heng,
Chuyan Xiong,
Jiaxin Ge,
Renrui Zhang,
Kaichen Zhou,
Shanghang Zhang
Abstract:
Recently, some studies have integrated Multimodal Large Language Models into robotic manipulation, constructing vision-language-action models (VLAs) to interpret multimodal information and predict SE(3) poses. While VLAs have shown promising progress, they may suffer from failures when faced with novel and complex tasks. To emulate human-like reasoning for more robust manipulation, we propose the…
▽ More
Recently, some studies have integrated Multimodal Large Language Models into robotic manipulation, constructing vision-language-action models (VLAs) to interpret multimodal information and predict SE(3) poses. While VLAs have shown promising progress, they may suffer from failures when faced with novel and complex tasks. To emulate human-like reasoning for more robust manipulation, we propose the self-corrected (SC-)VLA framework, which integrates fast system for directly predicting actions and slow system for reflecting on failed actions within a single VLA policy. For the fast system, we incorporate parameter-efficient fine-tuning to equip the model with pose prediction capabilities while preserving the inherent reasoning abilities of MLLMs. For the slow system, we propose a Chain-of-Thought training strategy for failure correction, designed to mimic human reflection after a manipulation failure. Specifically, our model learns to identify the causes of action failures, adaptively seek expert feedback, reflect on the current failure scenario, and iteratively generate corrective actions, step by step. Furthermore, a continuous policy learning method is designed based on successfully corrected samples, enhancing the fast system's adaptability to the current configuration. We compare SC-VLA with the previous SOTA VLA in both simulation and real-world tasks, demonstrating an efficient correction process and improved manipulation accuracy on both seen and unseen tasks.
△ Less
Submitted 18 March, 2025; v1 submitted 27 May, 2024;
originally announced May 2024.
-
Content-Preserving Diffusion Model for Unsupervised AS-OCT image Despeckling
Authors:
Li Sanqian,
Higashita Risa,
Fu Huazhu,
Li Heng,
Niu Jingxuan,
Liu Jiang
Abstract:
Anterior segment optical coherence tomography (AS-OCT) is a non-invasive imaging technique that is highly valuable for ophthalmic diagnosis. However, speckles in AS-OCT images can often degrade the image quality and affect clinical analysis. As a result, removing speckles in AS-OCT images can greatly benefit automatic ophthalmology analysis. Unfortunately, challenges still exist in deploying effec…
▽ More
Anterior segment optical coherence tomography (AS-OCT) is a non-invasive imaging technique that is highly valuable for ophthalmic diagnosis. However, speckles in AS-OCT images can often degrade the image quality and affect clinical analysis. As a result, removing speckles in AS-OCT images can greatly benefit automatic ophthalmology analysis. Unfortunately, challenges still exist in deploying effective AS-OCT image denoising algorithms, including collecting sufficient paired training data and the requirement to preserve consistent content in medical images. To address these practical issues, we propose an unsupervised AS-OCT despeckling algorithm via Content Preserving Diffusion Model (CPDM) with statistical knowledge. At the training stage, a Markov chain transforms clean images to white Gaussian noise by repeatedly adding random noise and removes the predicted noise in a reverse procedure. At the inference stage, we first analyze the statistical distribution of speckles and convert it into a Gaussian distribution, aiming to match the fast truncated reverse diffusion process. We then explore the posterior distribution of observed images as a fidelity term to ensure content consistency in the iterative procedure. Our experimental results show that CPDM significantly improves image quality compared to competitive methods. Furthermore, we validate the benefits of CPDM for subsequent clinical analysis, including ciliary muscle (CM) segmentation and scleral spur (SS) localization.
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
Continuous-time Radar-inertial Odometry for Automotive Radars
Authors:
Yin Zhi Ng,
Benjamin Choi,
Robby Tan,
Lionel Heng
Abstract:
We present an approach for radar-inertial odometry which uses a continuous-time framework to fuse measurements from multiple automotive radars and an inertial measurement unit (IMU). Adverse weather conditions do not have a significant impact on the operating performance of radar sensors unlike that of camera and LiDAR sensors. Radar's robustness in such conditions and the increasing prevalence of…
▽ More
We present an approach for radar-inertial odometry which uses a continuous-time framework to fuse measurements from multiple automotive radars and an inertial measurement unit (IMU). Adverse weather conditions do not have a significant impact on the operating performance of radar sensors unlike that of camera and LiDAR sensors. Radar's robustness in such conditions and the increasing prevalence of radars on passenger vehicles motivate us to look at the use of radar for ego-motion estimation. A continuous-time trajectory representation is applied not only as a framework to enable heterogeneous and asynchronous multi-sensor fusion, but also, to facilitate efficient optimization by being able to compute poses and their derivatives in closed-form and at any given time along the trajectory. We compare our continuous-time estimates to those from a discrete-time radar-inertial odometry approach and show that our continuous-time method outperforms the discrete-time method. To the best of our knowledge, this is the first time a continuous-time framework has been applied to radar-inertial odometry.
△ Less
Submitted 7 January, 2022;
originally announced January 2022.
-
Graph-Guided Deformation for Point Cloud Completion
Authors:
Jieqi Shi,
Lingyun Xu,
Liang Heng,
Shaojie Shen
Abstract:
For a long time, the point cloud completion task has been regarded as a pure generation task. After obtaining the global shape code through the encoder, a complete point cloud is generated using the shape priorly learnt by the networks. However, such models are undesirably biased towards prior average objects and inherently limited to fit geometry details. In this paper, we propose a Graph-Guided…
▽ More
For a long time, the point cloud completion task has been regarded as a pure generation task. After obtaining the global shape code through the encoder, a complete point cloud is generated using the shape priorly learnt by the networks. However, such models are undesirably biased towards prior average objects and inherently limited to fit geometry details. In this paper, we propose a Graph-Guided Deformation Network, which respectively regards the input data and intermediate generation as controlling and supporting points, and models the optimization guided by a graph convolutional network(GCN) for the point cloud completion task. Our key insight is to simulate the least square Laplacian deformation process via mesh deformation methods, which brings adaptivity for modeling variation in geometry details. By this means, we also reduce the gap between the completion task and the mesh deformation algorithms. As far as we know, we are the first to refine the point cloud completion task by mimicing traditional graphics algorithms with GCN-guided deformation. We have conducted extensive experiments on both the simulated indoor dataset ShapeNet, outdoor dataset KITTI, and our self-collected autonomous driving dataset Pandar40. The results show that our method outperforms the existing state-of-the-art algorithms in the 3D point cloud completion task.
△ Less
Submitted 11 November, 2021;
originally announced December 2021.
-
Clustering Aware Classification for Risk Prediction and Subtyping in Clinical Data
Authors:
Shivin Srivastava,
Siddharth Bhatia,
Lingxiao Huang,
Lim Jun Heng,
Kenji Kawaguchi,
Vaibhav Rajan
Abstract:
In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification either 1) are classifier-specific and not generic, or 2) independently perform clustering and classifier training, which may not form clusters that can potentially benefit class…
▽ More
In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification either 1) are classifier-specific and not generic, or 2) independently perform clustering and classifier training, which may not form clusters that can potentially benefit classifier performance. The question of how to perform clustering to improve the performance of classifiers trained on the clusters has received scant attention in previous literature, despite its importance in several real-world applications. In this paper, first, we theoretically analyze the generalization performance of classifiers trained on clustered data and find conditions under which clustering can potentially aid classification. This motivates the design of a simple k-means-based classification algorithm called Clustering Aware Classification (CAC) and its neural variant {DeepCAC}. DeepCAC effectively leverages deep representation learning to learn latent embeddings and finds clusters in a manner that make the clustered data suitable for training classifiers for each underlying subpopulation. Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of DeepCAC over previous methods for combined clustering and classification.
△ Less
Submitted 3 January, 2023; v1 submitted 23 February, 2021;
originally announced February 2021.
-
Nighttime Stereo Depth Estimation using Joint Translation-Stereo Learning: Light Effects and Uninformative Regions
Authors:
Aashish Sharma,
Lionel Heng,
Loong-Fah Cheong,
Robby T. Tan
Abstract:
Nighttime stereo depth estimation is still challenging, as assumptions associated with daytime lighting conditions do not hold any longer. Nighttime is not only about low-light and dense noise, but also about glow/glare, flares, non-uniform distribution of light, etc. One of the possible solutions is to train a network on night stereo images in a fully supervised manner. However, to obtain proper…
▽ More
Nighttime stereo depth estimation is still challenging, as assumptions associated with daytime lighting conditions do not hold any longer. Nighttime is not only about low-light and dense noise, but also about glow/glare, flares, non-uniform distribution of light, etc. One of the possible solutions is to train a network on night stereo images in a fully supervised manner. However, to obtain proper disparity ground-truths that are dense, independent from glare/glow, and have sufficiently far depth ranges is extremely intractable. To address the problem, we introduce a network joining day/night translation and stereo. In training the network, our method does not require ground-truth disparities of the night images, or paired day/night images. We utilize a translation network that can render realistic night stereo images from day stereo images. We then train a stereo network on the rendered night stereo images using the available disparity supervision from the corresponding day stereo images, and simultaneously also train the day/night translation network. We handle the fake depth problem, which occurs due to the unsupervised/unpaired translation, for light effects (e.g., glow/glare) and uninformative regions (e.g., low-light and saturated regions), by adding structure-preservation and weighted-smoothness constraints. Our experiments show that our method outperforms the baseline methods on night images.
△ Less
Submitted 8 October, 2020; v1 submitted 30 September, 2019;
originally announced September 2019.
-
A database linking piano and orchestral MIDI scores with application to automatic projective orchestration
Authors:
Léopold Crestel,
Philippe Esling,
Lena Heng,
Stephen McAdams
Abstract:
This article introduces the Projective Orchestral Database (POD), a collection of MIDI scores composed of pairs linking piano scores to their corresponding orchestrations. To the best of our knowledge, this is the first database of its kind, which performs piano or orchestral prediction, but more importantly which tries to learn the correlations between piano and orchestral scores. Hence, we also…
▽ More
This article introduces the Projective Orchestral Database (POD), a collection of MIDI scores composed of pairs linking piano scores to their corresponding orchestrations. To the best of our knowledge, this is the first database of its kind, which performs piano or orchestral prediction, but more importantly which tries to learn the correlations between piano and orchestral scores. Hence, we also introduce the projective orchestration task, which consists in learning how to perform the automatic orchestration of a piano score. We show how this task can be addressed using learning methods and also provide methodological guidelines in order to properly use this database.
△ Less
Submitted 19 October, 2018;
originally announced October 2018.
-
Real-Time Dense Mapping for Self-driving Vehicles using Fisheye Cameras
Authors:
Zhaopeng Cui,
Lionel Heng,
Ye Chuan Yeo,
Andreas Geiger,
Marc Pollefeys,
Torsten Sattler
Abstract:
We present a real-time dense geometric mapping algorithm for large-scale environments. Unlike existing methods which use pinhole cameras, our implementation is based on fisheye cameras which have larger field of view and benefit some other tasks including Visual-Inertial Odometry, localization and object detection around vehicles. Our algorithm runs on in-vehicle PCs at 15 Hz approximately, enabli…
▽ More
We present a real-time dense geometric mapping algorithm for large-scale environments. Unlike existing methods which use pinhole cameras, our implementation is based on fisheye cameras which have larger field of view and benefit some other tasks including Visual-Inertial Odometry, localization and object detection around vehicles. Our algorithm runs on in-vehicle PCs at 15 Hz approximately, enabling vision-only 3D scene perception for self-driving vehicles. For each synchronized set of images captured by multiple cameras, we first compute a depth map for a reference camera using plane-sweeping stereo. To maintain both accuracy and efficiency, while accounting for the fact that fisheye images have a rather low resolution, we recover the depths using multiple image resolutions. We adopt the fast object detection framework YOLOv3 to remove potentially dynamic objects. At the end of the pipeline, we fuse the fisheye depth images into the truncated signed distance function (TSDF) volume to obtain a 3D map. We evaluate our method on large-scale urban datasets, and results show that our method works well even in complex environments.
△ Less
Submitted 18 April, 2019; v1 submitted 17 September, 2018;
originally announced September 2018.
-
Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System
Authors:
Lionel Heng,
Benjamin Choi,
Zhaopeng Cui,
Marcel Geppert,
Sixing Hu,
Benson Kuan,
Peidong Liu,
Rang Nguyen,
Ye Chuan Yeo,
Andreas Geiger,
Gim Hee Lee,
Marc Pollefeys,
Torsten Sattler
Abstract:
Project AutoVision aims to develop localization and 3D scene perception capabilities for a self-driving vehicle. Such capabilities will enable autonomous navigation in urban and rural environments, in day and night, and with cameras as the only exteroceptive sensors. The sensor suite employs many cameras for both 360-degree coverage and accurate multi-view stereo; the use of low-cost cameras keeps…
▽ More
Project AutoVision aims to develop localization and 3D scene perception capabilities for a self-driving vehicle. Such capabilities will enable autonomous navigation in urban and rural environments, in day and night, and with cameras as the only exteroceptive sensors. The sensor suite employs many cameras for both 360-degree coverage and accurate multi-view stereo; the use of low-cost cameras keeps the cost of this sensor suite to a minimum. In addition, the project seeks to extend the operating envelope to include GNSS-less conditions which are typical for environments with tall buildings, foliage, and tunnels. Emphasis is placed on leveraging multi-view geometry and deep learning to enable the vehicle to localize and perceive in 3D space. This paper presents an overview of the project, and describes the sensor suite and current progress in the areas of calibration, localization, and perception.
△ Less
Submitted 4 March, 2019; v1 submitted 14 September, 2018;
originally announced September 2018.
-
3D Visual Perception for Self-Driving Cars using a Multi-Camera System: Calibration, Mapping, Localization, and Obstacle Detection
Authors:
Christian Häne,
Lionel Heng,
Gim Hee Lee,
Friedrich Fraundorfer,
Paul Furgale,
Torsten Sattler,
Marc Pollefeys
Abstract:
Cameras are a crucial exteroceptive sensor for self-driving cars as they are low-cost and small, provide appearance information about the environment, and work in various weather conditions. They can be used for multiple purposes such as visual navigation and obstacle detection. We can use a surround multi-camera system to cover the full 360-degree field-of-view around the car. In this way, we avo…
▽ More
Cameras are a crucial exteroceptive sensor for self-driving cars as they are low-cost and small, provide appearance information about the environment, and work in various weather conditions. They can be used for multiple purposes such as visual navigation and obstacle detection. We can use a surround multi-camera system to cover the full 360-degree field-of-view around the car. In this way, we avoid blind spots which can otherwise lead to accidents. To minimize the number of cameras needed for surround perception, we utilize fisheye cameras. Consequently, standard vision pipelines for 3D mapping, visual localization, obstacle detection, etc. need to be adapted to take full advantage of the availability of multiple cameras rather than treat each camera individually. In addition, processing of fisheye images has to be supported. In this paper, we describe the camera calibration and subsequent processing pipeline for multi-fisheye-camera systems developed as part of the V-Charge project. This project seeks to enable automated valet parking for self-driving cars. Our pipeline is able to precisely calibrate multi-camera systems, build sparse 3D maps for visual navigation, visually localize the car with respect to these maps, generate accurate dense maps, as well as detect obstacles based on real-time depth map extraction.
△ Less
Submitted 31 August, 2017;
originally announced August 2017.
-
Accuracy of Range-Based Cooperative Localization in Wireless Sensor Networks: A Lower Bound Analysis
Authors:
Liang Heng,
Grace Xingxin Gao
Abstract:
Accurate location information is essential for many wireless sensor network (WSN) applications. A location-aware WSN generally includes two types of nodes: sensors whose locations to be determined and anchors whose locations are known a priori. For range-based localization, sensors' locations are deduced from anchor-to-sensor and sensor-to-sensor range measurements. Localization accuracy depends o…
▽ More
Accurate location information is essential for many wireless sensor network (WSN) applications. A location-aware WSN generally includes two types of nodes: sensors whose locations to be determined and anchors whose locations are known a priori. For range-based localization, sensors' locations are deduced from anchor-to-sensor and sensor-to-sensor range measurements. Localization accuracy depends on the network parameters such as network connectivity and size. This paper provides a generalized theory that quantitatively characterizes such relation between network parameters and localization accuracy. We use the average degree as a connectivity metric and use geometric dilution of precision (DOP), equivalent to the Cramer-Rao bound, to quantify localization accuracy. We prove a novel lower bound on expectation of average geometric DOP (LB-E-AGDOP) and derives a closed-form formula that relates LB-E-AGDOP to only three parameters: average anchor degree, average sensor degree, and number of sensor nodes. The formula shows that localization accuracy is approximately inversely proportional to the average degree, and a higher ratio of average anchor degree to average sensor degree yields better localization accuracy. Furthermore, the paper demonstrates a strong connection between LB-E-AGDOP and the best achievable accuracy. Finally, we validate the theory via numerical simulations with three different random graph models.
△ Less
Submitted 14 March, 2014; v1 submitted 30 May, 2013;
originally announced May 2013.