Skip to main content

Showing 1–17 of 17 results for author: Dang, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.05287  [pdf, ps, other

    cs.CV

    EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

    Authors: Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang

    Abstract: The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attr… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: 32pages

  2. arXiv:2506.00337  [pdf, other

    cs.LG

    Channel-Imposed Fusion: A Simple yet Effective Method for Medical Time Series Classification

    Authors: Ming Hu, Jianfu Yin, Mingyu Dou, Yuqi Wang, Ruochen Dang, Siyi Liang, Cong Hu, Yao Wang, Bingliang Hu, Quan Wang

    Abstract: The automatic classification of medical time series signals, such as electroencephalogram (EEG) and electrocardiogram (ECG), plays a pivotal role in clinical decision support and early detection of diseases. Although Transformer based models have achieved notable performance by implicitly modeling temporal dependencies through self-attention mechanisms, their inherently complex architectures and o… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

  3. arXiv:2505.16448  [pdf, ps, other

    cs.AI

    Internal Bias in Reasoning Models leads to Overthinking

    Authors: Renfei Dang, Shujian Huang, Jiajun Chen

    Abstract: While current reasoning models possess strong exploratory capabilities, they are often criticized for overthinking due to redundant and unnecessary reflections. In this work, we reveal for the first time that overthinking in reasoning models may stem from their internal bias towards input texts. Upon encountering a reasoning problem, the model immediately forms a preliminary guess about the answer… ▽ More

    Submitted 27 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  4. arXiv:2501.05031  [pdf, other

    cs.CV cs.LG cs.RO

    ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

    Authors: Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, Lidong Bing

    Abstract: The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition,… ▽ More

    Submitted 13 March, 2025; v1 submitted 9 January, 2025; originally announced January 2025.

  5. arXiv:2412.14571  [pdf, other

    cs.CV cs.AI eess.IV

    SCKD: Semi-Supervised Cross-Modality Knowledge Distillation for 4D Radar Object Detection

    Authors: Ruoyu Xu, Zhiyu Xiang, Chenwei Zhang, Hanzhi Zhong, Xijun Zhao, Ruina Dang, Peng Xu, Tianyu Pu, Eryun Liu

    Abstract: 3D object detection is one of the fundamental perception tasks for autonomous vehicles. Fulfilling such a task with a 4D millimeter-wave radar is very attractive since the sensor is able to acquire 3D point clouds similar to Lidar while maintaining robust measurements under adverse weather. However, due to the high sparsity and noise associated with the radar point clouds, the performance of the e… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  6. arXiv:2411.12503  [pdf, other

    cs.RO

    ManiSkill-ViTac 2025: Challenge on Manipulation Skill Learning With Vision and Tactile Sensing

    Authors: Chuanyu Li, Renjun Dang, Xiang Li, Zhiyuan Wu, Jing Xu, Hamidreza Kasaei, Roberto Calandra, Nathan Lepora, Shan Luo, Hao Su, Rui Chen

    Abstract: This article introduces the ManiSkill-ViTac Challenge 2025, which focuses on learning contact-rich manipulation skills using both tactile and visual sensing. Expanding upon the 2024 challenge, ManiSkill-ViTac 2025 includes 3 independent tracks: tactile manipulation, tactile-vision fusion manipulation, and tactile sensor structure design. The challenge aims to push the boundaries of robotic manipul… ▽ More

    Submitted 19 November, 2024; originally announced November 2024.

    Comments: Challenge webpage: https://ai-workshops.github.io/maniskill-vitac-challenge-2025/

  7. arXiv:2410.13185  [pdf, other

    cs.AI cs.CL

    Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents

    Authors: Long Li, Weiwen Xu, Jiayan Guo, Ruochen Zhao, Xingxuan Li, Yuqian Yuan, Boqiang Zhang, Yuming Jiang, Yifei Xin, Ronghao Dang, Deli Zhao, Yu Rong, Tian Feng, Lidong Bing

    Abstract: Effective research ideation is a critical step for scientific research. However, the exponential increase in scientific literature makes it challenging for researchers to stay current with recent advances and identify meaningful research directions. Recent developments in large language models~(LLMs) suggest a promising avenue for automating the generation of novel research ideas. However, existin… ▽ More

    Submitted 30 October, 2024; v1 submitted 16 October, 2024; originally announced October 2024.

    Comments: 10 pages,5 figures, conference

  8. arXiv:2410.10589  [pdf, other

    cs.CV

    MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

    Authors: Minghao Zhu, Zhengpu Wang, Mengxian Hu, Ronghao Dang, Xiao Lin, Xun Zhou, Chengju Liu, Qijun Chen

    Abstract: Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set per… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: NeurIPS 2024 Camera Ready

  9. arXiv:2404.10241  [pdf, other

    cs.CV cs.AI

    Vision-and-Language Navigation via Causal Learning

    Authors: Liuyi Wang, Zongtao He, Ronghao Dang, Mengjiao Shen, Chengju Liu, Qijun Chen

    Abstract: In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By de… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  10. arXiv:2403.03405  [pdf, other

    cs.CV

    Causality-based Cross-Modal Representation Learning for Vision-and-Language Navigation

    Authors: Liuyi Wang, Zongtao He, Ronghao Dang, Huiyi Chen, Chengju Liu, Qijun Chen

    Abstract: Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios. However, existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments. In this paper, we tackle this challenge by proposing a unifie… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

    Comments: 16 pages

  11. CLIPose: Category-Level Object Pose Estimation with Pre-trained Vision-Language Knowledge

    Authors: Xiao Lin, Minghao Zhu, Ronghao Dang, Guangliang Zhou, Shaolong Shu, Feng Lin, Chengju Liu, Qijun Chen

    Abstract: Most of existing category-level object pose estimation methods devote to learning the object category information from point cloud modality. However, the scale of 3D datasets is limited due to the high cost of 3D data collection and annotation. Consequently, the category features extracted from these limited point cloud samples may not be comprehensive. This motivates us to investigate whether we… ▽ More

    Submitted 24 February, 2024; originally announced February 2024.

    Comments: 14 pages, 4 figures, 9 tables

  12. arXiv:2310.05136  [pdf, other

    cs.AI cs.CV

    InstructDET: Diversifying Referring Object Detection with Generalized Instructions

    Authors: Ronghao Dang, Jiangyan Feng, Haodong Zhang, Chongjian Ge, Lin Song, Lijun Gong, Chengju Liu, Qijun Chen, Feng Zhu, Rui Zhao, Yibing Song

    Abstract: We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from referring expressions (REC), the instructions we leverage are greatly diversified to encompass common user intentions related to object detection. For one image, we produce tremendous instructions that refer to every single object and diff… ▽ More

    Submitted 11 March, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

    Comments: 29 pages (include Appendix) Published in ICLR

  13. arXiv:2309.00297  [pdf, other

    cs.CV

    Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

    Authors: Minghao Zhu, Xiao Lin, Ronghao Dang, Chengju Liu, Qijun Chen

    Abstract: As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from… ▽ More

    Submitted 15 October, 2024; v1 submitted 1 September, 2023; originally announced September 2023.

    Comments: ACM MM 2023 Camera Ready

  14. A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

    Authors: Liuyi Wang, Zongtao He, Jiagui Tang, Ronghao Dang, Naijia Wang, Chengju Liu, Qijun Chen

    Abstract: Vision-and-Language Navigation (VLN) is a realistic but challenging task that requires an agent to locate the target region using verbal and visual cues. While significant advancements have been achieved recently, there are still two broad limitations: (1) The explicit information mining for significant guiding semantics concealed in both vision and language is still under-explored; (2) The previo… ▽ More

    Submitted 29 May, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted by IJCAI 2023

    Journal ref: International Joint Conferences on Artificial Intelligence Organization 2023

  15. arXiv:2302.01520  [pdf, other

    cs.RO cs.AI

    Multiple Thinking Achieving Meta-Ability Decoupling for Object Navigation

    Authors: Ronghao Dang, Lu Chen, Liuyi Wang, Zongtao He, Chengju Liu, Qijun Chen

    Abstract: We propose a meta-ability decoupling (MAD) paradigm, which brings together various object navigation methods in an architecture system, allowing them to mutually enhance each other and evolve together. Based on the MAD paradigm, we design a multiple thinking (MT) model that leverages distinct thinking to abstract various meta-abilities. Our method decouples meta-abilities from three aspects: input… ▽ More

    Submitted 2 February, 2023; originally announced February 2023.

    Comments: 17 pages

  16. arXiv:2208.00553  [pdf, other

    cs.AI cs.RO

    Search for or Navigate to? Dual Adaptive Thinking for Object Navigation

    Authors: Ronghao Dang, Liuyi Wang, Zongtao He, Shuai Su, Chengju Liu, Qijun Chen

    Abstract: "Search for" or "Navigate to"? When finding an object, the two choices always come up in our subconscious mind. Before seeing the target, we search for the target based on experience. After seeing the target, we remember the target location and navigate to. However, recently methods in object navigation field almost only consider using object association to enhance "search for" phase while neglect… ▽ More

    Submitted 13 August, 2022; v1 submitted 31 July, 2022; originally announced August 2022.

    Comments: 12 pages, ready for AAAI2023

  17. arXiv:2204.04421  [pdf, other

    cs.CV cs.AI

    Unbiased Directed Object Attention Graph for Object Navigation

    Authors: Ronghao Dang, Zhuofan Shi, Liuyi Wang, Zongtao He, Chengju Liu, Qijun Chen

    Abstract: Object navigation tasks require agents to locate specific objects in unknown environments based on visual information. Previously, graph convolutions were used to implicitly explore the relationships between objects. However, due to differences in visibility among objects, it is easy to generate biases in object attention. Thus, in this paper, we propose a directed object attention (DOA) graph to… ▽ More

    Submitted 7 July, 2022; v1 submitted 9 April, 2022; originally announced April 2022.

    Comments: 13 pages, accepted by ACM Mutimedia 2022