Skip to main content

Showing 1–50 of 962 results for author: Ding, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10238  [pdf, ps, other

    cs.CV

    MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

    Authors: Yanbo Ding

    Abstract: Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first fra… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.07209  [pdf, other

    cs.CV

    Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models

    Authors: Yan Xie, Zequn Zeng, Hao Zhang, Yucheng Ding, Yi Wang, Zhengjue Wang, Bo Chen, Hongwei Liu

    Abstract: Concept Bottleneck Models (CBMs) try to make the decision-making process transparent by exploring an intermediate concept space between the input image and the output prediction. Existing CBMs just learn coarse-grained relations between the whole image and the concepts, less considering local image information, leading to two main drawbacks: i) they often produce spurious visual-concept relations,… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

    Comments: CVPR 2025

  3. arXiv:2505.04172  [pdf, other

    eess.IV cs.HC physics.med-ph

    A Dataset and Toolkit for Multiparameter Cardiovascular Physiology Sensing on Rings

    Authors: Jiankai Tang, Kegang Wang, Yingke Ding, Jiatong Ji, Zeyu Wang, Xiyuxing Zhang, Ping Chen, Yuanchun Shi, Yuntao Wang

    Abstract: Smart rings offer a convenient way to continuously and unobtrusively monitor cardiovascular physiological signals. However, a gap remains between the ring hardware and reliable methods for estimating cardiovascular parameters, partly due to the lack of publicly available datasets and standardized analysis tools. In this work, we present $τ$-Ring, the first open-source ring-based dataset designed f… ▽ More

    Submitted 8 May, 2025; v1 submitted 7 May, 2025; originally announced May 2025.

  4. arXiv:2505.03827  [pdf, other

    cs.LG cs.AI

    MISE: Meta-knowledge Inheritance for Social Media-Based Stressor Estimation

    Authors: Xin Wang, Ling Feng, Huijun Zhang, Lei Cao, Kaisheng Zeng, Qi Li, Yang Ding, Yi Dai, David Clifton

    Abstract: Stress haunts people in modern society, which may cause severe health issues if left unattended. With social media becoming an integral part of daily life, leveraging social media to detect stress has gained increasing attention. While the majority of the work focuses on classifying stress states and stress categories, this study introduce a new task aimed at estimating more specific stressors (li… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: WWW2025, Oral Presentation

  5. arXiv:2505.02836  [pdf, other

    cs.CV

    Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

    Authors: Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, Zhaoshuo Li

    Abstract: Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often produci… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  6. arXiv:2505.02714  [pdf, other

    cs.LG

    Less is More: Efficient Weight Farcasting with 1-Layer Neural Network

    Authors: Xiao Shou, Debarun Bhattacharjya, Yanna Ding, Chen Zhao, Rui Li, Jianxi Gao

    Abstract: Addressing the computational challenges inherent in training large-scale deep neural networks remains a critical endeavor in contemporary machine learning research. While previous efforts have focused on enhancing training efficiency through techniques such as gradient descent with momentum, learning rate scheduling, and weight regularization, the demand for further innovation continues to burgeon… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: Accepted to DASFAA '25

  7. arXiv:2505.02665  [pdf, other

    cs.AI

    A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law

    Authors: Qianjun Pan, Wenkai Ji, Yuyang Ding, Junsong Li, Shilian Chen, Junyi Wang, Jie Zhou, Qin Chen, Min Zhang, Yulan Wu, Liang He

    Abstract: This survey explores recent advancements in reasoning large language models (LLMs) designed to mimic "slow thinking" - a reasoning process inspired by human cognition, as described in Kahneman's Thinking, Fast and Slow. These models, like OpenAI's o1, focus on scaling computational resources dynamically during complex tasks, such as math reasoning, visual reasoning, medical diagnosis, and multi-ag… ▽ More

    Submitted 8 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

  8. EnsembleCI: Ensemble Learning for Carbon Intensity Forecasting

    Authors: Leyi Yan, Linda Wang, Sihang Liu, Yi Ding

    Abstract: Carbon intensity (CI) measures the average carbon emissions generated per unit of electricity, making it a crucial metric for quantifying and managing the environmental impact. Accurate CI predictions are vital for minimizing carbon footprints, yet the state-of-the-art method (CarbonCast) falls short due to its inability to address regional variability and lack of adaptability. To address these li… ▽ More

    Submitted 6 May, 2025; v1 submitted 3 May, 2025; originally announced May 2025.

    Comments: 5 pages, 5 figures, 3 tables, In The 16th ACM International Conference on Future and Sustainable Energy Systems (E-ENERGY'25)

  9. arXiv:2505.00866  [pdf, other

    cs.CV

    Are Minimal Radial Distortion Solvers Really Necessary for Relative Pose Estimation?

    Authors: Viktor Kocur, Charalambos Tzamos, Yaqing Ding, Zuzana Berger Haladova, Torsten Sattler, Zuzana Kukelova

    Abstract: Estimating the relative pose between two cameras is a fundamental step in many applications such as Structure-from-Motion. The common approach to relative pose estimation is to apply a minimal solver inside a RANSAC loop. Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras exhibit radial distortion. Not modeling radial distortion leads to (significantly) worse results. Ho… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2410.05984

  10. arXiv:2504.21480  [pdf, other

    cs.CR cs.AI cs.SE

    A Comprehensive Study of Exploitable Patterns in Smart Contracts: From Vulnerability to Defense

    Authors: Yuchen Ding, Hongli Peng, Xiaoqi Li

    Abstract: With the rapid advancement of blockchain technology, smart contracts have enabled the implementation of increasingly complex functionalities. However, ensuring the security of smart contracts remains a persistent challenge across the stages of development, compilation, and execution. Vulnerabilities within smart contracts not only undermine the security of individual applications but also pose sig… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  11. arXiv:2504.21205  [pdf, other

    cs.CR cs.AI

    SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories

    Authors: Connor Dilgren, Purva Chiniya, Luke Griffith, Yu Ding, Yizheng Chen

    Abstract: This paper introduces SecRepoBench, a benchmark to evaluate LLMs on secure code generation in real-world repositories. SecRepoBench has 318 code generation tasks in 27 C/C++ repositories, covering 15 CWEs. We evaluate 19 state-of-the-art LLMs using our benchmark and find that the models struggle with generating correct and secure code. In addition, the performance of LLMs to generate self-containe… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  12. arXiv:2504.19596  [pdf, other

    eess.SP cs.LG

    Towards Robust Multimodal Physiological Foundation Models: Handling Arbitrary Missing Modalities

    Authors: Xi Fu, Wei-Bang Jiang, Yi Ding, Cuntai Guan

    Abstract: Multimodal physiological signals, such as EEG, ECG, EOG, and EMG, are crucial for healthcare and brain-computer interfaces. While existing methods rely on specialized architectures and dataset-specific fusion strategies, they struggle to learn universal representations that generalize across datasets and handle missing modalities at inference time. To address these issues, we propose PhysioOmni, a… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: 19 pages, 5 figures

  13. arXiv:2504.18904  [pdf, other

    cs.RO

    RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning

    Authors: Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu , et al. (12 additional authors not shown)

    Abstract: Data scaling and standardized evaluation benchmarks have driven significant advances in natural language processing and computer vision. However, robotics faces unique challenges in scaling data and establishing evaluation protocols. Collecting real-world data is resource-intensive and inefficient, while benchmarking in real-world scenarios remains highly complex. Synthetic data and simulation off… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

  14. arXiv:2504.17636  [pdf, other

    cs.CV

    A Guide to Structureless Visual Localization

    Authors: Vojtech Panek, Qunjie Zhou, Yaqing Ding, Sérgio Agostinho, Zuzana Kukelova, Torsten Sattler, Laura Leal-Taixé

    Abstract: Visual localization algorithms, i.e., methods that estimate the camera pose of a query image in a known scene, are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art visual localization algorithms are structure-based, i.e., they store a 3D model of the scene and use 2D-3D correspondences between the query image and 3D points in… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    ACM Class: I.2.10; I.4.8; I.4.9

  15. arXiv:2504.17116  [pdf, other

    quant-ph cs.AR

    OneAdapt: Adaptive Compilation for Resource-Constrained Photonic One-Way Quantum Computing

    Authors: Hezi Zhang, Jixuan Ruan, Dean Tullsen, Yufei Ding, Ang Li, Travis S. Humble

    Abstract: Measurement-based quantum computing (MBQC), a.k.a. one-way quantum computing (1WQC), is a universal quantum computing model, which is particularly well-suited for photonic platforms. In this model, computation is driven by measurements on an entangled state, which serves as an intermediate representation (IR) between program and hardware. However, compilers on previous IRs lacks the adaptability t… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

  16. arXiv:2504.16214  [pdf, other

    cs.LG cs.AI cs.PL

    Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis

    Authors: Xiao Zhang, Yaoyao Ding, Yang Hu, Gennady Pekhimenko

    Abstract: Deep learning (DL) workloads mainly run on accelerators like GPUs. Recent DL quantization techniques demand a new matrix multiplication operator with mixed input data types, further complicating GPU optimization. Prior high-level compilers like Triton lack the expressiveness to implement key optimizations like fine-grained data pipelines and hardware-friendly memory layouts for these operators, wh… ▽ More

    Submitted 30 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

    Comments: 17 pages, 24 figures

  17. arXiv:2504.16072  [pdf, ps, other

    cs.CV cs.AI

    Describe Anything: Detailed Localized Image and Video Captioning

    Authors: Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui

    Abstract: Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targete… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: Project page: https://describe-anything.github.io/

  18. arXiv:2504.15585  [pdf, other

    cs.CR cs.AI cs.CL cs.LG

    A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

    Authors: Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Junyuan Mao, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Chengwei Liu, Yifan Zhang, Qiankun Li , et al. (57 additional authors not shown)

    Abstract: The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concer… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  19. arXiv:2504.15300  [pdf, other

    cs.LG cs.DC cs.MA

    Collaborative Learning of On-Device Small Model and Cloud-Based Large Model: Advances and Future Directions

    Authors: Chaoyue Niu, Yucheng Ding, Junhui Lu, Zhengxiang Huang, Hang Zeng, Yutong Dai, Xuezhen Tu, Chengfei Lv, Fan Wu, Guihai Chen

    Abstract: The conventional cloud-based large model learning framework is increasingly constrained by latency, cost, personalization, and privacy concerns. In this survey, we explore an emerging paradigm: collaborative learning between on-device small model and cloud-based large model, which promises low-latency, cost-efficient, and personalized intelligent services while preserving user privacy. We provide… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  20. arXiv:2504.14894  [pdf, other

    cs.RO eess.SY

    Never too Cocky to Cooperate: An FIM and RL-based USV-AUV Collaborative System for Underwater Tasks in Extreme Sea Conditions

    Authors: Jingzehua Xu, Guanwen Xie, Jiwei Tang, Yimian Ding, Weiyi Liu, Shuai Zhang, Yi Li

    Abstract: This paper develops a novel unmanned surface vehicle (USV)-autonomous underwater vehicle (AUV) collaborative system designed to enhance underwater task performance in extreme sea conditions. The system integrates a dual strategy: (1) high-precision multi-AUV localization enabled by Fisher information matrix-optimized USV path planning, and (2) reinforcement learning-based cooperative planning and… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  21. arXiv:2504.12984  [pdf, other

    cs.LG cs.AI cs.PL

    Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

    Authors: Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko

    Abstract: Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that… ▽ More

    Submitted 25 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: 18 pages, 15 figures

  22. arXiv:2504.11434  [pdf, other

    cs.CV

    Enhancing Out-of-Distribution Detection with Extended Logit Normalization

    Authors: Yifan Ding, Xixi Liu, Jonas Unger, Gabriel Eilertsen

    Abstract: Out-of-distribution (OOD) detection is essential for the safe deployment of machine learning models. Recent advances have explored improved classification losses and representation learning strategies to enhance OOD detection. However, these methods are often tailored to specific post-hoc detection techniques, limiting their generalizability. In this work, we identify a critical issue in Logit Nor… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  23. arXiv:2504.10411  [pdf

    cs.AR

    FPGA-Optimized Hardware Accelerator for Fast Fourier Transform and Singular Value Decomposition in AI

    Authors: Hong Ding, Chia Chao Kang, SuYang Xi, Zehang Liu, Xuan Zhang, Yi Ding

    Abstract: This research introduces an FPGA-based hardware accelerator to optimize the Singular Value Decomposition (SVD) and Fast Fourier transform (FFT) operations in AI models. The proposed design aims to improve processing speed and reduce computational latency. Through experiments, we validate the performance benefits of the hardware accelerator and show how well it handles FFT and SVD operations. With… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: 5 pages, 2 figures

  24. arXiv:2504.08150  [pdf, other

    cs.LG

    Beyond Feature Importance: Feature Interactions in Predicting Post-Stroke Rigidity with Graph Explainable AI

    Authors: Jiawei Xu, Yonggeon Lee, Anthony Elkommos Youssef, Eunjin Yun, Tinglin Huang, Tianjian Guo, Hamidreza Saber, Rex Ying, Ying Ding

    Abstract: This study addresses the challenge of predicting post-stroke rigidity by emphasizing feature interactions through graph-based explainable AI. Post-stroke rigidity, characterized by increased muscle tone and stiffness, significantly affects survivors' mobility and quality of life. Despite its prevalence, early prediction remains limited, delaying intervention. We analyze 519K stroke hospitalization… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Jiawei Xu and Yonggeon Lee contributed equally to this work

  25. arXiv:2504.07793  [pdf, other

    cs.LG cs.CV

    Revisiting Likelihood-Based Out-of-Distribution Detection by Modeling Representations

    Authors: Yifan Ding, Arturas Aleksandrauskas, Amirhossein Ahmadian, Jonas Unger, Fredrik Lindsten, Gabriel Eilertsen

    Abstract: Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning systems, particularly in safety-critical applications. Likelihood-based deep generative models have historically faced criticism for their unsatisfactory performance in OOD detection, often assigning higher likelihood to OOD data than in-distribution samples when applied to image data. In this work, we de… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  26. arXiv:2504.04438  [pdf, other

    cs.NI

    DRAMA: A Dynamic Packet Routing Algorithm using Multi-Agent Reinforcement Learning with Emergent Communication

    Authors: Wang Zhang, Chenguang Liu, Yue Pi, Yong Zhang, Hairong Huang, Baoquan Rao, Yulong Ding, Shuanghua Yang, Jie Jiang

    Abstract: The continuous expansion of network data presents a pressing challenge for conventional routing algorithms. As the demand escalates, these algorithms are struggling to cope. In this context, reinforcement learning (RL) and multi-agent reinforcement learning (MARL) algorithms emerge as promising solutions. However, the urgency and importance of the problem are clear, as existing RL/MARL-based routi… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

    Comments: This article has been accepted by IJCNN 2025

  27. arXiv:2504.03762  [pdf, other

    eess.SP cs.LG

    Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer

    Authors: Muyun Jiang, Yi Ding, Wei Zhang, Kok Ann Colin Teo, LaiGuan Fong, Shuailei Zhang, Zhiwei Guo, Chenyu Liu, Raghavan Bhuvanakantham, Wei Khang Jeremy Sim, Chuan Huat Vince Foo, Rong Hui Jonathan Chua, Parasuraman Padmanabhan, Victoria Leong, Jia Lu, Balazs Gulyas, Cuntai Guan

    Abstract: Covert speech involves imagining speaking without audible sound or any movements. Decoding covert speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping and the low signal-to-noise ratio of the signal. In this study, we developed a large-scale multi-utterance speech EEG dataset from 57 right-handed native English-speaking subjects, each… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  28. arXiv:2504.02880  [pdf

    eess.IV cs.AI cs.CV

    Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms

    Authors: Junchi Zhou, Haozhou Wang, Yoichiro Kato, Tejasri Nampally, P. Rajalakshmi, M. Balram, Keisuke Katsura, Hao Lu, Yue Mu, Wanneng Yang, Yangmingrui Gao, Feng Xiao, Hongtao Chen, Yuhao Chen, Wenjuan Li, Jingwen Wang, Fenghua Yu, Jian Zhou, Wensheng Wang, Xiaochun Hu, Yuanzhu Yang, Yanfeng Ding, Wei Guo, Shouyang Liu

    Abstract: Developing computer vision-based rice phenotyping techniques is crucial for precision field management and accelerating breeding, thereby continuously advancing rice production. Among phenotyping tasks, distinguishing image components is a key prerequisite for characterizing plant growth and development at the organ scale, enabling deeper insights into eco-physiological processes. However, due to… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  29. arXiv:2504.02666  [pdf, other

    cs.LG cs.CV

    BECAME: BayEsian Continual Learning with Adaptive Model MErging

    Authors: Mei Li, Yuxiang Lu, Qinyan Dai, Suizhi Huang, Yue Ding, Hongtao Lu

    Abstract: Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  30. arXiv:2504.00420  [pdf, other

    cs.RO cs.CV

    Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation

    Authors: Yuanqi Yao, Siao Liu, Haoming Song, Delin Qu, Qizhi Chen, Yan Ding, Bin Zhao, Zhigang Wang, Xuelong Li, Dong Wang

    Abstract: Building a lifelong robot that can effectively leverage prior knowledge for continuous skill acquisition remains significantly challenging. Despite the success of experience replay and parameter-efficient methods in alleviating catastrophic forgetting problem, naively applying these methods causes a failure to leverage the shared primitives between skills. To tackle these issues, we propose Primit… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Accepted to CVPR 2025

  31. arXiv:2504.00002  [pdf, other

    cs.PF cs.AI cs.HC cs.NI

    Are We There Yet? A Measurement Study of Efficiency for LLM Applications on Mobile Devices

    Authors: Xiao Yan, Yi Ding

    Abstract: Recent advancements in large language models (LLMs) have prompted interest in deploying these models on mobile devices to enable new applications without relying on cloud connectivity. However, the efficiency constraints of deploying LLMs on resource-limited devices present significant challenges. In this paper, we conduct a comprehensive measurement study to evaluate the efficiency tradeoffs betw… ▽ More

    Submitted 10 March, 2025; originally announced April 2025.

  32. arXiv:2503.23350  [pdf, other

    cs.AI

    A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models

    Authors: Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S. Yu, Qing Li

    Abstract: With the advancement of web techniques, they have significantly revolutionized various aspects of people's lives. Despite the importance of the web, many tasks performed on it are repetitive and time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious daily tasks, one of the most promising approaches is to advance autonomous agents based on Artificial Intel… ▽ More

    Submitted 10 May, 2025; v1 submitted 30 March, 2025; originally announced March 2025.

    Comments: Accepted by KDD 2025;

  33. Alleviating LLM-based Generative Retrieval Hallucination in Alipay Search

    Authors: Yedan Shen, Kaixin Wu, Yuechen Ding, Jingyuan Wen, Hong Liu, Mingjie Zhong, Zhouhan Lin, Jia Xu, Linjian Mo

    Abstract: Generative retrieval (GR) has revolutionized document retrieval with the advent of large language models (LLMs), and LLM-based GR is gradually being adopted by the industry. Despite its remarkable advantages and potential, LLM-based GR suffers from hallucination and generates documents that are irrelevant to the query in some instances, severely challenging its credibility in practical application… ▽ More

    Submitted 13 May, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: Accepted by SIGIR 2025

    Journal ref: SIGIR 2025

  34. arXiv:2503.20666  [pdf, other

    cs.HC cs.CL

    TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews

    Authors: Huimin Xu, Seungjun Yi, Terence Lim, Jiawei Xu, Andrew Well, Carlos Mery, Aidong Zhang, Yuji Zhang, Heng Ji, Keshav Pingali, Yan Leng, Ying Ding

    Abstract: Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-A… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: Submitted to the American Medical Informatics Association (AMIA) 2025 Annual Symposium, 10 pages

  35. arXiv:2503.18432  [pdf, other

    cs.CL cs.AI cs.LG

    Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning

    Authors: Junsong Li, Jie Zhou, Yutao Yang, Bihao Zhan, Qianjun Pan, Yuyang Ding, Qin Chen, Jiang Bo, Xin Lin, Liang He

    Abstract: Automatic math correction aims to check students' solutions to mathematical problems via artificial intelligence technologies. Most existing studies focus on judging the final answer at the problem level, while they ignore detailed feedback on each step in a math problem-solving process, which requires abilities of semantic understanding and reasoning. In this paper, we propose a reinforcement lea… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  36. arXiv:2503.17924  [pdf, other

    cs.DC cs.AI cs.LG

    WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training

    Authors: Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, Yuchen Hao, Yufei Ding

    Abstract: In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variab… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

    Comments: 12 pages, 16 figures

    ACM Class: I.2.11

  37. arXiv:2503.17656  [pdf, other

    q-bio.QM cs.AI cs.LG

    NaFM: Pre-training a Foundation Model for Small-Molecule Natural Products

    Authors: Yuheng Ding, Yusong Wang, Bo Qiang, Jie Yu, Qi Li, Yiran Zhou, Zhenmin Liu

    Abstract: Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves sig… ▽ More

    Submitted 8 May, 2025; v1 submitted 22 March, 2025; originally announced March 2025.

  38. arXiv:2503.16742  [pdf, other

    cs.CV

    Digitally Prototype Your Eye Tracker: Simulating Hardware Performance using 3D Synthetic Data

    Authors: Esther Y. H. Lin, Yimin Ding, Jogendra Kundu, Yatong An, Mohamed T. El-Haddad, Alexander Fix

    Abstract: Eye tracking (ET) is a key enabler for Augmented and Virtual Reality (AR/VR). Prototyping new ET hardware requires assessing the impact of hardware choices on eye tracking performance. This task is compounded by the high cost of obtaining data from sufficiently many variations of real hardware, especially for machine learning, which requires large training datasets. We propose a method for end-to-… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

    Comments: 14 pages, 12 figures

  39. arXiv:2503.16710  [pdf, other

    cs.CV

    4D Gaussian Splatting SLAM

    Authors: Yanyan Li, Youxu Fang, Zunjie Zhu, Kunyi Li, Yong Ding, Federico Tombari

    Abstract: Simultaneously localizing camera poses and constructing Gaussian radiance fields in dynamic scenes establish a crucial bridge between 2D images and the 4D real world. Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in u… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  40. arXiv:2503.15558  [pdf, other

    cs.AI cs.CV cs.LG cs.RO

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    Authors: NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li , et al. (22 additional authors not shown)

    Abstract: Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, wit… ▽ More

    Submitted 2 April, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

  41. arXiv:2503.15211  [pdf, other

    cs.CV

    GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector

    Authors: Zechuan Li, Hongshan Yu, Yihao Ding, Jinhao Qiao, Basim Azam, Naveed Akhtar

    Abstract: We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimi… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025

  42. arXiv:2503.15208  [pdf, other

    cs.CV

    DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

    Authors: Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, Hao Zhao

    Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangle… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  43. arXiv:2503.13858  [pdf, other

    cs.CV cs.LG

    MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations

    Authors: Hongyu Ke, Jack Morris, Kentaro Oguchi, Xiaofei Cao, Yongkang Liu, Haoxin Wang, Yi Ding

    Abstract: 3D visual perception tasks, such as 3D detection from multi-camera images, are essential components of autonomous driving and assistance systems. However, designing computationally efficient methods remains a significant challenge. In this paper, we propose a Mamba-based framework called MamBEV, which learns unified Bird's Eye View (BEV) representations using linear spatio-temporal SSM-based atten… ▽ More

    Submitted 25 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Journal ref: ICLR 2025

  44. arXiv:2503.13492  [pdf, other

    eess.SP cs.AI cs.NE

    Event-Driven Implementation of a Physical Reservoir Computing Framework for superficial EMG-based Gesture Recognition

    Authors: Yuqi Ding, Elisa Donati, Haobo Li, Hadi Heidari

    Abstract: Wearable health devices have a strong demand in real-time biomedical signal processing. However traditional methods often require data transmission to centralized processing unit with substantial computational resources after collecting it from edge devices. Neuromorphic computing is an emerging field that seeks to design specialized hardware for computing systems inspired by the structure, functi… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 11 pages, 9 figures, journal

  45. arXiv:2503.12077  [pdf, other

    cs.CV cs.AI

    V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

    Authors: Zhengrong Yue, Shaobin Zhuang, Kunchang Li, Yanbo Ding, Yali Wang

    Abstract: Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a system… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

    Comments: CVPR 2025

  46. arXiv:2503.11199  [pdf, other

    cs.CV

    NF-SLAM: Effective, Normalizing Flow-supported Neural Field representations for object-level visual SLAM in automotive applications

    Authors: Li Cui, Yang Ding, Richard Hartley, Zirui Xie, Laurent Kneip, Zhenghua Yu

    Abstract: We propose a novel, vision-only object-level SLAM framework for automotive applications representing 3D shapes by implicit signed distance functions. Our key innovation consists of augmenting the standard neural representation by a normalizing flow network. As a result, achieving strong representation power on the specific class of road vehicles is made possible by compact networks with only 16-di… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: 9 pages, 5 figures, IROS 2024

  47. arXiv:2503.11081  [pdf, other

    cs.RO cs.AI cs.CV

    MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation

    Authors: Pingrui Zhang, Xianqiang Gao, Yuhan Wu, Kehui Liu, Dong Wang, Zhigang Wang, Bin Zhao, Yan Ding, Xuelong Li

    Abstract: In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduc… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  48. arXiv:2503.10668  [pdf, other

    cs.CL cs.AI

    Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words

    Authors: Hongyu Su, Yifeng Gao, Yifan Ding, Xingjun Ma

    Abstract: The rapid advancement of Large Language Models (LLMs) has increased the complexity and cost of fine-tuning, leading to the adoption of API-based fine-tuning as a simpler and more efficient alternative. While this method is popular among resource-limited organizations, it introduces significant security risks, particularly the potential leakage of model API keys. Existing watermarking techniques pa… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  49. arXiv:2503.10604  [pdf, other

    cs.CV

    MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

    Authors: Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Haoqian Wang

    Abstract: Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence a… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  50. Decoupled Doubly Contrastive Learning for Cross Domain Facial Action Unit Detection

    Authors: Yong Li, Menglin Liu, Zhen Cui, Yi Ding, Yuan Zong, Wenming Zheng, Shiguang Shan, Cuntai Guan

    Abstract: Despite the impressive performance of current vision-based facial action unit (AU) detection approaches, they are heavily susceptible to the variations across different domains and the cross-domain AU detection methods are under-explored. In response to this challenge, we propose a decoupled doubly contrastive adaptation (D$^2$CA) approach to learn a purified AU representation that is semantically… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Accepted by IEEE Transactions on Image Processing 2025. A novel and elegant feature decoupling method for cross-domain facial action unit detection

    Journal ref: IEEE Transactions on Image Processing 2025