Skip to main content

Showing 1–50 of 1,020 results for author: Weng

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.13585  [pdf, ps, other

    cs.CL cs.LG

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou , et al. (103 additional authors not shown)

    Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

  2. arXiv:2506.12370  [pdf, ps, other

    cs.DC cs.LG

    Efficient Unified Caching for Accelerating Heterogeneous AI Workloads

    Authors: Tianze Wang, Yifei Liu, Chen Chen, Pengfei Zuo, Jiawei Zhang, Qizhen Weng, Yin Chen, Zhenhua Han, Jieru Zhao, Quan Chen, Minyi Guo

    Abstract: Modern AI clusters, which host diverse workloads like data pre-processing, training and inference, often store the large-volume data in cloud storage and employ caching frameworks to facilitate remote data access. To avoid code-intrusion complexity and minimize cache space wastage, it is desirable to maintain a unified cache shared by all the workloads. However, existing cache management strategie… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: 15 pages, 17 figures

  3. arXiv:2506.11441  [pdf, ps, other

    cs.AR cs.AI

    DPUV4E: High-Throughput DPU Architecture Design for CNN on Versal ACAP

    Authors: Guoyu Li, Pengbo Zheng, Jian Weng, Enshan Yang

    Abstract: Convolutional Neural Networks (CNNs) remain prevalent in computer vision applications, and FPGAs, known for their flexibility and energy efficiency, have become essential components in heterogeneous acceleration systems. However, traditional FPGAs face challenges in balancing performance and versatility due to limited on-chip resources. AMD's Versal ACAP architecture, tailored for AI applications,… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: 10 pages, 9 figures

  4. arXiv:2506.09928  [pdf, ps, other

    cs.LG stat.ML

    Course Project Report: Comparing MCMC and Variational Inference for Bayesian Probabilistic Matrix Factorization on the MovieLens Dataset

    Authors: Ruixuan Xu, Xiangxiang Weng

    Abstract: This is a course project report with complete methodology, experiments, references and mathematical derivations. Matrix factorization [1] is a widely used technique in recommendation systems. Probabilistic Matrix Factorization (PMF) [2] extends traditional matrix factorization by incorporating probability distributions over latent factors, allowing for uncertainty quantification. However, computin… ▽ More

    Submitted 12 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

    Comments: 11 pages, 2 figures. This document is a course project report. Some derivations are presented in a simplified form. For more detailed discussions and comprehensive proofs, please refer to the references cited in this report. v2 replacement: we have modified the title to better match our content. We have also updated the references to be more complete, including the link to our code

  5. arXiv:2506.08003  [pdf, ps, other

    cs.CV cs.AI

    Audio-Sync Video Generation with Multi-Stream Temporal Control

    Authors: Shuchen Weng, Haojie Zheng, Zheng Chang, Si Li, Boxin Shi, Xinlong Wang

    Abstract: Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e.g., movies). Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e.g., Podcasts or historical recordings). However, existing approaches fall short in gene… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  6. arXiv:2506.07985  [pdf, ps, other

    cs.CV cs.LG

    Rethinking Crowd-Sourced Evaluation of Neuron Explanations

    Authors: Tuomas Oikarinen, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng

    Abstract: Interpreting individual neurons or directions in activations space is an important component of mechanistic interpretability. As such, many algorithms have been proposed to automatically produce neuron explanations, but it is often not clear how reliable these explanations are, or which methods produce the best explanations. This can be measured via crowd-sourced evaluations, but they can often be… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  7. arXiv:2506.05774  [pdf, ps, other

    cs.LG

    Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

    Authors: Tuomas Oikarinen, Ge Yan, Tsui-Wei Weng

    Abstract: Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods un… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: Published at ICML 2025

  8. arXiv:2506.05380  [pdf, ps, other

    cs.CL

    EvidenceOutcomes: a Dataset of Clinical Trial Publications with Clinically Meaningful Outcomes

    Authors: Yiliang Zhou, Abigail M. Newbury, Gongbo Zhang, Betina Ross Idnay, Hao Liu, Chunhua Weng, Yifan Peng

    Abstract: The fundamental process of evidence extraction and synthesis in evidence-based medicine involves extracting PICO (Population, Intervention, Comparison, and Outcome) elements from biomedical literature. However, Outcomes, being the most complex elements, are often neglected or oversimplified in existing benchmarks. To address this issue, we present EvidenceOutcomes, a novel, large, annotated corpus… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  9. arXiv:2506.02609  [pdf

    cs.AI

    A Time-Enhanced Data Disentanglement Network for Traffic Flow Forecasting

    Authors: Tianfan Jiang, Mei Wu, Wenchao Weng, Dewen Seng, Yiqian Lin

    Abstract: In recent years, traffic flow prediction has become a highlight in the field of intelligent transportation systems. However, due to the temporal variations and dynamic spatial correlations of traffic data, traffic prediction remains highly challenging.Traditional spatiotemporal networks, which rely on end-to-end training, often struggle to handle the diverse data dependencies of multiple traffic f… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  10. arXiv:2506.02537  [pdf, ps, other

    cs.CV cs.AI

    VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

    Authors: Hao Yan, Handong Zheng, Hao Wang, Liang Yin, Xingchen Liu, Zhenbiao Cao, Xinxing Su, Zihao Chen, Jihao Wu, Minghui Liao, Chao Weng, Wei Chen, Yuliang Liu, Xiang Bai

    Abstract: Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perce… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: 13 pages, 4 figures

  11. arXiv:2506.01372  [pdf, ps, other

    cs.AI cs.CL cs.LG

    AI Scientists Fail Without Strong Implementation Capability

    Authors: Minjun Zhu, Qiujie Xie, Yixuan Weng, Jian Wu, Zhen Lin, Linyi Yang, Yue Zhang

    Abstract: The emergence of Artificial Intelligence (AI) Scientist represents a paradigm shift in scientific discovery, with large language models (LLMs) taking the lead as the primary executor in the entire scientific workflow from idea generation to experiment implementation. Recent AI Scientist studies demonstrate sufficient capabilities for independent scientific discovery, with the generated research re… ▽ More

    Submitted 9 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: Position

  12. arXiv:2506.00869  [pdf, ps, other

    cs.CL

    What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

    Authors: Zhaotian Weng, Haoxuan Li, Kuan-Hao Huang, Jieyu Zhao

    Abstract: Despite the impressive performance of vision-language models (VLMs) on downstream tasks, their ability to understand and reason about causal relationships in visual inputs remains unclear. Robust causal reasoning is fundamental to solving complex high-level reasoning tasks, yet existing benchmarks often include a mixture of reasoning questions, and VLMs can frequently exploit object recognition an… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 12 pages

    ACM Class: I.7.0

  13. Timing is Important: Risk-aware Fund Allocation based on Time-Series Forecasting

    Authors: Fuyuan Lyu, Linfeng Du, Yunpeng Weng, Qiufang Ying, Zhiyan Xu, Wen Zou, Haolun Wu, Xiuqiang He, Xing Tang

    Abstract: Fund allocation has been an increasingly important problem in the financial domain. In reality, we aim to allocate the funds to buy certain assets within a certain future period. Naive solutions such as prediction-only or Predict-then-Optimize approaches suffer from goal mismatch. Additionally, the introduction of the SOTA time series forecasting model inevitably introduces additional uncertainty… ▽ More

    Submitted 5 June, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted by KDD 2025 ADS Track

  14. arXiv:2505.24198  [pdf, ps, other

    cs.RO

    Hold My Beer: Learning Gentle Humanoid Locomotion and End-Effector Stabilization Control

    Authors: Yitang Li, Yuanhang Zhang, Wenli Xiao, Chaoyi Pan, Haoyang Weng, Guanqi He, Tairan He, Guanya Shi

    Abstract: Can your humanoid walk up and hand you a full cup of beer, without spilling a drop? While humanoids are increasingly featured in flashy demos like dancing, delivering packages, traversing rough terrain, fine-grained control during locomotion remains a significant challenge. In particular, stabilizing a filled end-effector (EE) while walking is far from solved, due to a fundamental mismatch in task… ▽ More

    Submitted 3 June, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

  15. arXiv:2505.23486  [pdf, other

    cs.AI

    Autoformalization in the Era of Large Language Models: A Survey

    Authors: Ke Weng, Lun Du, Sirui Li, Wangyue Lu, Haozhe Sun, Hengyu Liu, Tiancheng Zhang

    Abstract: Autoformalization, the process of transforming informal mathematical propositions into verifiable formal representations, is a foundational task in automated theorem proving, offering a new perspective on the use of mathematics in both theoretical and applied domains. Driven by the rapid progress in artificial intelligence, particularly large language models (LLMs), this field has witnessed substa… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  16. arXiv:2505.22280  [pdf, other

    cs.CL cs.AI

    Natural Language Processing in Support of Evidence-based Medicine: A Scoping Review

    Authors: Zihan Xu, Haotian Ma, Gongbo Zhang, Yihao Ding, Chunhua Weng, Yifan Peng

    Abstract: Evidence-based medicine (EBM) is at the forefront of modern healthcare, emphasizing the use of the best available scientific evidence to guide clinical decisions. Due to the sheer volume and rapid growth of medical literature and the high cost of curation, there is a critical need to investigate Natural Language Processing (NLP) methods to identify, appraise, synthesize, summarize, and disseminate… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted by ACL 2025 Findings

  17. arXiv:2505.22016  [pdf, ps, other

    cs.CV

    PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms

    Authors: Yifei Xia, Shuchen Weng, Siqi Yang, Jingqi Liu, Chengxuan Zhu, Minggui Teng, Zijian Jia, Han Jiang, Boxin Shi

    Abstract: Panoramic video generation enables immersive 360° content creation, valuable in applications that demand scene-consistent world exploration. However, existing panoramic video generation models struggle to leverage pre-trained generative priors from conventional text-to-video models for high-quality and diverse panoramic videos generation, due to limited dataset scale and the gap in spatial feature… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  18. arXiv:2505.21538  [pdf, other

    cs.CV cs.AI

    Caption This, Reason That: VLMs Caught in the Middle

    Authors: Zihan Weng, Lucas Gomez, Taylor Whittington Webb, Pouya Bashivan

    Abstract: Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  19. arXiv:2505.21395  [pdf, ps, other

    cs.LG

    Square$χ$PO: Differentially Private and Robust $χ^2$-Preference Optimization in Offline Direct Alignment

    Authors: Xingyu Zhou, Yulian Wu, Wenqian Weng, Francesco Orabona

    Abstract: In this paper, we theoretically study the offline alignment of language models with human preference feedback, under both preference label corruption and privacy protections. To this end, we propose Square$χ$PO, a simple one-line change to $χ$PO where the standard log-loss is replaced by a new square loss over probability. Thanks to the inherent properties of this new loss, we have advanced the st… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  20. arXiv:2505.21331  [pdf, ps, other

    cs.DS cs.GT cs.LG cs.PF math.PR

    Scheduling with Uncertain Holding Costs and its Application to Content Moderation

    Authors: Caner Gocmen, Thodoris Lykouris, Deeksha Sinha, Wentao Weng

    Abstract: In content moderation for social media platforms, the cost of delaying the review of a content is proportional to its view trajectory, which fluctuates and is apriori unknown. Motivated by such uncertain holding costs, we consider a queueing model where job states evolve based on a Markov chain with state-dependent instantaneous holding costs. We demonstrate that in the presence of such uncertain… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  21. arXiv:2505.21279  [pdf, ps, other

    cs.AI

    XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration

    Authors: Shaoqing Zhang, Kehai Chen, Zhuosheng Zhang, Rumei Li, Rongxiang Weng, Yang Xiang, Liqiang Nie, Min Zhang

    Abstract: Recent advancements in vision-language models (VLMs) have spurred increased interest in Device-Control Agents (DC agents), such as utilizing in-the-wild device control to manage graphical user interfaces. Conventional methods for assessing the capabilities of DC agents, such as computing step-wise action accuracy and overall task success rates, provide a macroscopic view of DC agents' performance;… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  22. arXiv:2505.19641  [pdf, ps, other

    cs.AI cs.CL

    SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond

    Authors: Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, Junxian He

    Abstract: Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challe… ▽ More

    Submitted 4 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  23. arXiv:2505.19422  [pdf, other

    cs.CV

    LlamaSeg: Image Segmentation via Autoregressive Mask Generation

    Authors: Jiru Deng, Tengjin Weng, Tianyu Yang, Wenhan Luo, Zhiheng Li, Wenhao Jiang

    Abstract: We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally i… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  24. arXiv:2505.18699  [pdf, other

    cs.CV

    Affective Image Editing: Shaping Emotional Factors via Text Descriptions

    Authors: Peixuan Zhang, Shuchen Weng, Chengxuan Zhu, Binghao Tang, Zijian Jia, Si Li, Boxin Shi

    Abstract: In daily life, images as common affective stimuli have widespread applications. Despite significant progress in text-driven image editing, there is limited work focusing on understanding users' emotional requests. In this paper, we introduce AIEdiT for Affective Image Editing using Text descriptions, which evokes specific emotions by adaptively shaping multiple emotional factors across the entire… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  25. arXiv:2505.18190  [pdf, other

    eess.SP cs.AI cs.LG

    PhySense: Sensor Placement Optimization for Accurate Physics Sensing

    Authors: Yuezhou Ma, Haixu Wu, Hang Zhou, Huikun Weng, Jianmin Wang, Mingsheng Long

    Abstract: Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placeme… ▽ More

    Submitted 26 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

  26. arXiv:2505.17779  [pdf, ps, other

    cs.CV cs.LG

    U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

    Authors: Anjie Le, Henan Liu, Yue Wang, Zhenyu Liu, Rongkun Zhu, Taohan Weng, Jinze Yu, Boyang Wang, Yalun Wu, Kaiwen Yan, Quanlin Sun, Meirui Jiang, Jialun Pei, Siya Liu, Haoyun Zheng, Zhoujun Li, Alison Noble, Jacques Souquet, Xiaoqing Guo, Manxi Lin, Hongcheng Guo

    Abstract: Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We i… ▽ More

    Submitted 30 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

  27. arXiv:2505.17488  [pdf, other

    cs.LG eess.SY

    ExARNN: An Environment-Driven Adaptive RNN for Learning Non-Stationary Power Dynamics

    Authors: Haoran Li, Muhao Guo, Yang Weng, Marija Ilic, Guangchun Ruan

    Abstract: Non-stationary power system dynamics, influenced by renewable energy variability, evolving demand patterns, and climate change, are becoming increasingly complex. Accurately capturing these dynamics requires a model capable of adapting to environmental factors. Traditional models, including Recurrent Neural Networks (RNNs), lack efficient mechanisms to encode external factors, such as time or envi… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: 5 pages, 3 figures, conference

  28. arXiv:2505.16761  [pdf, ps, other

    cs.CV

    Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning

    Authors: Jian Liu, Jing Xu, Song Guo, Jing Li, Jingfeng Guo, Jiaao Yu, Haohan Weng, Biwen Lei, Xianghui Yang, Zhuo Chen, Fangqi Zhu, Tao Han, Chunchao Guo

    Abstract: Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present \textbf{Mesh-RFT}, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optim… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Under Review

  29. Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives

    Authors: Xingxing Weng, Chao Pang, Gui-Song Xia

    Abstract: Vision-language modeling (VLM) aims to bridge the information gap between images and natural language. Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress. The resulting models benefit from the absorption of extensive general knowledge and demonstrate strong performance a… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Accepted by IEEE Geoscience and Remote Sensing Magazine

    Journal ref: IEEE Geoscience and Remote Sensing Magazine, Early Access, 2025

  30. arXiv:2505.13925  [pdf, ps, other

    cs.RO cs.LG

    Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning

    Authors: Yunpeng Jiang, Jianshu Hu, Paul Weng, Yutong Ban

    Abstract: Symmetry is pervasive in robotics and has been widely exploited to improve sample efficiency in deep reinforcement learning (DRL). However, existing approaches primarily focus on spatial symmetries, such as reflection, rotation, and translation, while largely neglecting temporal symmetries. To address this gap, we explore time reversal symmetry, a form of temporal symmetry commonly found in roboti… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  31. arXiv:2505.13573  [pdf, ps, other

    cs.GR cs.AI

    FreeMesh: Boosting Mesh Generation with Coordinates Merging

    Authors: Jian Liu, Haohan Weng, Biwen Lei, Xianghui Yang, Zibo Zhao, Zhuo Chen, Song Guo, Tao Han, Chunchao Guo

    Abstract: The next-coordinate prediction paradigm has emerged as the de facto standard in current auto-regressive mesh generation methods. Despite their effectiveness, there is no efficient measurement for the various tokenizers that serialize meshes into sequences. In this paper, we introduce a new metric Per-Token-Mesh-Entropy (PTME) to evaluate the existing mesh tokenizers theoretically without any train… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML 2025, camera-ready version

  32. arXiv:2505.12398  [pdf, other

    cs.CL cs.AI cs.LG

    Traversal Verification for Speculative Tree Decoding

    Authors: Yepeng Weng, Qiao Hu, Xujie Chen, Li Liu, Dianwen Mei, Huishi Qiu, Jiang Tian, Zhongchao Shi

    Abstract: Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in parallel to determine whether the drafted tokens should be accepted or rejected. To enhance acceptance rates, existing frameworks typically construct token tre… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: Under review

  33. arXiv:2505.12380  [pdf, ps, other

    cs.LG cs.DB cs.PL

    Graph-Reward-SQL: Execution-Free Reinforcement Learning for Text-to-SQL via Graph Matching and Stepwise Reward

    Authors: Han Weng, Boyi Liu, Yuanfeng Song, Dun Zeng, Yingxiang Yang, Yi Zhan, Longjie Cui, Xiaoming Yin, Yang Sun

    Abstract: Reinforcement learning (RL) has been widely adopted to enhance the performance of large language models (LLMs) on Text-to-SQL tasks. However, existing methods often rely on execution-based or LLM-based Bradley-Terry reward models. The former suffers from high execution latency caused by repeated database calls, whereas the latter imposes substantial GPU memory overhead, both of which significantly… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

  34. arXiv:2505.12220  [pdf, ps, other

    cs.LG

    Machine Learning Applications Related to Suicide in Military and Veterans: A Scoping Literature Review

    Authors: Yuhan Zhang, Yishu Wei, Yanshan Wang, Yunyu Xiao, COL, Ronald K. Poropatich, Gretchen L. Haas, Yiye Zhang, Chunhua Weng, Jinze Liu, Lisa A. Brenner, James M. Bjork, Yifan Peng

    Abstract: Suicide remains one of the main preventable causes of death among active service members and veterans. Early detection and prediction are crucial in suicide prevention. Machine learning techniques have yielded promising results in this area recently. This study aims to assess and summarize current research and provides a comprehensive review regarding the application of machine learning techniques… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  35. arXiv:2505.11907  [pdf, ps, other

    cs.CV

    Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

    Authors: Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, Xuming Hu

    Abstract: The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this pape… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  36. arXiv:2505.11304  [pdf, other

    cs.LG cs.AI

    Heterogeneity-Aware Client Sampling: A Unified Solution for Consistent Federated Learning

    Authors: Shudi Weng, Chao Ren, Ming Xiao, Mikael Skoglund

    Abstract: Federated learning (FL) commonly involves clients with diverse communication and computational capabilities. Such heterogeneity can significantly distort the optimization dynamics and lead to objective inconsistency, where the global model converges to an incorrect stationary point potentially far from the pursued optimum. Despite its critical impact, the joint effect of communication and computat… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  37. arXiv:2505.10864  [pdf, ps, other

    cs.HC cs.CR

    Anti-Sensing: Defense against Unauthorized Radar-based Human Vital Sign Sensing with Physically Realizable Wearable Oscillators

    Authors: Md Farhan Tasnim Oshim, Nigel Doering, Bashima Islam, Tsui-Wei Weng, Tauhidur Rahman

    Abstract: Recent advancements in Ultra-Wideband (UWB) radar technology have enabled contactless, non-line-of-sight vital sign monitoring, making it a valuable tool for healthcare. However, UWB radar's ability to capture sensitive physiological data, even through walls, raises significant privacy concerns, particularly in human-robot interactions and autonomous systems that rely on radar for sensing human pr… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  38. arXiv:2505.10597  [pdf, other

    cs.LG cs.AI cs.CL

    Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

    Authors: Jiazheng Zhang, Wenqing Jing, Zizhuo Zhang, Zhiheng Xi, Shihan Dou, Rongxiang Weng, Jiahuan Li, Jingang Wang, Mingxu Chai, Shibo Hong, Tao Gui, Qi Zhang

    Abstract: Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human values. However, noisy preferences in human feedback can lead to reward misgeneralization - a phenomenon where reward models learn spurious correlations or overfit to noisy preferences, which poses important challenges to the generalization of RMs. This paper systematically analyzes the characteristics of p… ▽ More

    Submitted 18 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

  39. arXiv:2505.08216  [pdf, ps, other

    cs.RO eess.SY

    Rethink Repeatable Measures of Robot Performance with Statistical Query

    Authors: Bowen Weng, Linda Capito, Guillermo A. Castillo, Dylan Khor

    Abstract: For a general standardized testing algorithm designed to evaluate a specific aspect of a robot's performance, several key expectations are commonly imposed. Beyond accuracy (i.e., closeness to a typically unknown ground-truth reference) and efficiency (i.e., feasibility within acceptable testing costs and equipment constraints), one particularly important attribute is repeatability. Repeatability… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  40. arXiv:2505.07062  [pdf, ps, other

    cs.CV cs.AI

    Seed1.5-VL Technical Report

    Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

    Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  41. arXiv:2505.06883  [pdf, other

    cs.RO cs.AI cs.LG

    FACET: Force-Adaptive Control via Impedance Reference Tracking for Legged Robots

    Authors: Botian Xu, Haoyang Weng, Qingzhou Lu, Yang Gao, Huazhe Xu

    Abstract: Reinforcement learning (RL) has made significant strides in legged robot control, enabling locomotion across diverse terrains and complex loco-manipulation capabilities. However, the commonly used position or velocity tracking-based objectives are agnostic to forces experienced by the robot, leading to stiff and potentially dangerous behaviors and poor control during forceful interactions. To addr… ▽ More

    Submitted 19 May, 2025; v1 submitted 11 May, 2025; originally announced May 2025.

  42. arXiv:2505.06814  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering Challenge

    Authors: Bin Li, Shenxi Liu, Yixuan Weng, Yue Du, Yuhang Tian, Shoujun Zhou

    Abstract: Following the successful hosts of the 1-st (NLPCC 2023 Foshan) CMIVQA and the 2-rd (NLPCC 2024 Hangzhou) MMIVQA challenges, this year, a new task has been introduced to further advance research in multi-modal, multilingual, and multi-hop medical instructional question answering (M4IVQA) systems, with a specific focus on medical instructional videos. The M4IVQA challenge focuses on evaluating model… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: 12 pages, 5 figures, 4 tables

  43. arXiv:2505.04974  [pdf, other

    cs.CV

    ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

    Authors: Wanjiang Weng, Xiaofeng Tan, Hongsong Wang, Pan Zhou

    Abstract: Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsisten… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: 17 pages, 9 figures

  44. arXiv:2505.04965  [pdf, ps, other

    cs.CV

    DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

    Authors: Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng Weng, Zhongchao Shi, Gao Huang

    Abstract: Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grain… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: Accepted by ICLR 2025

  45. arXiv:2505.04653  [pdf, ps, other

    cs.CL cs.AI cs.CV cs.LG

    Advancing Conversational Diagnostic AI with Multimodal Reasoning

    Authors: Khaled Saab, Jan Freyberg, Chunjong Park, Tim Strother, Yong Cheng, Wei-Hung Weng, David G. T. Barrett, David Stutz, Nenad Tomasev, Anil Palepu, Valentin LiƩvin, Yash Sharma, Roma Ruparel, Abdullah Ahmed, Elahe Vedadi, Kimberly Kanada, Cian Hughes, Yun Liu, Geoff Brown, Yang Gao, Sean Li, S. Sara Mahdavi, James Manyika, Katherine Chou, Yossi Matias , et al. (11 additional authors not shown)

    Abstract: Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the abil… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  46. arXiv:2505.04150  [pdf, other

    cs.CV cs.LG

    Learning from Similarity Proportion Loss for Classifying Skeletal Muscle Recovery Stages

    Authors: Yu Yamaoka, Weng Ian Chan, Shigeto Seno, Soichiro Fukada, Hideo Matsuda

    Abstract: Evaluating the regeneration process of damaged muscle tissue is a fundamental analysis in muscle research to measure experimental effect sizes and uncover mechanisms behind muscle weakness due to aging and disease. The conventional approach to assessing muscle tissue regeneration involves whole-slide imaging and expert visual inspection of the recovery stages based on the morphological information… ▽ More

    Submitted 8 May, 2025; v1 submitted 7 May, 2025; originally announced May 2025.

    Comments: MICCAI2024 workshop ADSMI in Morocco (oral) [Peer-reviewed]

  47. arXiv:2505.03654  [pdf, other

    cs.CV cs.AI

    ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant

    Authors: Yifan Xiang, Zhenxi Zhang, Bin Li, Yixuan Weng, Shoujun Zhou, Yangfan He, Keqin Li

    Abstract: Recent advances in personalized MLLMs enable effective capture of user-specific concepts, supporting both recognition of personalized concepts and contextual captioning. However, humans typically explore and reason over relations among objects and individuals, transcending surface-level information to achieve more personalized and contextual understanding. To this end, existing methods may face th… ▽ More

    Submitted 19 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

    Comments: Work in progress

  48. arXiv:2504.18818  [pdf, other

    cs.LG

    Frequency-Integrated Transformer for Arbitrary-Scale Super-Resolution

    Authors: Xufei Wang, Fei Ge, Jinchen Zhu, Mingjian Zhang, Qi Wu, Jifeng Ren Shizhuang Weng

    Abstract: Methods based on implicit neural representation have demonstrated remarkable capabilities in arbitrary-scale super-resolution (ASSR) tasks, but they neglect the potential value of the frequency domain, leading to sub-optimal performance. We proposes a novel network called Frequency-Integrated Transformer (FIT) to incorporate and utilize frequency information to enhance ASSR performance. FIT employ… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

    Comments: 11pages,8figures

  49. arXiv:2504.18535  [pdf, other

    cs.CL cs.LG

    TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation

    Authors: Gwen Yidou Weng, Benjie Wang, Guy Van den Broeck

    Abstract: As large language models (LMs) advance, there is an increasing need to control their outputs to align with human values (e.g., detoxification) or desired attributes (e.g., personalization, topic). However, autoregressive models focus on next-token predictions and struggle with global properties that require looking ahead. Existing solutions either tune or post-train LMs for each new attribute - ex… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  50. arXiv:2504.15414  [pdf, other

    cs.RO cs.LG

    Post-Convergence Sim-to-Real Policy Transfer: A Principled Alternative to Cherry-Picking

    Authors: Dylan Khor, Bowen Weng

    Abstract: Learning-based approaches, particularly reinforcement learning (RL), have become widely used for developing control policies for autonomous agents, such as locomotion policies for legged robots. RL training typically maximizes a predefined reward (or minimizes a corresponding cost/loss) by iteratively optimizing policies within a simulator. Starting from a randomly initialized policy, the empirica… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.