-
GenHSI: Controllable Generation of Human-Scene Interaction Videos
Authors:
Zekun Li,
Rui Zhou,
Rahul Sajnani,
Xiaoyan Cong,
Daniel Ritchie,
Srinath Sridhar
Abstract:
Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in using these models to generate long movie-like videos with rich human-object interactions that include unrealistic human-scene interaction, lack of subject identity preservation, and require expensive training. We propose GenHSI,…
▽ More
Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in using these models to generate long movie-like videos with rich human-object interactions that include unrealistic human-scene interaction, lack of subject identity preservation, and require expensive training. We propose GenHSI, a training-free method for controllable generation of long human-scene interaction videos (HSI). Taking inspiration from movie animation, our key insight is to overcome the limitations of previous work by subdividing the long video generation task into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene, a user description, and multiple images of a person, we use these three stages to generate long-videos that preserve human-identity and provide rich human-scene interactions. Script writing converts complex human tasks into simple atomic tasks that are used in the pre-visualization stage to generate 3D keyframes (storyboards). These 3D keyframes are rendered and animated by off-the-shelf video diffusion models for consistent long video generation with rich contacts in a 3D-aware manner. A key advantage of our work is that we alleviate the need for scanned, accurate scenes and create 3D keyframes from single-view images. We are the first to generate a long video sequence with a consistent camera pose that contains arbitrary numbers of character actions without training. Experiments demonstrate that our method can generate long videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene. Visit our project homepage https://kunkun0w0.github.io/project/GenHSI/ for more information.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
VLM-Empowered Multi-Mode System for Efficient and Safe Planetary Navigation
Authors:
Sinuo Cheng,
Ruyi Zhou,
Wenhao Feng,
Huaiguang Yang,
Haibo Gao,
Zongquan Deng,
Liang Ding
Abstract:
The increasingly complex and diverse planetary exploration environment requires more adaptable and flexible rover navigation strategy. In this study, we propose a VLM-empowered multi-mode system to achieve efficient while safe autonomous navigation for planetary rovers. Vision-Language Model (VLM) is used to parse scene information by image inputs to achieve a human-level understanding of terrain…
▽ More
The increasingly complex and diverse planetary exploration environment requires more adaptable and flexible rover navigation strategy. In this study, we propose a VLM-empowered multi-mode system to achieve efficient while safe autonomous navigation for planetary rovers. Vision-Language Model (VLM) is used to parse scene information by image inputs to achieve a human-level understanding of terrain complexity. Based on the complexity classification, the system switches to the most suitable navigation mode, composing of perception, mapping and planning modules designed for different terrain types, to traverse the terrain ahead before reaching the next waypoint. By integrating the local navigation system with a map server and a global waypoint generation module, the rover is equipped to handle long-distance navigation tasks in complex scenarios. The navigation system is evaluated in various simulation environments. Compared to the single-mode conservative navigation method, our multi-mode system is able to bootstrap the time and energy efficiency in a long-distance traversal with varied type of obstacles, enhancing efficiency by 79.5%, while maintaining its avoidance capabilities against terrain hazards to guarantee rover safety. More system information is shown at https://chengsn1234.github.io/multi-mode-planetary-navigation/.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
On the optimal regret of collaborative personalized linear bandits
Authors:
Bruce Huang,
Ruida Zhou,
Lin F. Yang,
Suhas Diggavi
Abstract:
Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applyi…
▽ More
Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applying single agent algorithms independently ignores cross-agent similarity and learning opportunities. This paper investigates the optimal regret achievable in collaborative personalized linear bandits. We provide an information-theoretic lower bound that characterizes how the number of agents, the interaction rounds, and the degree of heterogeneity jointly affect regret. We then propose a new two-stage collaborative algorithm that achieves the optimal regret. Our analysis models heterogeneity via a hierarchical Bayesian framework and introduces a novel information-theoretic technique for bounding regret. Our results offer a complete characterization of when and how collaboration helps with a optimal regret bound $\tilde{O}(d\sqrt{mn})$, $\tilde{O}(dm^{1-γ}\sqrt{n})$, $\tilde{O}(dm\sqrt{n})$ for the number of rounds $n$ in the range of $(0, \frac{d}{m σ^2})$, $[\frac{d}{m^{2γ} σ^2}, \frac{d}{σ^2}]$ and $(\frac{d}{σ^2}, \infty)$ respectively, where $σ$ measures the level of heterogeneity, $m$ is the number of agents, and $γ\in[0, 1/2]$ is an absolute constant. In contrast, agents without collaboration achieve a regret bound $O(dm\sqrt{n})$ at best.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Safe Domains of Attraction for Discrete-Time Nonlinear Systems: Characterization and Verifiable Neural Network Estimation
Authors:
Mohamed Serry,
Haoyu Li,
Ruikun Zhou,
Huan Zhang,
Jun Liu
Abstract:
Analysis of nonlinear autonomous systems typically involves estimating domains of attraction, which have been a topic of extensive research interest for decades. Despite that, accurately estimating domains of attraction for nonlinear systems remains a challenging task, where existing methods are conservative or limited to low-dimensional systems. The estimation becomes even more challenging when a…
▽ More
Analysis of nonlinear autonomous systems typically involves estimating domains of attraction, which have been a topic of extensive research interest for decades. Despite that, accurately estimating domains of attraction for nonlinear systems remains a challenging task, where existing methods are conservative or limited to low-dimensional systems. The estimation becomes even more challenging when accounting for state constraints. In this work, we propose a framework to accurately estimate safe (state-constrained) domains of attraction for discrete-time autonomous nonlinear systems. In establishing this framework, we first derive a new Zubov equation, whose solution corresponds to the exact safe domain of attraction. The solution to the aforementioned Zubov equation is shown to be unique and continuous over the whole state space. We then present a physics-informed approach to approximating the solution of the Zubov equation using neural networks. To obtain certifiable estimates of the domain of attraction from the neural network approximate solutions, we propose a verification framework that can be implemented using standard verification tools (e.g., $α,\!β$-CROWN and dReal). To illustrate its effectiveness, we demonstrate our approach through numerical examples concerning nonlinear systems with state constraints.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
How Grounded is Wikipedia? A Study on Structured Evidential Support
Authors:
William Walden,
Kathryn Ricci,
Miriam Wanner,
Zhengping Jiang,
Chandler May,
Rongkun Zhou,
Benjamin Van Durme
Abstract:
Wikipedia is a critical resource for modern NLP, serving as a rich repository of up-to-date and citation-backed information on a wide variety of subjects. The reliability of Wikipedia -- its groundedness in its cited sources -- is vital to this purpose. This work provides a quantitative analysis of the extent to which Wikipedia *is* so grounded and of how readily grounding evidence may be retrieve…
▽ More
Wikipedia is a critical resource for modern NLP, serving as a rich repository of up-to-date and citation-backed information on a wide variety of subjects. The reliability of Wikipedia -- its groundedness in its cited sources -- is vital to this purpose. This work provides a quantitative analysis of the extent to which Wikipedia *is* so grounded and of how readily grounding evidence may be retrieved. To this end, we introduce PeopleProfiles -- a large-scale, multi-level dataset of claim support annotations on Wikipedia articles of notable people. We show that roughly 20% of claims in Wikipedia *lead* sections are unsupported by the article body; roughly 27% of annotated claims in the article *body* are unsupported by their (publicly accessible) cited sources; and >80% of lead claims cannot be traced to these sources via annotated body evidence. Further, we show that recovery of complex grounding evidence for claims that *are* supported remains a challenge for standard retrieval methods.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
SPIRE: Conditional Personalization for Federated Diffusion Generative Models
Authors:
Kaan Ozkara,
Ruida Zhou,
Suhas Diggavi
Abstract:
Recent advances in diffusion models have revolutionized generative AI, but their sheer size makes on device personalization, and thus effective federated learning (FL), infeasible. We propose Shared Backbone Personal Identity Representation Embeddings (SPIRE), a framework that casts per client diffusion based generation as conditional generation in FL. SPIRE factorizes the network into (i) a high…
▽ More
Recent advances in diffusion models have revolutionized generative AI, but their sheer size makes on device personalization, and thus effective federated learning (FL), infeasible. We propose Shared Backbone Personal Identity Representation Embeddings (SPIRE), a framework that casts per client diffusion based generation as conditional generation in FL. SPIRE factorizes the network into (i) a high capacity global backbone that learns a population level score function and (ii) lightweight, learnable client embeddings that encode local data statistics. This separation enables parameter efficient finetuning that touches $\leq 0.01\%$ of weights. We provide the first theoretical bridge between conditional diffusion training and maximum likelihood estimation in Gaussian mixture models. For a two component mixture we prove that gradient descent on the DDPM with respect to mixing weights loss recovers the optimal mixing weights and enjoys dimension free error bounds. Our analysis also hints at how client embeddings act as biases that steer a shared score network toward personalized distributions. Empirically, SPIRE matches or surpasses strong baselines during collaborative pretraining, and vastly outperforms them when adapting to unseen clients, reducing Kernel Inception Distance while updating only hundreds of parameters. SPIRE further mitigates catastrophic forgetting and remains robust across finetuning learning rate and epoch choices.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Social Networks: Enumerating Maximal Community Patterns in $c$-Closed Graphs
Authors:
Gabriela Bourla,
Kaixin Wang,
Fan Wei,
Runtian Zhou
Abstract:
Fox, Seshadhri, Roughgarden, Wei, and Wein (SICOMP 2020) introduced the model of $c$-closed graphs--a distribution-free model motivated by triadic closure, one of the most pervasive structural signatures of social networks. While enumerating maximal cliques in general graphs can take exponential time, it is known that in $c$-closed graphs, maximal cliques and maximal complete bipartite subgraphs c…
▽ More
Fox, Seshadhri, Roughgarden, Wei, and Wein (SICOMP 2020) introduced the model of $c$-closed graphs--a distribution-free model motivated by triadic closure, one of the most pervasive structural signatures of social networks. While enumerating maximal cliques in general graphs can take exponential time, it is known that in $c$-closed graphs, maximal cliques and maximal complete bipartite subgraphs can always be enumerated in polynomial time. These structures correspond to blow-ups of simple patterns: a single vertex or a single edge, with some vertices required to form cliques. In this work, we explore a natural extension: we study maximal blow-ups of arbitrary finite graphs $H$ in $c$-closed graphs. We prove that for any fixed graph $H$, the number of maximal blow-ups of $H$ in an $n$-vertex $c$-closed graph is always bounded by a polynomial in $n$. We further investigate the case of induced blow-ups and provide a precise characterization of the graphs $H$ for which the number of maximal induced blow-ups is also polynomially bounded in $n$. Finally, we study the analogue questions when $H$ ranges over an infinite family of graphs.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs
Authors:
Shulun Chen,
Runlong Zhou,
Zihan Zhang,
Maryam Fazel,
Simon S. Du
Abstract:
We consider the gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm achieves a variance-aware gap-dependent regret bound of…
▽ More
We consider the gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm achieves a variance-aware gap-dependent regret bound of $$\tilde{O}\left(\left(\sum_{Δ_h(s,a)>0} \frac{H^2 \log K \land \mathtt{Var}_{\max}^{\text{c}}}{Δ_h(s,a)} +\sum_{Δ_h(s,a)=0}\frac{ H^2 \land \mathtt{Var}_{\max}^{\text{c}}}{Δ_{\mathrm{min}}} + SAH^4 (S \lor H) \right) \log K\right),$$ where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Here, $Δ_h(s,a) =V_h^* (a) - Q_h^* (s, a)$ represents the suboptimality gap and $Δ_{\mathrm{min}} := \min_{Δ_h (s,a) > 0} Δ_h(s,a)$. The term $\mathtt{Var}_{\max}^{\text{c}}$ denotes the maximum conditional total variance, calculated as the maximum over all $(π, h, s)$ tuples of the expected total variance under policy $π$ conditioned on trajectories visiting state $s$ at step $h$. $\mathtt{Var}_{\max}^{\text{c}}$ characterizes the maximum randomness encountered when learning any $(h, s)$ pair. Our result stems from a novel analysis of the weighted sum of the suboptimality gap and can be potentially adapted for other algorithms. To complement the study, we establish a lower bound of $$Ω\left( \sum_{Δ_h(s,a)>0} \frac{H^2 \land \mathtt{Var}_{\max}^{\text{c}}}{Δ_h(s,a)}\cdot \log K\right),$$ demonstrating the necessity of dependence on $\mathtt{Var}_{\max}^{\text{c}}$ even when the maximum unconditional total variance (without conditioning on $(h, s)$) approaches zero.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Whole-Body Constrained Learning for Legged Locomotion via Hierarchical Optimization
Authors:
Haoyu Wang,
Ruyi Zhou,
Liang Ding,
Tie Liu,
Zhelin Zhang,
Peng Xu,
Haibo Gao,
Zongquan Deng
Abstract:
Reinforcement learning (RL) has demonstrated impressive performance in legged locomotion over various challenging environments. However, due to the sim-to-real gap and lack of explainability, unconstrained RL policies deployed in the real world still suffer from inevitable safety issues, such as joint collisions, excessive torque, or foot slippage in low-friction environments. These problems limit…
▽ More
Reinforcement learning (RL) has demonstrated impressive performance in legged locomotion over various challenging environments. However, due to the sim-to-real gap and lack of explainability, unconstrained RL policies deployed in the real world still suffer from inevitable safety issues, such as joint collisions, excessive torque, or foot slippage in low-friction environments. These problems limit its usage in missions with strict safety requirements, such as planetary exploration, nuclear facility inspection, and deep-sea operations. In this paper, we design a hierarchical optimization-based whole-body follower, which integrates both hard and soft constraints into RL framework to make the robot move with better safety guarantees. Leveraging the advantages of model-based control, our approach allows for the definition of various types of hard and soft constraints during training or deployment, which allows for policy fine-tuning and mitigates the challenges of sim-to-real transfer. Meanwhile, it preserves the robustness of RL when dealing with locomotion in complex unstructured environments. The trained policy with introduced constraints was deployed in a hexapod robot and tested in various outdoor environments, including snow-covered slopes and stairs, demonstrating the great traversability and safety of our approach.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query
Authors:
Wei Chow,
Yuan Gao,
Linfeng Li,
Xian Wang,
Qi Xu,
Hang Song,
Lingdong Kong,
Ran Zhou,
Yi Zeng,
Yidong Cai,
Botian Jiang,
Shilin Xu,
Jiajun Zhang,
Minghui Qiu,
Xiangtai Li,
Tianshu Yang,
Siliang Tang,
Juncheng Li
Abstract:
Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios freq…
▽ More
Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
NTIRE 2025 XGC Quality Assessment Challenge: Methods and Results
Authors:
Xiaohong Liu,
Xiongkuo Min,
Qiang Hu,
Xiaoyun Zhang,
Jie Guo,
Guangtao Zhai,
Shushi Wang,
Yingjie Zhou,
Lu Liu,
Jingxin Li,
Liu Yang,
Farong Wen,
Li Xu,
Yanwei Jiang,
Xilei Zhu,
Chunyi Li,
Zicheng Zhang,
Huiyu Duan,
Xiele Wu,
Yixuan Gao,
Yuqin Cao,
Jun Jia,
Wei Sun,
Jiezhang Cao,
Radu Timofte
, et al. (70 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking he…
▽ More
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking head. The user-generated video track uses the FineVD-GC, which contains 6,284 user generated videos. The user-generated video track has a total of 125 registered participants. A total of 242 submissions are received in the development phase, and 136 submissions are received in the test phase. Finally, 5 participating teams submitted their models and fact sheets. The AI generated video track uses the Q-Eval-Video, which contains 34,029 AI-Generated Videos (AIGVs) generated by 11 popular Text-to-Video (T2V) models. A total of 133 participants have registered in this track. A total of 396 submissions are received in the development phase, and 226 submissions are received in the test phase. Finally, 6 participating teams submitted their models and fact sheets. The talking head track uses the THQA-NTIRE, which contains 12,247 2D and 3D talking heads. A total of 89 participants have registered in this track. A total of 225 submissions are received in the development phase, and 118 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Each participating team in every track has proposed a method that outperforms the baseline, which has contributed to the development of fields in three tracks.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation
Authors:
Guobin Zhu,
Rui Zhou,
Wenkang Ji,
Shiyu Zhao
Abstract:
Although Multi-Agent Reinforcement Learning (MARL) is effective for complex multi-robot tasks, it suffers from low sample efficiency and requires iterative manual reward tuning. Large Language Models (LLMs) have shown promise in single-robot settings, but their application in multi-robot systems remains largely unexplored. This paper introduces a novel LLM-Aided MARL (LAMARL) approach, which integ…
▽ More
Although Multi-Agent Reinforcement Learning (MARL) is effective for complex multi-robot tasks, it suffers from low sample efficiency and requires iterative manual reward tuning. Large Language Models (LLMs) have shown promise in single-robot settings, but their application in multi-robot systems remains largely unexplored. This paper introduces a novel LLM-Aided MARL (LAMARL) approach, which integrates MARL with LLMs, significantly enhancing sample efficiency without requiring manual design. LAMARL consists of two modules: the first module leverages LLMs to fully automate the generation of prior policy and reward functions. The second module is MARL, which uses the generated functions to guide robot policy training effectively. On a shape assembly benchmark, both simulation and real-world experiments demonstrate the unique advantages of LAMARL. Ablation studies show that the prior policy improves sample efficiency by an average of 185.9% and enhances task completion, while structured prompts based on Chain-of-Thought (CoT) and basic APIs improve LLM output success rates by 28.5%-67.5%. Videos and code are available at https://windylab.github.io/LAMARL/
△ Less
Submitted 3 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling
Authors:
Zhengyu Chen,
Yudong Wang,
Teng Xiao,
Ruochen Zhou,
Xuesheng Yang,
Wei Wang,
Zhifang Sui,
Jingang Wang
Abstract:
Recent advancements in improving the reasoning capabilities of Large Language Models have underscored the efficacy of Process Reward Models (PRMs) in addressing intermediate errors through structured feedback mechanisms. This study analyzes PRMs from multiple perspectives, including training methodologies, scalability, and generalization capabilities. We investigate the interplay between pre-train…
▽ More
Recent advancements in improving the reasoning capabilities of Large Language Models have underscored the efficacy of Process Reward Models (PRMs) in addressing intermediate errors through structured feedback mechanisms. This study analyzes PRMs from multiple perspectives, including training methodologies, scalability, and generalization capabilities. We investigate the interplay between pre-training and reward model training FLOPs to assess their influence on PRM efficiency and accuracy in complex reasoning tasks. Our analysis reveals a pattern of diminishing returns in performance with increasing PRM scale, highlighting the importance of balancing model size and computational cost. Furthermore, the diversity of training datasets significantly impacts PRM performance, emphasizing the importance of diverse data to enhance both accuracy and efficiency. We further examine test-time scaling strategies, identifying Monte Carlo Tree Search as the most effective method when computational resources are abundant, while Best-of-N Sampling serves as a practical alternative under resource-limited conditions. Notably, our findings indicate that PRMs trained on mathematical datasets exhibit performance comparable to those tailored for code generation, suggesting robust cross-domain generalization. Employing a gradient-based metric, we observe that PRMs exhibit a preference for selecting responses with similar underlying patterns, further informing their optimization.
△ Less
Submitted 24 May, 2025;
originally announced June 2025.
-
DSAGL: Dual-Stream Attention-Guided Learning for Weakly Supervised Whole Slide Image Classification
Authors:
Daoxi Cao,
Hangbei Cheng,
Yijin Li,
Ruolin Zhou,
Xuehan Zhang,
Xinyi Li,
Binwei Li,
Xuancheng Gu,
Jianan Zhang,
Xueyu Liu,
Yongfei Wu
Abstract:
Whole-slide images (WSIs) are critical for cancer diagnosis due to their ultra-high resolution and rich semantic content. However, their massive size and the limited availability of fine-grained annotations pose substantial challenges for conventional supervised learning. We propose DSAGL (Dual-Stream Attention-Guided Learning), a novel weakly supervised classification framework that combines a te…
▽ More
Whole-slide images (WSIs) are critical for cancer diagnosis due to their ultra-high resolution and rich semantic content. However, their massive size and the limited availability of fine-grained annotations pose substantial challenges for conventional supervised learning. We propose DSAGL (Dual-Stream Attention-Guided Learning), a novel weakly supervised classification framework that combines a teacher-student architecture with a dual-stream design. DSAGL explicitly addresses instance-level ambiguity and bag-level semantic consistency by generating multi-scale attention-based pseudo labels and guiding instance-level learning. A shared lightweight encoder (VSSMamba) enables efficient long-range dependency modeling, while a fusion-attentive module (FASA) enhances focus on sparse but diagnostically relevant regions. We further introduce a hybrid loss to enforce mutual consistency between the two streams. Experiments on CIFAR-10, NCT-CRC, and TCGA-Lung datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL baselines, achieving superior discriminative performance and robustness under weak supervision.
△ Less
Submitted 27 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Authors:
Ruizhe Shi,
Minhak Song,
Runlong Zhou,
Zihan Zhang,
Maryam Fazel,
Simon S. Du
Abstract:
We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we char…
▽ More
We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
Authors:
Yongheng Zhang,
Xu Liu,
Ruoxi Zhou,
Qiguang Chen,
Hao Fei,
Wenpeng Lu,
Libo Qin
Abstract:
Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. M…
▽ More
Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
Authors:
Wei Liu,
Ruochen Zhou,
Yiyun Deng,
Yuzhen Huang,
Junteng Liu,
Yuntian Deng,
Yizhe Zhang,
Junxian He
Abstract:
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present…
▽ More
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
Authors:
Tencent Hunyuan Team,
Ao Liu,
Botong Zhou,
Can Xu,
Chayse Zhou,
ChenChen Zhang,
Chengcheng Xu,
Chenhao Wang,
Decheng Wu,
Dengpeng Wu,
Dian Jiao,
Dong Du,
Dong Wang,
Feng Zhang,
Fengzong Lian,
Guanghui Xu,
Guanwei Zhang,
Hai Wang,
Haipeng Luo,
Han Hu,
Huilin Xu,
Jiajia Wu,
Jianchen Zhu,
Jianfeng Yan,
Jiaqi Zhu
, et al. (230 additional authors not shown)
Abstract:
As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid response…
▽ More
As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.
△ Less
Submitted 22 May, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales
Authors:
Jun Cao,
Jiyi Li,
Ziwei Yang,
Renjie Zhou
Abstract:
There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate i…
▽ More
There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate identification of meaning, aspects, sentiments, and their interconnections in textual and visual data. On the other hand, Large language models (LLMs) have shown exceptional capabilities in various tasks by effectively exploring fine-grained information in multimodal data. However, some studies indicate that LLMs still fall short compared to fine-tuned small models in the field of ABSA. Based on these findings, we propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA. Specifically, we inject explanations generated by LLMs as rationales into SLMs and employ a dual cross-attention mechanism for enhancing feature interaction and fusion, thereby augmenting the SLMs' ability to identify aspects and sentiments. We evaluated our method using two baseline models, numerous experiments highlight the superiority of our approach on three widely-used benchmarks, indicating its generalizability and applicability to most pre-trained models for MABSA.
△ Less
Submitted 23 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
SuperMapNet for Long-Range and High-Accuracy Vectorized HD Map Construction
Authors:
Ruqin Zhou,
Chenguang Dai,
Wanshou Jiang,
Yongsheng Zhang,
Hanyun Wang,
San Jiang
Abstract:
Vectorized HD map is essential for autonomous driving. Remarkable work has been achieved in recent years, but there are still major issues: (1) in the generation of the BEV features, single modality-based methods are of limited perception capability, while direct concatenation-based multi-modal methods fail to capture synergies and disparities between different modalities, resulting in limited ran…
▽ More
Vectorized HD map is essential for autonomous driving. Remarkable work has been achieved in recent years, but there are still major issues: (1) in the generation of the BEV features, single modality-based methods are of limited perception capability, while direct concatenation-based multi-modal methods fail to capture synergies and disparities between different modalities, resulting in limited ranges with feature holes; (2) in the classification and localization of map elements, only point information is used without the consideration of element infor-mation and neglects the interaction between point information and element information, leading to erroneous shapes and element entanglement with low accuracy. To address above issues, we introduce SuperMapNet for long-range and high-accuracy vectorized HD map construction. It uses both camera images and LiDAR point clouds as input, and first tightly couple semantic information from camera images and geometric information from LiDAR point clouds by a cross-attention based synergy enhancement module and a flow-based disparity alignment module for long-range BEV feature generation. And then, local features from point queries and global features from element queries are tightly coupled by three-level interactions for high-accuracy classification and localization, where Point2Point interaction learns local geometric information between points of the same element and of each point, Element2Element interaction learns relation constraints between different elements and semantic information of each elements, and Point2Element interaction learns complement element information for its constituent points. Experiments on the nuScenes and Argoverse2 datasets demonstrate superior performances, surpassing SOTAs over 14.9/8.8 mAP and 18.5/3.1 mAP under hard/easy settings, respectively. The code is made publicly available1.
△ Less
Submitted 4 June, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
EfficientLLM: Efficiency in Large Language Models
Authors:
Zhengqing Yuan,
Weixiang Sun,
Yixin Liu,
Huichi Zhou,
Rong Zhou,
Yiyang Li,
Zheyuan Zhang,
Wei Song,
Yue Huang,
Haolong Jia,
Keerthiram Murugesan,
Yu Wang,
Lifang He,
Jianfeng Gao,
Lichao Sun,
Yanfang Ye
Abstract:
Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematica…
▽ More
Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
ELITE: Embedding-Less retrieval with Iterative Text Exploration
Authors:
Zhangyu Wang,
Siyuan Gao,
Rong Zhou,
Hao Wang,
Li Ning
Abstract:
Large Language Models (LLMs) have achieved impressive progress in natural language processing, but their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks. Retrieval-Augmented Generation (RAG) mitigates this by retrieving relevant information from an external corpus. However, existing RAG systems often rely on embedding-based retrieval trained…
▽ More
Large Language Models (LLMs) have achieved impressive progress in natural language processing, but their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks. Retrieval-Augmented Generation (RAG) mitigates this by retrieving relevant information from an external corpus. However, existing RAG systems often rely on embedding-based retrieval trained on corpus-level semantic similarity, which can lead to retrieving content that is semantically similar in form but misaligned with the question's true intent. Furthermore, recent RAG variants construct graph- or hierarchy-based structures to improve retrieval accuracy, resulting in significant computation and storage overhead. In this paper, we propose an embedding-free retrieval framework. Our method leverages the logical inferencing ability of LLMs in retrieval using iterative search space refinement guided by our novel importance measure and extend our retrieval results with logically related information without explicit graph construction. Experiments on long-context QA benchmarks, including NovelQA and Marathon, show that our approach outperforms strong baselines while reducing storage and runtime by over an order of magnitude.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Quantum Conflict Measurement in Decision Making for Out-of-Distribution Detection
Authors:
Yilin Dong,
Tianyun Zhu,
Xinde Li,
Jean Dezert,
Rigui Zhou,
Changming Zhu,
Lei Cao,
Shuzhi Sam Ge
Abstract:
Quantum Dempster-Shafer Theory (QDST) uses quantum interference effects to derive a quantum mass function (QMF) as a fuzzy metric type from information obtained from various data sources. In addition, QDST uses quantum parallel computing to speed up computation. Nevertheless, the effective management of conflicts between multiple QMFs in QDST is a challenging question. This work aims to address th…
▽ More
Quantum Dempster-Shafer Theory (QDST) uses quantum interference effects to derive a quantum mass function (QMF) as a fuzzy metric type from information obtained from various data sources. In addition, QDST uses quantum parallel computing to speed up computation. Nevertheless, the effective management of conflicts between multiple QMFs in QDST is a challenging question. This work aims to address this problem by proposing a Quantum Conflict Indicator (QCI) that measures the conflict between two QMFs in decision-making. Then, the properties of the QCI are carefully investigated. The obtained results validate its compliance with desirable conflict measurement properties such as non-negativity, symmetry, boundedness, extreme consistency and insensitivity to refinement. We then apply the proposed QCI in conflict fusion methods and compare its performance with several commonly used fusion approaches. This comparison demonstrates the superiority of the QCI-based conflict fusion method. Moreover, the Class Description Domain Space (C-DDS) and its optimized version, C-DDS+ by utilizing the QCI-based fusion method, are proposed to address the Out-of-Distribution (OOD) detection task. The experimental results show that the proposed approach gives better OOD performance with respect to several state-of-the-art baseline OOD detection methods. Specifically, it achieves an average increase in Area Under the Receiver Operating Characteristic Curve (AUC) of 1.2% and a corresponding average decrease in False Positive Rate at 95% True Negative Rate (FPR95) of 5.4% compared to the optimal baseline method.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Adaptively Point-weighting Curriculum Learning
Authors:
Wensheng Li,
Hao Wang,
Ruifeng Zhou,
Hanting Guan,
Chao Zhang,
Dacheng Tao
Abstract:
Curriculum learning (CL) is referred to as a training strategy that makes easy samples learned first and then fits hard samples. It imitates the process of humans learning knowledge, and has become a potential manner of effectively training deep networks. In this study, we develop the adaptively point-weighting (APW) curriculum learning algorithm, which adaptively assigns the weight to every train…
▽ More
Curriculum learning (CL) is referred to as a training strategy that makes easy samples learned first and then fits hard samples. It imitates the process of humans learning knowledge, and has become a potential manner of effectively training deep networks. In this study, we develop the adaptively point-weighting (APW) curriculum learning algorithm, which adaptively assigns the weight to every training sample not only based on its training error but also considering the current training state of the network. Specifically, in the early training phase, it increases the weights of easy samples to make the network rapidly capture the overall characteristics of the dataset; and in the later training phase, the weights of hard points rise to improve the fitting performance on the discrete local regions. Moreover, we also present the theoretical analysis on the properties of APW including training effectiveness, training feasibility, training stability, and generalization performance. The numerical experiments support the superiority of APW and demonstrate the validity of our theoretical findings.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
CloudQC: A Network-aware Framework for Multi-tenant Distributed Quantum Computing
Authors:
Ruilin Zhou,
Yuhang Gan,
Yi Liu,
Chen Qian
Abstract:
Distributed quantum computing (DQC) that allows a large quantum circuit to be executed simultaneously on multiple quantum processing units (QPUs) becomes a promising approach to increase the scalability of quantum computing. It is natural to envision the near-future DQC platform as a multi-tenant cluster of QPUs, called a Quantum Cloud. However, no existing DQC work has addressed the two key probl…
▽ More
Distributed quantum computing (DQC) that allows a large quantum circuit to be executed simultaneously on multiple quantum processing units (QPUs) becomes a promising approach to increase the scalability of quantum computing. It is natural to envision the near-future DQC platform as a multi-tenant cluster of QPUs, called a Quantum Cloud. However, no existing DQC work has addressed the two key problems of running DQC in a multi-tenant quantum cloud: placing multiple quantum circuits to QPUs and scheduling network resources to complete these jobs. This work is the first attempt to design a circuit placement and resource scheduling framework for a multi-tenant environment. The proposed framework is called CloudQC, which includes two main functional components, circuit placement and network scheduler, with the objectives of optimizing both quantum network cost and quantum computing time. Experimental results with real quantum circuit workloads show that CloudQC significantly reduces the average job completion time compared to existing DQC placement algorithms for both single-circuit and multi-circuit DQC. We envision this work will motivate more future work on network-aware quantum cloud.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
DEER: Deep Runahead for Instruction Prefetching on Modern Mobile Workloads
Authors:
Parmida Vahdatniya,
Julian Humecki,
Henry Kao,
Tony Li,
Ali Sedaghati,
Fang Su,
Ruoyu Zhou,
Alex Bi,
Reza Azimi,
Maziar Goudarzi
Abstract:
Mobile workloads incur heavy frontend stalls due to increasingly large code footprints as well as long repeat cycles. Existing instruction-prefetching techniques suffer from low coverage, poor timeliness, or high cost. We provide a SW/HW co-designed I-prefetcher; DEER uses profile analysis to extract metadata information that allow the hardware to prefetch the most likely future instruction cachel…
▽ More
Mobile workloads incur heavy frontend stalls due to increasingly large code footprints as well as long repeat cycles. Existing instruction-prefetching techniques suffer from low coverage, poor timeliness, or high cost. We provide a SW/HW co-designed I-prefetcher; DEER uses profile analysis to extract metadata information that allow the hardware to prefetch the most likely future instruction cachelines, hundreds of instructions earlier. This profile analysis skips over loops and recursions to go deeper into the future, and uses a return-address stack on the hardware side to allow prefetch on the return-path from large call-stacks. The produced metadata table is put in DRAM, pointed to by an in-hardware register; the high depth of the lookahead allows to preload the metadata in time and thus nearly no on-chip metadata storage is needed. Gem5 evaluation on real-world modern mobile workloads shows up to 45% reduction in L2 instruction-miss rate (19.6% on average), resulting in up to 8% speedup (4.7% on average). These gains are up to 4X larger than full-hardware record-and-replay prefetchers, while needing two orders of magnitude smaller on-chip storage.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
GenTorrent: Scaling Large Language Model Serving with An Overley Network
Authors:
Fei Fang,
Yifan Hua,
Shengze Wang,
Ruilin Zhou,
Yi Liu,
Chen Qian,
Xiaoxue Zhang
Abstract:
While significant progress has been made in research and development on open-source and cost-efficient large-language models (LLMs), serving scalability remains a critical challenge, particularly for small organizations and individuals seeking to deploy and test their LLM innovations. Inspired by peer-to-peer networks that leverage decentralized overlay nodes to increase throughput and availabilit…
▽ More
While significant progress has been made in research and development on open-source and cost-efficient large-language models (LLMs), serving scalability remains a critical challenge, particularly for small organizations and individuals seeking to deploy and test their LLM innovations. Inspired by peer-to-peer networks that leverage decentralized overlay nodes to increase throughput and availability, we propose GenTorrent, an LLM serving overlay that harnesses computing resources from decentralized contributors. We identify four key research problems inherent to enabling such a decentralized infrastructure: 1) overlay network organization; 2) LLM communication privacy; 3) overlay forwarding for resource efficiency; and 4) verification of serving quality. This work presents the first systematic study of these fundamental problems in the context of decentralized LLM serving. Evaluation results from a prototype implemented on a set of decentralized nodes demonstrate that GenTorrent achieves a latency reduction of over 50% compared to the baseline design without overlay forwarding. Furthermore, the security features introduce minimal overhead to serving latency and throughput. We believe this work pioneers a new direction for democratizing and scaling future AI serving capabilities.
△ Less
Submitted 30 April, 2025; v1 submitted 26 April, 2025;
originally announced April 2025.
-
Lightweight Adapter Learning for More Generalized Remote Sensing Change Detection
Authors:
Dou Quan,
Rufan Zhou,
Shuang Wang,
Ning Huyan,
Dong Zhao,
Yunan Li,
Licheng Jiao
Abstract:
Deep learning methods have shown promising performances in remote sensing image change detection (CD). However, existing methods usually train a dataset-specific deep network for each dataset. Due to the significant differences in the data distribution and labeling between various datasets, the trained dataset-specific deep network has poor generalization performances on other datasets. To solve t…
▽ More
Deep learning methods have shown promising performances in remote sensing image change detection (CD). However, existing methods usually train a dataset-specific deep network for each dataset. Due to the significant differences in the data distribution and labeling between various datasets, the trained dataset-specific deep network has poor generalization performances on other datasets. To solve this problem, this paper proposes a change adapter network (CANet) for a more universal and generalized CD. CANet contains dataset-shared and dataset-specific learning modules. The former explores the discriminative features of images, and the latter designs a lightweight adapter model, to deal with the characteristics of different datasets in data distribution and labeling. The lightweight adapter can quickly generalize the deep network for new CD tasks with a small computation cost. Specifically, this paper proposes an interesting change region mask (ICM) in the adapter, which can adaptively focus on interested change objects and decrease the influence of labeling differences in various datasets. Moreover, CANet adopts a unique batch normalization layer for each dataset to deal with data distribution differences. Compared with existing deep learning methods, CANet can achieve satisfactory CD performances on various datasets simultaneously. Experimental results on several public datasets have verified the effectiveness and advantages of the proposed CANet on CD. CANet has a stronger generalization ability, smaller training costs (merely updating 4.1%-7.7% parameters), and better performances under limited training datasets than other deep learning methods, which also can be flexibly inserted with existing deep models.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Authors:
Jiageng Wu,
Bowen Gu,
Ren Zhou,
Kevin Xie,
Doug Snyder,
Yixing Jiang,
Valentina Carducci,
Richard Wyss,
Rishi J Desai,
Emily Alsentzer,
Leo Anthony Celi,
Adam Rodman,
Sebastian Schneeweiss,
Jonathan H. Chen,
Santiago Romero-Brufau,
Kueiyu Joshua Lin,
Jie Yang
Abstract:
Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, current evaluations of LLMs in clinical contexts remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record (EHR) data. O…
▽ More
Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, current evaluations of LLMs in clinical contexts remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record (EHR) data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52 state-of-the-art LLMs (including DeepSeek-R1, GPT-4o, Gemini, and Llama 4) under various inference strategies. With a total of 13,572 experiments, our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding.
The BRIDGE leaderboard: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard
△ Less
Submitted 30 April, 2025; v1 submitted 28 April, 2025;
originally announced April 2025.
-
Optimal Static Fully Indexable Dictionaries
Authors:
Jingxun Liang,
Renfei Zhou
Abstract:
Fully indexable dictionaries (FID) store sets of integer keys while supporting rank/select queries. They serve as basic building blocks in many succinct data structures. Despite the great importance of FIDs, no known FID is succinct with efficient query time when the universe size $U$ is a large polynomial in the number of keys $n$, which is the conventional parameter regime for dictionary problem…
▽ More
Fully indexable dictionaries (FID) store sets of integer keys while supporting rank/select queries. They serve as basic building blocks in many succinct data structures. Despite the great importance of FIDs, no known FID is succinct with efficient query time when the universe size $U$ is a large polynomial in the number of keys $n$, which is the conventional parameter regime for dictionary problems. In this paper, we design an FID that uses $\log \binom{U}{n} + \frac{n}{(\log U / t)^{Ω(t)}}$ bits of space, and answers rank/select queries in $O(t + \log \log n)$ time in the worst case, for any parameter $1 \le t \le \log n / \log \log n$, provided $U = n^{1 + Θ(1)}$. This time-space trade-off matches known lower bounds for FIDs [Pǎtraşcu & Thorup STOC 2006; Pǎtraşcu & Viola SODA 2010] when $t \le \log^{0.99} n$.
Our techniques also lead to efficient succinct data structures for the fundamental problem of maintaining $n$ integers each of $\ell = Θ(\log n)$ bits and supporting partial-sum queries, with a trade-off between $O(t)$ query time and $n\ell + n / (\log n / t)^{Ω(t)}$ bits of space. Prior to this work, no known data structure for the partial-sum problem achieves constant query time with $n \ell + o(n)$ bits of space usage.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Depth as Points: Center Point-based Depth Estimation
Authors:
Zhiheng Tu,
Xinjian Huang,
Yong He,
Ruiyang Zhou,
Bo Du,
Weitao Wu
Abstract:
The perception of vehicles and pedestrians in urban scenarios is crucial for autonomous driving. This process typically involves complicated data collection, imposes high computational and hardware demands. To address these limitations, we first develop a highly efficient method for generating virtual datasets, which enables the creation of task- and scenario-specific datasets in a short time. Lev…
▽ More
The perception of vehicles and pedestrians in urban scenarios is crucial for autonomous driving. This process typically involves complicated data collection, imposes high computational and hardware demands. To address these limitations, we first develop a highly efficient method for generating virtual datasets, which enables the creation of task- and scenario-specific datasets in a short time. Leveraging this method, we construct the virtual depth estimation dataset VirDepth, a large-scale, multi-task autonomous driving dataset. Subsequently, we propose CenterDepth, a lightweight architecture for monocular depth estimation that ensures high operational efficiency and exhibits superior performance in depth estimation tasks with highly imbalanced height-scale distributions. CenterDepth integrates global semantic information through the innovative Center FC-CRFs algorithm, aggregates multi-scale features based on object key points, and enables detection-based depth estimation of targets. Experiments demonstrate that our proposed method achieves superior performance in terms of both computational speed and prediction accuracy.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
A Self-supervised Learning Method for Raman Spectroscopy based on Masked Autoencoders
Authors:
Pengju Ren,
Ri-gui Zhou,
Yaochong Li
Abstract:
Raman spectroscopy serves as a powerful and reliable tool for analyzing the chemical information of substances. The integration of Raman spectroscopy with deep learning methods enables rapid qualitative and quantitative analysis of materials. Most existing approaches adopt supervised learning methods. Although supervised learning has achieved satisfactory accuracy in spectral analysis, it is still…
▽ More
Raman spectroscopy serves as a powerful and reliable tool for analyzing the chemical information of substances. The integration of Raman spectroscopy with deep learning methods enables rapid qualitative and quantitative analysis of materials. Most existing approaches adopt supervised learning methods. Although supervised learning has achieved satisfactory accuracy in spectral analysis, it is still constrained by costly and limited well-annotated spectral datasets for training. When spectral annotation is challenging or the amount of annotated data is insufficient, the performance of supervised learning in spectral material identification declines. In order to address the challenge of feature extraction from unannotated spectra, we propose a self-supervised learning paradigm for Raman Spectroscopy based on a Masked AutoEncoder, termed SMAE. SMAE does not require any spectral annotations during pre-training. By randomly masking and then reconstructing the spectral information, the model learns essential spectral features. The reconstructed spectra exhibit certain denoising properties, improving the signal-to-noise ratio (SNR) by more than twofold. Utilizing the network weights obtained from masked pre-training, SMAE achieves clustering accuracy of over 80% for 30 classes of isolated bacteria in a pathogenic bacterial dataset, demonstrating significant improvements compared to classical unsupervised methods and other state-of-the-art deep clustering methods. After fine-tuning the network with a limited amount of annotated data, SMAE achieves an identification accuracy of 83.90% on the test set, presenting competitive performance against the supervised ResNet (83.40%).
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
An Iterative Task-Driven Framework for Resilient LiDAR Place Recognition in Adverse Weather
Authors:
Xiongwei Zhao,
Xieyuanli Chen,
Xu Zhu,
Xingxiang Xie,
Haojie Bai,
Congcong Wen,
Rundong Zhou,
Qihao Sun
Abstract:
LiDAR place recognition (LPR) plays a vital role in autonomous navigation. However, existing LPR methods struggle to maintain robustness under adverse weather conditions such as rain, snow, and fog, where weather-induced noise and point cloud degradation impair LiDAR reliability and perception accuracy. To tackle these challenges, we propose an Iterative Task-Driven Framework (ITDNet), which integ…
▽ More
LiDAR place recognition (LPR) plays a vital role in autonomous navigation. However, existing LPR methods struggle to maintain robustness under adverse weather conditions such as rain, snow, and fog, where weather-induced noise and point cloud degradation impair LiDAR reliability and perception accuracy. To tackle these challenges, we propose an Iterative Task-Driven Framework (ITDNet), which integrates a LiDAR Data Restoration (LDR) module and a LiDAR Place Recognition (LPR) module through an iterative learning strategy. These modules are jointly trained end-to-end, with alternating optimization to enhance performance. The core rationale of ITDNet is to leverage the LDR module to recover the corrupted point clouds while preserving structural consistency with clean data, thereby improving LPR accuracy in adverse weather. Simultaneously, the LPR task provides feature pseudo-labels to guide the LDR module's training, aligning it more effectively with the LPR task. To achieve this, we first design a task-driven LPR loss and a reconstruction loss to jointly supervise the optimization of the LDR module. Furthermore, for the LDR module, we propose a Dual-Domain Mixer (DDM) block for frequency-spatial feature fusion and a Semantic-Aware Generator (SAG) block for semantic-guided restoration. In addition, for the LPR module, we introduce a Multi-Frequency Transformer (MFT) block and a Wavelet Pyramid NetVLAD (WPN) block to aggregate multi-scale, robust global descriptors. Finally, extensive experiments on Weather-KITTI, Boreas, and our proposed Weather-Apollo datasets demonstrate that, ITDNet outperforms existing LPR methods, achieving state-of-the-art performance in adverse weather.
△ Less
Submitted 19 June, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
Robust Gittins for Stochastic Scheduling
Authors:
Benjamin Moseley,
Heather Newman,
Kirk Pruhs,
Rudy Zhou
Abstract:
A common theme in stochastic optimization problems is that, theoretically, stochastic algorithms need to "know" relatively rich information about the underlying distributions. This is at odds with most applications, where distributions are rough predictions based on historical data. Thus, commonly, stochastic algorithms are making decisions using imperfect predicted distributions, while trying to…
▽ More
A common theme in stochastic optimization problems is that, theoretically, stochastic algorithms need to "know" relatively rich information about the underlying distributions. This is at odds with most applications, where distributions are rough predictions based on historical data. Thus, commonly, stochastic algorithms are making decisions using imperfect predicted distributions, while trying to optimize over some unknown true distributions. We consider the fundamental problem of scheduling stochastic jobs preemptively on a single machine to minimize expected mean completion time in the setting where the scheduler is only given imperfect predicted job size distributions. If the predicted distributions are perfect, then it is known that this problem can be solved optimally by the Gittins index policy. The goal of our work is to design a scheduling policy that is robust in the sense that it produces nearly optimal schedules even if there are modest discrepancies between the predicted distributions and the underlying real distributions. Our main contributions are:
(1) We show that the standard Gittins index policy is not robust in this sense. If the true distributions are perturbed by even an arbitrarily small amount, then running the Gittins index policy using the perturbed distributions can lead to an unbounded increase in mean completion time.
(2) We explain how to modify the Gittins index policy to make it robust, that is, to produce nearly optimal schedules, where the approximation depends on a new measure of error between the true and predicted distributions that we define.
Looking forward, the approach we develop here can be applied more broadly to many other stochastic optimization problems to better understand the impact of mispredictions, and lead to the development of new algorithms that are robust against such mispredictions.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Kimi-VL Technical Report
Authors:
Kimi Team,
Angang Du,
Bohong Yin,
Bowei Xing,
Bowen Qu,
Bowen Wang,
Cheng Chen,
Chenlin Zhang,
Chenzhuang Du,
Chu Wei,
Congcong Wang,
Dehao Zhang,
Dikang Du,
Dongliang Wang,
Enming Yuan,
Enzhe Lu,
Fang Li,
Flood Sung,
Guangda Wei,
Guokun Lai,
Han Zhu,
Hao Ding,
Hao Hu,
Hao Yang,
Hao Zhang
, et al. (70 additional authors not shown)
Abstract:
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-…
▽ More
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.
△ Less
Submitted 23 June, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models
Authors:
Runlong Zhou,
Yi Zhang
Abstract:
Language models often struggle with cross-mode knowledge retrieval -- the ability to access knowledge learned in one format (mode) when queried in another. We demonstrate that models trained on multiple data sources (e.g., Wikipedia and TinyStories) exhibit significantly reduced accuracy when retrieving knowledge in a format different from its original training mode. This paper quantitatively inve…
▽ More
Language models often struggle with cross-mode knowledge retrieval -- the ability to access knowledge learned in one format (mode) when queried in another. We demonstrate that models trained on multiple data sources (e.g., Wikipedia and TinyStories) exhibit significantly reduced accuracy when retrieving knowledge in a format different from its original training mode. This paper quantitatively investigates this phenomenon through a controlled study of random token sequence memorization across different modes. We first explore dataset rewriting as a solution, revealing that effective cross-mode retrieval requires prohibitively extensive rewriting efforts that follow a sigmoid-like relationship. As an alternative, we propose CASCADE, a novel pretraining algorithm that uses cascading datasets with varying sequence lengths to capture knowledge at different scales. Our experiments demonstrate that CASCADE outperforms dataset rewriting approaches, even when compressed into a single model with a unified loss function. This work provides both qualitative evidence of cross-mode retrieval limitations and a practical solution to enhance language models' ability to access knowledge independently of its presentational format.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Two-stage deep learning framework for the restoration of incomplete-ring PET images
Authors:
Yeqi Fang,
Rong Zhou
Abstract:
Positron Emission Tomography (PET) is an important molecular imaging tool widely used in medicine. Traditional PET systems rely on complete detector rings for full angular coverage and reliable data collection. However, incomplete-ring PET scanners have emerged due to hardware failures, cost constraints, or specific clinical needs. Standard reconstruction algorithms often suffer from performance d…
▽ More
Positron Emission Tomography (PET) is an important molecular imaging tool widely used in medicine. Traditional PET systems rely on complete detector rings for full angular coverage and reliable data collection. However, incomplete-ring PET scanners have emerged due to hardware failures, cost constraints, or specific clinical needs. Standard reconstruction algorithms often suffer from performance degradation with these systems because of reduced data completeness and geometric inconsistencies. We present a two-stage deep-learning framework that, without incorporating any time-of-flight (TOF) information, restores high-quality images from data with about 50% missing coincidences - double the loss levels previously addressed by CNN-based methods. The pipeline operates in two stages: a projection-domain Attention U-Net first predicts the missing sections of the sinogram by leveraging spatial context from neighbouring slices, after which the completed data are reconstructed with OSEM algorithm and passed to a U-Net-diffusion module that removes residual artefacts while reinstating high-frequency detail. Using 206 brain volumes from a public dataset, the result shows that our model successfully preserves most anatomical structures and tracer distribution features with PSNR of 30.92 dB and SSIM of 0.9708. We also achieve higher inference speed, thus providing an effective solution for incomplete-ring PET imaging.
△ Less
Submitted 4 June, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge
Authors:
Adam Schmidt,
Mert Asim Karaoglu,
Soham Sinha,
Mingang Jang,
Ho-Gun Ha,
Kyungmin Jung,
Kyeongmo Gu,
Ihsan Ullah,
Hyunki Lee,
Jonáš Šerých,
Michal Neoral,
Jiří Matas,
Rulin Zhou,
Wenlong He,
An Wang,
Hongliang Ren,
Bruno Silva,
Sandro Queirós,
Estêvão Lima,
João L. Vilaça,
Shunsuke Kikuchi,
Atsushi Kouno,
Hiroki Matsuzaki,
Tongtong Li,
Yulu Chen
, et al. (15 additional authors not shown)
Abstract:
Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to a…
▽ More
Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification. The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024. The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency. The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. The efficiency component tests the latency of algorithm inference. The challenge was conducted as a part of MICCAI EndoVis 2024. In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day. This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery. In this paper we summarize the design, submissions, and results from the challenge. The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models
Authors:
Wenkang Ji,
Huaben Chen,
Mingyang Chen,
Guobin Zhu,
Lufeng Xu,
Roderich Groß,
Rui Zhou,
Ming Cao,
Shiyu Zhao
Abstract:
The development of control policies for multi-robot systems traditionally follows a complex and labor-intensive process, often lacking the flexibility to adapt to dynamic tasks. This has motivated research on methods to automatically create control policies. However, these methods require iterative processes of manually crafting and refining objective functions, thereby prolonging the development…
▽ More
The development of control policies for multi-robot systems traditionally follows a complex and labor-intensive process, often lacking the flexibility to adapt to dynamic tasks. This has motivated research on methods to automatically create control policies. However, these methods require iterative processes of manually crafting and refining objective functions, thereby prolonging the development cycle. This work introduces \textit{GenSwarm}, an end-to-end system that leverages large language models to automatically generate and deploy control policies for multi-robot tasks based on simple user instructions in natural language. As a multi-language-agent system, GenSwarm achieves zero-shot learning, enabling rapid adaptation to altered or unseen tasks. The white-box nature of the code policies ensures strong reproducibility and interpretability. With its scalable software and hardware architectures, GenSwarm supports efficient policy deployment on both simulated and real-world multi-robot systems, realizing an instruction-to-execution end-to-end functionality that could prove valuable for robotics specialists and non-specialists alike.The code of the proposed GenSwarm system is available online: https://github.com/WindyLab/GenSwarm.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision
Authors:
Rulin Zhou,
Wenlong He,
An Wang,
Qiqi Yao,
Haijun Hu,
Jiankun Wang,
Xi Zhang an Hongliang Ren
Abstract:
Accurate tissue point tracking in endoscopic videos is critical for robotic-assisted surgical navigation and scene understanding, but remains challenging due to complex deformations, instrument occlusion, and the scarcity of dense trajectory annotations. Existing methods struggle with long-term tracking under these conditions due to limited feature utilization and annotation dependence. We present…
▽ More
Accurate tissue point tracking in endoscopic videos is critical for robotic-assisted surgical navigation and scene understanding, but remains challenging due to complex deformations, instrument occlusion, and the scarcity of dense trajectory annotations. Existing methods struggle with long-term tracking under these conditions due to limited feature utilization and annotation dependence. We present Endo-TTAP, a novel framework addressing these challenges through: (1) A Multi-Facet Guided Attention (MFGA) module that synergizes multi-scale flow dynamics, DINOv2 semantic embeddings, and explicit motion patterns to jointly predict point positions with uncertainty and occlusion awareness; (2) A two-stage curriculum learning strategy employing an Auxiliary Curriculum Adapter (ACA) for progressive initialization and hybrid supervision. Stage I utilizes synthetic data with optical flow ground truth for uncertainty-occlusion regularization, while Stage II combines unsupervised flow consistency and semi-supervised learning with refined pseudo-labels from off-the-shelf trackers. Extensive validation on two MICCAI Challenge datasets and our collected dataset demonstrates that Endo-TTAP achieves state-of-the-art performance in tissue point tracking, particularly in scenarios characterized by complex endoscopic conditions. The source code and dataset will be available at https://anonymous.4open.science/r/Endo-TTAP-36E5.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Optimal Non-Oblivious Open Addressing
Authors:
Michael A. Bender,
William Kuszmaul,
Renfei Zhou
Abstract:
A hash table is said to be open-addressed (or non-obliviously open-addressed) if it stores elements (and free slots) in an array with no additional metadata. Intuitively, open-addressed hash tables must incur a space-time tradeoff: The higher the load factor at which the hash table operates, the longer insertions/deletions/queries should take.
In this paper, we show that no such tradeoff exists:…
▽ More
A hash table is said to be open-addressed (or non-obliviously open-addressed) if it stores elements (and free slots) in an array with no additional metadata. Intuitively, open-addressed hash tables must incur a space-time tradeoff: The higher the load factor at which the hash table operates, the longer insertions/deletions/queries should take.
In this paper, we show that no such tradeoff exists: It is possible to construct an open-addressed hash table that supports constant-time operations even when the hash table is entirely full. In fact, it is even possible to construct a version of this data structure that: (1) is dynamically resized so that the number of slots in memory that it uses, at any given moment, is the same as the number of elements it contains; (2) supports $O(1)$-time operations, not just in expectation, but with high probability; and (3) requires external access to just $O(1)$ hash functions that are each just $O(1)$-wise independent.
Our results complement a recent lower bound by Bender, Kuszmaul, and Zhou showing that oblivious open-addressed hash tables must incur $Ω(\log \log \varepsilon^{-1})$-time operations. The hash tables in this paper are non-oblivious, which is why they are able to bypass the previous lower bound.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Shape-Kit: A Design Toolkit for Crafting On-Body Expressive Haptics
Authors:
Ran Zhou,
Jianru Ding,
Chenfeng Gao,
Wanli Qian,
Benjamin Erickson,
Madeline Balaam,
Daniel Leithinger,
Ken Nakagaki
Abstract:
Driven by the vision of everyday haptics, the HCI community is advocating for "design touch first" and investigating "how to touch well." However, a gap remains between the exploratory nature of haptic design and technical reproducibility. We present Shape-Kit, a hybrid design toolkit embodying our "crafting haptics" metaphor, where hand touch is transduced into dynamic pin-based sensations that c…
▽ More
Driven by the vision of everyday haptics, the HCI community is advocating for "design touch first" and investigating "how to touch well." However, a gap remains between the exploratory nature of haptic design and technical reproducibility. We present Shape-Kit, a hybrid design toolkit embodying our "crafting haptics" metaphor, where hand touch is transduced into dynamic pin-based sensations that can be freely explored across the body. An ad-hoc tracking module captures and digitizes these patterns. Our study with 14 designers and artists demonstrates how Shape-Kit facilitates sensorial exploration for expressive haptic design. We analyze how designers collaboratively ideate, prototype, iterate, and compose touch experiences and show the subtlety and richness of touch that can be achieved through diverse crafting methods with Shape-Kit. Reflecting on the findings, our work contributes key insights into haptic toolkit design and touch design practices centered on the "crafting haptics" metaphor. We discuss in-depth how Shape-Kit's simplicity, though remaining constrained, enables focused crafting for deeper exploration, while its collaborative nature fosters shared sense-making of touch experiences.
△ Less
Submitted 30 March, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
Authors:
Wanhua Li,
Renping Zhou,
Jiawei Zhou,
Yingwei Song,
Johannes Herter,
Minghan Qin,
Gao Huang,
Hanspeter Pfister
Abstract:
Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture t…
▽ More
Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.
△ Less
Submitted 31 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback
Authors:
Runlong Zhou,
Maryam Fazel,
Simon S. Du
Abstract:
Reinforcement learning from human feedback (RLHF) has become essential for improving language model capabilities, but traditional approaches rely on the assumption that human preferences follow a transitive Bradley-Terry model. This assumption fails to capture the non-transitive nature of populational human preferences. Nash learning from human feedback (NLHF), targeting non-transitive preferences…
▽ More
Reinforcement learning from human feedback (RLHF) has become essential for improving language model capabilities, but traditional approaches rely on the assumption that human preferences follow a transitive Bradley-Terry model. This assumption fails to capture the non-transitive nature of populational human preferences. Nash learning from human feedback (NLHF), targeting non-transitive preferences, is a problem of computing the Nash equilibrium (NE) of the two-player constant-sum game defined by the human preference. We introduce Extragradient preference optimization (EGPO), a novel algorithm for NLHF achieving last-iterate linear convergence to the NE of KL-regularized games and polynomial convergence to the NE of original games, while being robust to noise. Unlike previous approaches that rely on nested optimization, we derive an equivalent implementation using gradients of an online variant of the identity preference optimization (IPO) loss, enabling more faithful implementation for neural networks. Our empirical evaluations demonstrate EGPO's superior performance over baseline methods when training for the same number of epochs, as measured by pairwise win-rates using the ground truth preference. These results validate both the theoretical strengths and practical advantages of EGPO for language model alignment with non-transitive human preferences.
△ Less
Submitted 8 June, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Cost-Aware Optimal Pairwise Pure Exploration
Authors:
Di Wu,
Chengshuai Shi,
Ruida Zhou,
Cong Shen
Abstract:
Pure exploration is one of the fundamental problems in multi-armed bandits (MAB). However, existing works mostly focus on specific pure exploration tasks, without a holistic view of the general pure exploration problem. This work fills this gap by introducing a versatile framework to study pure exploration, with a focus on identifying the pairwise relationships between targeted arm pairs. Moreover…
▽ More
Pure exploration is one of the fundamental problems in multi-armed bandits (MAB). However, existing works mostly focus on specific pure exploration tasks, without a holistic view of the general pure exploration problem. This work fills this gap by introducing a versatile framework to study pure exploration, with a focus on identifying the pairwise relationships between targeted arm pairs. Moreover, unlike existing works that only optimize the stopping time (i.e., sample complexity), this work considers that arms are associated with potentially different costs and targets at optimizing the cumulative cost that occurred during learning. Under the general framework of pairwise pure exploration with arm-specific costs, a performance lower bound is derived. Then, a novel algorithm, termed CAET (Cost-Aware Pairwise Exploration Task), is proposed. CAET builds on the track-and-stop principle with a novel design to handle the arm-specific costs, which can potentially be zero and thus represent a very challenging case. Theoretical analyses prove that the performance of CAET approaches the lower bound asymptotically. Special cases are further discussed, including an extension to regret minimization, which is another major focus of MAB. The effectiveness and efficiency of CAET are also verified through experimental results under various settings.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning
Authors:
Guiyao Tie,
Zeli Zhao,
Dingjie Song,
Fuyang Wei,
Rong Zhou,
Yurou Dai,
Wen Yin,
Zhejian Yang,
Jiangyue Yan,
Yao Su,
Zhenhan Dai,
Yifeng Xie,
Yihan Cao,
Lichao Sun,
Pan Zhou,
Lifang He,
Hechang Chen,
Yu Zhang,
Qingsong Wen,
Tianming Liu,
Neil Zhenqiang Gong,
Jiliang Tang,
Caiming Xiong,
Heng Ji,
Philip S. Yu
, et al. (1 additional authors not shown)
Abstract:
The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific per…
▽ More
The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT's alignment strategies to DeepSeek-R1's innovative reasoning advancements, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications.
△ Less
Submitted 20 May, 2025; v1 submitted 8 March, 2025;
originally announced March 2025.
-
LiteChain: A Lightweight Blockchain for Verifiable and Scalable Federated Learning in Massive Edge Networks
Authors:
Handi Chen,
Rui Zhou,
Yun-Hin Chan,
Zhihan Jiang,
Xianhao Chen,
Edith C. H. Ngai
Abstract:
Leveraging blockchain in Federated Learning (FL) emerges as a new paradigm for secure collaborative learning on Massive Edge Networks (MENs). As the scale of MENs increases, it becomes more difficult to implement and manage a blockchain among edge devices due to complex communication topologies, heterogeneous computation capabilities, and limited storage capacities. Moreover, the lack of a standar…
▽ More
Leveraging blockchain in Federated Learning (FL) emerges as a new paradigm for secure collaborative learning on Massive Edge Networks (MENs). As the scale of MENs increases, it becomes more difficult to implement and manage a blockchain among edge devices due to complex communication topologies, heterogeneous computation capabilities, and limited storage capacities. Moreover, the lack of a standard metric for blockchain security becomes a significant issue. To address these challenges, we propose a lightweight blockchain for verifiable and scalable FL, namely LiteChain, to provide efficient and secure services in MENs. Specifically, we develop a distributed clustering algorithm to reorganize MENs into a two-level structure to improve communication and computing efficiency under security requirements. Moreover, we introduce a Comprehensive Byzantine Fault Tolerance (CBFT) consensus mechanism and a secure update mechanism to ensure the security of model transactions through LiteChain. Our experiments based on Hyperledger Fabric demonstrate that LiteChain presents the lowest end-to-end latency and on-chain storage overheads across various network scales, outperforming the other two benchmarks. In addition, LiteChain exhibits a high level of robustness against replay and data poisoning attacks.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English
Authors:
Runtao Zhou,
Guangya Wan,
Saadia Gabriel,
Sheng Li,
Alexander J Gates,
Maarten Sap,
Thomas Hartvigsen
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop…
▽ More
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at https://github.com/Runtaozhou/dialect_bias_eval.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
Authors:
Shiqi Chen,
Tongyao Zhu,
Ruochen Zhou,
Jinghan Zhang,
Siyang Gao,
Juan Carlos Niebles,
Mor Geva,
Junxian He,
Jiajun Wu,
Manling Li
Abstract:
Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal st…
▽ More
Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at https://github.com/shiqichen17/AdaptVis.
△ Less
Submitted 4 March, 2025; v1 submitted 3 March, 2025;
originally announced March 2025.
-
CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR
Authors:
Nian Shao,
Rui Zhou,
Pengyu Wang,
Xian Li,
Ying Fang,
Yujie Yang,
Xiaofei Li
Abstract:
In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech wavefo…
▽ More
In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech waveform with a neural vocoder or directly used for ASR. The proposed network is composed of interleaved cross-band and narrow-band processing in the Mel-frequency domain, for learning the full-band spectral pattern and the narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the key advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results on four English and one Chinese datasets demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model. Code and audio examples of our model are available online in https://audio.westlake.edu.cn/Research/CleanMel.html.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.