-
BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects
Authors:
Hongyang Li,
Sanjoy Dey,
Bum Chul Kwon,
Michael Danziger,
Michal Rosen-Tzvi,
Jianying Hu,
James Kozloski,
Ching-Huei Tsou,
Bharath Dandala,
Pablo Meyer
Abstract:
Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks. These models have been adapted to decipher the language of DNA, where sequences of nucleotides act as "words" that encode genomic functions. However, the genome differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar. Although DNA l…
▽ More
Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks. These models have been adapted to decipher the language of DNA, where sequences of nucleotides act as "words" that encode genomic functions. However, the genome differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar. Although DNA language models (DNALMs) such as DNABERT, GENA-LM have achieved high level of performance on genome-related biological tasks, these models do not encode biological functions in the presence of sequence variations. To address this problem, we pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs), as they underlie important biological functions. Specifically, we use ModernBERT to pre-train two different Biomedical Foundation Models (BMFM), namely, BMFM-DNA-REF in which the model is trained with sequences of varying lengths along with their reverse complements derived from the reference genome and BMFM-DNA-SNP in which the model is trained with sequences created using a novel representation scheme that encodes sequence variations. Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks. To explore the model's practical utility, we experimented with various strategies for SNP imputation on promoter detection task introduced in DNABERT-2. However, we acknowledge that the current benchmarks are limited in their ability to fully evaluate these models. To enable more comprehensive assessment in the future and encourage community contributions, we release our models through HuggingFace and the code to reproduce the results at https://github.com/BiomedSciAI/biomed-multi-omic
△ Less
Submitted 26 June, 2025;
originally announced July 2025.
-
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Authors:
Yana Wei,
Liang Zhao,
Jianjian Sun,
Kangheng Lin,
Jisheng Yin,
Jingcheng Hu,
Yinmin Zhang,
En Yu,
Haoran Lv,
Zejia Weng,
Jia Wang,
Chunrui Han,
Yuang Peng,
Qi Han,
Zheng Ge,
Xiangyu Zhang,
Daxin Jiang,
Vishal M. Patel
Abstract:
The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimoda…
▽ More
The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Circular Holographic MIMO Beamforming for Integrated Data and Energy Multicast Systems
Authors:
Qingxiao Huang,
Yizhe Zhao,
Jie Hu,
Kun Yang,
Yuguang Fang
Abstract:
Thanks to the application of metamaterials, holographic multiple-input multiple-output (H-MIMO) is expected to achieve a higher spatial diversity gain with lower hardware complexity. With the aid of a circular antenna arrangement of H-MIMO, integrated data and energy multicast (IDEM) can fully exploit the near-field channel to realize wider range of energy focusing and higher achievable rate. In t…
▽ More
Thanks to the application of metamaterials, holographic multiple-input multiple-output (H-MIMO) is expected to achieve a higher spatial diversity gain with lower hardware complexity. With the aid of a circular antenna arrangement of H-MIMO, integrated data and energy multicast (IDEM) can fully exploit the near-field channel to realize wider range of energy focusing and higher achievable rate. In this paper, we derive the closed-form near-field resolution function in 3D space and show the asymptotic spatial orthogonality of near-field channel for circular antenna array. We then investigate the beamforming designs for IDEM systems, where the minimum rate of data users (DUs) are maximized while guaranteeing the energy harvesting requirements for energy users (EUs). Specifically, the asymptotically optimal fully-digital beamformer is first obtained based on the spatial orthogonality. Then, the alternating optimization is adopted for the H-MIMO beamforming, where the digital beamformer is obtained in closed form and the analog beamformers of three different control modes are then obtained, respectively. Scaling schemes are also investigated to further improve the IDEM performance. Numerical results verify the correctness of the resolution function and asymptotic orthogonality. Moreover, the proposed beamforming schemes with very low complexity outperform benchmark schemes.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer
Authors:
Yecheng Wu,
Junyu Chen,
Zhuoyang Zhang,
Enze Xie,
Jincheng Yu,
Junsong Chen,
Jinyi Hu,
Yao Lu,
Song Han,
Han Cai
Abstract:
We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer fo…
▽ More
We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
NTSFormer: A Self-Teaching Graph Transformer for Multimodal Cold-Start Node Classification
Authors:
Jun Hu,
Yufei He,
Yuan Li,
Bryan Hooi,
Bingsheng He
Abstract:
Cold-start node classification on multimodal graphs is challenging because cold-start nodes are isolated (i.e., no edges) and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to MLPs for cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limit…
▽ More
Cold-start node classification on multimodal graphs is challenging because cold-start nodes are isolated (i.e., no edges) and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to MLPs for cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a "student" prediction based only on self-information (i.e., the node's own features), and a "teacher" prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer's capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by a Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experimental results on public datasets show that NTSFormer achieves superior performance on multimodal cold-start node classification tasks.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Music2Palette: Emotion-aligned Color Palette Generation via Cross-Modal Representation Learning
Authors:
Jiayun Hu,
Yueyi He,
Tianyi Liang,
Changbo Wang,
Chenhui Li
Abstract:
Emotion alignment between music and palettes is crucial for effective multimedia content, yet misalignment creates confusion that weakens the intended message. However, existing methods often generate only a single dominant color, missing emotion variation. Others rely on indirect mappings through text or images, resulting in the loss of crucial emotion details. To address these challenges, we pre…
▽ More
Emotion alignment between music and palettes is crucial for effective multimedia content, yet misalignment creates confusion that weakens the intended message. However, existing methods often generate only a single dominant color, missing emotion variation. Others rely on indirect mappings through text or images, resulting in the loss of crucial emotion details. To address these challenges, we present Music2Palette, a novel method for emotion-aligned color palette generation via cross-modal representation learning. We first construct MuCED, a dataset of 2,634 expert-validated music-palette pairs aligned through Russell-based emotion vectors. To directly translate music into palettes, we propose a cross-modal representation learning framework with a music encoder and color decoder. We further propose a multi-objective optimization approach that jointly enhances emotion alignment, color diversity, and palette coherence. Extensive experiments demonstrate that our method outperforms current methods in interpreting music emotion and generating attractive and diverse color palettes. Our approach enables applications like music-driven image recoloring, video generating, and data visualization, bridging the gap between auditory and visual emotion experiences.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts
Authors:
Yun Wang,
Longguang Wang,
Chenghao Zhang,
Yongjian Zhang,
Zhanjie Zhang,
Ao Ma,
Chenyou Fan,
Tin Lun Lam,
Junjie Hu
Abstract:
Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching…
▽ More
Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.
△ Less
Submitted 6 July, 2025;
originally announced July 2025.
-
Enhancing Robustness of LLM-Driven Multi-Agent Systems through Randomized Smoothing
Authors:
Jinwei Hu,
Yi Dong,
Zhengtao Ding,
Xiaowei Huang
Abstract:
This paper presents a defense framework for enhancing the safety of large language model (LLM) empowered multi-agent systems (MAS) in safety-critical domains such as aerospace. We apply randomized smoothing, a statistical robustness certification technique, to the MAS consensus context, enabling probabilistic guarantees on agent decisions under adversarial influence. Unlike traditional verificatio…
▽ More
This paper presents a defense framework for enhancing the safety of large language model (LLM) empowered multi-agent systems (MAS) in safety-critical domains such as aerospace. We apply randomized smoothing, a statistical robustness certification technique, to the MAS consensus context, enabling probabilistic guarantees on agent decisions under adversarial influence. Unlike traditional verification methods, our approach operates in black-box settings and employs a two-stage adaptive sampling mechanism to balance robustness and computational efficiency. Simulation results demonstrate that our method effectively prevents the propagation of adversarial behaviors and hallucinations while maintaining consensus performance. This work provides a practical and scalable path toward safe deployment of LLM-based MAS in real-world, high-stakes environments.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
Hierarchical Testing with Rabbit Optimization for Industrial Cyber-Physical Systems
Authors:
Jinwei Hu,
Zezhi Tang,
Xin Jin,
Benyuan Zhang,
Yi Dong,
Xiaowei Huang
Abstract:
This paper presents HERO (Hierarchical Testing with Rabbit Optimization), a novel black-box adversarial testing framework for evaluating the robustness of deep learning-based Prognostics and Health Management systems in Industrial Cyber-Physical Systems. Leveraging Artificial Rabbit Optimization, HERO generates physically constrained adversarial examples that align with real-world data distributio…
▽ More
This paper presents HERO (Hierarchical Testing with Rabbit Optimization), a novel black-box adversarial testing framework for evaluating the robustness of deep learning-based Prognostics and Health Management systems in Industrial Cyber-Physical Systems. Leveraging Artificial Rabbit Optimization, HERO generates physically constrained adversarial examples that align with real-world data distributions via global and local perspective. Its generalizability ensures applicability across diverse ICPS scenarios. This study specifically focuses on the Proton Exchange Membrane Fuel Cell system, chosen for its highly dynamic operational conditions, complex degradation mechanisms, and increasing integration into ICPS as a sustainable and efficient energy solution. Experimental results highlight HERO's ability to uncover vulnerabilities in even state-of-the-art PHM models, underscoring the critical need for enhanced robustness in real-world applications. By addressing these challenges, HERO demonstrates its potential to advance more resilient PHM systems across a wide range of ICPS domains.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic
Authors:
Jianwei Tang,
Hong Yang,
Tengyue Chen,
Jian-Fang Hu
Abstract:
Action-driven stochastic human motion prediction aims to generate future motion sequences of a pre-defined target action based on given past observed sequences performing non-target actions. This task primarily presents two challenges. Firstly, generating smooth transition motions is hard due to the varying transition speeds of different actions. Secondly, the action characteristic is difficult to…
▽ More
Action-driven stochastic human motion prediction aims to generate future motion sequences of a pre-defined target action based on given past observed sequences performing non-target actions. This task primarily presents two challenges. Firstly, generating smooth transition motions is hard due to the varying transition speeds of different actions. Secondly, the action characteristic is difficult to be learned because of the similarity of some actions. These issues cause the predicted results to be unreasonable and inconsistent. As a result, we propose two memory banks, the Soft-transition Action Bank (STAB) and Action Characteristic Bank (ACB), to tackle the problems above. The STAB stores the action transition information. It is equipped with the novel soft searching approach, which encourages the model to focus on multiple possible action categories of observed motions. The ACB records action characteristic, which produces more prior information for predicting certain actions. To fuse the features retrieved from the two banks better, we further propose the Adaptive Attention Adjustment (AAA) strategy. Extensive experiments on four motion prediction datasets demonstrate that our approach consistently outperforms the previous state-of-the-art. The demo and code are available at https://hyqlat.github.io/STABACB.github.io/.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
Temporal Continual Learning with Prior Compensation for Human Motion Prediction
Authors:
Jianwei Tang,
Jiangxin Sun,
Xiaotong Lin,
Lifang Zhang,
Wei-Shi Zheng,
Jian-Fang Hu
Abstract:
Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent pr…
▽ More
Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent predictions is limited. In this paper, we introduce a novel multi-stage training framework called Temporal Continual Learning (TCL) to address the above challenges. To better preserve prior information, we introduce the Prior Compensation Factor (PCF). We incorporate it into the model training to compensate for the lost prior information. Furthermore, we derive a more reasonable optimization objective through theoretical derivation. It is important to note that our TCL framework can be easily integrated with different HMP backbone models and adapted to various datasets and applications. Extensive experiments on four HMP benchmark datasets demonstrate the effectiveness and flexibility of TCL. The code is available at https://github.com/hyqlat/TCL.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
Beyond Charging Anxiety: An Explainable Approach to Understanding User Preferences of EV Charging Stations Using Review Data
Authors:
Zifei Wang,
Emmanuel Abolarin,
Kai Wu,
Venkatarao Rebba,
Jian Hu,
Zhen Hu,
Shan Bao,
Feng Zhou
Abstract:
Electric vehicles (EVs) charging infrastructure is directly related to the overall EV user experience and thus impacts the widespread adoption of EVs. Understanding key factors that affect EV users' charging experience is essential for building a robust and user-friendly EV charging infrastructure. This study leverages about $17,000$ charging station (CS) reviews on Google Maps to explore EV user…
▽ More
Electric vehicles (EVs) charging infrastructure is directly related to the overall EV user experience and thus impacts the widespread adoption of EVs. Understanding key factors that affect EV users' charging experience is essential for building a robust and user-friendly EV charging infrastructure. This study leverages about $17,000$ charging station (CS) reviews on Google Maps to explore EV user preferences for charging stations, employing ChatGPT 4.0 for aspect-based sentiment analysis. We identify twelve key aspects influencing user satisfaction, ranging from accessibility and reliability to amenities and pricing. Two distinct preference models are developed: a micro-level model focused on individual user satisfaction and a macro-level model capturing collective sentiment towards specific charging stations. Both models utilize the LightGBM algorithm for user preference prediction, achieving strong performance compared to other machine learning approaches. To further elucidate the impact of each aspect on user ratings, we employ SHAP (SHapley Additive exPlanations), a game-theoretic approach for interpreting machine learning models. Our findings highlight the significant impact of positive sentiment towards "amenities and location", coupled with negative sentiment regarding "reliability and maintenance", on overall user satisfaction. These insights offer actionable guidance to charging station operators, policymakers, and EV manufacturers, empowering them to enhance user experience and foster wider EV adoption.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
Dexterous Teleoperation of 20-DoF ByteDexter Hand via Human Motion Retargeting
Authors:
Ruoshi Wen,
Jiajun Zhang,
Guangzeng Chen,
Zhongren Cui,
Min Du,
Yang Gou,
Zhigang Han,
Junkai Hu,
Liqun Huang,
Hao Niu,
Wei Xu,
Haoxiang Zhang,
Zhengming Zhu,
Hang Li,
Zeyu Ren
Abstract:
Replicating human--level dexterity remains a fundamental robotics challenge, requiring integrated solutions from mechatronic design to the control of high degree--of--freedom (DoF) robotic hands. While imitation learning shows promise in transferring human dexterity to robots, the efficacy of trained policies relies on the quality of human demonstration data. We bridge this gap with a hand--arm te…
▽ More
Replicating human--level dexterity remains a fundamental robotics challenge, requiring integrated solutions from mechatronic design to the control of high degree--of--freedom (DoF) robotic hands. While imitation learning shows promise in transferring human dexterity to robots, the efficacy of trained policies relies on the quality of human demonstration data. We bridge this gap with a hand--arm teleoperation system featuring: (1) a 20--DoF linkage--driven anthropomorphic robotic hand for biomimetic dexterity, and (2) an optimization--based motion retargeting for real--time, high--fidelity reproduction of intricate human hand motions and seamless hand--arm coordination. We validate the system via extensive empirical evaluations, including dexterous in-hand manipulation tasks and a long--horizon task requiring the organization of a cluttered makeup table randomly populated with nine objects. Experimental results demonstrate its intuitive teleoperation interface with real--time control and the ability to generate high--quality demonstration data. Please refer to the accompanying video for further details.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
HelixDesign-Antibody: A Scalable Production-Grade Platform for Antibody Design Built on HelixFold3
Authors:
Jie Gao,
Jing Hu,
Shanzhuo Zhang,
Kunrui Zhu,
Sheng Qian,
Yueyang Huang,
Xiaonan Zhang,
Xiaomin Fang
Abstract:
Antibody engineering is essential for developing therapeutics and advancing biomedical research. Traditional discovery methods often rely on time-consuming and resource-intensive experimental screening. To enhance and streamline this process, we introduce a production-grade, high-throughput platform built on HelixFold3, HelixDesign-Antibody, which utilizes the high-accuracy structure prediction mo…
▽ More
Antibody engineering is essential for developing therapeutics and advancing biomedical research. Traditional discovery methods often rely on time-consuming and resource-intensive experimental screening. To enhance and streamline this process, we introduce a production-grade, high-throughput platform built on HelixFold3, HelixDesign-Antibody, which utilizes the high-accuracy structure prediction model, HelixFold3. The platform facilitates the large-scale generation of antibody candidate sequences and evaluates their interaction with antigens. Integrated high-performance computing (HPC) support enables high-throughput screening, addressing challenges such as fragmented toolchains and high computational demands. Validation on multiple antigens showcases the platform's ability to generate diverse and high-quality antibodies, confirming a scaling law where exploring larger sequence spaces increases the likelihood of identifying optimal binders. This platform provides a seamless, accessible solution for large-scale antibody design and is available via the antibody design page of PaddleHelix platform.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
CI-VID: A Coherent Interleaved Text-Video Dataset
Authors:
Yiming Ju,
Jijin Hu,
Zhengxiong Luo,
Haoge Deng,
hanyu Zhao,
Li Du,
Chengwei Wu,
Donglin Hao,
Xinlong Wang,
Tengfei Pan
Abstract:
Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VI…
▽ More
Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation, enabling models to produce coherent, multi-scene video sequences. CI-VID contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions that capture both the individual content of each clip and the transitions between them, enabling visually and textually grounded generation. To further validate the effectiveness of CI-VID, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences. This facilitates the creation of story-driven content with smooth visual transitions and strong temporal coherence, underscoring the quality and practical utility of the CI-VID dataset We release the CI-VID dataset and the accompanying code for data construction and evaluation at: https://github.com/ymju-BAAI/CI-VID
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Enhanced Influence-aware Group Recommendation for Online Media Propagation
Authors:
Chengkun He,
Xiangmin Zhou,
Chen Wang,
Longbing Cao,
Jie Shao,
Xiaodong Li,
Guang Xu,
Carrie Jinqiu Hu,
Zahir Tari
Abstract:
Group recommendation over social media streams has attracted significant attention due to its wide applications in domains such as e-commerce, entertainment, and online news broadcasting. By leveraging social connections and group behaviours, group recommendation (GR) aims to provide more accurate and engaging content to a set of users rather than individuals. Recently, influence-aware GR has emer…
▽ More
Group recommendation over social media streams has attracted significant attention due to its wide applications in domains such as e-commerce, entertainment, and online news broadcasting. By leveraging social connections and group behaviours, group recommendation (GR) aims to provide more accurate and engaging content to a set of users rather than individuals. Recently, influence-aware GR has emerged as a promising direction, as it considers the impact of social influence on group decision-making. In earlier work, we proposed Influence-aware Group Recommendation (IGR) to solve this task. However, this task remains challenging due to three key factors: the large and ever-growing scale of social graphs, the inherently dynamic nature of influence propagation within user groups, and the high computational overhead of real-time group-item matching.
To tackle these issues, we propose an Enhanced Influence-aware Group Recommendation (EIGR) framework. First, we introduce a Graph Extraction-based Sampling (GES) strategy to minimise redundancy across multiple temporal social graphs and effectively capture the evolving dynamics of both groups and items. Second, we design a novel DYnamic Independent Cascade (DYIC) model to predict how influence propagates over time across social items and user groups. Finally, we develop a two-level hash-based User Group Index (UG-Index) to efficiently organise user groups and enable real-time recommendation generation. Extensive experiments on real-world datasets demonstrate that our proposed framework, EIGR, consistently outperforms state-of-the-art baselines in both effectiveness and efficiency.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
LearnAFE: Circuit-Algorithm Co-design Framework for Learnable Audio Analog Front-End
Authors:
Jinhai Hu,
Zhongyi Zhang,
Cong Sheng Leow,
Wang Ling Goh,
Yuan Gao
Abstract:
This paper presents a circuit-algorithm co-design framework for learnable analog front-end (AFE) in audio signal classification. Designing AFE and backend classifiers separately is a common practice but non-ideal, as shown in this paper. Instead, this paper proposes a joint optimization of the backend classifier with the AFE's transfer function to achieve system-level optimum. More specifically, t…
▽ More
This paper presents a circuit-algorithm co-design framework for learnable analog front-end (AFE) in audio signal classification. Designing AFE and backend classifiers separately is a common practice but non-ideal, as shown in this paper. Instead, this paper proposes a joint optimization of the backend classifier with the AFE's transfer function to achieve system-level optimum. More specifically, the transfer function parameters of an analog bandpass filter (BPF) bank are tuned in a signal-to-noise ratio (SNR)-aware training loop for the classifier. Using a co-design loss function LBPF, this work shows superior optimization of both the filter bank and the classifier. Implemented in open-source SKY130 130nm CMOS process, the optimized design achieved 90.5%-94.2% accuracy for 10-keyword classification task across a wide range of input signal SNR from 5 dB to 20 dB, with only 22k classifier parameters. Compared to conventional approach, the proposed audio AFE achieves 8.7% and 12.9% reduction in power and capacitor area respectively.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment
Authors:
Kai Zhou,
Shuhai Zhang,
Zeng You,
Jinwu Hu,
Mingkui Tan,
Fei Liu
Abstract:
Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and the…
▽ More
Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models' generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA. Specifically, we develop an end-to-end cross-modal contrastive training framework to improve skeleton-text alignment, ensuring sufficient discrimination for skeleton features. Additionally, we introduce a prototype-guided text feature alignment strategy to mitigate the adverse impact of the distribution discrepancy during testing. We provide a theoretical analysis to support our prototype-guided text feature alignment strategy and empirically evaluate our overall PGFA on three well-known datasets. Compared with the top competitor SMIE method, our PGFA achieves absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
User Concerns Regarding Social Robots for Mood Regulation: A Case Study on the "Sunday Blues"
Authors:
Zhuochao Peng,
Jiaxin Xu,
Jun Hu,
Haian Xue,
Laurens A. G. Kolks,
Pieter M. A. Desmet
Abstract:
While recent research highlights the potential of social robots to support mood regulation, little is known about how prospective users view their integration into everyday life. To explore this, we conducted an exploratory case study that used a speculative robot concept "Mora" to provoke reflection and facilitate meaningful discussion about using social robots to manage subtle, day-to-day emotio…
▽ More
While recent research highlights the potential of social robots to support mood regulation, little is known about how prospective users view their integration into everyday life. To explore this, we conducted an exploratory case study that used a speculative robot concept "Mora" to provoke reflection and facilitate meaningful discussion about using social robots to manage subtle, day-to-day emotional experiences. We focused on the "Sunday Blues," a common dip in mood that occurs at the end of the weekend, as a relatable context in which to explore individuals' insights. Using a video prototype and a co-constructing stories method, we engaged 15 participants in imagining interactions with Mora and discussing their expectations, doubts, and concerns. The study surfaced a range of nuanced reflections around the attributes of social robots like empathy, intervention effectiveness, and ethical boundaries, which we translated into design considerations for future research and development in human-robot interaction.
△ Less
Submitted 30 June, 2025;
originally announced July 2025.
-
QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration
Authors:
Junze Hu,
Xiangyu Jin,
Yizhe Zeng,
Yuling Liu,
Yunpeng Li,
Dan Du,
Kaiyu Xie,
Hongsong Zhu
Abstract:
We introduce QLPro, a vulnerability detection framework that systematically integrates LLMs and static analysis tools to enable comprehensive vulnerability detection across entire open-source projects.We constructed a new dataset, JavaTest, comprising 10 open-source projects from GitHub with 62 confirmed vulnerabilities. CodeQL, a state-of-the-art static analysis tool, detected only 24 of these vu…
▽ More
We introduce QLPro, a vulnerability detection framework that systematically integrates LLMs and static analysis tools to enable comprehensive vulnerability detection across entire open-source projects.We constructed a new dataset, JavaTest, comprising 10 open-source projects from GitHub with 62 confirmed vulnerabilities. CodeQL, a state-of-the-art static analysis tool, detected only 24 of these vulnerabilities while QLPro detected 41. Furthermore, QLPro discovered 6 previously unknown vulnerabilities, 2 of which have been confirmed as 0-days.
△ Less
Submitted 30 June, 2025;
originally announced June 2025.
-
Learning Efficient Robotic Garment Manipulation with Standardization
Authors:
Changshi Zhou,
Feng Luan,
Jiarui Hu,
Shaoqiang Meng,
Zhipeng Wang,
Yanchao Dong,
Yanmin Zhou,
Bin He
Abstract:
Garment manipulation is a significant challenge for robots due to the complex dynamics and potential self-occlusion of garments. Most existing methods of efficient garment unfolding overlook the crucial role of standardization of flattened garments, which could significantly simplify downstream tasks like folding, ironing, and packing. This paper presents APS-Net, a novel approach to garment manip…
▽ More
Garment manipulation is a significant challenge for robots due to the complex dynamics and potential self-occlusion of garments. Most existing methods of efficient garment unfolding overlook the crucial role of standardization of flattened garments, which could significantly simplify downstream tasks like folding, ironing, and packing. This paper presents APS-Net, a novel approach to garment manipulation that combines unfolding and standardization in a unified framework. APS-Net employs a dual-arm, multi-primitive policy with dynamic fling to quickly unfold crumpled garments and pick-and-place (p and p) for precise alignment. The purpose of garment standardization during unfolding involves not only maximizing surface coverage but also aligning the garment's shape and orientation to predefined requirements. To guide effective robot learning, we introduce a novel factorized reward function for standardization, which incorporates garment coverage (Cov), keypoint distance (KD), and intersection-over-union (IoU) metrics. Additionally, we introduce a spatial action mask and an Action Optimized Module to improve unfolding efficiency by selecting actions and operation points effectively. In simulation, APS-Net outperforms state-of-the-art methods for long sleeves, achieving 3.9 percent better coverage, 5.2 percent higher IoU, and a 0.14 decrease in KD (7.09 percent relative reduction). Real-world folding tasks further demonstrate that standardization simplifies the folding process. Project page: see https://hellohaia.github.io/APS/
△ Less
Submitted 28 June, 2025;
originally announced June 2025.
-
Uncovering smooth structures in single-cell data with PCS-guided neighbor embeddings
Authors:
Rong Ma,
Xi Li,
Jingyuan Hu,
Bin Yu
Abstract:
Single-cell sequencing is revolutionizing biology by enabling detailed investigations of cell-state transitions. Many biological processes unfold along continuous trajectories, yet it remains challenging to extract smooth, low-dimensional representations from inherently noisy, high-dimensional single-cell data. Neighbor embedding (NE) algorithms, such as t-SNE and UMAP, are widely used to embed hi…
▽ More
Single-cell sequencing is revolutionizing biology by enabling detailed investigations of cell-state transitions. Many biological processes unfold along continuous trajectories, yet it remains challenging to extract smooth, low-dimensional representations from inherently noisy, high-dimensional single-cell data. Neighbor embedding (NE) algorithms, such as t-SNE and UMAP, are widely used to embed high-dimensional single-cell data into low dimensions. But they often introduce undesirable distortions, resulting in misleading interpretations. Existing evaluation methods for NE algorithms primarily focus on separating discrete cell types rather than capturing continuous cell-state transitions, while dynamic modeling approaches rely on strong assumptions about cellular processes and specialized data. To address these challenges, we build on the Predictability-Computability-Stability (PCS) framework for reliable and reproducible data-driven discoveries. First, we systematically evaluate popular NE algorithms through empirical analysis, simulation, and theory, and reveal their key shortcomings, such as artifacts and instability. We then introduce NESS, a principled and interpretable machine learning approach to improve NE representations by leveraging algorithmic stability and to enable robust inference of smooth biological structures. NESS offers useful concepts, quantitative stability metrics, and efficient computational workflows to uncover developmental trajectories and cell-state transitions in single-cell data. Finally, we apply NESS to six single-cell datasets, spanning pluripotent stem cell differentiation, organoid development, and multiple tissue-specific lineage trajectories. Across these diverse contexts, NESS consistently yields useful biological insights, such as identification of transitional and stable cell states and quantification of transcriptional dynamics during development.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling
Authors:
Tianhao Chen,
Xin Xu,
Zijing Liu,
Pengxiang Li,
Xinyuan Song,
Ajay Kumar Jaiswal,
Fan Zhang,
Jishan Hu,
Yang Wang,
Hao Chen,
Shizhe Diao,
Shiwei Liu,
Yu Li,
Lu Yin,
Can Yang
Abstract:
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the…
▽ More
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https://github.com/dandingsky/GPAS.
△ Less
Submitted 3 July, 2025; v1 submitted 27 June, 2025;
originally announced June 2025.
-
Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
Authors:
Xin Xu,
Tianhao Chen,
Fan Zhang,
Wanlong Liu,
Pengxiang Li,
Ajay Kumar Jaiswal,
Yuchen Yan,
Jishan Hu,
Yang Wang,
Hao Chen,
Shiwei Liu,
Shizhe Diao,
Can Yang,
Lu Yin
Abstract:
While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterat…
▽ More
While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection
Authors:
Jiawei Hu
Abstract:
Surface defect detection in industrial scenarios is both crucial and technically demanding due to the wide variability in defect types, irregular shapes and sizes, fine-grained requirements, and complex material textures. Although recent advances in AI-based detectors have improved performance, existing methods often suffer from redundant features, limited detail sensitivity, and weak robustness u…
▽ More
Surface defect detection in industrial scenarios is both crucial and technically demanding due to the wide variability in defect types, irregular shapes and sizes, fine-grained requirements, and complex material textures. Although recent advances in AI-based detectors have improved performance, existing methods often suffer from redundant features, limited detail sensitivity, and weak robustness under multiscale conditions. To address these challenges, we propose YOLO-FDA, a novel YOLO-based detection framework that integrates fine-grained detail enhancement and attention-guided feature fusion. Specifically, we adopt a BiFPN-style architecture to strengthen bidirectional multilevel feature aggregation within the YOLOv5 backbone. To better capture fine structural changes, we introduce a Detail-directional Fusion Module (DDFM) that introduces a directional asymmetric convolution in the second-lowest layer to enrich spatial details and fuses the second-lowest layer with low-level features to enhance semantic consistency. Furthermore, we propose two novel attention-based fusion strategies, Attention-weighted Concatenation (AC) and Cross-layer Attention Fusion (CAF) to improve contextual representation and reduce feature noise. Extensive experiments on benchmark datasets demonstrate that YOLO-FDA consistently outperforms existing state-of-the-art methods in terms of both accuracy and robustness across diverse types of defects and scales.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
Authors:
Sonia K. Murthy,
Rosie Zhao,
Jennifer Hu,
Sham Kakade,
Markus Wulfmeier,
Peng Qian,
Tomer Ullman
Abstract:
Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person's feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cogniti…
▽ More
Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person's feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called "cognitive models" provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. In this work, we use a leading cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning "effort" in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning. Our findings from LLMs' training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other high-level behaviors, shaping training regimes for reasoning models, and better controlling trade-offs between values during model training.
△ Less
Submitted 6 July, 2025; v1 submitted 25 June, 2025;
originally announced June 2025.
-
Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models
Authors:
Kejia Chen,
Jiawen Zhang,
Jiacong Hu,
Yu Wang,
Jian Lou,
Zunlei Feng,
Mingli Song
Abstract:
Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments. However, emerging studies on a few calibration dataset-free quantization methods suggest that quantization may compromise the safety capabilities of LLMs, underscoring the urgent need for systematic safety evaluations and effective mitigation strate…
▽ More
Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments. However, emerging studies on a few calibration dataset-free quantization methods suggest that quantization may compromise the safety capabilities of LLMs, underscoring the urgent need for systematic safety evaluations and effective mitigation strategies. In this paper, we present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets, utilizing widely accepted safety benchmarks. To address the identified safety vulnerabilities, we propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs while minimizing any adverse impact on utility. Extensive experimental results demonstrate that Q-resafe successfully re-aligns the safety of quantized LLMs with their pre-quantization counterparts, even under challenging evaluation scenarios. Project page is available at: https://github.com/Thecommonirin/Qresafe.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Distillation-Enabled Knowledge Alignment for Generative Semantic Communications in AIGC Provisioning Tasks
Authors:
Jingzhi Hu,
Geoffrey Ye Li
Abstract:
Due to the surging amount of AI-generated content (AIGC), its provisioning to edges and mobile users from the cloud incurs substantial traffic on networks. Generative semantic communication (GSC) offers a promising solution by transmitting highly compact information, i.e., prompt text and latent representations, instead of high-dimensional AIGC data. However, GSC relies on the alignment between th…
▽ More
Due to the surging amount of AI-generated content (AIGC), its provisioning to edges and mobile users from the cloud incurs substantial traffic on networks. Generative semantic communication (GSC) offers a promising solution by transmitting highly compact information, i.e., prompt text and latent representations, instead of high-dimensional AIGC data. However, GSC relies on the alignment between the knowledge in the cloud generative AI (GAI) and that possessed by the edges and users, and between the knowledge for wireless transmission and that of actual channels, which remains challenging. In this paper, we propose DeKA-g, a distillation-enabled knowledge alignment algorithm for GSC systems. The core idea is to distill the generation knowledge from the cloud-GAI into low-rank matrices, which can be incorporated by the edge and used to adapt the transmission knowledge to diverse wireless channel conditions. DeKA-g comprises two novel methods: metaword-aided knowledge distillation (MAKD) and variable-rate grouped SNR adaptation (VGSA). For MAKD, an optimized metaword is employed to enhance the efficiency of knowledge distillation, while VGSA enables efficient adaptation to diverse compression rates and SNR ranges. From simulation results, DeKA-g improves the alignment between the edge-generated images and the cloud-generated ones by 44%. Moreover, it adapts to compression rates with 116% higher efficiency than the baseline and enhances the performance in low-SNR conditions by 28%.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
Authors:
Xingyang Li,
Muyang Li,
Tianle Cai,
Haocheng Xi,
Shuo Yang,
Yujun Lin,
Lvmin Zhang,
Songlin Yang,
Jinbo Hu,
Kelly Peng,
Maneesh Agrawala,
Ion Stoica,
Kurt Keutzer,
Song Han
Abstract:
Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal d…
▽ More
Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning
Authors:
Ai Han,
Junxing Hu,
Pu Wei,
Zhiqian Zhang,
Yuhang Guo,
Jiawei Lu,
Zicheng Zhang
Abstract:
Multi-agent reinforcement learning (MARL) has emerged as a prominent paradigm for increasingly complex tasks. However, joint evolution across heterogeneous agents remains challenging due to cooperative inefficiency and training instability. In this paper, we propose the joint evolution dynamics for MARL called JoyAgents-R1, which first applies Group Relative Policy Optimization (GRPO) to the joint…
▽ More
Multi-agent reinforcement learning (MARL) has emerged as a prominent paradigm for increasingly complex tasks. However, joint evolution across heterogeneous agents remains challenging due to cooperative inefficiency and training instability. In this paper, we propose the joint evolution dynamics for MARL called JoyAgents-R1, which first applies Group Relative Policy Optimization (GRPO) to the joint training of heterogeneous multi-agents. By iteratively refining agents' large language models (LLMs) and memories, the method achieves holistic equilibrium with optimal decision-making and memory capabilities. Specifically, JoyAgents-R1 first implements node-wise Monte Carlo sampling on the behavior of each agent across entire reasoning trajectories to enhance GRPO sampling efficiency while maintaining policy diversity. Then, our marginal benefit-driven selection strategy identifies top-$K$ sampling groups with maximal reward fluctuations, enabling targeted agent model updates that improve training stability and maximize joint benefits through cost-effective parameter adjustments. Meanwhile, JoyAgents-R1 introduces an adaptive memory evolution mechanism that repurposes GRPO rewards as cost-free supervisory signals to eliminate repetitive reasoning and accelerate convergence. Experiments across general and domain-specific scenarios demonstrate that JoyAgents-R1 achieves performance comparable to that of larger LLMs while built on smaller open-source models.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
CAM-NET: An AI Model for Whole Atmosphere with Thermosphere and Ionosphere Extension
Authors:
Jiahui Hu,
Wenjun Dong
Abstract:
We present Compressible Atmospheric Model-Network (CAM-NET), an AI model designed to predict neutral atmospheric variables from the Earth's surface to the ionosphere with high accuracy and computational efficiency. Accurate modeling of the entire atmosphere is critical for understanding the upward propagation of gravity waves, which influence upper-atmospheric dynamics and coupling across atmosphe…
▽ More
We present Compressible Atmospheric Model-Network (CAM-NET), an AI model designed to predict neutral atmospheric variables from the Earth's surface to the ionosphere with high accuracy and computational efficiency. Accurate modeling of the entire atmosphere is critical for understanding the upward propagation of gravity waves, which influence upper-atmospheric dynamics and coupling across atmospheric layers. CAM-NET leverages the Spherical Fourier Neural Operator (SFNO) to capture global-scale atmospheric dynamics while preserving the Earth's spherical structure. Trained on a decade of datasets from the Whole Atmosphere Community Climate Model with thermosphere and ionosphere eXtension (WACCM-X), CAM-NET demonstrates accuracy comparable to WACCM-X while achieving a speedup of over 1000x in inference time, can provide one year simulation within a few minutes once trained. The model effectively predicts key atmospheric parameters, including zonal and meridional winds, temperature, and time rate of pressure. Inspired by traditional modeling approaches that use external couplers to simulate tracer transport, CAM-NET introduces a modular architecture that explicitly separates tracer prediction from core dynamics. The core backbone of CAM-NET focuses on forecasting primary physical variables (e.g., temperature, wind velocity), while tracer variables are predicted through a lightweight, fine-tuned model. This design allows for efficient adaptation to specific tracer scenarios with minimal computational cost, avoiding the need to retrain the entire model. We have validated this approach on the $O^2$ tracer, demonstrating strong performance and generalization capabilities.
△ Less
Submitted 1 July, 2025; v1 submitted 24 June, 2025;
originally announced June 2025.
-
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs
Authors:
Shuang Ao,
Yi Dong,
Jinwei Hu,
Sarvapali Ramchurn
Abstract:
Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs. However, fine-tuning can compromise safety alignment, even with benign data, increasing susceptibility to harmful outputs. Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs. To address this i…
▽ More
Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs. However, fine-tuning can compromise safety alignment, even with benign data, increasing susceptibility to harmful outputs. Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs. To address this issue, we propose Safe Pruning LoRA (SPLoRA), a novel pruning-based approach that selectively removes LoRA layers that weaken safety alignment, improving safety while preserving performance. At its core, we introduce Empirical-DIEM (E-DIEM), a dimension-insensitive similarity metric that effectively detects safety misalignment in LoRA-adapted models. We conduct extensive experiments on LLMs fine-tuned with mixed of benign and malicious data, and purely benign datasets, evaluating SPLoRA across utility, safety, and reliability metrics. Results demonstrate that SPLoRA outperforms state-of-the-art safety alignment techniques, significantly reducing safety risks while maintaining or improving model performance and reliability. Additionally, SPLoRA reduces inference overhead, making it a scalable and efficient solution for deploying safer and more reliable LLMs. The code is available at https://github.com/AoShuang92/SPLoRA.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Auto-Regressively Generating Multi-View Consistent Images
Authors:
JiaKui Hu,
Yuxiao Yang,
Jialun Liu,
Jinbo Wu,
Chen Zhao,
Yanye Lu
Abstract:
Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-vi…
▽ More
Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the "Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at https://github.com/MILab-PKU/MVAR.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Enhancing Image Restoration Transformer via Adaptive Translation Equivariance
Authors:
JiaKui Hu,
Zhengjian Yao,
Lujia Jin,
Hangzhou He,
Yanye Lu
Abstract:
Translation equivariance is a fundamental inductive bias in image restoration, ensuring that translated inputs produce translated outputs. Attention mechanisms in modern restoration transformers undermine this property, adversely impacting both training convergence and generalization. To alleviate this issue, we propose two key strategies for incorporating translation equivariance: slide indexing…
▽ More
Translation equivariance is a fundamental inductive bias in image restoration, ensuring that translated inputs produce translated outputs. Attention mechanisms in modern restoration transformers undermine this property, adversely impacting both training convergence and generalization. To alleviate this issue, we propose two key strategies for incorporating translation equivariance: slide indexing and component stacking. Slide indexing maintains operator responses at fixed positions, with sliding window attention being a notable example, while component stacking enables the arrangement of translation-equivariant operators in parallel or sequentially, thereby building complex architectures while preserving translation equivariance. However, these strategies still create a dilemma in model design between the high computational cost of self-attention and the fixed receptive field associated with sliding window attention. To address this, we develop an adaptive sliding indexing mechanism to efficiently select key-value pairs for each query, which are then concatenated in parallel with globally aggregated key-value pairs. The designed network, called the Translation Equivariance Adaptive Transformer (TEAFormer), is assessed across a variety of image restoration tasks. The results highlight its superiority in terms of effectiveness, training convergence, and generalization.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Context Consistency Learning via Sentence Removal for Semi-Supervised Video Paragraph Grounding
Authors:
Yaokun Zhong,
Siyu Jiang,
Jian Zhu,
Jian-Fang Hu
Abstract:
Semi-Supervised Video Paragraph Grounding (SSVPG) aims to localize multiple sentences in a paragraph from an untrimmed video with limited temporal annotations. Existing methods focus on teacher-student consistency learning and video-level contrastive loss, but they overlook the importance of perturbing query contexts to generate strong supervisory signals. In this work, we propose a novel Context…
▽ More
Semi-Supervised Video Paragraph Grounding (SSVPG) aims to localize multiple sentences in a paragraph from an untrimmed video with limited temporal annotations. Existing methods focus on teacher-student consistency learning and video-level contrastive loss, but they overlook the importance of perturbing query contexts to generate strong supervisory signals. In this work, we propose a novel Context Consistency Learning (CCL) framework that unifies the paradigms of consistency regularization and pseudo-labeling to enhance semi-supervised learning. Specifically, we first conduct teacher-student learning where the student model takes as inputs strongly-augmented samples with sentences removed and is enforced to learn from the adequately strong supervisory signals from the teacher model. Afterward, we conduct model retraining based on the generated pseudo labels, where the mutual agreement between the original and augmented views' predictions is utilized as the label confidence. Extensive experiments show that CCL outperforms existing methods by a large margin.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
TAB: Unified Benchmarking of Time Series Anomaly Detection Methods
Authors:
Xiangfei Qiu,
Zhe Li,
Wanghui Qiu,
Shiyan Hu,
Lekui Zhou,
Xingjian Wu,
Zhengyu Li,
Chenjuan Guo,
Aoying Zhou,
Zhenli Sheng,
Jilin Hu,
Christian S. Jensen,
Bin Yang
Abstract:
Time series anomaly detection (TSAD) plays an important role in many domains such as finance, transportation, and healthcare. With the ongoing instrumentation of reality, more time series data will be available, leading also to growing demands for TSAD. While many TSAD methods already exist, new and better methods are still desirable. However, effective progress hinges on the availability of relia…
▽ More
Time series anomaly detection (TSAD) plays an important role in many domains such as finance, transportation, and healthcare. With the ongoing instrumentation of reality, more time series data will be available, leading also to growing demands for TSAD. While many TSAD methods already exist, new and better methods are still desirable. However, effective progress hinges on the availability of reliable means of evaluating new methods and comparing them with existing methods. We address deficiencies in current evaluation procedures related to datasets and experimental settings and protocols. Specifically, we propose a new time series anomaly detection benchmark, called TAB. First, TAB encompasses 29 public multivariate datasets and 1,635 univariate time series from different domains to facilitate more comprehensive evaluations on diverse datasets. Second, TAB covers a variety of TSAD methods, including Non-learning, Machine learning, Deep learning, LLM-based, and Time-series pre-trained methods. Third, TAB features a unified and automated evaluation pipeline that enables fair and easy evaluation of TSAD methods. Finally, we employ TAB to evaluate existing TSAD methods and report on the outcomes, thereby offering a deeper insight into the performance of these methods. Besides, all datasets and code are available at https://github.com/decisionintelligence/TAB.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
DRO-Augment Framework: Robustness by Synergizing Wasserstein Distributionally Robust Optimization and Data Augmentation
Authors:
Jiaming Hu,
Debarghya Mukherjee,
Ioannis Ch. Paschalidis
Abstract:
In many real-world applications, ensuring the robustness and stability of deep neural networks (DNNs) is crucial, particularly for image classification tasks that encounter various input perturbations. While data augmentation techniques have been widely adopted to enhance the resilience of a trained model against such perturbations, there remains significant room for improvement in robustness agai…
▽ More
In many real-world applications, ensuring the robustness and stability of deep neural networks (DNNs) is crucial, particularly for image classification tasks that encounter various input perturbations. While data augmentation techniques have been widely adopted to enhance the resilience of a trained model against such perturbations, there remains significant room for improvement in robustness against corrupted data and adversarial attacks simultaneously. To address this challenge, we introduce DRO-Augment, a novel framework that integrates Wasserstein Distributionally Robust Optimization (W-DRO) with various data augmentation strategies to improve the robustness of the models significantly across a broad spectrum of corruptions. Our method outperforms existing augmentation methods under severe data perturbations and adversarial attack scenarios while maintaining the accuracy on the clean datasets on a range of benchmark datasets, including but not limited to CIFAR-10-C, CIFAR-100-C, MNIST, and Fashion-MNIST. On the theoretical side, we establish novel generalization error bounds for neural networks trained using a computationally efficient, variation-regularized loss function closely related to the W-DRO problem.
△ Less
Submitted 24 June, 2025; v1 submitted 21 June, 2025;
originally announced June 2025.
-
RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought
Authors:
Junbo Qiao,
Miaomiao Cai,
Wei Li,
Yutong Liu,
Xudong Huang,
Gaoqi He,
Jiao Xie,
Jie Hu,
Xinghao Chen,
Shaohui Lin
Abstract:
Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success o…
▽ More
Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.
△ Less
Submitted 23 June, 2025; v1 submitted 20 June, 2025;
originally announced June 2025.
-
Deep Reinforcement Learning Xiangqi Player with Monte Carlo Tree Search
Authors:
Berk Yilmaz,
Junyu Hu,
Jinsong Liu
Abstract:
This paper presents a Deep Reinforcement Learning (DRL) system for Xiangqi (Chinese Chess) that integrates neural networks with Monte Carlo Tree Search (MCTS) to enable strategic self-play and self-improvement. Addressing the underexplored complexity of Xiangqi, including its unique board layout, piece movement constraints, and victory conditions, our approach combines policy-value networks with M…
▽ More
This paper presents a Deep Reinforcement Learning (DRL) system for Xiangqi (Chinese Chess) that integrates neural networks with Monte Carlo Tree Search (MCTS) to enable strategic self-play and self-improvement. Addressing the underexplored complexity of Xiangqi, including its unique board layout, piece movement constraints, and victory conditions, our approach combines policy-value networks with MCTS to simulate move consequences and refine decision-making. By overcoming challenges such as Xiangqi's high branching factor and asymmetrical piece dynamics, our work advances AI capabilities in culturally significant strategy games while providing insights for adapting DRL-MCTS frameworks to domain-specific rule systems.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
CooperRisk: A Driving Risk Quantification Pipeline with Multi-Agent Cooperative Perception and Prediction
Authors:
Mingyue Lei,
Zewei Zhou,
Hongchen Li,
Jia Hu,
Jiaqi Ma
Abstract:
Risk quantification is a critical component of safe autonomous driving, however, constrained by the limited perception range and occlusion of single-vehicle systems in complex and dense scenarios. Vehicle-to-everything (V2X) paradigm has been a promising solution to sharing complementary perception information, nevertheless, how to ensure the risk interpretability while understanding multi-agent i…
▽ More
Risk quantification is a critical component of safe autonomous driving, however, constrained by the limited perception range and occlusion of single-vehicle systems in complex and dense scenarios. Vehicle-to-everything (V2X) paradigm has been a promising solution to sharing complementary perception information, nevertheless, how to ensure the risk interpretability while understanding multi-agent interaction with V2X remains an open question. In this paper, we introduce the first V2X-enabled risk quantification pipeline, CooperRisk, to fuse perception information from multiple agents and quantify the scenario driving risk in future multiple timestamps. The risk is represented as a scenario risk map to ensure interpretability based on risk severity and exposure, and the multi-agent interaction is captured by the learning-based cooperative prediction model. We carefully design a risk-oriented transformer-based prediction model with multi-modality and multi-agent considerations. It aims to ensure scene-consistent future behaviors of multiple agents and avoid conflicting predictions that could lead to overly conservative risk quantification and cause the ego vehicle to become overly hesitant to drive. Then, the temporal risk maps could serve to guide a model predictive control planner. We evaluate the CooperRisk pipeline in a real-world V2X dataset V2XPnP, and the experiments demonstrate its superior performance in risk quantification, showing a 44.35% decrease in conflict rate between the ego vehicle and background traffic participants.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
NTIRE 2025 Image Shadow Removal Challenge Report
Authors:
Florin-Alexandru Vasluianu,
Tim Seizinger,
Zhuyun Zhou,
Cailian Chen,
Zongwei Wu,
Radu Timofte,
Mingjia Li,
Jin Hu,
Hainuo Wang,
Hengxing Liu,
Jiarui Wang,
Qiming Hu,
Xiaojie Guo,
Xin Lu,
Jiarong Yang,
Yuanfei Bao,
Anya Hu,
Zihao Fan,
Kunyu Wang,
Jie Xiao,
Xi Wang,
Xueyang Fu,
Zheng-Jun Zha,
Yu-Fan Lin,
Chia-Ming Lee
, et al. (57 additional authors not shown)
Abstract:
This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were e…
▽ More
This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning
Authors:
Chunlei Li,
Jingyang Hou,
Yilei Shi,
Jingliang Hu,
Xiao Xiang Zhu,
Lichao Mou
Abstract:
Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual…
▽ More
Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models
Authors:
Bharath Dandala,
Michael M. Danziger,
Ella Barkan,
Tanwi Biswas,
Viatcheslav Gurev,
Jianying Hu,
Matthew Madgwick,
Akira Koseki,
Tal Kozlovski,
Michal Rosen-Zvi,
Yishai Shimoni,
Ching-Huei Tsou
Abstract:
Transcriptomic foundation models (TFMs) have recently emerged as powerful tools for analyzing gene expression in cells and tissues, supporting key tasks such as cell-type annotation, batch correction, and perturbation prediction. However, the diversity of model implementations and training strategies across recent TFMs, though promising, makes it challenging to isolate the contribution of individu…
▽ More
Transcriptomic foundation models (TFMs) have recently emerged as powerful tools for analyzing gene expression in cells and tissues, supporting key tasks such as cell-type annotation, batch correction, and perturbation prediction. However, the diversity of model implementations and training strategies across recent TFMs, though promising, makes it challenging to isolate the contribution of individual design choices or evaluate their potential synergies. This hinders the field's ability to converge on best practices and limits the reproducibility of insights across studies. We present BMFM-RNA, an open-source, modular software package that unifies diverse TFM pretraining and fine-tuning objectives within a single framework. Leveraging this capability, we introduce a novel training objective, whole cell expression decoder (WCED), which captures global expression patterns using an autoencoder-like CLS bottleneck representation. In this paper, we describe the framework, supported input representations, and training objectives. We evaluated four model checkpoints pretrained on CELLxGENE using combinations of masked language modeling (MLM), WCED and multitask learning. Using the benchmarking capabilities of BMFM-RNA, we show that WCED-based models achieve performance that matches or exceeds state-of-the-art approaches like scGPT across more than a dozen datasets in both zero-shot and fine-tuning tasks. BMFM-RNA, available as part of the biomed-multi-omics project ( https://github.com/BiomedSciAI/biomed-multi-omic ), offers a reproducible foundation for systematic benchmarking and community-driven exploration of optimal TFM training strategies, enabling the development of more effective tools to leverage the latest advances in AI for understanding cell biology.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Gated Rotary-Enhanced Linear Attention for Long-term Sequential Recommendation
Authors:
Juntao Hu,
Wei Zhou,
Huayi Shen,
Xiao Du,
Jie Liao,
Junhao Wen,
Min Gao
Abstract:
In Sequential Recommendation Systems (SRSs), Transformer models show remarkable performance but face computation cost challenges when modeling long-term user behavior sequences due to the quadratic complexity of the dot-product attention mechanism. By approximating the dot-product attention, linear attention provides an efficient option with linear complexity. However, existing linear attention me…
▽ More
In Sequential Recommendation Systems (SRSs), Transformer models show remarkable performance but face computation cost challenges when modeling long-term user behavior sequences due to the quadratic complexity of the dot-product attention mechanism. By approximating the dot-product attention, linear attention provides an efficient option with linear complexity. However, existing linear attention methods face two limitations: 1) they often use learnable position encodings, which incur extra computational costs in long-term sequence scenarios, and 2) they may not consider the user's fine-grained local preferences and confuse these with the actual change of long-term interests. To remedy these drawbacks, we propose a long-term sequential Recommendation model with Gated Rotary Enhanced Linear Attention (RecGRELA). Specifically, we first propose a Rotary-Enhanced Linear Attention (RELA) module to model long-range dependency within the user's historical information using rotary position encodings. We then introduce a local short operation to incorporate local preferences and demonstrate the theoretical insight. We further introduce a SiLU-based Gated mechanism for RELA (GRELA) to help the model determine whether a user's behavior indicates local interest or a genuine shift in long-term preferences. Experimental results on four public datasets demonstrate that our RecGRELA achieves state-of-the-art performance compared to existing SRSs while maintaining low memory overhead.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Combining Self-attention and Dilation Convolutional for Semantic Segmentation of Coal Maceral Groups
Authors:
Zhenghao Xi,
Zhengnan Lv,
Yang Zheng,
Xiang Liu,
Zhuang Yu,
Junran Chen,
Jing Hu,
Yaqi Liu
Abstract:
The segmentation of coal maceral groups can be described as a semantic segmentation process of coal maceral group images, which is of great significance for studying the chemical properties of coal. Generally, existing semantic segmentation models of coal maceral groups use the method of stacking parameters to achieve higher accuracy. It leads to increased computational requirements and impacts mo…
▽ More
The segmentation of coal maceral groups can be described as a semantic segmentation process of coal maceral group images, which is of great significance for studying the chemical properties of coal. Generally, existing semantic segmentation models of coal maceral groups use the method of stacking parameters to achieve higher accuracy. It leads to increased computational requirements and impacts model training efficiency. At the same time, due to the professionalism and diversity of coal maceral group images sampling, obtaining the number of samples for model training requires a long time and professional personnel operation. To address these issues, We have innovatively developed an IoT-based DA-VIT parallel network model. By utilizing this model, we can continuously broaden the dataset through IoT and achieving sustained improvement in the accuracy of coal maceral groups segmentation. Besides, we decouple the parallel network from the backbone network to ensure the normal using of the backbone network during model data updates. Secondly, DCSA mechanism of DA-VIT is introduced to enhance the local feature information of coal microscopic images. This DCSA can decompose the large kernels of convolutional attention into multiple scales and reduce 81.18% of parameters.Finally, we performed the contrast experiment and ablation experiment between DA-VIT and state-of-the-art methods at lots of evaluation metrics. Experimental results show that DA-VIT-Base achieves 92.14% pixel accuracy and 63.18% mIoU. Params and FLOPs of DA-VIT-Tiny are 4.95M and 8.99G, respectively. All of the evaluation metrics of the proposed DA-VIT are better than other state-of-the-art methods.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Taming Stable Diffusion for Computed Tomography Blind Super-Resolution
Authors:
Chunlei Li,
Yilei Shi,
Haoxi Hu,
Jingliang Hu,
Xiao Xiang Zhu,
Lichao Mou
Abstract:
High-resolution computed tomography (CT) imaging is essential for medical diagnosis but requires increased radiation exposure, creating a critical trade-off between image quality and patient safety. While deep learning methods have shown promise in CT super-resolution, they face challenges with complex degradations and limited medical training data. Meanwhile, large-scale pre-trained diffusion mod…
▽ More
High-resolution computed tomography (CT) imaging is essential for medical diagnosis but requires increased radiation exposure, creating a critical trade-off between image quality and patient safety. While deep learning methods have shown promise in CT super-resolution, they face challenges with complex degradations and limited medical training data. Meanwhile, large-scale pre-trained diffusion models, particularly Stable Diffusion, have demonstrated remarkable capabilities in synthesizing fine details across various vision tasks. Motivated by this, we propose a novel framework that adapts Stable Diffusion for CT blind super-resolution. We employ a practical degradation model to synthesize realistic low-quality images and leverage a pre-trained vision-language model to generate corresponding descriptions. Subsequently, we perform super-resolution using Stable Diffusion with a specialized controlling strategy, conditioned on both low-resolution inputs and the generated text descriptions. Extensive experiments show that our method outperforms existing approaches, demonstrating its potential for achieving high-quality CT imaging at reduced radiation doses. Our code will be made publicly available.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Efficient Long-Context LLM Inference via KV Cache Clustering
Authors:
Jie Hu,
Shengnan Wang,
Yutong He,
Ping Gong,
Jiawei Yi,
Juncheng Zhang,
Youhui Bai,
Renhai Chen,
Gong Zhang,
Cheng Li,
Kun Yuan
Abstract:
Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational ov…
▽ More
Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that Chelsea achieves up to 80% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, Chelsea accelerates the decoding stage of inference by up to 3.19$\times$ and reduces end-to-end latency by up to 2.72$\times$.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
WIP: Exploring the Value of a Debugging Cheat Sheet and Mini Lecture in Improving Undergraduate Debugging Skills and Mindset
Authors:
Andrew Ash,
John Hu
Abstract:
This work-in-progress research paper explores the efficacy of a small-scale microelectronics debugging education intervention utilizing quasi-experimental design in an introductory microelectronics course for third-year electrical and computer engineering (ECE) students. In the first semester of research, the experimental group attended a debugging "mini lecture" covering two common sources of cir…
▽ More
This work-in-progress research paper explores the efficacy of a small-scale microelectronics debugging education intervention utilizing quasi-experimental design in an introductory microelectronics course for third-year electrical and computer engineering (ECE) students. In the first semester of research, the experimental group attended a debugging "mini lecture" covering two common sources of circuit error and received a debugging cheat sheet with recommendations for testing and hypothesis formation. Across three debugging problems, students in the experimental group were faster by an average of 1:43 and had a 7 percent higher success rate than the control group. Both groups demonstrated a strong general growth mindset while the experimental group also displayed a shift in their debugging mindset by perceiving a greater value towards debugging. Though these differences are not yet statistically significant, the pilot results indicate that a mini-lecture and debugging cheat sheet are steps in the right direction toward improving students' readiness for debugging in the workplace.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Polymorphism Crystal Structure Prediction with Adaptive Space Group Diversity Control
Authors:
Sadman Sadeed Omee,
Lai Wei,
Sourin Dey,
Jianjun Hu
Abstract:
Crystalline materials can form different structural arrangements (i.e. polymorphs) with the same chemical composition, exhibiting distinct physical properties depending on how they were synthesized or the conditions under which they operate. For example, carbon can exist as graphite (soft, conductive) or diamond (hard, insulating). Computational methods that can predict these polymorphs are vital…
▽ More
Crystalline materials can form different structural arrangements (i.e. polymorphs) with the same chemical composition, exhibiting distinct physical properties depending on how they were synthesized or the conditions under which they operate. For example, carbon can exist as graphite (soft, conductive) or diamond (hard, insulating). Computational methods that can predict these polymorphs are vital in materials science, which help understand stability relationships, guide synthesis efforts, and discover new materials with desired properties without extensive trial-and-error experimentation. However, effective crystal structure prediction (CSP) algorithms for inorganic polymorph structures remain limited. We propose ParetoCSP2, a multi-objective genetic algorithm for polymorphism CSP that incorporates an adaptive space group diversity control technique, preventing over-representation of any single space group in the population guided by a neural network interatomic potential. Using an improved population initialization method and performing iterative structure relaxation, ParetoCSP2 not only alleviates premature convergence but also achieves improved convergence speed. Our results show that ParetoCSP2 achieves excellent performance in polymorphism prediction, including a nearly perfect space group and structural similarity accuracy for formulas with two polymorphs but with the same number of unit cell atoms. Evaluated on a benchmark dataset, it outperforms baseline algorithms by factors of 2.46-8.62 for these accuracies and improves by 44.8\%-87.04\% across key performance metrics for regular CSP. Our source code is freely available at https://github.com/usccolumbia/ParetoCSP2.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis
Authors:
Surya Jasper,
Minh Luu,
Evan Pan,
Aakash Tyagi,
Michael Quinn,
Jiang Hu,
David Kebo Houngninou
Abstract:
Hardware complexity continues to strain verification resources, motivating the adoption of machine learning (ML) methods to improve debug efficiency. However, ML-assisted debugging critically depends on diverse and scalable bug datasets, which existing manual or automated bug insertion methods fail to reliably produce. We introduce BugGen, a first of its kind, fully autonomous, multi-agent pipelin…
▽ More
Hardware complexity continues to strain verification resources, motivating the adoption of machine learning (ML) methods to improve debug efficiency. However, ML-assisted debugging critically depends on diverse and scalable bug datasets, which existing manual or automated bug insertion methods fail to reliably produce. We introduce BugGen, a first of its kind, fully autonomous, multi-agent pipeline leveraging Large Language Models (LLMs) to systematically generate, insert, and validate realistic functional bugs in RTL. BugGen partitions modules, selects mutation targets via a closed-loop agentic architecture, and employs iterative refinement and rollback mechanisms to ensure syntactic correctness and functional detectability. Evaluated across five OpenTitan IP blocks, BugGen produced 500 unique bugs with 94% functional accuracy and achieved a throughput of 17.7 validated bugs per hour-over five times faster than typical manual expert insertion. Additionally, BugGen identified 104 previously undetected bugs in OpenTitan regressions, highlighting its utility in exposing verification coverage gaps. Compared against Certitude, BugGen demonstrated over twice the syntactic accuracy, deeper exposure of testbench blind spots, and more functionally meaningful and complex bug scenarios. Furthermore, when these BugGen-generated datasets were employed to train ML-based failure triage models, we achieved high classification accuracy (88.1%-93.2%) across different IP blocks, confirming the practical utility and realism of generated bugs. BugGen thus provides a scalable solution for generating high-quality bug datasets, significantly enhancing verification efficiency and ML-assisted debugging.
△ Less
Submitted 18 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.