-
DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models
Authors:
Xiwen Chen,
Wenhui Zhu,
Peijie Qiu,
Xuanzhao Dong,
Hao Wang,
Haiyu Wu,
Huayu Li,
Aristeidis Sotiras,
Yalin Wang,
Abolfazl Razi
Abstract:
Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinc…
▽ More
Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications
Authors:
Tao Zhu,
Qi Yu,
Xinru Dong,
Shiyu Li,
Yue Liu,
Jinlong Jiang,
Lei Shu
Abstract:
Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust ba…
▽ More
Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP, demonstrating exceptional efficiency alongside state-of-the-art performance. Code is available at https://github.com/modadundun/ProDisc-VAD.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
Authors:
Meng-Hao Guo,
Jiajun Xu,
Yi Zhang,
Jiaxi Song,
Haoyang Peng,
Yi-Xuan Deng,
Xinzhi Dong,
Kiyohiro Nakayama,
Zhengyang Geng,
Chen Wang,
Bolin Ni,
Guo-Wei Yang,
Yongming Rao,
Houwen Peng,
Han Hu,
Gordon Wetzstein,
Shi-min Hu
Abstract:
Reasoning stands as a cornerstone of intelligence, enabling the synthesis of existing knowledge to solve complex problems. Despite remarkable progress, existing reasoning benchmarks often fail to rigorously evaluate the nuanced reasoning capabilities required for complex, real-world problemsolving, particularly in multi-disciplinary and multimodal contexts. In this paper, we introduce a graduate-l…
▽ More
Reasoning stands as a cornerstone of intelligence, enabling the synthesis of existing knowledge to solve complex problems. Despite remarkable progress, existing reasoning benchmarks often fail to rigorously evaluate the nuanced reasoning capabilities required for complex, real-world problemsolving, particularly in multi-disciplinary and multimodal contexts. In this paper, we introduce a graduate-level, multi-disciplinary, EnglishChinese benchmark, dubbed as Reasoning Bench (R-Bench), for assessing the reasoning capability of both language and multimodal models. RBench spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing in both English and Chinese. These questions are meticulously curated to ensure rigorous difficulty calibration, subject balance, and crosslinguistic alignment, enabling the assessment to be an Olympiad-level multi-disciplinary benchmark. We evaluate widely used models, including OpenAI o1, GPT-4o, DeepSeek-R1, etc. Experimental results indicate that advanced models perform poorly on complex reasoning, especially multimodal reasoning. Even the top-performing model OpenAI o1 achieves only 53.2% accuracy on our multimodal evaluation. Data and code are made publicly available at here.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
Node2Vec-DGI-EL: A Hierarchical Graph Representation Learning Model for Ingredient-Disease Association Prediction
Authors:
Leifeng Zhang,
Xin Dong,
Shuaibing Jia,
Jianhua Zhang
Abstract:
Traditional Chinese medicine, as an essential component of traditional medicine, contains active ingredients that serve as a crucial source for modern drug development, holding immense therapeutic potential and development value. A multi-layered and complex network is formed from Chinese medicine to diseases and used to predict the potential associations between Chinese medicine ingredients and di…
▽ More
Traditional Chinese medicine, as an essential component of traditional medicine, contains active ingredients that serve as a crucial source for modern drug development, holding immense therapeutic potential and development value. A multi-layered and complex network is formed from Chinese medicine to diseases and used to predict the potential associations between Chinese medicine ingredients and diseases. This study proposes an ingredient-disease association prediction model (Node2Vec-DGI-EL) based on hierarchical graph representation learning. First, the model uses the Node2Vec algorithm to extract node embedding vectors from the network as the initial features of the nodes. Next, the network nodes are deeply represented and learned using the DGI algorithm to enhance the model's expressive power. To improve prediction accuracy and robustness, an ensemble learning method is incorporated to achieve more accurate ingredient-disease association predictions. The effectiveness of the model is then evaluated through a series of theoretical verifications. The results demonstrated that the proposed model significantly outperformed existing methods, achieving an AUC of 0.9987 and an AUPR of 0.9545, thereby indicating superior predictive capability. Ablation experiments further revealed the contribution and importance of each module. Additionally, case studies explored potential associations, such as triptonide with hypertensive retinopathy and methyl ursolate with colorectal cancer. Molecular docking experiments validated these findings, showing the triptonide-PGR interaction and the methyl ursolate-NFE2L2 interaction can bind stable. In conclusion, the Node2Vec-DGI-EL model focuses on TCM datasets and effectively predicts ingredient-disease associations, overcoming the reliance on node semantic information.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA
Authors:
Xuanzhao Dong,
Wenhui Zhu,
Hao Wang,
Xiwen Chen,
Peijie Qiu,
Rui Yin,
Yi Su,
Yalin Wang
Abstract:
Medical question answering (QA) is a reasoning-intensive task that remains challenging for large language models (LLMs) due to hallucinations and outdated domain knowledge. Retrieval-Augmented Generation (RAG) provides a promising post-training solution by leveraging external knowledge. However, existing medical RAG systems suffer from two key limitations: (1) a lack of modeling for human-like rea…
▽ More
Medical question answering (QA) is a reasoning-intensive task that remains challenging for large language models (LLMs) due to hallucinations and outdated domain knowledge. Retrieval-Augmented Generation (RAG) provides a promising post-training solution by leveraging external knowledge. However, existing medical RAG systems suffer from two key limitations: (1) a lack of modeling for human-like reasoning behaviors during information retrieval, and (2) reliance on suboptimal medical corpora, which often results in the retrieval of irrelevant or noisy snippets. To overcome these challenges, we propose Discuss-RAG, a plug-and-play module designed to enhance the medical QA RAG system through collaborative agent-based reasoning. Our method introduces a summarizer agent that orchestrates a team of medical experts to emulate multi-turn brainstorming, thereby improving the relevance of retrieved content. Additionally, a decision-making agent evaluates the retrieved snippets before their final integration. Experimental results on four benchmark medical QA datasets show that Discuss-RAG consistently outperforms MedRAG, especially significantly improving answer accuracy by up to 16.67% on BioASQ and 12.20% on PubMedQA. The code is available at: https://github.com/LLM-VLM-GSL/Discuss-RAG.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Frequency Feature Fusion Graph Network For Depression Diagnosis Via fNIRS
Authors:
Chengkai Yang,
Xingping Dong,
Xiaofen Zong
Abstract:
Data-driven approaches for depression diagnosis have emerged as a significant research focus in neuromedicine, driven by the development of relevant datasets. Recently, graph neural network (GNN)-based models have gained widespread adoption due to their ability to capture brain channel functional connectivity from both spatial and temporal perspectives. However, their effectiveness is hindered by…
▽ More
Data-driven approaches for depression diagnosis have emerged as a significant research focus in neuromedicine, driven by the development of relevant datasets. Recently, graph neural network (GNN)-based models have gained widespread adoption due to their ability to capture brain channel functional connectivity from both spatial and temporal perspectives. However, their effectiveness is hindered by the absence of a robust temporal biomarker. In this paper, we introduce a novel and effective biomarker for depression diagnosis by leveraging the discrete Fourier transform (DFT) and propose a customized graph network architecture based on Temporal Graph Convolutional Network (TGCN). Our model was trained on a dataset comprising 1,086 subjects, which is over 10 times larger than previous datasets in the field of depression diagnosis. Furthermore, to align with medical requirements, we performed propensity score matching (PSM) to create a refined subset, referred to as the PSM dataset. Experimental results demonstrate that incorporating our newly designed biomarker enhances the representation of temporal characteristics in brain channels, leading to improved F1 scores in both the real-world dataset and the PSM dataset. This advancement has the potential to contribute to the development of more effective depression diagnostic tools. In addition, we used SHapley Additive exPlaination (SHAP) to validate the interpretability of our model, ensuring its practical applicability in medical settings.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization
Authors:
Shuai Gong,
Chaoran Cui,
Xiaolin Dong,
Xiushan Nie,
Lei Zhu,
Xiaojun Chang
Abstract:
Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data while preserving privacy. Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt. However, such a one-prompt-fits-all learning paradigm typically leads to performance degradation on persona…
▽ More
Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data while preserving privacy. Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt. However, such a one-prompt-fits-all learning paradigm typically leads to performance degradation on personalized samples. Although the mixture of experts (MoE) offers a promising solution for specialization, existing MoE-based methods suffer from coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level prompt mixture with parameter-free routing framework for FedDG, which treats multiple prompts as distinct experts. Unlike existing image-level routing designs, TRIP assigns different tokens within an image to specific experts. To ensure communication efficiency, TRIP incorporates a parameter-free routing mechanism based on token clustering and optimal transport. The instance-specific prompt is then synthesized by aggregating experts, weighted by the number of tokens assigned to each. Additionally, TRIP develops an unbiased learning strategy for prompt experts, leveraging the VLM's zero-shot generalization capability. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communication of only 1K parameters per round. Our code is available at https://github.com/GongShuai8210/TRIP.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Bayesian Experimental Design for Model Discrepancy Calibration: An Auto-Differentiable Ensemble Kalman Inversion Approach
Authors:
Huchen Yang,
Xinghao Dong,
Jin-Long Wu
Abstract:
Bayesian experimental design (BED) offers a principled framework for optimizing data acquisition by leveraging probabilistic inference. However, practical implementations of BED are often compromised by model discrepancy, i.e., the mismatch between predictive models and true physical systems, which can potentially lead to biased parameter estimates. While data-driven approaches have been recently…
▽ More
Bayesian experimental design (BED) offers a principled framework for optimizing data acquisition by leveraging probabilistic inference. However, practical implementations of BED are often compromised by model discrepancy, i.e., the mismatch between predictive models and true physical systems, which can potentially lead to biased parameter estimates. While data-driven approaches have been recently explored to characterize the model discrepancy, the resulting high-dimensional parameter space poses severe challenges for both Bayesian updating and design optimization. In this work, we propose a hybrid BED framework enabled by auto-differentiable ensemble Kalman inversion (AD-EKI) that addresses these challenges by providing a computationally efficient, gradient-free alternative to estimate the information gain for high-dimensional network parameters. The AD-EKI allows a differentiable evaluation of the utility function in BED and thus facilitates the use of standard gradient-based methods for design optimization. In the proposed hybrid framework, we iteratively optimize experimental designs, decoupling the inference of low-dimensional physical parameters handled by standard BED methods, from the high-dimensional model discrepancy handled by AD-EKI. The identified optimal designs for the model discrepancy enable us to systematically collect informative data for its calibration. The performance of the proposed method is studied by a classical convection-diffusion BED example, and the hybrid framework enabled by AD-EKI efficiently identifies informative data to calibrate the model discrepancy and robustly infers the unknown physical parameters in the modeled system. Besides addressing the challenges of BED with model discrepancy, AD-EKI also potentially fosters efficient and scalable frameworks in many other areas with bilevel optimization, such as meta-learning and structure optimization.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition
Authors:
Yin Tang,
Jiankai Li,
Hongyu Yang,
Xuan Dong,
Lifeng Fan,
Weixin Li
Abstract:
In an era where social media platforms abound, individuals frequently share images that offer insights into their intents and interests, impacting individual life quality and societal stability. Traditional computer vision tasks, such as object detection and semantic segmentation, focus on concrete visual representations, while intent recognition relies more on implicit visual clues. This poses ch…
▽ More
In an era where social media platforms abound, individuals frequently share images that offer insights into their intents and interests, impacting individual life quality and societal stability. Traditional computer vision tasks, such as object detection and semantic segmentation, focus on concrete visual representations, while intent recognition relies more on implicit visual clues. This poses challenges due to the wide variation and subjectivity of such clues, compounded by the problem of intra-class variety in conveying abstract concepts, e.g. "enjoy life". Existing methods seek to solve the problem by manually designing representative features or building prototypes for each class from global features. However, these methods still struggle to deal with the large visual diversity of each intent category. In this paper, we introduce a novel approach named Multi-grained Compositional visual Clue Learning (MCCL) to address these challenges for image intent recognition. Our method leverages the systematic compositionality of human cognition by breaking down intent recognition into visual clue composition and integrating multi-grained features. We adopt class-specific prototypes to alleviate data imbalance. We treat intent recognition as a multi-label classification problem, using a graph convolutional network to infuse prior knowledge through label embedding correlations. Demonstrated by a state-of-the-art performance on the Intentonomy and MDID datasets, our approach advances the accuracy of existing methods while also possessing good interpretability. Our work provides an attempt for future explorations in understanding complex and miscellaneous forms of human expression.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Modality Reliability Guided Multimodal Recommendation
Authors:
Xue Dong,
Xuemeng Song,
Na Zheng,
Sicheng Zhao,
Guiguang Ding
Abstract:
Multimodal recommendation faces an issue of the performance degradation that the uni-modal recommendation sometimes achieves the better performance. A possible reason is that the unreliable item modality data hurts the fusion result. Several existing studies have introduced weights for different modalities to reduce the contribution of the unreliable modality data in predicting the final user rati…
▽ More
Multimodal recommendation faces an issue of the performance degradation that the uni-modal recommendation sometimes achieves the better performance. A possible reason is that the unreliable item modality data hurts the fusion result. Several existing studies have introduced weights for different modalities to reduce the contribution of the unreliable modality data in predicting the final user rating. However, they fail to provide appropriate supervisions for learning the modality weights, making the learned weights imprecise. Therefore, we propose a modality reliability guided multimodal recommendation framework that uniquely learns the modality weights supervised by the modality reliability. Considering that there is no explicit label provided for modality reliability, we resort to automatically identify it through the BPR recommendation objective. In particular, we define a modality reliability vector as the supervision label by the difference between modality-specific user ratings to positive and negative items, where a larger difference indicates a higher reliability of the modality as the BPR objective is better satisfied. Furthermore, to enhance the effectiveness of the supervision, we calculate the confidence level for the modality reliability vector, which dynamically adjusts the supervision strength and eliminates the harmful supervision. Extensive experiments on three real-world datasets show the effectiveness of the proposed method.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
BrainPrompt: Multi-Level Brain Prompt Enhancement for Neurological Condition Identification
Authors:
Jiaxing Xu,
Kai He,
Yue Tang,
Wei Li,
Mengcheng Lan,
Xia Dong,
Yiping Ke,
Mengling Feng
Abstract:
Neurological conditions, such as Alzheimer's Disease, are challenging to diagnose, particularly in the early stages where symptoms closely resemble healthy controls. Existing brain network analysis methods primarily focus on graph-based models that rely solely on imaging data, which may overlook important non-imaging factors and limit the model's predictive power and interpretability. In this pape…
▽ More
Neurological conditions, such as Alzheimer's Disease, are challenging to diagnose, particularly in the early stages where symptoms closely resemble healthy controls. Existing brain network analysis methods primarily focus on graph-based models that rely solely on imaging data, which may overlook important non-imaging factors and limit the model's predictive power and interpretability. In this paper, we present BrainPrompt, an innovative framework that enhances Graph Neural Networks (GNNs) by integrating Large Language Models (LLMs) with knowledge-driven prompts, enabling more effective capture of complex, non-imaging information and external knowledge for neurological disease identification. BrainPrompt integrates three types of knowledge-driven prompts: (1) ROI-level prompts to encode the identity and function of each brain region, (2) subject-level prompts that incorporate demographic information, and (3) disease-level prompts to capture the temporal progression of disease. By leveraging these multi-level prompts, BrainPrompt effectively harnesses knowledge-enhanced multi-modal information from LLMs, enhancing the model's capability to predict neurological disease stages and meanwhile offers more interpretable results. We evaluate BrainPrompt on two resting-state functional Magnetic Resonance Imaging (fMRI) datasets from neurological disorders, showing its superiority over state-of-the-art methods. Additionally, a biomarker study demonstrates the framework's ability to extract valuable and interpretable information aligned with domain knowledge in neuroscience.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement
Authors:
Zhifan Ye,
Kejing Xia,
Yonggan Fu,
Xin Dong,
Jihoon Hong,
Xiangchi Yuan,
Shizhe Diao,
Jan Kautz,
Pavlo Molchanov,
Yingyan Celine Lin
Abstract:
State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understa…
▽ More
State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training. Our code is available at https://github.com/GATECH-EIC/LongMamba.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Generative Framework for Personalized Persuasion: Inferring Causal, Counterfactual, and Latent Knowledge
Authors:
Donghuo Zeng,
Roberto Legaspi,
Yuewen Sun,
Xinshuai Dong,
Kazushi Ikeda,
Peter Spirtes,
Kun Zhang
Abstract:
We hypothesize that optimal system responses emerge from adaptive strategies grounded in causal and counterfactual knowledge. Counterfactual inference allows us to create hypothetical scenarios to examine the effects of alternative system responses. We enhance this process through causal discovery, which identifies the strategies informed by the underlying causal structure that govern system behav…
▽ More
We hypothesize that optimal system responses emerge from adaptive strategies grounded in causal and counterfactual knowledge. Counterfactual inference allows us to create hypothetical scenarios to examine the effects of alternative system responses. We enhance this process through causal discovery, which identifies the strategies informed by the underlying causal structure that govern system behaviors. Moreover, we consider the psychological constructs and unobservable noises that might be influencing user-system interactions as latent factors. We show that these factors can be effectively estimated. We employ causal discovery to identify strategy-level causal relationships among user and system utterances, guiding the generation of personalized counterfactual dialogues. We model the user utterance strategies as causal factors, enabling system strategies to be treated as counterfactual actions. Furthermore, we optimize policies for selecting system responses based on counterfactual data. Our results using a real-world dataset on social good demonstrate significant improvements in persuasive system outcomes, with increased cumulative rewards validating the efficacy of causal discovery in guiding personalized counterfactual inference and optimizing dialogue policies for a persuasive dialogue system.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Authors:
Shizhe Diao,
Yu Yang,
Yonggan Fu,
Xin Dong,
Dan Su,
Markus Kliegl,
Zijia Chen,
Peter Belcak,
Yoshi Suhara,
Hongxu Yin,
Mostofa Patwary,
Yingyan,
Lin,
Jan Kautz,
Pavlo Molchanov
Abstract:
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for…
▽ More
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Bingchen Li,
Fengbin Guan,
Yizhen Shao,
Zihao Yu,
Xijun Wang,
Yiting Lu,
Wei Luo,
Suhang Yao,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Yabin Zhang,
Ao-Xiang Zhang,
Tianwu Zhi,
Jianzhao Liu,
Yang Li,
Jingwen Xu,
Yiting Liao,
Yushen Zuo,
Mingyang Wu,
Renjie Li,
Shengyun Zhong
, et al. (88 additional authors not shown)
Abstract:
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re…
▽ More
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction
Authors:
Dubing Chen,
Huan Zheng,
Jin Fang,
Xingping Dong,
Xianfei Li,
Wenlong Liao,
Tao He,
Pai Peng,
Jianbing Shen
Abstract:
We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level c…
▽ More
We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistency, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and make distinct contributions across various modules in the VisionOcc framework. To effectively fuse temporal signals across heterogeneous representations, we propose a novel fusion strategy by reinterpreting the formulation of vanilla RNNs. This reinterpretation leverages gradient descent on features to unify the integration of diverse temporal information, seamlessly embedding the proposed temporal cues into the network. Extensive experiments on nuScenes demonstrate that GDFusion significantly outperforms established baselines. Notably, on Occ3D benchmark, it achieves 1.4\%-4.8\% mIoU improvements and reduces memory consumption by 27\%-72\%.
△ Less
Submitted 18 April, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
Knowledge Acquisition on Mass-shooting Events via LLMs for AI-Driven Justice
Authors:
Benign John Ihugba,
Afsana Nasrin,
Ling Wu,
Lin Li,
Lijun Qian,
Xishuang Dong
Abstract:
Mass-shooting events pose a significant challenge to public safety, generating large volumes of unstructured textual data that hinder effective investigations and the formulation of public policy. Despite the urgency, few prior studies have effectively automated the extraction of key information from these events to support legal and investigative efforts. This paper presented the first dataset de…
▽ More
Mass-shooting events pose a significant challenge to public safety, generating large volumes of unstructured textual data that hinder effective investigations and the formulation of public policy. Despite the urgency, few prior studies have effectively automated the extraction of key information from these events to support legal and investigative efforts. This paper presented the first dataset designed for knowledge acquisition on mass-shooting events through the application of named entity recognition (NER) techniques. It focuses on identifying key entities such as offenders, victims, locations, and criminal instruments, that are vital for legal and investigative purposes. The NER process is powered by Large Language Models (LLMs) using few-shot prompting, facilitating the efficient extraction and organization of critical information from diverse sources, including news articles, police reports, and social media. Experimental results on real-world mass-shooting corpora demonstrate that GPT-4o is the most effective model for mass-shooting NER, achieving the highest Micro Precision, Micro Recall, and Micro F1-scores. Meanwhile, o1-mini delivers competitive performance, making it a resource-efficient alternative for less complex NER tasks. It is also observed that increasing the shot count enhances the performance of all models, but the gains are more substantial for GPT-4o and o1-mini, highlighting their superior adaptability to few-shot learning scenarios.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models
Authors:
Maria Teleki,
Xiangjue Dong,
Haoran Liu,
James Caverlee
Abstract:
Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine…
▽ More
Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine defaults, and propose a twofold framework for (i) the large-scale discovery and analysis of gendered discourse words in spoken content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the measurement of the gender bias associated with these gendered discourse words in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus our study on podcasts, a popular and growing form of social media, analyzing 15,117 podcast episodes. We analyze correlations between gender and discourse words -- discovered via LDA and BERTopic -- to automatically form gendered discourse word lists. We then study the prevalence of these gendered discourse words in domain-specific contexts, and find that gendered discourse-based masculine defaults exist in the domains of business, technology/politics, and video games. Next, we study the representation of these gendered discourse words from a state-of-the-art LLM embedding model from OpenAI, and find that the masculine discourse words have a more stable and robust representation than the feminine discourse words, which may result in better system performance on downstream tasks for men. Hence, men are rewarded for their discourse patterns with better system performance by one of the state-of-the-art language models -- and this embedding disparity is a representational harm and a masculine default.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Explicit and Implicit Representations in AI-based 3D Reconstruction for Radiology: A systematic literature review
Authors:
Yuezhe Yang,
Boyu Yang,
Yaqian Wang,
Yang He,
Xingbo Dong,
Zhe Jin
Abstract:
The demand for high-quality medical imaging in clinical practice and assisted diagnosis has made 3D reconstruction in radiological imaging a key research focus. Artificial intelligence (AI) has emerged as a promising approach to enhancing reconstruction accuracy while reducing acquisition and processing time, thereby minimizing patient radiation exposure and discomfort and ultimately benefiting cl…
▽ More
The demand for high-quality medical imaging in clinical practice and assisted diagnosis has made 3D reconstruction in radiological imaging a key research focus. Artificial intelligence (AI) has emerged as a promising approach to enhancing reconstruction accuracy while reducing acquisition and processing time, thereby minimizing patient radiation exposure and discomfort and ultimately benefiting clinical diagnosis. This review explores state-of-the-art AI-based 3D reconstruction algorithms in radiological imaging, categorizing them into explicit and implicit approaches based on their underlying principles. Explicit methods include point-based, volume-based, and Gaussian representations, while implicit methods encompass implicit prior embedding and neural radiance fields. Additionally, we examine commonly used evaluation metrics and benchmark datasets. Finally, we discuss the current state of development, key challenges, and future research directions in this evolving field. Our project available on: https://github.com/Bean-Young/AI4Med.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
The Cambridge Report on Database Research
Authors:
Anastasia Ailamaki,
Samuel Madden,
Daniel Abadi,
Gustavo Alonso,
Sihem Amer-Yahia,
Magdalena Balazinska,
Philip A. Bernstein,
Peter Boncz,
Michael Cafarella,
Surajit Chaudhuri,
Susan Davidson,
David DeWitt,
Yanlei Diao,
Xin Luna Dong,
Michael Franklin,
Juliana Freire,
Johannes Gehrke,
Alon Halevy,
Joseph M. Hellerstein,
Mark D. Hill,
Stratos Idreos,
Yannis Ioannidis,
Christoph Koch,
Donald Kossmann,
Tim Kraska
, et al. (21 additional authors not shown)
Abstract:
On October 19 and 20, 2023, the authors of this report convened in Cambridge, MA, to discuss the state of the database research field, its recent accomplishments and ongoing challenges, and future directions for research and community engagement. This gathering continues a long standing tradition in the database community, dating back to the late 1980s, in which researchers meet roughly every five…
▽ More
On October 19 and 20, 2023, the authors of this report convened in Cambridge, MA, to discuss the state of the database research field, its recent accomplishments and ongoing challenges, and future directions for research and community engagement. This gathering continues a long standing tradition in the database community, dating back to the late 1980s, in which researchers meet roughly every five years to produce a forward looking report.
This report summarizes the key takeaways from our discussions. We begin with a retrospective on the academic, open source, and commercial successes of the community over the past five years. We then turn to future opportunities, with a focus on core data systems, particularly in the context of cloud computing and emerging hardware, as well as on the growing impact of data science, data governance, and generative AI.
This document is not intended as an exhaustive survey of all technical challenges or industry innovations in the field. Rather, it reflects the perspectives of senior community members on the most pressing challenges and promising opportunities ahead.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results
Authors:
Yuqian Fu,
Xingyu Qiu,
Bin Ren,
Yanwei Fu,
Radu Timofte,
Nicu Sebe,
Ming-Hsuan Yang,
Luc Van Gool,
Kaijin Zhang,
Qingpeng Nong,
Xiugang Dong,
Hong Gao,
Xiangsheng Zhou,
Jiancheng Pan,
Yanxing Liu,
Xiao He,
Jiahao Li,
Yuze Sun,
Xiaomeng Huang,
Zhenyu Zhang,
Ran Ma,
Yuhan Liu,
Zijian Zhuang,
Shuai Yi,
Yixiong Zou
, et al. (37 additional authors not shown)
Abstract:
Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registe…
▽ More
Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registered participants, received submissions from 42 teams, and concluded with 13 teams making valid final submissions. Participants approached the task from diverse perspectives, proposing novel models that achieved new state-of-the-art (SOTA) results under both open-source and closed-source settings. In this report, we present an overview of the 1st NTIRE 2025 CD-FSOD Challenge, highlighting the proposed solutions and summarizing the results submitted by the participants.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
MM-IFEngine: Towards Multimodal Instruction Following
Authors:
Shengyuan Ding,
Shenxi Wu,
Xiangyu Zhao,
Yuhang Zang,
Haodong Duan,
Xiaoyi Dong,
Pan Zhang,
Yuhang Cao,
Dahua Lin,
Jiaqi Wang
Abstract:
The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To addre…
▽ More
The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). We have fully open-sourced the datasets (both SFT and DPO), evaluation code and training scripts at https://github.com/SYuan03/MM-IFEngine.
△ Less
Submitted 27 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance
Authors:
Jiazi Bu,
Pengyang Ling,
Yujie Zhou,
Pan Zhang,
Tong Wu,
Xiaoyi Dong,
Yuhang Zang,
Yuhang Cao,
Dahua Lin,
Jiaqi Wang
Abstract:
Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the scarcity and complexity of high-resolution content. To this end, we present HiFlow, a training-free and model-agnostic framework to unlock the resolution potential…
▽ More
Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the scarcity and complexity of high-resolution content. To this end, we present HiFlow, a training-free and model-agnostic framework to unlock the resolution potential of pre-trained flow models. Specifically, HiFlow establishes a virtual reference flow within the high-resolution space that effectively captures the characteristics of low-resolution flow information, offering guidance for high-resolution generation through three key aspects: initialization alignment for low-frequency consistency, direction alignment for structure preservation, and acceleration alignment for detail fidelity. By leveraging this flow-aligned guidance, HiFlow substantially elevates the quality of high-resolution image synthesis of T2I models and demonstrates versatility across their personalized variants. Extensive experiments validate HiFlow's superiority in achieving superior high-resolution image quality over current state-of-the-art methods.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Development and Experimental Evaluation of a Vibration-Based Adhesion System for Miniature Wall-Climbing Robots
Authors:
Siqian Li,
Jung-Che Chang,
Xi Wang,
Xin Dong
Abstract:
In recent years, miniature wall-climbing robots have attracted widespread attention due to their significant potential in equipment inspection and in-situ repair applications. Traditional wall-climbing systems typically rely on electromagnetic, electrostatic, vacuum suction, or van der Waals forces for controllable adhesion. However, these conventional methods impose limitations when striving for…
▽ More
In recent years, miniature wall-climbing robots have attracted widespread attention due to their significant potential in equipment inspection and in-situ repair applications. Traditional wall-climbing systems typically rely on electromagnetic, electrostatic, vacuum suction, or van der Waals forces for controllable adhesion. However, these conventional methods impose limitations when striving for both a compact design and high-speed mobility. This paper proposes a novel Vibration-Based Adhesion (VBA) technique, which utilizes a flexible disk vibrating near a surface to generate a strong and controllable attractive force without direct contact. By employing an electric motor as the vibration source, the constructed VBA system was experimentally evaluated, achieving an adhesion-to-weight ratio exceeding 51 times. The experimental results demonstrate that this adhesion mechanism not only provides a high normal force but also maintains minimal shear force, making it particularly suitable for high-speed movement and heavy load applications in miniature wall-climbing robots.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
Divergent Paths: Separating Homophilic and Heterophilic Learning for Enhanced Graph-level Representations
Authors:
Han Lei,
Jiaxing Xu,
Xia Dong,
Yiping Ke
Abstract:
Graph Convolutional Networks (GCNs) are predominantly tailored for graphs displaying homophily, where similar nodes connect, but often fail on heterophilic graphs. The strategy of adopting distinct approaches to learn from homophilic and heterophilic components in node-level tasks has been widely discussed and proven effective both theoretically and experimentally. However, in graph-level tasks, r…
▽ More
Graph Convolutional Networks (GCNs) are predominantly tailored for graphs displaying homophily, where similar nodes connect, but often fail on heterophilic graphs. The strategy of adopting distinct approaches to learn from homophilic and heterophilic components in node-level tasks has been widely discussed and proven effective both theoretically and experimentally. However, in graph-level tasks, research on this topic remains notably scarce. Addressing this gap, our research conducts an analysis on graphs with nodes' category ID available, distinguishing intra-category and inter-category components as embodiment of homophily and heterophily, respectively. We find while GCNs excel at extracting information within categories, they frequently capture noise from inter-category components. Consequently, it is crucial to employ distinct learning strategies for intra- and inter-category elements. To alleviate this problem, we separately learn the intra- and inter-category parts by a combination of an intra-category convolution (IntraNet) and an inter-category high-pass graph convolution (InterNet). Our IntraNet is supported by sophisticated graph preprocessing steps and a novel category-based graph readout function. For the InterNet, we utilize a high-pass filter to amplify the node disparities, enhancing the recognition of details in the high-frequency components. The proposed approach, DivGNN, combines the IntraNet and InterNet with a gated mechanism and substantially improves classification performance on graph-level tasks, surpassing traditional GNN baselines in effectiveness.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
SHapley Estimated Explanation (SHEP): A Fast Post-Hoc Attribution Method for Interpreting Intelligent Fault Diagnosis
Authors:
Qian Chen,
Xingjian Dong,
Zhike Peng,
Guang Meng
Abstract:
Despite significant progress in intelligent fault diagnosis (IFD), the lack of interpretability remains a critical barrier to practical industrial applications, driving the growth of interpretability research in IFD. Post-hoc interpretability has gained popularity due to its ability to preserve network flexibility and scalability without modifying model structures. However, these methods often yie…
▽ More
Despite significant progress in intelligent fault diagnosis (IFD), the lack of interpretability remains a critical barrier to practical industrial applications, driving the growth of interpretability research in IFD. Post-hoc interpretability has gained popularity due to its ability to preserve network flexibility and scalability without modifying model structures. However, these methods often yield suboptimal time-domain explanations. Recently, combining domain transform with SHAP has improved interpretability by extending explanations to more informative domains. Nonetheless, the computational expense of SHAP, exacerbated by increased dimensions from domain transforms, remains a major challenge. To address this, we propose patch-wise attribution and SHapley Estimated Explanation (SHEP). Patch-wise attribution reduces feature dimensions at the cost of explanation granularity, while SHEP simplifies subset enumeration to approximate SHAP, reducing complexity from exponential to linear. Together, these methods significantly enhance SHAP's computational efficiency, providing feasibility for real-time interpretation in monitoring tasks. Extensive experiments confirm SHEP's efficiency, interpretability, and reliability in approximating SHAP. Additionally, with open-source code, SHEP has the potential to serve as a benchmark for post-hoc interpretability in IFD. The code is available on https://github.com/ChenQian0618/SHEP.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Schrödinger Diffusion Driven Signal Recovery in 3T BOLD fMRI Using Unmatched 7T Observations
Authors:
Yujian Xiong,
Xuanzhao Dong,
Sebastian Waz,
Wenhui Zhu,
Negar Mallak,
Zhong-lin Lu,
Yalin Wang
Abstract:
Ultra-high-field (7 Tesla) BOLD fMRI offers exceptional detail in both spatial and temporal domains, along with robust signal-to-noise characteristics, making it a powerful modality for studying visual information processing in the brain. However, due to the limited accessibility of 7T scanners, the majority of neuroimaging studies are still conducted using 3T systems, which inherently suffer from…
▽ More
Ultra-high-field (7 Tesla) BOLD fMRI offers exceptional detail in both spatial and temporal domains, along with robust signal-to-noise characteristics, making it a powerful modality for studying visual information processing in the brain. However, due to the limited accessibility of 7T scanners, the majority of neuroimaging studies are still conducted using 3T systems, which inherently suffer from reduced fidelity in both resolution and SNR. To mitigate this limitation, we introduce a new computational approach designed to enhance the quality of 3T BOLD fMRI acquisitions. Specifically, we project both 3T and 7T datasets, sourced from different individuals and experimental setups, into a shared low-dimensional representation space. Within this space, we employ a lightweight, unsupervised Schrödinger Bridge framework to infer a high-SNR, high-resolution counterpart of the 3T data, without relying on paired supervision. This methodology is evaluated across multiple fMRI retinotopy datasets, including synthetically generated samples, and demonstrates a marked improvement in the reliability and fit of population receptive field (pRF) models applied to the enhanced 3T outputs. Our findings suggest that it is feasible to computationally approximate 7T-level quality from standard 3T acquisitions.
△ Less
Submitted 13 May, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
Causal Discovery and Counterfactual Reasoning to Optimize Persuasive Dialogue Policies
Authors:
Donghuo Zeng,
Roberto Legaspi,
Yuewen Sun,
Xinshuai Dong,
Kazushi Ikeda,
Peter Spirtes,
Kun Zhang
Abstract:
Tailoring persuasive conversations to users leads to more effective persuasion. However, existing dialogue systems often struggle to adapt to dynamically evolving user states. This paper presents a novel method that leverages causal discovery and counterfactual reasoning for optimizing system persuasion capability and outcomes. We employ the Greedy Relaxation of the Sparsest Permutation (GRaSP) al…
▽ More
Tailoring persuasive conversations to users leads to more effective persuasion. However, existing dialogue systems often struggle to adapt to dynamically evolving user states. This paper presents a novel method that leverages causal discovery and counterfactual reasoning for optimizing system persuasion capability and outcomes. We employ the Greedy Relaxation of the Sparsest Permutation (GRaSP) algorithm to identify causal relationships between user and system utterance strategies, treating user strategies as states and system strategies as actions. GRaSP identifies user strategies as causal factors influencing system responses, which inform Bidirectional Conditional Generative Adversarial Networks (BiCoGAN) in generating counterfactual utterances for the system. Subsequently, we use the Dueling Double Deep Q-Network (D3QN) model to utilize counterfactual data to determine the best policy for selecting system utterances. Our experiments with the PersuasionForGood dataset show measurable improvements in persuasion outcomes using our approach over baseline methods. The observed increase in cumulative rewards and Q-values highlights the effectiveness of causal discovery in enhancing counterfactual reasoning and optimizing reinforcement learning policies for online dialogue systems.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
Authors:
Chuan Qin,
Xin Chen,
Chengrui Wang,
Pengmin Wu,
Xi Chen,
Yihang Cheng,
Jingyi Zhao,
Meng Xiao,
Xiangchao Dong,
Qingqing Long,
Boya Pan,
Han Wu,
Chengzan Li,
Yuanchun Zhou,
Hui Xiong,
Hengshu Zhu
Abstract:
In recent years, the rapid advancement of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), has revolutionized the paradigm of scientific discovery, establishing AI-for-Science (AI4Science) as a dynamic and evolving field. However, there is still a lack of an effective framework for the overall assessment of AI4Science, particularly from a holistic perspective o…
▽ More
In recent years, the rapid advancement of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), has revolutionized the paradigm of scientific discovery, establishing AI-for-Science (AI4Science) as a dynamic and evolving field. However, there is still a lack of an effective framework for the overall assessment of AI4Science, particularly from a holistic perspective on data quality and model capability. Therefore, in this study, we propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science from both scientific data and LLM perspectives. First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance which are subdivided into 15 sub-dimensions. Drawing on data resource papers published between 2018 and 2023 in peer-reviewed journals, we present recommendation lists of AI-ready datasets for both Earth and Life Sciences, making a novel and original contribution to the field. Concurrently, to assess the capabilities of LLMs across multiple scientific disciplines, we establish 16 assessment dimensions based on five core indicators Knowledge, Understanding, Reasoning, Multimodality, and Values spanning Mathematics, Physics, Chemistry, Life Sciences, and Earth and Space Sciences. Using the developed benchmark datasets, we have conducted a comprehensive evaluation of over 20 representative open-source and closed source LLMs. All the results are publicly available and can be accessed online at www.scihorizon.cn/en.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
SMPR: A structure-enhanced multimodal drug-disease prediction model for drug repositioning and cold start
Authors:
Xin Dong,
Rui Miao,
Suyan Zhang,
Shuaibing Jia,
Leifeng Zhang,
Yong Liang,
Jianhua Zhang,
Yi Zhun Zhu
Abstract:
Repositioning drug-disease relationships has always been a hot field of research. However, actual cases of biologically validated drug relocation remain very limited, and existing models have not yet fully utilized the structural information of the drug. Furthermore, most repositioning models are only used to complete the relationship matrix, and their practicality is poor when dealing with drug c…
▽ More
Repositioning drug-disease relationships has always been a hot field of research. However, actual cases of biologically validated drug relocation remain very limited, and existing models have not yet fully utilized the structural information of the drug. Furthermore, most repositioning models are only used to complete the relationship matrix, and their practicality is poor when dealing with drug cold start problems. This paper proposes a structure-enhanced multimodal relationship prediction model (SMRP). SMPR is based on the SMILE structure of the drug, using the Mol2VEC method to generate drug embedded representations, and learn disease embedded representations through heterogeneous network graph neural networks. Ultimately, a drug-disease relationship matrix is constructed. In addition, to reduce the difficulty of users' use, SMPR also provides a cold start interface based on structural similarity based on reposition results to simply and quickly predict drug-related diseases. The repositioning ability and cold start capability of the model are verified from multiple perspectives. While the AUC and ACUPR scores of repositioning reach 99% and 61% respectively, the AUC of cold start achieve 80%. In particular, the cold start Recall indicator can reach more than 70%, which means that SMPR is more sensitive to positive samples. Finally, case analysis is used to verify the practical value of the model and visual analysis directly demonstrates the improvement of the structure to the model. For quick use, we also provide local deployment of the model and package it into an executable program.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
DAgent: A Relational Database-Driven Data Analysis Report Generation Agent
Authors:
Wenyi Xu,
Yuren Mao,
Xiaolu Zhang,
Chao Zhang,
Xuemei Dong,
Mengfei Zhang,
Yunjun Gao
Abstract:
Relational database-driven data analysis (RDB-DA) report generation, which aims to generate data analysis reports after querying relational databases, has been widely applied in fields such as finance and healthcare. Typically, these tasks are manually completed by data scientists, making the process very labor-intensive and showing a clear need for automation. Although existing methods (e.g., Tab…
▽ More
Relational database-driven data analysis (RDB-DA) report generation, which aims to generate data analysis reports after querying relational databases, has been widely applied in fields such as finance and healthcare. Typically, these tasks are manually completed by data scientists, making the process very labor-intensive and showing a clear need for automation. Although existing methods (e.g., Table QA or Text-to-SQL) have been proposed to reduce human dependency, they cannot handle complex analytical tasks that require multi-step reasoning, cross-table associations, and synthesizing insights into reports. Moreover, there is no dataset available for developing automatic RDB-DA report generation. To fill this gap, this paper proposes an LLM agent system for RDB-DA report generation tasks, dubbed DAgent; moreover, we construct a benchmark for automatic data analysis report generation, which includes a new dataset DA-Dataset and evaluation metrics. DAgent integrates planning, tools, and memory modules to decompose natural language questions into logically independent sub-queries, accurately retrieve key information from relational databases, and generate analytical reports that meet the requirements of completeness, correctness, and conciseness through multi-step reasoning and effective data integration. Experimental analysis on the DA-Dataset demonstrates that DAgent's superiority in retrieval performance and analysis report generation quality, showcasing its strong potential for tackling complex database analysis report generation tasks.
△ Less
Submitted 1 April, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation
Authors:
Xingguo Lv,
Xingbo Dong,
Liwen Wang,
Jiewen Yang,
Lei Zhao,
Bin Pu,
Zhe Jin,
Xuejun Li
Abstract:
Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmenta…
▽ More
Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. The source code is available at https://github.com/Yore0/TTDG-MGM.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
GNNs as Predictors of Agentic Workflow Performances
Authors:
Yuanshuo Zhang,
Yuchen Hou,
Bohan Tang,
Shuo Chen,
Muhan Zhang,
Xiaowen Dong,
Siheng Chen
Abstract:
Agentic workflows invoked by Large Language Models (LLMs) have achieved remarkable success in handling complex tasks. However, optimizing such workflows is costly and inefficient in real-world applications due to extensive invocations of LLMs. To fill this gap, this position paper formulates agentic workflows as computational graphs and advocates Graph Neural Networks (GNNs) as efficient predictor…
▽ More
Agentic workflows invoked by Large Language Models (LLMs) have achieved remarkable success in handling complex tasks. However, optimizing such workflows is costly and inefficient in real-world applications due to extensive invocations of LLMs. To fill this gap, this position paper formulates agentic workflows as computational graphs and advocates Graph Neural Networks (GNNs) as efficient predictors of agentic workflow performances, avoiding repeated LLM invocations for evaluation. To empirically ground this position, we construct FLORA-Bench, a unified platform for benchmarking GNNs for predicting agentic workflow performances. With extensive experiments, we arrive at the following conclusion: GNNs are simple yet effective predictors. This conclusion supports new applications of GNNs and a novel direction towards automating agentic workflow optimization. All codes, models, and data are available at https://github.com/youngsoul0731/Flora-Bench.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Type Information-Assisted Self-Supervised Knowledge Graph Denoising
Authors:
Jiaqi Sun,
Yujia Zheng,
Xinshuai Dong,
Haoyue Dai,
Kun Zhang
Abstract:
Knowledge graphs serve as critical resources supporting intelligent systems, but they can be noisy due to imperfect automatic generation processes. Existing approaches to noise detection often rely on external facts, logical rule constraints, or structural embeddings. These methods are often challenged by imperfect entity alignment, flexible knowledge graph construction, and overfitting on structu…
▽ More
Knowledge graphs serve as critical resources supporting intelligent systems, but they can be noisy due to imperfect automatic generation processes. Existing approaches to noise detection often rely on external facts, logical rule constraints, or structural embeddings. These methods are often challenged by imperfect entity alignment, flexible knowledge graph construction, and overfitting on structures. In this paper, we propose to exploit the consistency between entity and relation type information for noise detection, resulting a novel self-supervised knowledge graph denoising method that avoids those problems. We formalize type inconsistency noise as triples that deviate from the majority with respect to type-dependent reasoning along the topological structure. Specifically, we first extract a compact representation of a given knowledge graph via an encoder that models the type dependencies of triples. Then, the decoder reconstructs the original input knowledge graph based on the compact representation. It is worth noting that, our proposal has the potential to address the problems of knowledge graph compression and completion, although this is not our focus. For the specific task of noise detection, the discrepancy between the reconstruction results and the input knowledge graph provides an opportunity for denoising, which is facilitated by the type consistency embedded in our method. Experimental validation demonstrates the effectiveness of our approach in detecting potential noise in real-world data.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Adaptive Backdoor Attacks with Reasonable Constraints on Graph Neural Networks
Authors:
Xuewen Dong,
Jiachen Li,
Shujun Li,
Zhichao You,
Qiang Qu,
Yaroslav Kholodov,
Yulong Shen
Abstract:
Recent studies show that graph neural networks (GNNs) are vulnerable to backdoor attacks. Existing backdoor attacks against GNNs use fixed-pattern triggers and lack reasonable trigger constraints, overlooking individual graph characteristics and rendering insufficient evasiveness. To tackle the above issues, we propose ABARC, the first Adaptive Backdoor Attack with Reasonable Constraints, applying…
▽ More
Recent studies show that graph neural networks (GNNs) are vulnerable to backdoor attacks. Existing backdoor attacks against GNNs use fixed-pattern triggers and lack reasonable trigger constraints, overlooking individual graph characteristics and rendering insufficient evasiveness. To tackle the above issues, we propose ABARC, the first Adaptive Backdoor Attack with Reasonable Constraints, applying to both graph-level and node-level tasks in GNNs. For graph-level tasks, we propose a subgraph backdoor attack independent of the graph's topology. It dynamically selects trigger nodes for each target graph and modifies node features with constraints based on graph similarity, feature range, and feature type. For node-level tasks, our attack begins with an analysis of node features, followed by selecting and modifying trigger features, which are then constrained by node similarity, feature range, and feature type. Furthermore, an adaptive edge-pruning mechanism is designed to reduce the impact of neighbors on target nodes, ensuring a high attack success rate (ASR). Experimental results show that even with reasonable constraints for attack evasiveness, our attack achieves a high ASR while incurring a marginal clean accuracy drop (CAD). When combined with the state-of-the-art defense randomized smoothing (RS) method, our attack maintains an ASR over 94%, surpassing existing attacks by more than 7%.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Towards Quantifying Long-Range Interactions in Graph Machine Learning: a Large Graph Dataset and a Measurement
Authors:
Huidong Liang,
Haitz Sáez de Ocáriz Borde,
Baskaran Sripathmanathan,
Michael Bronstein,
Xiaowen Dong
Abstract:
Long-range dependencies are critical for effective graph representation learning, yet most existing datasets focus on small graphs tailored to inductive tasks, offering limited insight into long-range interactions. Current evaluations primarily compare models employing global attention (e.g., graph transformers) with those using local neighborhood aggregation (e.g., message-passing neural networks…
▽ More
Long-range dependencies are critical for effective graph representation learning, yet most existing datasets focus on small graphs tailored to inductive tasks, offering limited insight into long-range interactions. Current evaluations primarily compare models employing global attention (e.g., graph transformers) with those using local neighborhood aggregation (e.g., message-passing neural networks) without a direct measurement of long-range dependency. In this work, we introduce City-Networks, a novel large-scale transductive learning dataset derived from real-world city roads. This dataset features graphs with over $10^5$ nodes and significantly larger diameters than those in existing benchmarks, naturally embodying long-range information. We annotate the graphs using an eccentricity-based approach, ensuring that the classification task inherently requires information from distant nodes. Furthermore, we propose a model-agnostic measurement based on the Jacobians of neighbors from distant hops, offering a principled quantification of long-range dependencies. Finally, we provide theoretical justifications for both our dataset design and the proposed measurement - particularly by focusing on over-smoothing and influence score dilution - which establishes a robust foundation for further exploration of long-range interactions in graph neural networks.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Heterogeneous Graph Structure Learning through the Lens of Data-generating Processes
Authors:
Keyue Jiang,
Bohan Tang,
Xiaowen Dong,
Laura Toni
Abstract:
Inferring the graph structure from observed data is a key task in graph machine learning to capture the intrinsic relationship between data entities. While significant advancements have been made in learning the structure of homogeneous graphs, many real-world graphs exhibit heterogeneous patterns where nodes and edges have multiple types. This paper fills this gap by introducing the first approac…
▽ More
Inferring the graph structure from observed data is a key task in graph machine learning to capture the intrinsic relationship between data entities. While significant advancements have been made in learning the structure of homogeneous graphs, many real-world graphs exhibit heterogeneous patterns where nodes and edges have multiple types. This paper fills this gap by introducing the first approach for heterogeneous graph structure learning (HGSL). To this end, we first propose a novel statistical model for the data-generating process (DGP) of heterogeneous graph data, namely hidden Markov networks for heterogeneous graphs (H2MN). Then we formalize HGSL as a maximum a-posterior estimation problem parameterized by such DGP and derive an alternating optimization method to obtain a solution together with a theoretical justification of the optimization conditions. Finally, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate that our proposed method excels in learning structure on heterogeneous graphs in terms of edge type identification and edge weight recovery.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
When Selection Meets Intervention: Additional Complexities in Causal Discovery
Authors:
Haoyue Dai,
Ignavier Ng,
Jianle Sun,
Zeyu Tang,
Gongxu Luo,
Xinshuai Dong,
Peter Spirtes,
Kun Zhang
Abstract:
We address the common yet often-overlooked selection bias in interventional studies, where subjects are selectively enrolled into experiments. For instance, participants in a drug trial are usually patients of the relevant disease; A/B tests on mobile applications target existing users only, and gene perturbation studies typically focus on specific cell types, such as cancer cells. Ignoring this b…
▽ More
We address the common yet often-overlooked selection bias in interventional studies, where subjects are selectively enrolled into experiments. For instance, participants in a drug trial are usually patients of the relevant disease; A/B tests on mobile applications target existing users only, and gene perturbation studies typically focus on specific cell types, such as cancer cells. Ignoring this bias leads to incorrect causal discovery results. Even when recognized, the existing paradigm for interventional causal discovery still fails to address it. This is because subtle differences in when and where interventions happen can lead to significantly different statistical patterns. We capture this dynamic by introducing a graphical model that explicitly accounts for both the observed world (where interventions are applied) and the counterfactual world (where selection occurs while interventions have not been applied). We characterize the Markov property of the model, and propose a provably sound algorithm to identify causal relations as well as selection mechanisms up to the equivalence class, from data with soft interventions and unknown targets. Through synthetic and real-world experiments, we demonstrate that our algorithm effectively identifies true causal relations despite the presence of selection bias.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models
Authors:
Wenhui Zhu,
Xin Li,
Xiwen Chen,
Peijie Qiu,
Vamsi Krishna Vasa,
Xuanzhao Dong,
Yanxi Chen,
Natasha Lepore,
Oana Dumitrascu,
Yi Su,
Yalin Wang
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpre…
▽ More
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
PriFFT: Privacy-preserving Federated Fine-tuning of Large Language Models via Hybrid Secret Sharing
Authors:
Zhichao You,
Xuewen Dong,
Ke Cheng,
Xutong Mu,
Jiaxuan Fu,
Shiyang Ma,
Qiang Qu,
Yulong Shen
Abstract:
Fine-tuning large language models (LLMs) raises privacy concerns due to the risk of exposing sensitive training data. Federated learning (FL) mitigates this risk by keeping training samples on local devices, while facing the following problems in privacy-preserving federated fine-tuning. (i) Recent studies show that adversaries can still infer private information in FL. (ii) LLM parameters are sha…
▽ More
Fine-tuning large language models (LLMs) raises privacy concerns due to the risk of exposing sensitive training data. Federated learning (FL) mitigates this risk by keeping training samples on local devices, while facing the following problems in privacy-preserving federated fine-tuning. (i) Recent studies show that adversaries can still infer private information in FL. (ii) LLM parameters are shared publicly during federated fine-tuning, while developers are often reluctant to disclose these parameters, posing further security challenges. (iii) Existing works focus on secure inference of LLMs but do not consider privacy-preserving fine-tuning. Inspired by the above problems, we propose PriFFT, a privacy-preserving federated fine-tuning mechanism, to protect both the model parameters and users' privacy. Due to considerable LLM parameters, we present hybrid secret sharing combining arithmetic secret sharing (ASS) and function secret sharing (FSS) to build secure operations and implement secure layers and activation for privacy-preserving fine-tuning. To improve the efficiency of privacy-preserving federated fine-tuning of LLMs, we optimize several secure computation protocols based on FSS, including reciprocal calculation, tensor products, natural exponentiation, softmax, sigmoid, hyperbolic tangent, and dropout. The hybrid secret sharing enables PriFFT to apply our optimized FSS protocols while combining ASS protocols to support complex computation without extra communication. The optimized protocols reduce execution time up to 62.5% and communication overhead up to 70.7% compared to existing protocols. Besides, PriFFT reduces execution time and communication overhead in privacy-preserving fine-tuning up to 59.1%$ and 77.0%$ without accuracy drop compared to the existing secret sharing methods.
△ Less
Submitted 13 May, 2025; v1 submitted 4 March, 2025;
originally announced March 2025.
-
Group Sparsity Methods for Compressive Space-Frequency Channel Estimation and Spatial Equalization in Fluid Antenna System
Authors:
Xuehui Dong,
Kai Wan,
Shuangyang Li,
Robert Caiming Qiu,
Giuseppe Caire
Abstract:
Fluid Antenna System (FAS) unlocks unprecedented flexibility in wireless channel optimization through spatial reconfigurability. However, its practical deployment is hindered by the coupled challenges posed by high-dimensional channel estimation and real-time position optimization. This paper bridges wireless propagation physics with compressed sensing theory to address these challenges through th…
▽ More
Fluid Antenna System (FAS) unlocks unprecedented flexibility in wireless channel optimization through spatial reconfigurability. However, its practical deployment is hindered by the coupled challenges posed by high-dimensional channel estimation and real-time position optimization. This paper bridges wireless propagation physics with compressed sensing theory to address these challenges through three aspects. First, we establish a group-sparse recovery framework for space-frequency characteristics (SFC) in FAS, formally characterizing leakage-induced sparsity degradation from limited aperture and bandwidth as a structured group-sparsity problem. By deriving dictionary-adapted group restricted isometry property (D-GRIP), we prove tight recovery bounds for a convex $\ell_1/\ell_2$-mixed norm optimization formulation that preserves leakage-aware sparsity patterns. Second, we develop a Descending Correlation Group Orthogonal Matching Pursuit (DC-GOMP) algorithm that systematically relaxes leakage constraints to reduce subcoherence. This approach enables robust FSC recovery with accelerated convergence and superior performance compared to conventional compressive sensing methods like OMP or GOMP. Third, we formulate spatial equalization (SE) as a mixed-integer linear programming (MILP) problem, ensuring optimality through the branch-and-bound method. To achieve real-time implementability while maintaining near-optimal performance, we complement this with a greedy algorithm.
Simulation results demonstrate the proposed channel estimation algorithm effectively resolves energy misallocation and enables recovery of weak details, achieving superior recovery accuracy and convergence rate. The SE framework suppresses deep fading phenomena and reduces hardware deployment overhead while maintaining equivalent link reliability.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Visual-RFT: Visual Reinforcement Fine-Tuning
Authors:
Ziyu Liu,
Zeyi Sun,
Yuhang Zang,
Xiaoyi Dong,
Yuhang Cao,
Haodong Duan,
Dahua Lin,
Jiaqi Wang
Abstract:
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models,…
▽ More
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning
Authors:
Zijian Li,
Shunxing Fan,
Yujia Zheng,
Ignavier Ng,
Shaoan Xie,
Guangyi Chen,
Xinshuai Dong,
Ruichu Cai,
Kun Zhang
Abstract:
Disentangled representation learning aims to uncover latent variables underlying the observed data, and generally speaking, rather strong assumptions are needed to ensure identifiability. Some approaches rely on sufficient changes on the distribution of latent variables indicated by auxiliary variables such as domain indices, but acquiring enough domains is often challenging. Alternative approache…
▽ More
Disentangled representation learning aims to uncover latent variables underlying the observed data, and generally speaking, rather strong assumptions are needed to ensure identifiability. Some approaches rely on sufficient changes on the distribution of latent variables indicated by auxiliary variables such as domain indices, but acquiring enough domains is often challenging. Alternative approaches exploit structural sparsity assumptions on the mixing procedure, but such constraints are usually (partially) violated in practice. Interestingly, we find that these two seemingly unrelated assumptions can actually complement each other to achieve identifiability. Specifically, when conditioned on auxiliary variables, the sparse mixing procedure assumption provides structural constraints on the mapping from estimated to true latent variables and hence compensates for potentially insufficient distribution changes. Building on this insight, we propose an identifiability theory with less restrictive constraints regarding distribution changes and the sparse mixing procedure, enhancing applicability to real-world scenarios. Additionally, we develop an estimation framework incorporating a domain encoding network and a sparse mixing constraint and provide two implementations based on variational autoencoders and generative adversarial networks, respectively. Experiment results on synthetic and real-world datasets support our theoretical results.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription
Authors:
Benjamin Gutteridge,
Matthew Thomas Jackson,
Toni Kukurin,
Xiaowen Dong
Abstract:
Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal…
▽ More
Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal large language models (MLLMs) for transcribing multi-page handwritten documents in a zero-shot setting. We investigate various configurations of commercial OCR engines and MLLMs, utilizing the latter both as end-to-end transcribers and as post-processors, with and without image components. We propose a novel method, '+first page', which enhances MLLM transcription by providing the OCR output of the entire document along with just the first page image. This approach leverages shared document features without incurring the high cost of processing all images. Experiments on a multi-page version of the IAM Handwriting Database demonstrate that '+first page' improves transcription accuracy, balances cost with performance, and even enhances results on out-of-sample text by extrapolating formatting and OCR error patterns from a single page.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Towards More Accurate Full-Atom Antibody Co-Design
Authors:
Jiayang Wu,
Xingyi Zhang,
Xiangyu Dong,
Kun Xie,
Ziqi Liu,
Wensheng Gan,
Sibo Wang,
Le Song
Abstract:
Antibody co-design represents a critical frontier in drug development, where accurate prediction of both 1D sequence and 3D structure of complementarity-determining regions (CDRs) is essential for targeting specific epitopes. Despite recent advances in equivariant graph neural networks for antibody design, current approaches often fall short in capturing the intricate interactions that govern anti…
▽ More
Antibody co-design represents a critical frontier in drug development, where accurate prediction of both 1D sequence and 3D structure of complementarity-determining regions (CDRs) is essential for targeting specific epitopes. Despite recent advances in equivariant graph neural networks for antibody design, current approaches often fall short in capturing the intricate interactions that govern antibody-antigen recognition and binding specificity. In this work, we present Igformer, a novel end-to-end framework that addresses these limitations through innovative modeling of antibody-antigen binding interfaces. Our approach refines the inter-graph representation by integrating personalized propagation with global attention mechanisms, enabling comprehensive capture of the intricate interplay between local chemical interactions and global conformational dependencies that characterize effective antibody-antigen binding. Through extensive validation on epitope-binding CDR design and structure prediction tasks, Igformer demonstrates significant improvements over existing methods, suggesting that explicit modeling of multi-scale residue interactions can substantially advance computational antibody design for therapeutic applications.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
FPGA-based Emulation and Device-Side Management for CXL-based Memory Tiering Systems
Authors:
Yiqi Chen,
Xiping Dong,
Zhe Zhou,
Zhao Wang,
Jie Zhang,
Guangyu Sun
Abstract:
The Compute Express Link (CXL) technology facilitates the extension of CPU memory through byte-addressable SerDes links and cascaded switches, creating complex heterogeneous memory systems where CPU access to various endpoints differs in latency and bandwidth. Effective tiered memory management is essential for optimizing system performance in such systems. However, designing an effective memory t…
▽ More
The Compute Express Link (CXL) technology facilitates the extension of CPU memory through byte-addressable SerDes links and cascaded switches, creating complex heterogeneous memory systems where CPU access to various endpoints differs in latency and bandwidth. Effective tiered memory management is essential for optimizing system performance in such systems. However, designing an effective memory tiering system for CXL-extended heterogeneous memory faces challenges: 1) Existing evaluation methods, such as NUMA-based emulation and full-system simulations like GEM5, are limited in assessing hardware-based tiered memory management solutions and handling real-world workloads at scale. 2) Previous memory tiering systems struggle to simultaneously achieve high resolution, low overhead, and high flexibility and compatibility.
In this study, we first introduce HeteroBox, a configurable emulation platform that leverages real CXL-enabled FPGAs to emulate the performance of various CXL memory architectures. HeteroBox allows one to configure a memory space with multiple regions, each exhibiting distinct CPU-access latency and bandwidth. HeteroBox helps assess the performance of both software-managed and hardware-managed memory tiering systems with high efficiency and fidelity. Based on HeteroBox, we further propose HeteroMem, a hardware-managed memory tiering system that operates on the device side. HeteroMem creates an abstraction layer between the CPU and device memory, effectively monitoring data usage and migrating data to faster memory tiers, thus hiding device-side heterogeneity from the CPU. Evaluations with real-world applications show that HeteroMem delivers high performance while keeping heterogeneous memory management fully transparent to the CPU, achieving a 5.1\% to 16.2\% performance improvement over existing memory tiering solutions.
△ Less
Submitted 14 March, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
GCDance: Genre-Controlled 3D Full Body Dance Generation Driven By Music
Authors:
Xinran Liu,
Xu Dong,
Diptesh Kanojia,
Wenwu Wang,
Zhenhua Feng
Abstract:
Generating high-quality full-body dance sequences from music is a challenging task as it requires strict adherence to genre-specific choreography. Moreover, the generated sequences must be both physically realistic and precisely synchronized with the beats and rhythm of the music. To overcome these challenges, we propose GCDance, a classifier-free diffusion framework for generating genre-specific…
▽ More
Generating high-quality full-body dance sequences from music is a challenging task as it requires strict adherence to genre-specific choreography. Moreover, the generated sequences must be both physically realistic and precisely synchronized with the beats and rhythm of the music. To overcome these challenges, we propose GCDance, a classifier-free diffusion framework for generating genre-specific dance motions conditioned on both music and textual prompts. Specifically, our approach extracts music features by combining high-level pre-trained music foundation model features with hand-crafted features for multi-granularity feature fusion. To achieve genre controllability, we leverage CLIP to efficiently embed genre-based textual prompt representations at each time step within our dance generation pipeline. Our GCDance framework can generate diverse dance styles from the same piece of music while ensuring coherence with the rhythm and melody of the music. Extensive experimental results obtained on the FineDance dataset demonstrate that GCDance significantly outperforms the existing state-of-the-art approaches, which also achieve competitive results on the AIST++ dataset. Our ablation and inference time analysis demonstrate that GCDance provides an effective solution for high-quality music-driven dance generation.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
Authors:
Xiuwei Chen,
Sihao Lin,
Xiao Dong,
Zisheng Chen,
Meng Cao,
Jianhua Han,
Hang Xu,
Xiaodan Liang
Abstract:
Transformers have been favored in both uni-modal and multi-modal foundation models for their flexible scalability in attention modules. Consequently, a number of pre-trained Transformer models, e.g., LLaVA, CLIP, and DEIT, are publicly available. Recent research has introduced subquadratic architectures like Mamba, which enables global awareness with linear complexity. Nevertheless, training speci…
▽ More
Transformers have been favored in both uni-modal and multi-modal foundation models for their flexible scalability in attention modules. Consequently, a number of pre-trained Transformer models, e.g., LLaVA, CLIP, and DEIT, are publicly available. Recent research has introduced subquadratic architectures like Mamba, which enables global awareness with linear complexity. Nevertheless, training specialized subquadratic architectures from scratch for certain tasks is both resource-intensive and time-consuming. As a motivator, we explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba. Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks. Concerning architecture disparities, we project the intermediate features into an aligned latent space before transferring knowledge. On top of that, a Weight Subcloning and Adaptive Bidirectional distillation method (WSAB) is introduced for knowledge transfer without limitations on varying layer counts. For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba's visual features, enhancing the cross-modal interaction capabilities of Mamba architecture. Despite using less than 75% of the training data typically required for training from scratch, TransMamba boasts substantially stronger performance across various network architectures and downstream tasks, including image classification, visual question answering, and text-video retrieval. The code will be publicly available.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement
Authors:
Wenhui Zhu,
Xuanzhao Dong,
Xin Li,
Yujian Xiong,
Xiwen Chen,
Peijie Qiu,
Vamsi Krishna Vasa,
Zhangsihao Yang,
Yi Su,
Oana Dumitrascu,
Yalin Wang
Abstract:
Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical r…
▽ More
Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{https://github.com/Retinal-Research/EyeBench}
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
Authors:
Zihan Liu,
Shuangrui Ding,
Zhixiong Zhang,
Xiaoyi Dong,
Pan Zhang,
Yuhang Zang,
Yuhang Cao,
Dahua Lin,
Jiaqi Wang
Abstract:
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for…
▽ More
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .
△ Less
Submitted 18 February, 2025;
originally announced February 2025.