-
Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
Authors:
Lexiang Tang,
Xianwei Zhuang,
Bang Yang,
Zhiyuan Hu,
Hongxiang Li,
Lu Ma,
Jinghan Ru,
Yuexian Zou
Abstract:
Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, they remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content. We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference. Through sys…
▽ More
Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, they remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content. We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference. Through systematic analysis, we identify three key pathological attention behaviors in LVLMs: (1) weak visual grounding, where attention to visual tokens is insufficient or misallocated, over-focusing on uninformative regions; (2) language prior dominance, where excessive attention to prior response tokens reinforces autoregressive patterns and impairs multimodal alignment; (3) prompt redundancy, where many attention heads fixate on system prompt tokens, disrupting the integration of image, instruction, and response content. To address these issues, we introduce two inference-time interventions: token-level attention intervention (TAI), which enhances focus on salient visual content, and head-level attention intervention (HAI), which suppresses over-attention to prompt and nearby text tokens. VisFlow operates without additional training or model modifications. Extensive experiments across models and benchmarks show that VisFlow effectively reduces hallucinations and improves visual factuality, with negligible computational cost.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Particle Builder -- Learn about the Standard Model while playing against an AI
Authors:
Mohammad Attar,
Andrew Carse,
Yeming Chen,
Thomas Green,
Jeong-Yeon Ha,
Yanbai Jin,
Amy McWilliams,
Theirry Panggabean,
Zhengyu Peng,
Lujin Sun,
Jing Ru,
Jiacheng She,
Jialin Wang,
Zilun Wei,
Jiayuan Zhu,
Lachlan McGinness
Abstract:
Particle Builder Online is a web-based education game designed for high school physics students. Students can play against an AI opponent or peers to familiarise themselves with the Standard Model of Particle Physics. The game is aimed at a high school level and tailored to the International Baccalaureate and the Australian Curriculum. Students from four schools in Canberra took pre/post-tests and…
▽ More
Particle Builder Online is a web-based education game designed for high school physics students. Students can play against an AI opponent or peers to familiarise themselves with the Standard Model of Particle Physics. The game is aimed at a high school level and tailored to the International Baccalaureate and the Australian Curriculum. Students from four schools in Canberra took pre/post-tests and a survey while completing a lesson where they played Particle Builder. Students' understanding of particle physics concepts improved significantly. Students found the game more enjoyable and effective than regular classroom lessons.
△ Less
Submitted 27 May, 2025;
originally announced June 2025.
-
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
Authors:
Xianwei Zhuang,
Yuxin Xie,
Yufan Deng,
Dongchao Yang,
Liming Liang,
Jinghan Ru,
Yuguo Yin,
Yuexian Zou
Abstract:
In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcemen…
▽ More
In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcement learning through Direct Preference Optimization (DPO), (2) an expanded training corpus containing 8.3M visual-generative instruction pairs, (3) an upgraded language model backbone using Qwen2, (4) enhanced image generation resolution, and (5) emergent image editing capabilities without architectural modifications. These advancements enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks, demonstrating significant improvements in both comprehension and generation metrics. Notably, through visual instruction tuning, the model acquires image editing functionality while maintaining architectural consistency with its predecessor, revealing the potential for unified visual understanding, generation, and editing. Our findings suggest that well-designed unified visual autoregressive models can effectively adopt flexible training strategies from large language models (LLMs), exhibiting promising scalability. The codebase and model weights are publicly available at https://github.com/VARGPT-family/VARGPT-v1.1.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors
Authors:
Yuguo Yin,
Yuxin Xie,
Wenyuan Yang,
Dongchao Yang,
Jinghan Ru,
Xianwei Zhuang,
Liming Liang,
Yuexian Zou
Abstract:
Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. We theoretically analyze the inconsistency in terms of both multilingual modal alignment direction error and weight error, and propose the theoretical…
▽ More
Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. We theoretically analyze the inconsistency in terms of both multilingual modal alignment direction error and weight error, and propose the theoretical weight error upper bound for quantifying the inconsistency. Based on the analysis of the weight error upper bound, we find that the inconsistency problem stems from the data distribution error caused by random sampling of languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive learning and audio-English co-anchor contrastive learning, aiming to mitigate the negative impact of data distribution error on recall and consistency in ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets show that our scheme achieves state-of-the-art performance on recall and consistency metrics for eight mainstream languages, including English. Our code will be available at https://github.com/ATRI-ACL/ATRI-ACL.
△ Less
Submitted 4 June, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Multi-Class Imbalanced Learning with Support Vector Machines via Differential Evolution
Authors:
Zhong-Liang Zhang,
Jie Yang,
Jian-Ming Ru,
Xiao-Xi Zhao,
Xing-Gang Luo
Abstract:
Support vector machine (SVM) is a powerful machine learning algorithm to handle classification tasks. However, the classical SVM is developed for binary problems with the assumption of balanced datasets. Obviously, the multi-class imbalanced classification problems are more complex. In this paper, we propose an improved SVM via Differential Evolution (i-SVM-DE) method to deal with it. An improved…
▽ More
Support vector machine (SVM) is a powerful machine learning algorithm to handle classification tasks. However, the classical SVM is developed for binary problems with the assumption of balanced datasets. Obviously, the multi-class imbalanced classification problems are more complex. In this paper, we propose an improved SVM via Differential Evolution (i-SVM-DE) method to deal with it. An improved SVM (i-SVM) model is proposed to handle the data imbalance by combining cost sensitive technique and separation margin modification in the constraints, which formalize a parameter optimization problem. By using one-versus-one (OVO) scheme, a multi-class problem is decomposed into a number of binary subproblems. A large optimization problem is formalized through concatenating the parameters in the binary subproblems. To find the optimal model effectively and learn the support vectors for each class simultaneously, an improved differential evolution (DE) algorithm is applied to solve this large optimization problem. Instead of the validation set, we propose the fitness functions to evaluate the learned model and obtain the optimal parameters in the search process of DE. A series of experiments are carried out to verify the benefits of our proposed method. The results indicate that i-SVM-DE is statistically superior by comparing with the other baseline methods.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Do we really have to filter out random noise in pre-training data for language models?
Authors:
Jinghan Ru,
Yuxin Xie,
Xianwei Zhuang,
Yuguo Yin,
Zhihui Guo,
Zhiming Liu,
Qianli Ren,
Yuexian Zou
Abstract:
Web-scale pre-training datasets are the cornerstone of LLMs' success. However, text data curated from the Internet inevitably contains random noise caused by decoding errors or unregulated web content. In contrast to previous works that focus on low quality or synthetic data, our study \textbf{provides the first systematic investigation of such random noise through a cohesive ``What-Why-How'' fram…
▽ More
Web-scale pre-training datasets are the cornerstone of LLMs' success. However, text data curated from the Internet inevitably contains random noise caused by decoding errors or unregulated web content. In contrast to previous works that focus on low quality or synthetic data, our study \textbf{provides the first systematic investigation of such random noise through a cohesive ``What-Why-How'' framework.} Surprisingly, we observed that the resulting increase in the loss of next-token prediction (NTP) was significantly lower than the proportion of random noise even when the model was scaled up to 2.7B. We provide a theoretical justification for this phenomenon, which also elucidates the success of multilingual models and can be applied to multimodal models. On the other hand, experiments show that the model's performance in downstream tasks is not based solely on the NTP loss, which means that random noise may result in degraded downstream performance. To address the potential adverse effects, we introduce a novel plug-and-play Local Gradient Matching loss, which explicitly enhances the denoising capability of the downstream task head by aligning the gradient of normal and perturbed features without requiring knowledge of the model's parameters. Additional experiments on 8 language and 14 vision benchmarks further validate its effectiveness.
△ Less
Submitted 15 May, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
Authors:
Xianwei Zhuang,
Yuxin Xie,
Yufan Deng,
Liming Liang,
Jinghan Ru,
Yuguo Yin,
Yuexian Zou
Abstract:
We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework. VARGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation. VARGPT innovatively extends the LLaVA architecture, achieving efficient scale-wise autoregressi…
▽ More
We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework. VARGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation. VARGPT innovatively extends the LLaVA architecture, achieving efficient scale-wise autoregressive visual generation within MLLMs while seamlessly accommodating mixed-modal input and output within a single model framework. Our VARGPT undergoes a three-stage unified training process on specially curated datasets, comprising a pre-training phase and two mixed visual instruction-tuning phases. The unified training strategy are designed to achieve alignment between visual and textual features, enhance instruction following for both understanding and generation, and improve visual generation quality, respectively. Despite its LLAVA-based architecture for multimodel understanding, VARGPT significantly outperforms LLaVA-1.5 across various vision-centric benchmarks, such as visual question-answering and reasoning tasks. Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks. Project page is at: \url{https://vargpt-1.github.io/}
△ Less
Submitted 21 January, 2025;
originally announced January 2025.
-
Imbalanced Open Set Domain Adaptation via Moving-threshold Estimation and Gradual Alignment
Authors:
Jinghan Ru,
Jun Tian,
Zhekai Du,
Chengwei Xiao,
Jingjing Li,
Heng Tao Shen
Abstract:
Multimedia applications are often associated with cross-domain knowledge transfer, where Unsupervised Domain Adaptation (UDA) can be used to reduce the domain shifts. Open Set Domain Adaptation (OSDA) aims to transfer knowledge from a well-labeled source domain to an unlabeled target domain under the assumption that the target domain contains unknown classes. Existing OSDA methods consistently lay…
▽ More
Multimedia applications are often associated with cross-domain knowledge transfer, where Unsupervised Domain Adaptation (UDA) can be used to reduce the domain shifts. Open Set Domain Adaptation (OSDA) aims to transfer knowledge from a well-labeled source domain to an unlabeled target domain under the assumption that the target domain contains unknown classes. Existing OSDA methods consistently lay stress on the covariate shift, ignoring the potential label shift problem. The performance of OSDA methods degrades drastically under intra-domain class imbalance and inter-domain label shift. However, little attention has been paid to this issue in the community. In this paper, the Imbalanced Open Set Domain Adaptation (IOSDA) is explored where the covariate shift, label shift and category mismatch exist simultaneously. To alleviate the negative effects raised by label shift in OSDA, we propose Open-set Moving-threshold Estimation and Gradual Alignment (OMEGA) - a novel architecture that improves existing OSDA methods on class-imbalanced data. Specifically, a novel unknown-aware target clustering scheme is proposed to form tight clusters in the target domain to reduce the negative effects of label shift and intra-domain class imbalance. Furthermore, moving-threshold estimation is designed to generate specific thresholds for each target sample rather than using one for all. Extensive experiments on IOSDA, OSDA and OPDA benchmarks demonstrate that our method could significantly outperform existing state-of-the-arts. Code and data are available at https://github.com/mendicant04/OMEGA.
△ Less
Submitted 8 March, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
On the expected number of perfect matchings in cubic planar graphs
Authors:
Marc Noy,
Clément Requilé,
Juanjo Rué
Abstract:
A well-known conjecture by Lovász and Plummer from the 1970s asserted that a bridgeless cubic graph has exponentially many perfect matchings. It was solved in the affirmative by Esperet et al. (Adv. Math. 2011). On the other hand, Chudnovsky and Seymour (Combinatorica 2012) proved the conjecture in the special case of cubic planar graphs. In our work we consider random bridgeless cubic planar grap…
▽ More
A well-known conjecture by Lovász and Plummer from the 1970s asserted that a bridgeless cubic graph has exponentially many perfect matchings. It was solved in the affirmative by Esperet et al. (Adv. Math. 2011). On the other hand, Chudnovsky and Seymour (Combinatorica 2012) proved the conjecture in the special case of cubic planar graphs. In our work we consider random bridgeless cubic planar graphs with the uniform distribution on graphs with $n$ vertices. Under this model we show that the expected number of perfect matchings in labeled bridgeless cubic planar graphs is asymptotically $cγ^n$, where $c>0$ and $γ\sim 1.14196$ is an explicit algebraic number. We also compute the expected number of perfect matchings in (non necessarily bridgeless) cubic planar graphs and provide lower bounds for unlabeled graphs. Our starting point is a correspondence between counting perfect matchings in rooted cubic planar maps and the partition function of the Ising model in rooted triangulations.
△ Less
Submitted 1 March, 2021; v1 submitted 28 May, 2020;
originally announced May 2020.
-
Dynamic Programming for Graphs on Surfaces
Authors:
Juanjo Rué,
Ignasi Sau,
Dimitrios M. Thilikos
Abstract:
We provide a framework for the design and analysis of dynamic programming algorithms for surface-embedded graphs on n vertices and branchwidth at most k. Our technique applies to general families of problems where standard dynamic programming runs in 2^{O(k log k)} n steps. Our approach combines tools from topological graph theory and analytic combinatorics. In particular, we introduce a new type…
▽ More
We provide a framework for the design and analysis of dynamic programming algorithms for surface-embedded graphs on n vertices and branchwidth at most k. Our technique applies to general families of problems where standard dynamic programming runs in 2^{O(k log k)} n steps. Our approach combines tools from topological graph theory and analytic combinatorics. In particular, we introduce a new type of branch decomposition called "surface cut decomposition", generalizing sphere cut decompositions of planar graphs introduced by Seymour and Thomas, which has nice combinatorial properties. Namely, the number of partial solutions that can be arranged on a surface cut decomposition can be upper-bounded by the number of non-crossing partitions on surfaces with boundary. It follows that partial solutions can be represented by a single-exponential (in the branchwidth k) number of configurations. This proves that, when applied on surface cut decompositions, dynamic programming runs in 2^{O(k)} n steps. That way, we considerably extend the class of problems that can be solved in running times with a single-exponential dependence on branchwidth and unify/improve most previous results in this direction.
△ Less
Submitted 25 April, 2011; v1 submitted 13 April, 2011;
originally announced April 2011.
-
Asymptotic Enumeration of Non-crossing Partitions on Surfaces
Authors:
Juanjo Rué,
Ignasi Sau,
Dimitrios M. Thilikos
Abstract:
We generalize the notion of non-crossing partition on a disk to general surfaces with boundary. For this, we consider a surface $Σ$ and introduce the number $C_Σ(n)$ of non-crossing partitions of a set of $n$ points laying on the boundary of $Σ$. Our proofs use bijective techniques arising from map enumeration, joint with the symbolic method and singularity analysis on generating functions. An out…
▽ More
We generalize the notion of non-crossing partition on a disk to general surfaces with boundary. For this, we consider a surface $Σ$ and introduce the number $C_Σ(n)$ of non-crossing partitions of a set of $n$ points laying on the boundary of $Σ$. Our proofs use bijective techniques arising from map enumeration, joint with the symbolic method and singularity analysis on generating functions. An outcome of our results is that the exponential growth of $C_Σ(n)$ is the same as the one of the $n$-th Catalan number, i.e., does not change when we move from the case where $Σ$ is a disk to general surfaces with boundary.
△ Less
Submitted 14 April, 2011; v1 submitted 13 April, 2011;
originally announced April 2011.