-
Unified Multimodal Understanding via Byte-Pair Visual Encoding
Authors:
Wanpeng Zhang,
Yicheng Feng,
Hao Luo,
Yijiang Li,
Zihao Yue,
Sipeng Zheng,
Zongqing Lu
Abstract:
Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates…
▽ More
Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.
△ Less
Submitted 30 June, 2025;
originally announced June 2025.
-
MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Authors:
Yuxuan Luo,
Yuhui Yuan,
Junwen Chen,
Haonan Cai,
Ziyi Yue,
Yuwei Yang,
Fatima Zohra Daha,
Ji Li,
Zhouhui Lian
Abstract:
In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effec…
▽ More
In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits -- low entity fidelity, weak relations, and clutter -- with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
△ Less
Submitted 13 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
SAR2Struct: Extracting 3D Semantic Structural Representation of Aircraft Targets from Single-View SAR Image
Authors:
Ziyu Yue,
Ruixi You,
Feng Xu
Abstract:
To translate synthetic aperture radar (SAR) image into interpretable forms for human understanding is the ultimate goal of SAR advanced information retrieval. Existing methods mainly focus on 3D surface reconstruction or local geometric feature extraction of targets, neglecting the role of structural modeling in capturing semantic information. This paper proposes a novel task: SAR target structure…
▽ More
To translate synthetic aperture radar (SAR) image into interpretable forms for human understanding is the ultimate goal of SAR advanced information retrieval. Existing methods mainly focus on 3D surface reconstruction or local geometric feature extraction of targets, neglecting the role of structural modeling in capturing semantic information. This paper proposes a novel task: SAR target structure recovery, which aims to infer the components of a target and the structural relationships between its components, specifically symmetry and adjacency, from a single-view SAR image. Through learning the structural consistency and geometric diversity across the same type of targets as observed in different SAR images, it aims to derive the semantic representation of target directly from its 2D SAR image. To solve this challenging task, a two-step algorithmic framework based on structural descriptors is developed. Specifically, in the training phase, it first detects 2D keypoints from real SAR images, and then learns the mapping from these keypoints to 3D hierarchical structures using simulated data. During the testing phase, these two steps are integrated to infer the 3D structure from real SAR images. Experimental results validated the effectiveness of each step and demonstrated, for the first time, that 3D semantic structural representation of aircraft targets can be directly derived from a single-view SAR image.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning
Authors:
Zikang Wang,
Boyu Chen,
Zhengrong Yue,
Yi Wang,
Yu Qiao,
Limin Wang,
Yali Wang
Abstract:
The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. Ho…
▽ More
The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing to interactively discover preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, our VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME and 70.1 on EgoSchema, outperforming its strong baselines (e.g., Intern2.5VL-8B and InternVideo2.5-8B), by up to 10.8\% and 6.2\%. Compared to leading close-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but with 7\% input frames and 12\% inference time on average.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training
Authors:
Jianyi Wang,
Shanchuan Lin,
Zhijie Lin,
Yuxi Ren,
Meng Wei,
Zongsheng Yue,
Shangchen Zhou,
Hao Chen,
Yang Zhao,
Ceyuan Yang,
Xuefeng Xiao,
Chen Change Loy,
Lu Jiang
Abstract:
Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution…
▽ More
Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
MiMo-VL Technical Report
Authors:
Xiaomi LLM-Core Team,
:,
Zihao Yue,
Zhenru Lin,
Yifan Song,
Weikun Wang,
Shuhuai Ren,
Shuhao Gu,
Shicheng Li,
Peidian Li,
Liang Zhao,
Lei Li,
Kainan Bao,
Hao Tian,
Hailin Zhang,
Gang Wang,
Dawei Zhu,
Cici,
Chenhong He,
Bowen Ye,
Bowen Shen,
Zihan Zhang,
Zihan Jiang,
Zhixian Zheng,
Zhichao Song
, et al. (50 additional authors not shown)
Abstract:
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with…
▽ More
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Hybrid Latent Reasoning via Reinforcement Learning
Authors:
Zhenrui Yue,
Bowen Jin,
Huimin Zeng,
Honglei Zhuang,
Zhen Qin,
Jinsung Yoon,
Lanyu Shang,
Jiawei Han,
Dong Wang
Abstract:
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as th…
▽ More
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages
Authors:
Chin-Jou Li,
Eunjung Yeo,
Kwanghee Choi,
Paula Andrea Pérez-Toro,
Masao Someki,
Rohan Kumar Das,
Zhengjun Yue,
Juan Rafael Orozco-Arroyave,
Elmar Nöth,
David R. Mortensen
Abstract:
Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech.…
▽ More
Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.
△ Less
Submitted 30 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
ModRWKV: Transformer Multimodality in Linear Time
Authors:
Jiale Kang,
Ziyin Yue,
Qingyu Yin,
Jiang Rui,
Weile Li,
Zening Lu,
Zhouran Ji
Abstract:
Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework bui…
▽ More
Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone-which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly accelerates multimodal training. Comparative experiments with different pretrained checkpoints further demonstrate that such initialization plays a crucial role in enhancing the model's ability to understand multimodal signals. Supported by extensive experiments, we conclude that modern RNN architectures present a viable alternative to Transformers in the domain of multimodal large language models (MLLMs). Furthermore, we identify the optimal configuration of the ModRWKV architecture through systematic exploration.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
A system identification approach to clustering vector autoregressive time series
Authors:
Zuogong Yue,
Xinyi Wang,
Victor Solo
Abstract:
Clustering of time series based on their underlying dynamics is keeping attracting researchers due to its impacts on assisting complex system modelling. Most current time series clustering methods handle only scalar time series, treat them as white noise, or rely on domain knowledge for high-quality feature construction, where the autocorrelation pattern/feature is mostly ignored. Instead of relyi…
▽ More
Clustering of time series based on their underlying dynamics is keeping attracting researchers due to its impacts on assisting complex system modelling. Most current time series clustering methods handle only scalar time series, treat them as white noise, or rely on domain knowledge for high-quality feature construction, where the autocorrelation pattern/feature is mostly ignored. Instead of relying on heuristic feature/metric construction, the system identification approach allows treating vector time series clustering by explicitly considering their underlying autoregressive dynamics. We first derive a clustering algorithm based on a mixture autoregressive model. Unfortunately it turns out to have significant computational problems. We then derive a `small-noise' limiting version of the algorithm, which we call k-LMVAR (Limiting Mixture Vector AutoRegression), that is computationally manageable. We develop an associated BIC criterion for choosing the number of clusters and model order. The algorithm performs very well in comparative simulations and also scales well computationally.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible Benchmarking
Authors:
Sicheng Shen,
Dongcheng Zhao,
Linghao Feng,
Zeyang Yue,
Jindong Li,
Tenglong Li,
Guobin Shen,
Yi Zeng
Abstract:
Spiking Transformers have recently emerged as promising architectures for combining the efficiency of spiking neural networks with the representational power of self-attention. However, the lack of standardized implementations, evaluation pipelines, and consistent design choices has hindered fair comparison and principled analysis. In this paper, we introduce \textbf{STEP}, a unified benchmark fra…
▽ More
Spiking Transformers have recently emerged as promising architectures for combining the efficiency of spiking neural networks with the representational power of self-attention. However, the lack of standardized implementations, evaluation pipelines, and consistent design choices has hindered fair comparison and principled analysis. In this paper, we introduce \textbf{STEP}, a unified benchmark framework for Spiking Transformers that supports a wide range of tasks, including classification, segmentation, and detection across static, event-based, and sequential datasets. STEP provides modular support for diverse components such as spiking neurons, input encodings, surrogate gradients, and multiple backends (e.g., SpikingJelly, BrainCog). Using STEP, we reproduce and evaluate several representative models, and conduct systematic ablation studies on attention design, neuron types, encoding schemes, and temporal modeling capabilities. We also propose a unified analytical model for energy estimation, accounting for spike sparsity, bitwidth, and memory access, and show that quantized ANNs may offer comparable or better energy efficiency. Our results suggest that current Spiking Transformers rely heavily on convolutional frontends and lack strong temporal modeling, underscoring the need for spike-native architectural innovations. The full code is available at: https://github.com/Fancyssc/STEP
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
SnapNCode: An Integrated Development Environment for Programming Physical Objects Interactions
Authors:
Xiaoyan Wei,
Zijian Yue,
Hsiang-Ting Chen
Abstract:
Spatial computing technologies have the potential to revolutionize how we interact with the world around us. However, most modern integrated development environments (IDEs) have not fully adapted to this paradigm shift. For example, physical 3D objects in the real world are still represented as 2D text variables in code, creating a significant perceptual distance between these representations. In…
▽ More
Spatial computing technologies have the potential to revolutionize how we interact with the world around us. However, most modern integrated development environments (IDEs) have not fully adapted to this paradigm shift. For example, physical 3D objects in the real world are still represented as 2D text variables in code, creating a significant perceptual distance between these representations. In response to this challenge, we introduce SnapNCode, a novel IDE for spatial programming. SnapNCode enables programmers to capture various states of physical objects through live video streams from cameras and directly insert these visual representations into their code. Moreover, users can augment physical objects by attaching code snippets onto objects, which are opportunistically triggered when observed by cameras. We conducted a user study (N=12) to assess the usability of SnapNCode. Feedback from participants indicates that the system is easy-to-use and holds promise for daily casual uses and integration into a broader range of workflows.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Context-AI Tunes: Context-Aware AI-Generated Music for Stress Reduction
Authors:
Xiaoyan Wei,
Zebang Zhang,
Zijian Yue,
Hsiang-Ting Chen
Abstract:
Music plays a critical role in emotional regulation and stress relief; however, individuals often need different types of music tailored to their unique stress levels or surrounding environment. Choosing the right music can be challenging due to the overwhelming number of options and the time-consuming trial-and-error process. To address this, we propose Context-AI Tune (CAT), a system that genera…
▽ More
Music plays a critical role in emotional regulation and stress relief; however, individuals often need different types of music tailored to their unique stress levels or surrounding environment. Choosing the right music can be challenging due to the overwhelming number of options and the time-consuming trial-and-error process. To address this, we propose Context-AI Tune (CAT), a system that generates personalized music based on environmental inputs and the user's self-assessed stress level. A 2x2 within-subject experiment (N=26) was conducted with two independent variables: AI (AI, NoAI) and Environment (Busy Hub, Quiet Library). CAT's effectiveness in reducing stress was evaluated using the Visual Analog Scale for Stress (VAS-S). Results show that CAT is more effective than manually chosen music in reducing stress by adapting to user context.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Authors:
LLM-Core Xiaomi,
:,
Bingquan Xia,
Bowen Shen,
Cici,
Dawei Zhu,
Di Zhang,
Gang Wang,
Hailin Zhang,
Huaqiu Liu,
Jiebao Xiao,
Jinhao Dong,
Liang Zhao,
Peidian Li,
Peng Wang,
Shihua Yu,
Shimao Chen,
Weikun Wang,
Wenhan Ma,
Xiangwei Deng,
Yi Huang,
Yifan Song,
Zihan Jiang,
Bowen Ye,
Can Cai
, et al. (40 additional authors not shown)
Abstract:
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective…
▽ More
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
△ Less
Submitted 5 June, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
Authors:
Bohan Wang,
Zhongqi Yue,
Fengda Zhang,
Shuo Chen,
Li'an Bi,
Junzhe Zhang,
Xue Song,
Kennard Yanting Chan,
Jiachun Pan,
Weijia Wu,
Mingze Zhou,
Wang Lin,
Kaihang Pan,
Saining Zhang,
Liyu Jia,
Wentao Hu,
Wei Zhao,
Hanwang Zhang
Abstract:
We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally dist…
▽ More
We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: https://selftok-team.github.io/report/.
△ Less
Submitted 27 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
ACORN: Adaptive Contrastive Optimization for Safe and Robust Fine-Grained Robotic Manipulation
Authors:
Zhongquan Zhou,
Shuhao Li,
Zixian Yue
Abstract:
Embodied AI research has traditionally emphasized performance metrics such as success rate and cumulative reward, overlooking critical robustness and safety considerations that emerge during real-world deployment. In actual environments, agents continuously encounter unpredicted situations and distribution shifts, causing seemingly reliable policies to experience catastrophic failures, particularl…
▽ More
Embodied AI research has traditionally emphasized performance metrics such as success rate and cumulative reward, overlooking critical robustness and safety considerations that emerge during real-world deployment. In actual environments, agents continuously encounter unpredicted situations and distribution shifts, causing seemingly reliable policies to experience catastrophic failures, particularly in manipulation tasks. To address this gap, we introduce four novel safety-centric metrics that quantify an agent's resilience to environmental perturbations. Building on these metrics, we present Adaptive Contrastive Optimization for Robust Manipulation (ACORN), a plug-and-play algorithm that enhances policy robustness without sacrificing performance. ACORN leverages contrastive learning to simultaneously align trajectories with expert demonstrations while diverging from potentially unsafe behaviors. Our approach efficiently generates informative negative samples through structured Gaussian noise injection, employing a double perturbation technique that maintains sample diversity while minimizing computational overhead. Comprehensive experiments across diverse manipulation environments validate ACORN's effectiveness, yielding improvements of up to 23% in safety metrics under disturbance compared to baseline methods. These findings underscore ACORN's significant potential for enabling reliable deployment of embodied agents in safety-critical real-world applications.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning
Authors:
Wang Lin,
Liyu Jia,
Wentao Hu,
Kaihang Pan,
Zhongqi Yue,
Wei Zhao,
Jingyuan Chen,
Fei Wu,
Hanwang Zhang
Abstract:
Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency…
▽ More
Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Authors:
Kaihang Pan,
Wang Lin,
Zhongqi Yue,
Tenglong Ao,
Liyu Jia,
Wei Zhao,
Juncheng Li,
Siliang Tang,
Hanwang Zhang
Abstract:
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive…
▽ More
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: https://DDT-LLaMA.github.io/.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs
Authors:
Zichao Yue,
Chenhui Deng,
Zhiru Zhang
Abstract:
Graph neural networks (GNNs) are widely used for learning node embeddings in graphs, typically adopting a message-passing scheme. This approach, however, leads to the neighbor explosion problem, with exponentially growing computational and memory demands as layers increase. Graph sampling has become the predominant method for scaling GNNs to large graphs, mitigating but not fully solving the issue…
▽ More
Graph neural networks (GNNs) are widely used for learning node embeddings in graphs, typically adopting a message-passing scheme. This approach, however, leads to the neighbor explosion problem, with exponentially growing computational and memory demands as layers increase. Graph sampling has become the predominant method for scaling GNNs to large graphs, mitigating but not fully solving the issue. Pre-propagation GNNs (PP-GNNs) represent a new class of models that decouple feature propagation from training through pre-processing, addressing neighbor explosion in theory. Yet, their practical advantages and system-level optimizations remain underexplored. This paper provides a comprehensive characterization of PP-GNNs, comparing them with graph-sampling-based methods in training efficiency, scalability, and accuracy. While PP-GNNs achieve comparable accuracy, we identify data loading as the key bottleneck for training efficiency and input expansion as a major scalability challenge. To address these issues, we propose optimized data loading schemes and tailored training methods that improve PP-GNN training throughput by an average of 15$\times$ over the PP-GNN baselines, with speedup of up to 2 orders of magnitude compared to sampling-based GNNs on large graph benchmarks. Our implementation is publicly available at https://github.com/cornell-zhang/preprop-gnn.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?
Authors:
Hansi Zeng,
Kai Hui,
Honglei Zhuang,
Zhen Qin,
Zhenrui Yue,
Hamed Zamani,
Dana Alon
Abstract:
While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and development. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classificat…
▽ More
While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and development. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classification problem: predicting which of two LLMs, differing in their pre-training, will perform better after supervised fine-tuning (SFT). We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations, e.g., objectives or data, and evaluate them on diverse downstream tasks after SFT. We first conduct a study and demonstrate that the conventional perplexity is a misleading indicator. As such, we introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%. Despite the inherent complexity of this task, we demonstrate the practical utility of our proposed proxies in specific scenarios, paving the way for more efficient design of pre-training schemes optimized for various downstream tasks.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
Authors:
Minghe Gao,
Xuqi Liu,
Zhongqi Yue,
Yang Wu,
Shuang Chen,
Juncheng Li,
Siliang Tang,
Fei Wu,
Tat-Seng Chua,
Yueting Zhuang
Abstract:
Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Th…
▽ More
Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought~(CoT) reward model automatically. It generates code for solving visual tasks and transforms the analysis of code blocks into the evaluation of CoT step as training samples. Then, we train SVIP-Reward model using a multi-head attention mechanism called TriAtt-CoT. The advantages of SVIP-Reward are evident throughout the entire process of MLLM. We also introduce a benchmark for CoT reward model training and testing. Experimental results demonstrate that SVIP-Reward improves MLLM performance across training and inference-time scaling, yielding better results on benchmarks while reducing hallucinations and enhancing reasoning ability.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Authors:
Ye Wang,
Ziheng Wang,
Boshen Xu,
Yang Du,
Kejun Lin,
Zihan Xiao,
Zihao Yue,
Jianzhong Ju,
Liang Zhang,
Dingyi Yang,
Xiangnan Fang,
Zewen He,
Zhenbo Luo,
Wenxuan Wang,
Junqi Lin,
Jian Luan,
Qin Jin
Abstract:
Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that en…
▽ More
Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.
△ Less
Submitted 29 June, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
Authors:
Zhengrong Yue,
Shaobin Zhuang,
Kunchang Li,
Yanbo Ding,
Yali Wang
Abstract:
Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a system…
▽ More
Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content. Via a concise video-to-shot prompting paradigm, it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree. Via a robust tree-of-thought searching paradigm, it allows our V-Stylist to precisely specify vague style preference in the open user query. (3) Style Artist leverages the matched model to render all the video shots into the required style. Via a novel multi-round self-reflection paradigm, it allows our V-Stylist to adaptively adjust detail control, according to the style requirement. With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover,we further construct a new benchmark Text-driven Video Stylization Benchmark (TVSBench), which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V-Stylist achieves the state-of-the-art, e.g.,V-Stylist surpasses FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Authors:
Boyu Chen,
Zhengrong Yue,
Siran Chen,
Zikang Wang,
Yang Liu,
Peng Li,
Yali Wang
Abstract:
Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understa…
▽ More
Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1. Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2. Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3. Action: Agents answer long video-related questions and exchange reasons. 4. Reflection: We evaluate the performance of each agent in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent improves accuracy by up to 13.3% compared with SOTA.
△ Less
Submitted 31 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Authors:
Bowen Jin,
Hansi Zeng,
Zhenrui Yue,
Jinsung Yoon,
Sercan Arik,
Dong Wang,
Hamed Zamani,
Jiawei Han
Abstract:
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Searc…
▽ More
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.
△ Less
Submitted 8 April, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Authors:
Yuqi Liu,
Bohao Peng,
Zhisheng Zhong,
Zihao Yue,
Fanbin Lu,
Bei Yu,
Jiaya Jia
Abstract:
Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforc…
▽ More
Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at https://github.com/dvlab-research/Seg-Zero.
△ Less
Submitted 28 June, 2025; v1 submitted 9 March, 2025;
originally announced March 2025.
-
Optimizing Carbon Footprint in ICT through Swarm Intelligence with Algorithmic Complexity
Authors:
Vasileios Alevizos,
Nikitas Gerolimos,
Sabrina Edralin,
Clark Xu,
Akebu Simasiku,
Georgios Priniotakis,
George Papakostas,
Zongliang Yue
Abstract:
Global emissions from fossil fuel combustion and cement production were recorded in 2022, signaling a resurgence to pre-pandemic levels and providing an apodictic indication that emission peaks have not yet been achieved. Significant contributions to this upward trend are made by the Information and Communication Technology (ICT) industry due to its substantial energy consumption. This shows the n…
▽ More
Global emissions from fossil fuel combustion and cement production were recorded in 2022, signaling a resurgence to pre-pandemic levels and providing an apodictic indication that emission peaks have not yet been achieved. Significant contributions to this upward trend are made by the Information and Communication Technology (ICT) industry due to its substantial energy consumption. This shows the need for further exploration of swarm intelligence applications to measure and optimize the carbon footprint within ICT. All causative factors are evaluated based on the quality of data collection; variations from each source are quantified; and an objective function related to carbon footprint in ICT energy management is optimized. Emphasis is placed on the asyndetic integration of data sources to construct a convex optimization problem. An apodictic necessity to prevent the erosion of accuracy in carbon footprint assessments is addressed. Complexity percentages ranged from 5.25% for the Bat Algorithm to 7.87% for Fast Bacterial Swarming, indicating significant fluctuations in resource intensity among algorithms. These findings suggest that we were able to quantify the environmental impact of various swarm algorithms.
△ Less
Submitted 19 January, 2025;
originally announced January 2025.
-
Integrating Artificial Open Generative Artificial Intelligence into Software Supply Chain Security
Authors:
Vasileios Alevizos,
George A Papakostas,
Akebu Simasiku,
Dimitra Malliarou,
Antonis Messinis,
Sabrina Edralin,
Clark Xu,
Zongliang Yue
Abstract:
While new technologies emerge, human errors always looming. Software supply chain is increasingly complex and intertwined, the security of a service has become paramount to ensuring the integrity of products, safeguarding data privacy, and maintaining operational continuity. In this work, we conducted experiments on the promising open Large Language Models (LLMs) into two main software security ch…
▽ More
While new technologies emerge, human errors always looming. Software supply chain is increasingly complex and intertwined, the security of a service has become paramount to ensuring the integrity of products, safeguarding data privacy, and maintaining operational continuity. In this work, we conducted experiments on the promising open Large Language Models (LLMs) into two main software security challenges: source code language errors and deprecated code, with a focus on their potential to replace conventional static and dynamic security scanners that rely on predefined rules and patterns. Our findings suggest that while LLMs present some unexpected results, they also encounter significant limitations, particularly in memory complexity and the management of new and unfamiliar data patterns. Despite these challenges, the proactive application of LLMs, coupled with extensive security databases and continuous updates, holds the potential to fortify Software Supply Chain (SSC) processes against emerging threats.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
Logarithmic Positional Partition Interval Encoding
Authors:
Vasileios Alevizos,
Nikitas Gerolimos,
Sabrina Edralin,
Clark Xu,
Akebu Simasiku,
Georgios Priniotakis,
George Papakostas,
Zongliang Yue
Abstract:
One requirement of maintaining digital information is storage. With the latest advances in the digital world, new emerging media types have required even more storage space to be kept than before. In fact, in many cases it is required to have larger amounts of storage to keep up with protocols that support more types of information at the same time. In contrast, compression algorithms have been in…
▽ More
One requirement of maintaining digital information is storage. With the latest advances in the digital world, new emerging media types have required even more storage space to be kept than before. In fact, in many cases it is required to have larger amounts of storage to keep up with protocols that support more types of information at the same time. In contrast, compression algorithms have been integrated to facilitate the transfer of larger data. Numerical representations are construed as embodiments of information. However, this correct association of a sequence could feasibly be inverted to signify an elongated series of numerals. In this work, a novel mathematical paradigm was introduced to engineer a methodology reliant on iterative logarithmic transformations, finely tuned to numeric sequences. Through this fledgling approach, an intricate interplay of polymorphic numeric manipulations was conducted. By applying repeated logarithmic operations, the data were condensed into a minuscule representation. Approximately thirteen times surpassed the compression method, ZIP. Such extreme compaction, achieved through iterative reduction of expansive integers until they manifested as single-digit entities, conferred a novel sense of informational embodiment. Instead of relegating data to classical discrete encodings, this method transformed them into a quasi-continuous, logarithmically. By contrast, this introduced approach revealed that morphing data into deeply compressed numerical substrata beyond conventional boundaries was feasible. A holistic perspective emerges, validating that numeric data can be recalibrated into ephemeral sequences of logarithmic impressions. It was not merely a matter of reducing digits, but of reinterpreting data through a resolute numeric vantage.
△ Less
Submitted 15 December, 2024;
originally announced December 2024.
-
Arbitrary-steps Image Super-resolution via Diffusion Inversion
Authors:
Zongsheng Yue,
Kang Liao,
Chen Change Loy
Abstract:
This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Central to our approach is a deep n…
▽ More
This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Central to our approach is a deep noise predictor to estimate the optimal noise maps for the forward diffusion process. Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result. Compared to existing approaches, our method offers a flexible and efficient sampling mechanism that supports an arbitrary number of sampling steps, ranging from one to five. Even with a single sampling step, our method demonstrates superior or comparable performance to recent state-of-the-art approaches. The code and model are publicly available at https://github.com/zsyOAOA/InvSR.
△ Less
Submitted 13 March, 2025; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness
Authors:
Qifan Yu,
Zhebei Shen,
Zhongqi Yue,
Yang Wu,
Wenqiao Zhang,
Yunfei Li,
Juncheng Li,
Siliang Tang,
Yueting Zhuang
Abstract:
Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective dat…
▽ More
Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection. We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier). We further propose practical ways to score against each principle, which automatically adapts to a given dataset without tedious hyperparameter tuning. Comprehensive experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data, significantly reducing computational costs while maintaining superior results. This exemplifies the "Less is More" philosophy in MLLM development.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis
Authors:
Xinyu Hou,
Zongsheng Yue,
Xiaoming Li,
Chen Change Loy
Abstract:
In this work, we introduce a single parameter $ω$, to effectively control granularity in diffusion-based synthesis. This parameter is incorporated during the denoising steps of the diffusion model's reverse process. Our approach does not require model retraining, architectural modifications, or additional computational overhead during inference, yet enables precise control over the level of detail…
▽ More
In this work, we introduce a single parameter $ω$, to effectively control granularity in diffusion-based synthesis. This parameter is incorporated during the denoising steps of the diffusion model's reverse process. Our approach does not require model retraining, architectural modifications, or additional computational overhead during inference, yet enables precise control over the level of details in the generated outputs. Moreover, spatial masks or denoising schedules with varying $ω$ values can be applied to achieve region-specific or timestep-specific granularity control. Prior knowledge of image composition from control signals or reference images further facilitates the creation of precise $ω$ masks for granularity control on specific objects. To highlight the parameter's role in controlling subtle detail variations, the technique is named Omegance, combining "omega" and "nuance". Our method demonstrates impressive performance across various image and video synthesis tasks and is adaptable to advanced diffusion models. The code is available at https://github.com/itsmag11/Omegance.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
VideoOrion: Tokenizing Object Dynamics in Videos
Authors:
Yicheng Feng,
Yijiang Li,
Wanpeng Zhang,
Hao Luo,
Zihao Yue,
Sipeng Zheng,
Zongqing Lu
Abstract:
We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos - the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our met…
▽ More
We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos - the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.
△ Less
Submitted 18 March, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
Authors:
Qifan Yu,
Wei Chow,
Zhongqi Yue,
Kaihang Pan,
Yang Wu,
Xiaoyang Wan,
Juncheng Li,
Siliang Tang,
Hanwang Zhang,
Yueting Zhuang
Abstract:
Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing p…
▽ More
Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models. This presents prospects for developing instruction-driven image editing models that support human creativity.
△ Less
Submitted 29 March, 2025; v1 submitted 24 November, 2024;
originally announced November 2024.
-
Cyclic Vision-Language Manipulator: Towards Reliable and Fine-Grained Image Interpretation for Automated Report Generation
Authors:
Yingying Fang,
Zihao Jin,
Shaojie Guo,
Jinda Liu,
Zhiling Yue,
Yijian Gao,
Junzhi Ning,
Zhi Li,
Simon Walsh,
Guang Yang
Abstract:
Despite significant advancements in automated report generation, the opaqueness of text interpretability continues to cast doubt on the reliability of the content produced. This paper introduces a novel approach to identify specific image features in X-ray images that influence the outputs of report generation models. Specifically, we propose Cyclic Vision-Language Manipulator CVLM, a module to ge…
▽ More
Despite significant advancements in automated report generation, the opaqueness of text interpretability continues to cast doubt on the reliability of the content produced. This paper introduces a novel approach to identify specific image features in X-ray images that influence the outputs of report generation models. Specifically, we propose Cyclic Vision-Language Manipulator CVLM, a module to generate a manipulated X-ray from an original X-ray and its report from a designated report generator. The essence of CVLM is that cycling manipulated X-rays to the report generator produces altered reports aligned with the alterations pre-injected into the reports for X-ray generation, achieving the term "cyclic manipulation". This process allows direct comparison between original and manipulated X-rays, clarifying the critical image features driving changes in reports and enabling model users to assess the reliability of the generated texts. Empirical evaluations demonstrate that CVLM can identify more precise and reliable features compared to existing explanation methods, significantly enhancing the transparency and applicability of AI-generated reports.
△ Less
Submitted 18 June, 2025; v1 submitted 7 November, 2024;
originally announced November 2024.
-
Enhancing Weakly Supervised Semantic Segmentation for Fibrosis via Controllable Image Generation
Authors:
Zhiling Yue,
Yingying Fang,
Liutao Yang,
Nikhil Baid,
Simon Walsh,
Guang Yang
Abstract:
Fibrotic Lung Disease (FLD) is a severe condition marked by lung stiffening and scarring, leading to respiratory decline. High-resolution computed tomography (HRCT) is critical for diagnosing and monitoring FLD; however, fibrosis appears as irregular, diffuse patterns with unclear boundaries, leading to high inter-observer variability and time-intensive manual annotation. To tackle this challenge,…
▽ More
Fibrotic Lung Disease (FLD) is a severe condition marked by lung stiffening and scarring, leading to respiratory decline. High-resolution computed tomography (HRCT) is critical for diagnosing and monitoring FLD; however, fibrosis appears as irregular, diffuse patterns with unclear boundaries, leading to high inter-observer variability and time-intensive manual annotation. To tackle this challenge, we propose DiffSeg, a novel weakly supervised semantic segmentation (WSSS) method that uses image-level annotations to generate pixel-level fibrosis segmentation, reducing the need for fine-grained manual labeling. Additionally, our DiffSeg incorporates a diffusion-based generative model to synthesize HRCT images with different levels of fibrosis from healthy slices, enabling the generation of the fibrosis-injected slices and their paired fibrosis location. Experiments indicate that our method significantly improves the accuracy of pseudo masks generated by existing WSSS methods, greatly reducing the complexity of manual labeling and enhancing the consistency of the generated masks.
△ Less
Submitted 5 November, 2024;
originally announced November 2024.
-
Transferable Sequential Recommendation via Vector Quantized Meta Learning
Authors:
Zhenrui Yue,
Huimin Zeng,
Yang Zhang,
Julian McAuley,
Dong Wang
Abstract:
While sequential recommendation achieves significant progress on capturing user-item transition patterns, transferring such large-scale recommender systems remains challenging due to the disjoint user and item groups across domains. In this paper, we propose a vector quantized meta learning for transferable sequential recommenders (MetaRec). Without requiring additional modalities or shared inform…
▽ More
While sequential recommendation achieves significant progress on capturing user-item transition patterns, transferring such large-scale recommender systems remains challenging due to the disjoint user and item groups across domains. In this paper, we propose a vector quantized meta learning for transferable sequential recommenders (MetaRec). Without requiring additional modalities or shared information across domains, our approach leverages user-item interactions from multiple source domains to improve the target domain performance. To solve the input heterogeneity issue, we adopt vector quantization that maps item embeddings from heterogeneous input spaces to a shared feature space. Moreover, our meta transfer paradigm exploits limited target data to guide the transfer of source domain knowledge to the target domain (i.e., learn to transfer). In addition, MetaRec adaptively transfers from multiple source tasks by rescaling meta gradients based on the source-target domain similarity, enabling selective learning to improve recommendation performance. To validate the effectiveness of our approach, we perform extensive experiments on benchmark datasets, where MetaRec consistently outperforms baseline methods by a considerable margin.
△ Less
Submitted 3 November, 2024;
originally announced November 2024.
-
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Authors:
Xiangyu Zeng,
Kunchang Li,
Chenting Wang,
Xinhao Li,
Tianxiang Jiang,
Ziang Yan,
Songze Li,
Yansong Shi,
Zhengrong Yue,
Yi Wang,
Yali Wang,
Yu Qiao,
Limin Wang
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a…
▽ More
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format. Specifically, based on VideoChat, we propose our long-video MLLM, coined as VideoChat-T, by implementing a token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance the temporal awareness of visual representation. Meanwhile, we introduce the TimePro, a comprehensive grounding-centric instruction tuning dataset composed of 9 tasks and 349k high-quality grounded annotations. Notably, we design a new instruction tuning task type, called Temporal Grounded Caption, to peform detailed video descriptions with the corresponding time stamps prediction. This explicit temporal location prediction will guide MLLM to correctly attend on the visual content when generating description, and thus reduce the hallucination risk caused by the LLMs. Experimental results demonstrate that our TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and VideoMME, respectively. In addition, VideoChat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. After fine-tuning, it performs on par with the traditional supervised expert models.
△ Less
Submitted 12 February, 2025; v1 submitted 25 October, 2024;
originally announced October 2024.
-
Inference Scaling for Long-Context Retrieval Augmented Generation
Authors:
Zhenrui Yue,
Honglei Zhuang,
Aijun Bai,
Kai Hui,
Rolf Jagerman,
Hansi Zeng,
Zhen Qin,
Dong Wang,
Xuanhui Wang,
Michael Bendersky
Abstract:
The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inferenc…
▽ More
The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inference scaling for retrieval augmented generation (RAG), exploring the combination of multiple strategies beyond simply increasing the quantity of knowledge, including in-context learning and iterative prompting. These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs' ability to effectively acquire and utilize contextual information. We address two key questions: (1) How does RAG performance benefit from the scaling of inference computation when optimally configured? (2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters? Our observations reveal that increasing inference computation leads to nearly linear gains in RAG performance when optimally allocated, a relationship we describe as the inference scaling laws for RAG. Building on this, we further develop the computation allocation model to estimate RAG performance across different inference configurations. The model predicts optimal inference parameters under various computation constraints, which align closely with the experimental results. By applying these optimal configurations, we demonstrate that scaling inference compute on long-context LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.
△ Less
Submitted 2 March, 2025; v1 submitted 5 October, 2024;
originally announced October 2024.
-
Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors
Authors:
Aiping Zhang,
Zongsheng Yue,
Renjing Pei,
Wenqi Ren,
Xiaochun Cao
Abstract:
Diffusion-based image super-resolution (SR) methods have achieved remarkable success by leveraging large pre-trained text-to-image diffusion models as priors. However, these methods still face two challenges: the requirement for dozens of sampling steps to achieve satisfactory results, which limits efficiency in real scenarios, and the neglect of degradation models, which are critical auxiliary in…
▽ More
Diffusion-based image super-resolution (SR) methods have achieved remarkable success by leveraging large pre-trained text-to-image diffusion models as priors. However, these methods still face two challenges: the requirement for dozens of sampling steps to achieve satisfactory results, which limits efficiency in real scenarios, and the neglect of degradation models, which are critical auxiliary information in solving the SR problem. In this work, we introduced a novel one-step SR model, which significantly addresses the efficiency issue of diffusion-based SR methods. Unlike existing fine-tuning strategies, we designed a degradation-guided Low-Rank Adaptation (LoRA) module specifically for SR, which corrects the model parameters based on the pre-estimated degradation information from low-resolution images. This module not only facilitates a powerful data-dependent or degradation-dependent SR model but also preserves the generative prior of the pre-trained diffusion model as much as possible. Furthermore, we tailor a novel training pipeline by introducing an online negative sample generation strategy. Combined with the classifier-free guidance strategy during inference, it largely improves the perceptual quality of the super-resolution results. Extensive experiments have demonstrated the superior efficiency and effectiveness of the proposed model compared to recent state-of-the-art methods.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Train Once, Deploy Anywhere: Matryoshka Representation Learning for Multimodal Recommendation
Authors:
Yueqi Wang,
Zhenrui Yue,
Huimin Zeng,
Dong Wang,
Julian McAuley
Abstract:
Despite recent advancements in language and vision modeling, integrating rich multimodal knowledge into recommender systems continues to pose significant challenges. This is primarily due to the need for efficient recommendation, which requires adaptive and interactive responses. In this study, we focus on sequential recommendation and introduce a lightweight framework called full-scale Matryoshka…
▽ More
Despite recent advancements in language and vision modeling, integrating rich multimodal knowledge into recommender systems continues to pose significant challenges. This is primarily due to the need for efficient recommendation, which requires adaptive and interactive responses. In this study, we focus on sequential recommendation and introduce a lightweight framework called full-scale Matryoshka representation learning for multimodal recommendation (fMRLRec). Our fMRLRec captures item features at different granularities, learning informative representations for efficient recommendation across multiple dimensions. To integrate item features from diverse modalities, fMRLRec employs a simple mapping to project multimodal item features into an aligned feature space. Additionally, we design an efficient linear transformation that embeds smaller features into larger ones, substantially reducing memory requirements for large-scale training on recommendation data. Combined with improved state space modeling techniques, fMRLRec scales to different dimensions and only requires one-time training to produce multiple models tailored to various granularities. We demonstrate the effectiveness and efficiency of fMRLRec on multiple benchmark datasets, which consistently achieves superior performance over state-of-the-art baseline methods. We make our code and data publicly available at https://github.com/yueqirex/fMRLRec.
△ Less
Submitted 2 October, 2024; v1 submitted 25 September, 2024;
originally announced September 2024.
-
k-MLE, k-Bregman, k-VARs: Theory, Convergence, Computation
Authors:
Zuogong Yue,
Victor Solo
Abstract:
We develop hard clustering based on likelihood rather than distance and prove convergence. We also provide simulations and real data examples.
We develop hard clustering based on likelihood rather than distance and prove convergence. We also provide simulations and real data examples.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
Unveiling Visual Biases in Audio-Visual Localization Benchmarks
Authors:
Liangyu Chen,
Zihao Yue,
Boshen Xu,
Qin Jin
Abstract:
Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding vi…
▽ More
Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
△ Less
Submitted 25 August, 2024;
originally announced September 2024.
-
Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction
Authors:
Xiaodi Li,
Jianfeng Gui,
Qian Gao,
Haoyuan Shi,
Zhenyu Yue
Abstract:
Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this pap…
▽ More
Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this paper, we propose a novel post-hoc interpretability algorithm for cancer drug response prediction, CETExplainer, which incorporates a controllable edge-type-specific weighting mechanism. It considers the mutual information between subgraphs and predictions, proposing a structural scoring approach to provide fine-grained, biologically meaningful explanations for predictive models. We also introduce a method for constructing ground truth based on real-world datasets to quantitatively evaluate the proposed interpretability algorithm. Empirical analysis on the real-world dataset demonstrates that CETExplainer achieves superior stability and improves explanation quality compared to leading algorithms, thereby offering a robust and insightful tool for cancer drug prediction.
△ Less
Submitted 3 September, 2024; v1 submitted 30 August, 2024;
originally announced August 2024.
-
DRExplainer: Quantifiable Interpretability in Drug Response Prediction with Directed Graph Convolutional Network
Authors:
Haoyuan Shi,
Tao Xu,
Xiaodi Li,
Qian Gao,
Zhiwei Xiong,
Junfeng Xia,
Zhenyu Yue
Abstract:
Predicting the response of a cancer cell line to a therapeutic drug is pivotal for personalized medicine. Despite numerous deep learning methods that have been developed for drug response prediction, integrating diverse information about biological entities and predicting the directional response remain major challenges. Here, we propose a novel interpretable predictive model, DRExplainer, which l…
▽ More
Predicting the response of a cancer cell line to a therapeutic drug is pivotal for personalized medicine. Despite numerous deep learning methods that have been developed for drug response prediction, integrating diverse information about biological entities and predicting the directional response remain major challenges. Here, we propose a novel interpretable predictive model, DRExplainer, which leverages a directed graph convolutional network to enhance the prediction in a directed bipartite network framework. DRExplainer constructs a directed bipartite network integrating multi-omics profiles of cell lines, the chemical structure of drugs and known drug response to achieve directed prediction. Then, DRExplainer identifies the most relevant subgraph to each prediction in this directed bipartite network by learning a mask, facilitating critical medical decision-making. Additionally, we introduce a quantifiable method for model interpretability that leverages a ground truth benchmark dataset curated from biological features. In computational experiments, DRExplainer outperforms state-of-the-art predictive methods and another graph-based explanation method under the same experimental setting. Finally, the case studies further validate the interpretability and the effectiveness of DRExplainer in predictive novel drug response. Our code is available at: https://github.com/vshy-dream/DRExplainer.
△ Less
Submitted 27 March, 2025; v1 submitted 22 August, 2024;
originally announced August 2024.
-
MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
Authors:
Yanbo Ding,
Shaobin Zhuang,
Kunchang Li,
Zhengrong Yue,
Yu Qiao,
Yali Wang
Abstract:
Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflo…
▽ More
Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step of MUSES forward in bridging natural language, 2D image generation, and 3D world. Our codes are available at the following link: https://github.com/DINGYANB/MUSES.
△ Less
Submitted 15 December, 2024; v1 submitted 20 August, 2024;
originally announced August 2024.
-
GEGA: Graph Convolutional Networks and Evidence Retrieval Guided Attention for Enhanced Document-level Relation Extraction
Authors:
Yanxu Mao,
Xiaohui Chen,
Peipei Liu,
Tiehan Cui,
Zuhui Yue,
Zheng Li
Abstract:
Document-level relation extraction (DocRE) aims to extract relations between entities from unstructured document text. Compared to sentence-level relation extraction, it requires more complex semantic understanding from a broader text context. Currently, some studies are utilizing logical rules within evidence sentences to enhance the performance of DocRE. However, in the data without provided evi…
▽ More
Document-level relation extraction (DocRE) aims to extract relations between entities from unstructured document text. Compared to sentence-level relation extraction, it requires more complex semantic understanding from a broader text context. Currently, some studies are utilizing logical rules within evidence sentences to enhance the performance of DocRE. However, in the data without provided evidence sentences, researchers often obtain a list of evidence sentences for the entire document through evidence retrieval (ER). Therefore, DocRE suffers from two challenges: firstly, the relevance between evidence and entity pairs is weak; secondly, there is insufficient extraction of complex cross-relations between long-distance multi-entities. To overcome these challenges, we propose GEGA, a novel model for DocRE. The model leverages graph neural networks to construct multiple weight matrices, guiding attention allocation to evidence sentences. It also employs multi-scale representation aggregation to enhance ER. Subsequently, we integrate the most efficient evidence information to implement both fully supervised and weakly supervised training processes for the model. We evaluate the GEGA model on three widely used benchmark datasets: DocRED, Re-DocRED, and Revisit-DocRED. The experimental results indicate that our model has achieved comprehensive improvements compared to the existing SOTA model.
△ Less
Submitted 8 September, 2024; v1 submitted 31 July, 2024;
originally announced July 2024.
-
Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding
Authors:
Jiangtao Zhang,
Zongsheng Yue,
Hui Wang,
Qian Zhao,
Deyu Meng
Abstract:
Blind image deconvolution (BID) is a classic yet challenging problem in the field of image processing. Recent advances in deep image prior (DIP) have motivated a series of DIP-based approaches, demonstrating remarkable success in BID. However, due to the high non-convexity of the inherent optimization process, these methods are notorious for their sensitivity to the initialized kernel. To alleviat…
▽ More
Blind image deconvolution (BID) is a classic yet challenging problem in the field of image processing. Recent advances in deep image prior (DIP) have motivated a series of DIP-based approaches, demonstrating remarkable success in BID. However, due to the high non-convexity of the inherent optimization process, these methods are notorious for their sensitivity to the initialized kernel. To alleviate this issue and further improve their performance, we propose a new framework for BID that better considers the prior modeling and the initialization for blur kernels, leveraging a deep generative model. The proposed approach pre-trains a generative adversarial network-based kernel generator that aptly characterizes the kernel priors and a kernel initializer that facilitates a well-informed initialization for the blur kernel through latent space encoding. With the pre-trained kernel generator and initializer, one can obtain a high-quality initialization of the blur kernel, and enable optimization within a compact latent kernel manifold. Such a framework results in an evident performance improvement over existing DIP-based BID methods. Extensive experiments on different datasets demonstrate the effectiveness of the proposed method.
△ Less
Submitted 20 July, 2024;
originally announced July 2024.
-
SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling
Authors:
Huizheng Wang,
Jiahao Fang,
Xinru Tang,
Zhiheng Yue,
Jinxi Li,
Yubin Qin,
Sihan Guan,
Qize Yang,
Yang Wang,
Chao Li,
Yang Hu,
Shouyi Yin
Abstract:
Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token parallel processing (LTPP). However, existing dynamic sparse accelerators struggle to effectively ha…
▽ More
Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token parallel processing (LTPP). However, existing dynamic sparse accelerators struggle to effectively handle LTPP, as they solely focus on separate stage optimization, and with most efforts confined to computational enhancements. By re-examining the end-to-end flow of dynamic sparse acceleration, we pinpoint an ever-overlooked opportunity that the LTPP can exploit the intrinsic coordination among stages to avoid excessive memory access and redundant computation. Motivated by our observation, we present SOFA, a cross-stage compute-memory efficient algorithm-hardware co-design, which is tailored to tackle the challenges posed by LTPP of Transformer inference effectively. We first propose a novel leading zero computing paradigm, which predicts attention sparsity by using log-based add-only operations to avoid the significant overhead of prediction. Then, a distributed sorting and a sorted updating FlashAttention mechanism are proposed with a cross-stage coordinated tiling principle, which enables fine-grained and lightweight coordination among stages, helping optimize memory access and latency. Further, we propose a SOFA accelerator to support these optimizations efficiently. Extensive experiments on 20 benchmarks show that SOFA achieves $9.5\times$ speed up and $71.5\times$ higher energy efficiency than Nvidia A100 GPU. Compared to 8 SOTA accelerators, SOFA achieves an average $15.8\times$ energy efficiency, $10.3\times$ area efficiency and $9.3\times$ speed up, respectively.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement
Authors:
Zijie Yue,
Miaojing Shi,
Hanli Wang,
Shuai Ding,
Qijun Chen,
Shanlin Yang
Abstract:
Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gai…
▽ More
Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual map reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods.
△ Less
Submitted 17 February, 2025; v1 submitted 11 July, 2024;
originally announced July 2024.