Skip to main content

Showing 1–50 of 149 results for author: Yue, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.23639  [pdf, ps, other

    cs.CV cs.AI

    Unified Multimodal Understanding via Byte-Pair Visual Encoding

    Authors: Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu

    Abstract: Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  2. arXiv:2506.10963  [pdf, ps, other

    cs.CV cs.CL

    MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

    Authors: Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian

    Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effec… ▽ More

    Submitted 13 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

    Comments: 85 pages, 70 figures, code: https://github.com/MMMGBench/MMMG, project page: https://mmmgbench.github.io/

  3. arXiv:2506.06757  [pdf, ps, other

    cs.CV

    SAR2Struct: Extracting 3D Semantic Structural Representation of Aircraft Targets from Single-View SAR Image

    Authors: Ziyu Yue, Ruixi You, Feng Xu

    Abstract: To translate synthetic aperture radar (SAR) image into interpretable forms for human understanding is the ultimate goal of SAR advanced information retrieval. Existing methods mainly focus on 3D surface reconstruction or local geometric feature extraction of targets, neglecting the role of structural modeling in capturing semantic information. This paper proposes a novel task: SAR target structure… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: 13 pages, 12 figures

  4. arXiv:2506.06097  [pdf, ps, other

    cs.CV

    VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

    Authors: Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, Yali Wang

    Abstract: The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. Ho… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  5. arXiv:2506.05301  [pdf, other

    cs.CV

    SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

    Authors: Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang

    Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Draft Ver. Project page: https://iceclear.github.io/projects/seedvr2/

  6. arXiv:2506.03569  [pdf, ps, other

    cs.CL

    MiMo-VL Technical Report

    Authors: Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song , et al. (50 additional authors not shown)

    Abstract: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 32 pages

  7. arXiv:2505.18454  [pdf, other

    cs.CL

    Hybrid Latent Reasoning via Reinforcement Learning

    Authors: Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang

    Abstract: Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as th… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  8. arXiv:2505.14874  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

    Authors: Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen

    Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech.… ▽ More

    Submitted 30 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: 5 pages, 1 figure, Accepted to Interspeech 2025

  9. arXiv:2505.14505  [pdf, ps, other

    cs.CL cs.AI

    ModRWKV: Transformer Multimodality in Linear Time

    Authors: Jiale Kang, Ziyin Yue, Qingyu Yin, Jiang Rui, Weile Li, Zening Lu, Zhouran Ji

    Abstract: Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework bui… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  10. arXiv:2505.14421  [pdf, ps, other

    stat.ML cs.LG eess.SP eess.SY

    A system identification approach to clustering vector autoregressive time series

    Authors: Zuogong Yue, Xinyi Wang, Victor Solo

    Abstract: Clustering of time series based on their underlying dynamics is keeping attracting researchers due to its impacts on assisting complex system modelling. Most current time series clustering methods handle only scalar time series, treat them as white noise, or rely on domain knowledge for high-quality feature construction, where the autocorrelation pattern/feature is mostly ignored. Instead of relyi… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  11. arXiv:2505.11151  [pdf, ps, other

    cs.NE

    STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible Benchmarking

    Authors: Sicheng Shen, Dongcheng Zhao, Linghao Feng, Zeyang Yue, Jindong Li, Tenglong Li, Guobin Shen, Yi Zeng

    Abstract: Spiking Transformers have recently emerged as promising architectures for combining the efficiency of spiking neural networks with the representational power of self-attention. However, the lack of standardized implementations, evaluation pipelines, and consistent design choices has hindered fair comparison and principled analysis. In this paper, we introduce \textbf{STEP}, a unified benchmark fra… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: 21 pages, 8 figures

  12. arXiv:2505.09882  [pdf, ps, other

    cs.HC

    SnapNCode: An Integrated Development Environment for Programming Physical Objects Interactions

    Authors: Xiaoyan Wei, Zijian Yue, Hsiang-Ting Chen

    Abstract: Spatial computing technologies have the potential to revolutionize how we interact with the world around us. However, most modern integrated development environments (IDEs) have not fully adapted to this paradigm shift. For example, physical 3D objects in the real world are still represented as 2D text variables in code, creating a significant perceptual distance between these representations. In… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: 18 pages, HCII 2025

  13. arXiv:2505.09872  [pdf, ps, other

    cs.HC

    Context-AI Tunes: Context-Aware AI-Generated Music for Stress Reduction

    Authors: Xiaoyan Wei, Zebang Zhang, Zijian Yue, Hsiang-Ting Chen

    Abstract: Music plays a critical role in emotional regulation and stress relief; however, individuals often need different types of music tailored to their unique stress levels or surrounding environment. Choosing the right music can be challenging due to the overwhelming number of options and the time-consuming trial-and-error process. To address this, we propose Context-AI Tune (CAT), a system that genera… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: 17 pages, HCII 2025

  14. arXiv:2505.07608  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

    Authors: LLM-Core Xiaomi, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai , et al. (40 additional authors not shown)

    Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective… ▽ More

    Submitted 5 June, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

  15. arXiv:2505.07538  [pdf, other

    cs.CV

    Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

    Authors: Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang

    Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally dist… ▽ More

    Submitted 27 May, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

  16. arXiv:2505.06628  [pdf, ps, other

    cs.RO

    ACORN: Adaptive Contrastive Optimization for Safe and Robust Fine-Grained Robotic Manipulation

    Authors: Zhongquan Zhou, Shuhao Li, Zixian Yue

    Abstract: Embodied AI research has traditionally emphasized performance metrics such as success rate and cumulative reward, overlooking critical robustness and safety considerations that emerge during real-world deployment. In actual environments, agents continuously encounter unpredicted situations and distribution shifts, causing seemingly reliable policies to experience catastrophic failures, particularl… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: 6 pages,4 figures

  17. arXiv:2504.15932  [pdf, other

    cs.CV

    Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

    Authors: Wang Lin, Liyu Jia, Wentao Hu, Kaihang Pan, Zhongqi Yue, Wei Zhao, Jingyuan Chen, Fei Wu, Hanwang Zhang

    Abstract: Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  18. arXiv:2504.14666  [pdf, other

    cs.CV

    Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

    Authors: Kaihang Pan, Wang Lin, Zhongqi Yue, Tenglong Ao, Liyu Jia, Wei Zhao, Juncheng Li, Siliang Tang, Hanwang Zhang

    Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR 2025 (Oral)

  19. arXiv:2504.13266  [pdf, ps, other

    cs.LG

    Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs

    Authors: Zichao Yue, Chenhui Deng, Zhiru Zhang

    Abstract: Graph neural networks (GNNs) are widely used for learning node embeddings in graphs, typically adopting a message-passing scheme. This approach, however, leads to the neighbor explosion problem, with exponentially growing computational and memory demands as layers increase. Graph sampling has become the predominant method for scaling GNNs to large graphs, mitigating but not fully solving the issue… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Journal ref: Proceedings of the 8 th MLSys Conference (MLSys), 2025

  20. arXiv:2504.12491  [pdf, other

    cs.CL

    Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?

    Authors: Hansi Zeng, Kai Hui, Honglei Zhuang, Zhen Qin, Zhenrui Yue, Hamed Zamani, Dana Alon

    Abstract: While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and development. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classificat… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  21. arXiv:2504.06606  [pdf, other

    cs.CV

    Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

    Authors: Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, Yueting Zhuang

    Abstract: Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Th… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  22. arXiv:2503.13377  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Authors: Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin

    Abstract: Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that en… ▽ More

    Submitted 29 June, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: Project Page: https://xuboshen.github.io/Time-R1/

  23. arXiv:2503.12077  [pdf, other

    cs.CV cs.AI

    V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

    Authors: Zhengrong Yue, Shaobin Zhuang, Kunchang Li, Yanbo Ding, Yali Wang

    Abstract: Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a system… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

    Comments: CVPR 2025

  24. arXiv:2503.10200  [pdf, other

    cs.CV

    LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

    Authors: Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, Yali Wang

    Abstract: Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understa… ▽ More

    Submitted 31 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

  25. arXiv:2503.09516  [pdf, other

    cs.CL cs.AI cs.IR

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Authors: Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, Jiawei Han

    Abstract: Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Searc… ▽ More

    Submitted 8 April, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

    Comments: 31 pages

  26. arXiv:2503.06520  [pdf, ps, other

    cs.CV cs.MM

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Authors: Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia

    Abstract: Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforc… ▽ More

    Submitted 28 June, 2025; v1 submitted 9 March, 2025; originally announced March 2025.

  27. arXiv:2501.17166  [pdf, ps, other

    cs.NE physics.comp-ph

    Optimizing Carbon Footprint in ICT through Swarm Intelligence with Algorithmic Complexity

    Authors: Vasileios Alevizos, Nikitas Gerolimos, Sabrina Edralin, Clark Xu, Akebu Simasiku, Georgios Priniotakis, George Papakostas, Zongliang Yue

    Abstract: Global emissions from fossil fuel combustion and cement production were recorded in 2022, signaling a resurgence to pre-pandemic levels and providing an apodictic indication that emission peaks have not yet been achieved. Significant contributions to this upward trend are made by the Information and Communication Technology (ICT) industry due to its substantial energy consumption. This shows the n… ▽ More

    Submitted 19 January, 2025; originally announced January 2025.

  28. Integrating Artificial Open Generative Artificial Intelligence into Software Supply Chain Security

    Authors: Vasileios Alevizos, George A Papakostas, Akebu Simasiku, Dimitra Malliarou, Antonis Messinis, Sabrina Edralin, Clark Xu, Zongliang Yue

    Abstract: While new technologies emerge, human errors always looming. Software supply chain is increasingly complex and intertwined, the security of a service has become paramount to ensuring the integrity of products, safeguarding data privacy, and maintaining operational continuity. In this work, we conducted experiments on the promising open Large Language Models (LLMs) into two main software security ch… ▽ More

    Submitted 26 December, 2024; originally announced December 2024.

    Journal ref: 2024 5th International Conference on Data Analytics for Business and Industry (ICDABI)

  29. arXiv:2412.11236  [pdf, ps, other

    cs.DS cs.IT eess.SP

    Logarithmic Positional Partition Interval Encoding

    Authors: Vasileios Alevizos, Nikitas Gerolimos, Sabrina Edralin, Clark Xu, Akebu Simasiku, Georgios Priniotakis, George Papakostas, Zongliang Yue

    Abstract: One requirement of maintaining digital information is storage. With the latest advances in the digital world, new emerging media types have required even more storage space to be kept than before. In fact, in many cases it is required to have larger amounts of storage to keep up with protocols that support more types of information at the same time. In contrast, compression algorithms have been in… ▽ More

    Submitted 15 December, 2024; originally announced December 2024.

  30. arXiv:2412.09013  [pdf, other

    cs.CV

    Arbitrary-steps Image Super-resolution via Diffusion Inversion

    Authors: Zongsheng Yue, Kang Liao, Chen Change Loy

    Abstract: This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Central to our approach is a deep n… ▽ More

    Submitted 13 March, 2025; v1 submitted 12 December, 2024; originally announced December 2024.

    Comments: Accepted by CVPR 2025. Project: https://github.com/zsyOAOA/InvSR

    MSC Class: NA ACM Class: I.4.3

  31. arXiv:2412.06293  [pdf, other

    cs.CV

    Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

    Authors: Qifan Yu, Zhebei Shen, Zhongqi Yue, Yang Wu, Wenqiao Zhang, Yunfei Li, Juncheng Li, Siliang Tang, Yueting Zhuang

    Abstract: Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective dat… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: 14 pages, 7 figures

  32. arXiv:2411.17769  [pdf, other

    cs.CV

    Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis

    Authors: Xinyu Hou, Zongsheng Yue, Xiaoming Li, Chen Change Loy

    Abstract: In this work, we introduce a single parameter $ω$, to effectively control granularity in diffusion-based synthesis. This parameter is incorporated during the denoising steps of the diffusion model's reverse process. Our approach does not require model retraining, architectural modifications, or additional computational overhead during inference, yet enables precise control over the level of detail… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: Project page: https://itsmag11.github.io/Omegance/

  33. arXiv:2411.16156  [pdf, other

    cs.CV cs.LG

    VideoOrion: Tokenizing Object Dynamics in Videos

    Authors: Yicheng Feng, Yijiang Li, Wanpeng Zhang, Hao Luo, Zihao Yue, Sipeng Zheng, Zongqing Lu

    Abstract: We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos - the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our met… ▽ More

    Submitted 18 March, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

  34. arXiv:2411.15738  [pdf, other

    cs.CV

    AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

    Authors: Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, Yueting Zhuang

    Abstract: Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing p… ▽ More

    Submitted 29 March, 2025; v1 submitted 24 November, 2024; originally announced November 2024.

    Comments: Accepted by CVPR 2025

  35. arXiv:2411.05261  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    Cyclic Vision-Language Manipulator: Towards Reliable and Fine-Grained Image Interpretation for Automated Report Generation

    Authors: Yingying Fang, Zihao Jin, Shaojie Guo, Jinda Liu, Zhiling Yue, Yijian Gao, Junzhi Ning, Zhi Li, Simon Walsh, Guang Yang

    Abstract: Despite significant advancements in automated report generation, the opaqueness of text interpretability continues to cast doubt on the reliability of the content produced. This paper introduces a novel approach to identify specific image features in X-ray images that influence the outputs of report generation models. Specifically, we propose Cyclic Vision-Language Manipulator CVLM, a module to ge… ▽ More

    Submitted 18 June, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

  36. Enhancing Weakly Supervised Semantic Segmentation for Fibrosis via Controllable Image Generation

    Authors: Zhiling Yue, Yingying Fang, Liutao Yang, Nikhil Baid, Simon Walsh, Guang Yang

    Abstract: Fibrotic Lung Disease (FLD) is a severe condition marked by lung stiffening and scarring, leading to respiratory decline. High-resolution computed tomography (HRCT) is critical for diagnosing and monitoring FLD; however, fibrosis appears as irregular, diffuse patterns with unclear boundaries, leading to high inter-observer variability and time-intensive manual annotation. To tackle this challenge,… ▽ More

    Submitted 5 November, 2024; originally announced November 2024.

    Journal ref: 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI)

  37. arXiv:2411.01785  [pdf, other

    cs.IR cs.AI

    Transferable Sequential Recommendation via Vector Quantized Meta Learning

    Authors: Zhenrui Yue, Huimin Zeng, Yang Zhang, Julian McAuley, Dong Wang

    Abstract: While sequential recommendation achieves significant progress on capturing user-item transition patterns, transferring such large-scale recommender systems remains challenging due to the disjoint user and item groups across domains. In this paper, we propose a vector quantized meta learning for transferable sequential recommenders (MetaRec). Without requiring additional modalities or shared inform… ▽ More

    Submitted 3 November, 2024; originally announced November 2024.

    Comments: Accepted to BigData 2024

  38. arXiv:2410.19702  [pdf, other

    cs.CV cs.AI cs.MM

    TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

    Authors: Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, Limin Wang

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a… ▽ More

    Submitted 12 February, 2025; v1 submitted 25 October, 2024; originally announced October 2024.

    Comments: Accepted by ICLR2025

  39. arXiv:2410.04343  [pdf, other

    cs.CL

    Inference Scaling for Long-Context Retrieval Augmented Generation

    Authors: Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky

    Abstract: The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inferenc… ▽ More

    Submitted 2 March, 2025; v1 submitted 5 October, 2024; originally announced October 2024.

    Comments: ICLR 2025

  40. arXiv:2409.17058  [pdf, other

    cs.CV

    Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors

    Authors: Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, Xiaochun Cao

    Abstract: Diffusion-based image super-resolution (SR) methods have achieved remarkable success by leveraging large pre-trained text-to-image diffusion models as priors. However, these methods still face two challenges: the requirement for dozens of sampling steps to achieve satisfactory results, which limits efficiency in real scenarios, and the neglect of degradation models, which are critical auxiliary in… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: The code is available at https://github.com/ArcticHare105/S3Diff

  41. arXiv:2409.16627  [pdf, other

    cs.IR

    Train Once, Deploy Anywhere: Matryoshka Representation Learning for Multimodal Recommendation

    Authors: Yueqi Wang, Zhenrui Yue, Huimin Zeng, Dong Wang, Julian McAuley

    Abstract: Despite recent advancements in language and vision modeling, integrating rich multimodal knowledge into recommender systems continues to pose significant challenges. This is primarily due to the need for efficient recommendation, which requires adaptive and interactive responses. In this study, we focus on sequential recommendation and introduce a lightweight framework called full-scale Matryoshka… ▽ More

    Submitted 2 October, 2024; v1 submitted 25 September, 2024; originally announced September 2024.

    Comments: Accepted to EMNLP 2024 Findings

  42. arXiv:2409.06938  [pdf, other

    stat.ML cs.LG

    k-MLE, k-Bregman, k-VARs: Theory, Convergence, Computation

    Authors: Zuogong Yue, Victor Solo

    Abstract: We develop hard clustering based on likelihood rather than distance and prove convergence. We also provide simulations and real data examples.

    Submitted 10 September, 2024; originally announced September 2024.

  43. arXiv:2409.06709  [pdf, other

    cs.MM cs.AI cs.SD eess.AS

    Unveiling Visual Biases in Audio-Visual Localization Benchmarks

    Authors: Liangyu Chen, Zihao Yue, Boshen Xu, Qin Jin

    Abstract: Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding vi… ▽ More

    Submitted 25 August, 2024; originally announced September 2024.

    Comments: Accepted by ECCV24 AVGenL Workshop

  44. arXiv:2408.17129  [pdf, ps, other

    cs.LG cs.AI

    Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction

    Authors: Xiaodi Li, Jianfeng Gui, Qian Gao, Haoyuan Shi, Zhenyu Yue

    Abstract: Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this pap… ▽ More

    Submitted 3 September, 2024; v1 submitted 30 August, 2024; originally announced August 2024.

  45. DRExplainer: Quantifiable Interpretability in Drug Response Prediction with Directed Graph Convolutional Network

    Authors: Haoyuan Shi, Tao Xu, Xiaodi Li, Qian Gao, Zhiwei Xiong, Junfeng Xia, Zhenyu Yue

    Abstract: Predicting the response of a cancer cell line to a therapeutic drug is pivotal for personalized medicine. Despite numerous deep learning methods that have been developed for drug response prediction, integrating diverse information about biological entities and predicting the directional response remain major challenges. Here, we propose a novel interpretable predictive model, DRExplainer, which l… ▽ More

    Submitted 27 March, 2025; v1 submitted 22 August, 2024; originally announced August 2024.

  46. arXiv:2408.10605  [pdf, other

    cs.CV cs.AI

    MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

    Authors: Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang

    Abstract: Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflo… ▽ More

    Submitted 15 December, 2024; v1 submitted 20 August, 2024; originally announced August 2024.

    Comments: AAAI 2025

  47. arXiv:2407.21384  [pdf, other

    cs.CL cs.AI

    GEGA: Graph Convolutional Networks and Evidence Retrieval Guided Attention for Enhanced Document-level Relation Extraction

    Authors: Yanxu Mao, Xiaohui Chen, Peipei Liu, Tiehan Cui, Zuhui Yue, Zheng Li

    Abstract: Document-level relation extraction (DocRE) aims to extract relations between entities from unstructured document text. Compared to sentence-level relation extraction, it requires more complex semantic understanding from a broader text context. Currently, some studies are utilizing logical rules within evidence sentences to enhance the performance of DocRE. However, in the data without provided evi… ▽ More

    Submitted 8 September, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  48. arXiv:2407.14816  [pdf, other

    cs.CV

    Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding

    Authors: Jiangtao Zhang, Zongsheng Yue, Hui Wang, Qian Zhao, Deyu Meng

    Abstract: Blind image deconvolution (BID) is a classic yet challenging problem in the field of image processing. Recent advances in deep image prior (DIP) have motivated a series of DIP-based approaches, demonstrating remarkable success in BID. However, due to the high non-convexity of the inherent optimization process, these methods are notorious for their sensitivity to the initialized kernel. To alleviat… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

    Comments: ECCV@2024. Code: https://github.com/jtaoz/GKPILE-Deconvolution

    ACM Class: I.4.4

  49. arXiv:2407.10416  [pdf, other

    cs.AR

    SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling

    Authors: Huizheng Wang, Jiahao Fang, Xinru Tang, Zhiheng Yue, Jinxi Li, Yubin Qin, Sihan Guan, Qize Yang, Yang Wang, Chao Li, Yang Hu, Shouyi Yin

    Abstract: Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token parallel processing (LTPP). However, existing dynamic sparse accelerators struggle to effectively ha… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

  50. arXiv:2407.08507  [pdf, other

    cs.CV

    Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement

    Authors: Zijie Yue, Miaojing Shi, Hanli Wang, Shuai Ding, Qijun Chen, Shanlin Yang

    Abstract: Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gai… ▽ More

    Submitted 17 February, 2025; v1 submitted 11 July, 2024; originally announced July 2024.

    Comments: International Journal of Computer Vision