Skip to main content

Showing 1–50 of 701 results for author: Bai, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06121  [pdf, ps, other

    cs.IR

    Unconditional Diffusion for Generative Sequential Recommendation

    Authors: Yimeng Bai, Yang Zhang, Sihao Ding, Shaohui Ruan, Han Yao, Danhui Guan, Fuli Feng, Tat-Seng Chua

    Abstract: Diffusion models, known for their generative ability to simulate data creation through noise-adding and denoising processes, have emerged as a promising approach for building generative recommenders. To incorporate user history for personalization, existing methods typically adopt a conditional diffusion framework, where the reverse denoising process of reconstructing items from noise is modified… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    ACM Class: H.3.3; H.3.5

  2. arXiv:2507.05177  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

    Authors: Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang

    Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for trans… ▽ More

    Submitted 8 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

    Comments: Technical Report

  3. arXiv:2507.03038  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Cautious Next Token Prediction

    Authors: Yizhou Wang, Lingzhi Zhang, Yue Bai, Mang Tik Chiu, Zhengmian Hu, Mingyuan Zhang, Qihua Dong, Yu Yin, Sohrab Amirghodsi, Yun Fu

    Abstract: Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Findings of ACL 2025

  4. arXiv:2507.00185  [pdf

    eess.IV cs.AI cs.CV

    Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

    Authors: Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou, Yan Wang, Yang Bai, Yuhe Ke, Jie Yao, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting

    Abstract: Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: 42 pages, 3 composite figures, 4 tables

  5. arXiv:2506.22007  [pdf, ps, other

    cs.CV

    RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

    Authors: Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, Abhinav Valada

    Abstract: We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works pre… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 8 pages, 6 figures

  6. arXiv:2506.21976  [pdf, ps, other

    cs.LG cs.AI cs.CV cs.MA cs.RO

    SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model

    Authors: Shuhan Tan, John Lambert, Hong Jeon, Sakshum Kulshrestha, Yijing Bai, Jing Luo, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang

    Abstract: The goal of traffic simulation is to augment a potentially limited amount of manually-driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from p… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: Accepted to CVPR 2025

  7. arXiv:2506.21552  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.MM cs.RO

    Whole-Body Conditioned Egocentric Video Prediction

    Authors: Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik

    Abstract: We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional dif… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Project Page: https://dannytran123.github.io/PEVA

  8. arXiv:2506.20493  [pdf

    eess.SY cs.GT

    Analyzing the Impact of Strategic Bidding on the Reserve Capacity via a Bi-Level Model

    Authors: Yun Xu, Yunxiao Bai, Yunyong Zhang, Peng Wang, Xuelin Wang, Jiqun Guo, Kaijun Xie, Rusheng Zhao

    Abstract: The growing integration of renewable energy sources necessitates adequate reserve capacity to maintain power balance. However, in market clearing, power companies with flexible resources may submit strategic bids to maximize profits, potentially compromising system reserves. This paper examines the effects of such strategic behavior by modeling the market as a bi-level problem. The upper level rep… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  9. arXiv:2506.18841  [pdf, ps, other

    cs.CL cs.AI cs.LG

    LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

    Authors: Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li

    Abstract: Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  10. arXiv:2506.15576  [pdf, ps, other

    cs.IR

    DiscRec: Disentangled Semantic-Collaborative Modeling for Generative Recommendation

    Authors: Chang Liu, Yimeng Bai, Xiaoyan Zhao, Yang Zhang, Fuli Feng, Wenge Rong

    Abstract: Generative recommendation is emerging as a powerful paradigm that directly generates item predictions, moving beyond traditional matching-based approaches. However, current methods face two key challenges: token-item misalignment, where uniform token-level modeling ignores item-level granularity that is critical for collaborative signal learning, and semantic-collaborative signal entanglement, whe… ▽ More

    Submitted 22 June, 2025; v1 submitted 18 June, 2025; originally announced June 2025.

    Comments: Fixed the indentation issue in the abstract that caused rendering errors on arXiv

  11. arXiv:2506.13796  [pdf, ps, other

    cs.CL cs.AI

    ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries

    Authors: Zhou Chen, Xiao Wang, Yuanhong Liao, Ming Lin, Yuqi Bai

    Abstract: As the issue of global climate change becomes increasingly severe, the demand for research in climate science continues to grow. Natural language processing technologies, represented by Large Language Models (LLMs), have been widely applied to climate change-specific research, providing essential information support for decision-makers and the public. Some studies have improved model performance o… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: ICLR 2025 camera ready, 13 pages, 4 figures, 4 tables

  12. arXiv:2506.12473  [pdf

    cs.CL

    TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks

    Authors: Zhou Chen, Zhiqiang Wei, Yuqi Bai, Xue Xiong, Jianmin Wu

    Abstract: Model routing allocates queries to the suitable model, improving system performance while reducing costs. However, existing routing methods face practical limitations that hinder scalability in large-scale applications and struggle to keep up with the rapid growth of the large language model (LLM) ecosystem. To tackle these challenges, we propose TagRouter, a training-free model routing method des… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: ACL 2025, 26 pages, 13 figures, 14 tables

  13. arXiv:2506.11418  [pdf, ps, other

    cs.CL

    Efficient Long-Context LLM Inference via KV Cache Clustering

    Authors: Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, Kun Yuan

    Abstract: Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational ov… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  14. arXiv:2506.11302  [pdf, ps, other

    cs.CV cs.AI

    TARDIS STRIDE: A Spatio-Temporal Road Image Dataset and World Model for Autonomy

    Authors: Héctor Carrión, Yutong Bai, Víctor A. Hernández Castro, Kishan Panaganti, Ayush Zenith, Matthew Trang, Tony Zhang, Pietro Perona, Jitendra Malik

    Abstract: World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observatio… ▽ More

    Submitted 19 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

    Comments: Computer Vision, Pattern Recognition, Early-Fusion, Dataset, Data Augmentation

  15. arXiv:2506.11163  [pdf, ps, other

    eess.IV cs.CV cs.GR

    Vector Representations of Vessel Trees

    Authors: James Batten, Michiel Schaap, Matthew Sinclair, Ying Bai, Ben Glocker

    Abstract: We introduce a novel framework for learning vector representations of tree-structured geometric data focusing on 3D vascular networks. Our approach employs two sequentially trained Transformer-based autoencoders. In the first stage, the Vessel Autoencoder captures continuous geometric details of individual vessel segments by learning embeddings from sampled points along each curve. In the second s… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  16. arXiv:2506.08632  [pdf, other

    cs.CV

    RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping

    Authors: Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Dong Chen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok

    Abstract: Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods t… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  17. arXiv:2506.07900  [pdf, ps, other

    cs.CL cs.AI

    MiniCPM4: Ultra-Efficient LLMs on End Devices

    Authors: MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li , et al. (50 additional authors not shown)

    Abstract: This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelera… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: MiniCPM4 Technical Report

  18. arXiv:2506.07463  [pdf, ps, other

    cs.CL cs.AI

    CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

    Authors: Guang Liu, Liangdong Wang, Jijie Li, Yang Yu, Yao Xu, Jiabei Chen, Yu Bai, Feng Liao, Yonghua Lin

    Abstract: We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources fr… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  19. arXiv:2506.07099  [pdf, ps, other

    cs.LG cs.AI

    Filling the Missings: Spatiotemporal Data Imputation by Conditional Diffusion

    Authors: Wenying He, Jieling Huang, Junhua Gu, Ji Zhang, Yude Bai

    Abstract: Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interde… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: 9 pages,3 figures

  20. arXiv:2506.07037  [pdf, ps, other

    cs.CL

    KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering

    Authors: Zhongze Luo, Weixuan Wan, Qizhi Zheng, Yanhong Bai, Jingyun Sun, Jian Wang, Dan Wang

    Abstract: There are many types of standards in the field of communication. The traditional consulting model has a long cycle and relies on the knowledge and experience of experts, making it difficult to meet the rapidly developing technological demands. This paper combines the fine-tuning of large language models with the construction of knowledge graphs to implement an intelligent consultation and question… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: 23 pages

  21. arXiv:2506.05813  [pdf, ps, other

    cs.CL

    MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning

    Authors: Ye Bai, Minghan Wang, Thuy-Trang Vu

    Abstract: Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single-pass inference. Existing approaches, such as Chain-of-Thought reasoning and question decomposition, lack error detection mechanisms and discard problem-solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi-agent Ada… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: 26 pages, 10 figures

  22. arXiv:2506.05790  [pdf, ps, other

    cs.CL

    Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

    Authors: Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

    Abstract: While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can m… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  23. arXiv:2506.05633  [pdf, ps, other

    q-bio.NC cs.CV cs.NE

    Noninvasive precision modulation of high-level neural population activity via natural vision perturbations

    Authors: Guy Gaziv, Sarah Goulding, Ani Ayvazian-Hancock, Yoon Bai, James J. DiCarlo

    Abstract: Precise control of neural activity -- modulating target neurons deep in the brain while leaving nearby neurons unaffected -- is an outstanding challenge in neuroscience, generally approached using invasive techniques. This study investigates the possibility of precisely and noninvasively modulating neural activity in the high-level primate ventral visual stream via perturbations on one's natural v… ▽ More

    Submitted 13 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  24. arXiv:2506.04272  [pdf, ps, other

    cs.LG

    Understanding the Impact of Sampling Quality in Direct Preference Optimization

    Authors: Kyung Rok Kim, Yumo Bai, Chonghuan Wang, Guanting Chen

    Abstract: We study the role of the sampling distribution in Direct Preference Optimization (DPO) and aim to understand its impact on DPO's training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the generating distribution. We first analyze how distribution of responses influences policy updates during gradient descent, drawi… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Submitted to NeurIPS2025

  25. arXiv:2506.04180  [pdf, ps, other

    cs.CL

    SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

    Authors: Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee

    Abstract: Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  26. arXiv:2506.03546  [pdf, ps, other

    cs.RO cs.AI cs.MA

    From Virtual Agents to Robot Teams: A Multi-Robot Framework Evaluation in High-Stakes Healthcare Context

    Authors: Yuanchen Bai, Zijian Ding, Angelique Taylor

    Abstract: Advancements in generative models have enabled multi-agent systems (MAS) to perform complex virtual tasks such as writing and code generation, which do not generalize well to physical multi-agent robotic teams. Current frameworks often treat agents as conceptual task executors rather than physically embodied entities, and overlook critical real-world constraints such as spatial context, robotic ca… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  27. arXiv:2506.03524  [pdf, ps, other

    cs.CL cs.SE

    Seed-Coder: Let the Code Model Curate Data for Itself

    Authors: ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen , et al. (2 additional authors not shown)

    Abstract: Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality f… ▽ More

    Submitted 4 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  28. arXiv:2506.03373  [pdf, ps, other

    cs.CV cs.AI

    A Foundation Model for Spatial Proteomics

    Authors: Muhammad Shaban, Yuzhou Chang, Huaying Qiu, Yao Yu Yeo, Andrew H. Song, Guillaume Jaume, Yuchen Wang, Luca L. Weishaupt, Tong Ding, Anurag Vaidya, Abdallah Lamane, Daniel Shao, Mohammed Zidane, Yunhao Bai, Paige McCallum, Shuli Luo, Wenrui Wu, Yang Wang, Precious Cramer, Chi Ngai Chan, Pierre Stephan, Johanna Schaffenrath, Jia Le Lee, Hendrik A. Michel, Caiwei Tian , et al. (35 additional authors not shown)

    Abstract: Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-superv… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  29. arXiv:2506.02572  [pdf, ps, other

    cs.LG cs.AI

    HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference

    Authors: Ping Gong, Jiawei Yi, Shengnan Wang, Juncheng Zhang, Zewen Jin, Ouxiang Zhou, Ruibo Liu, Guanbin Xu, Youhui Bai, Bowen Ye, Kun Yuan, Tong Yang, Gong Zhang, Renhai Chen, Feng Wu, Cheng Li

    Abstract: Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top-$k$ attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance betwee… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: ACL 2025 findings

  30. arXiv:2506.00478  [pdf, ps, other

    cs.LG cs.CV

    Dynamic Domain Adaptation-Driven Physics-Informed Graph Representation Learning for AC-OPF

    Authors: Hongjie Zhu, Zezheng Zhang, Zeyu Zhang, Yu Bai, Shimin Wen, Huazhang Wang, Daji Ergu, Ying Cai, Yang Zhao

    Abstract: Alternating Current Optimal Power Flow (AC-OPF) aims to optimize generator power outputs by utilizing the non-linear relationships between voltage magnitudes and phase angles in a power system. However, current AC-OPF solvers struggle to effectively represent the complex relationship between variable distributions in the constraint space and their corresponding optimal solutions. This limitation i… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  31. arXiv:2505.24863  [pdf, ps, other

    cs.CL

    AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

    Authors: Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, Huan Zhang

    Abstract: This paper presents AlphaOne ($α$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $α$1 first introduces $α$ moment, which represents the scaled thinking phase with a universal parameter $α$. Within this scaled pre-$α$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  32. arXiv:2505.24413  [pdf, ps, other

    cs.LG stat.CO

    Multi-task Learning for Heterogeneous Multi-source Block-Wise Missing Data

    Authors: Yang Sui, Qi Xu, Yang Bai, Annie Qu

    Abstract: Multi-task learning (MTL) has emerged as an imperative machine learning tool to solve multiple learning tasks simultaneously and has been successfully applied to healthcare, marketing, and biomedical fields. However, in order to borrow information across different tasks effectively, it is essential to utilize both homogeneous and heterogeneous information. Among the extensive literature on MTL, va… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  33. arXiv:2505.24281  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Multi-task Learning for Heterogeneous Data via Integrating Shared and Task-Specific Encodings

    Authors: Yang Sui, Qi Xu, Yang Bai, Annie Qu

    Abstract: Multi-task learning (MTL) has become an essential machine learning tool for addressing multiple learning tasks simultaneously and has been effectively applied across fields such as healthcare, marketing, and biomedical research. However, to enable efficient information sharing across tasks, it is crucial to leverage both shared and heterogeneous information. Despite extensive research on MTL, vari… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  34. arXiv:2505.23751  [pdf, ps, other

    cs.LG cs.AI cs.CV

    REOrdering Patches Improves Vision Models

    Authors: Declan Kutscher, David M. Chan, Yutong Bai, Trevor Darrell, Ritwik Gupta

    Abstract: Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch or… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  35. arXiv:2505.23653  [pdf, ps, other

    cs.LG

    How does Transformer Learn Implicit Reasoning?

    Authors: Jiaran Ye, Zijun Yao, Zhidian Huang, Liangming Pan, Jinxin Liu, Yushi Bai, Amy Xin, Liu Weichuan, Xiaoyin Che, Lei Hou, Juanzi Li

    Abstract: Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly -- producing correct answers without explicitly verbalizing intermediate steps -- but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  36. arXiv:2505.22684  [pdf, ps, other

    cs.SI cs.LG

    Recovering Fairness Directly from Modularity: a New Way for Fair Community Partitioning

    Authors: Yufeng Wang, Yiguang Bai, Tianqing Zhu, Ismail Ben Ayed, Jing Yuan

    Abstract: Community partitioning is crucial in network analysis, with modularity optimization being the prevailing technique. However, traditional modularity-based methods often overlook fairness, a critical aspect in real-world applications. To address this, we introduce protected group networks and propose a novel fairness-modularity metric. This metric extends traditional modularity by explicitly incorpo… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 17pages, 5 figures

  37. arXiv:2505.20152  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

    Authors: Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li

    Abstract: Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of ge… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  38. arXiv:2505.19799  [pdf, ps, other

    cs.CV

    A Regularization-Guided Equivariant Approach for Image Restoration

    Authors: Yulu Bai, Jiahong Fu, Qi Xie, Deyu Meng

    Abstract: Equivariant and invariant deep learning models have been developed to exploit intrinsic symmetries in data, demonstrating significant effectiveness in certain scenarios. However, these methods often suffer from limited representation accuracy and rely on strict symmetry assumptions that may not hold in practice. These limitations pose a significant drawback for image restoration tasks, which deman… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  39. arXiv:2505.17685  [pdf, other

    cs.CV

    FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

    Authors: Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei

    Abstract: Visual language models (VLMs) have attracted increasing interest in autonomous driving due to their powerful reasoning capabilities. However, existing VLMs typically utilize discrete text Chain-of-Thought (CoT) tailored to the current scenario, which essentially represents highly abstract and symbolic compression of visual information, potentially leading to spatio-temporal relationship ambiguity… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  40. arXiv:2505.16483  [pdf, other

    cs.CL cs.AI

    Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

    Authors: Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

    Abstract: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tas… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  41. arXiv:2505.16160  [pdf, ps, other

    cs.CL

    EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

    Authors: Bin Xu, Yu Bai, Huashan Sun, Yiguan Lin, Siming Liu, Xinyue Liang, Yaolin Li, Yang Gao, Heyan Huang

    Abstract: As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a se… ▽ More

    Submitted 27 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  42. arXiv:2505.12693  [pdf, other

    cs.CV

    TACOcc:Target-Adaptive Cross-Modal Fusion with Volume Rendering for 3D Semantic Occupancy

    Authors: Luyao Lei, Shuo Xu, Yifan Bai, Xing Wei

    Abstract: The performance of multi-modal 3D occupancy prediction is limited by ineffective fusion, mainly due to geometry-semantics mismatch from fixed fusion strategies and surface detail loss caused by sparse, noisy annotations. The mismatch stems from the heterogeneous scale and distribution of point cloud and image features, leading to biased matching under fixed neighborhood fusion. To address this, we… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  43. arXiv:2505.11166  [pdf, ps, other

    cs.CL cs.AI

    SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

    Authors: Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang

    Abstract: Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  44. arXiv:2505.10940  [pdf, ps, other

    cs.IR cs.AI

    Who You Are Matters: Bridging Topics and Social Roles via LLM-Enhanced Logical Recommendation

    Authors: Qing Yu, Xiaobei Wang, Shuchang Liu, Yandong Bai, Xiaoyu Yang, Xueliang Wang, Chang Meng, Shanshan Wu, Hailan Yang, Huihui Xiao, Xiang Li, Fan Yang, Xiaoqiang Feng, Lantao Hu, Han Li, Kun Gai, Lixin Zou

    Abstract: Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e.g., categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of use… ▽ More

    Submitted 20 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

  45. arXiv:2505.08414  [pdf

    eess.IV cs.CV

    An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care

    Authors: Zhi Da Soh, Yang Bai, Kai Yu, Yang Zhou, Xiaofeng Lei, Sahil Thakur, Zann Lee, Lee Ching Linette Phang, Qingsheng Peng, Can Can Xue, Rachel Shujuan Chong, Quan V. Hoang, Lavanya Raghavan, Yih Chung Tham, Charumathi Sabanayagam, Wei-Chi Wu, Ming-Chih Ho, Jiangnan He, Preeti Gupta, Ecosse Lamoureux, Seang Mei Saw, Vinay Nangia, Songhomitra Panda-Jonas, Jie Xu, Ya Xing Wang , et al. (6 additional authors not shown)

    Abstract: Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptati… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  46. arXiv:2505.05795  [pdf, other

    eess.SY cs.RO

    Formation Maneuver Control Based on the Augmented Laplacian Method

    Authors: Xinzhe Zhou, Xuyang Wang, Xiaoming Duan, Yuzhu Bai, Jianping He

    Abstract: This paper proposes a novel formation maneuver control method for both 2-D and 3-D space, which enables the formation to translate, scale, and rotate with arbitrary orientation. The core innovation is the novel design of weights in the proposed augmented Laplacian matrix. Instead of using scalars, we represent weights as matrices, which are designed based on a specified rotation axis and allow the… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  47. arXiv:2505.04173  [pdf, other

    cs.LG cs.CV

    DiffPattern-Flex: Efficient Layout Pattern Generation via Discrete Diffusion

    Authors: Zixiao Wang, Wenqian Zhao, Yunheng Shen, Yang Bai, Guojin Chen, Farzan Farnia, Bei Yu

    Abstract: Recent advancements in layout pattern generation have been dominated by deep generative models. However, relying solely on neural networks for legality guarantees raises concerns in many practical applications. In this paper, we present \tool{DiffPattern}-Flex, a novel approach designed to generate reliable layout patterns efficiently. \tool{DiffPattern}-Flex incorporates a new method for generati… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 13 pages, 13 figures. Accepted by TCAD

  48. arXiv:2505.03329  [pdf, other

    cs.CV

    FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

    Authors: Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Lei Sun, Xiangxiang Chu

    Abstract: The task of scene text editing is to modify or add texts on images while maintaining the fidelity of newly generated text and visual coherence with the background. Recent works based on latent diffusion models (LDM) show improved text editing results, yet still face challenges and often generate inaccurate or unrecognizable characters, especially for non-Latin ones (\eg, Chinese), which have compl… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: 9 pages, 4 figures

  49. arXiv:2504.20050  [pdf, ps, other

    cs.CR

    Multi-Party Private Set Operations from Predicative Zero-Sharing

    Authors: Minglang Dong, Yu Chen, Cong Zhang, Yujie Bai, Yang Cao

    Abstract: Typical protocols in the multi-party private set operations (MPSO) setting enable m > 2 parties to perform certain secure computation on the intersection or union of their private sets, realizing a very limited range of MPSO functionalities. Most works in this field focus on just one or two specific functionalities, resulting in a large variety of isolated schemes and a lack of a unified framework… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  50. arXiv:2504.19738  [pdf, other

    cs.AI cs.LG

    Learning Efficiency Meets Symmetry Breaking

    Authors: Yingbin Bai, Sylvie Thiebaux, Felipe Trevizan

    Abstract: Learning-based planners leveraging Graph Neural Networks can learn search guidance applicable to large search spaces, yet their potential to address symmetries remains largely unexplored. In this paper, we introduce a graph representation of planning problems allying learning efficiency with the ability to detect symmetries, along with two pruning methods, action pruning and state pruning, designe… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.