-
Unconditional Diffusion for Generative Sequential Recommendation
Authors:
Yimeng Bai,
Yang Zhang,
Sihao Ding,
Shaohui Ruan,
Han Yao,
Danhui Guan,
Fuli Feng,
Tat-Seng Chua
Abstract:
Diffusion models, known for their generative ability to simulate data creation through noise-adding and denoising processes, have emerged as a promising approach for building generative recommenders. To incorporate user history for personalization, existing methods typically adopt a conditional diffusion framework, where the reverse denoising process of reconstructing items from noise is modified…
▽ More
Diffusion models, known for their generative ability to simulate data creation through noise-adding and denoising processes, have emerged as a promising approach for building generative recommenders. To incorporate user history for personalization, existing methods typically adopt a conditional diffusion framework, where the reverse denoising process of reconstructing items from noise is modified to be conditioned on the user history. However, this design may fail to fully utilize historical information, as it gets distracted by the need to model the "item $\leftrightarrow$ noise" translation. This motivates us to reformulate the diffusion process for sequential recommendation in an unconditional manner, treating user history (instead of noise) as the endpoint of the forward diffusion process (i.e., the starting point of the reverse process), rather than as a conditional input. This formulation allows for exclusive focus on modeling the "item $\leftrightarrow$ history" translation. To this end, we introduce Brownian Bridge Diffusion Recommendation (BBDRec). By leveraging a Brownian bridge process, BBDRec enforces a structured noise addition and denoising mechanism, ensuring that the trajectories are constrained towards a specific endpoint -- user history, rather than noise. Extensive experiments demonstrate BBDRec's effectiveness in enhancing sequential recommendation performance. The source code is available at https://github.com/baiyimeng/BBDRec.
△ Less
Submitted 8 July, 2025;
originally announced July 2025.
-
OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
Authors:
Chen Wang,
Tianyu Peng,
Wen Yang,
Yinan Bai,
Guangfu Wang,
Jun Lin,
Lanpeng Jia,
Lingxiang Wu,
Jinqiao Wang,
Chengqing Zong,
Jiajun Zhang
Abstract:
Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for trans…
▽ More
Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S
△ Less
Submitted 8 July, 2025; v1 submitted 7 July, 2025;
originally announced July 2025.
-
Cautious Next Token Prediction
Authors:
Yizhou Wang,
Lingzhi Zhang,
Yue Bai,
Mang Tik Chiu,
Zhengmian Hu,
Mingyuan Zhang,
Qihua Dong,
Yu Yin,
Sohrab Amirghodsi,
Yun Fu
Abstract:
Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a…
▽ More
Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model's capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings' behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)
Authors:
Yang Zhou,
Chrystie Wan Ning Quek,
Jun Zhou,
Yan Wang,
Yang Bai,
Yuhe Ke,
Jie Yao,
Laura Gutierrez,
Zhen Ling Teo,
Darren Shu Jeng Ting,
Brian T. Soetikno,
Christopher S. Nielsen,
Tobias Elze,
Zengxiang Li,
Linh Le Dinh,
Lionel Tim-Ee Cheng,
Tran Nguyen Tuan Anh,
Chee Leong Cheng,
Tien Yin Wong,
Nan Liu,
Iain Beehuat Tan,
Tony Kiat Hon Lim,
Rick Siow Mong Goh,
Yong Liu,
Daniel Shu Wei Ting
Abstract:
Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio…
▽ More
Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines.
△ Less
Submitted 30 June, 2025;
originally announced July 2025.
-
RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation
Authors:
Liudi Yang,
Yang Bai,
George Eskandar,
Fengyi Shen,
Mohammad Altillawi,
Dong Chen,
Soumajit Majumder,
Ziyuan Liu,
Gitta Kutyniok,
Abhinav Valada
Abstract:
We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works pre…
▽ More
We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model
Authors:
Shuhan Tan,
John Lambert,
Hong Jeon,
Sakshum Kulshrestha,
Yijing Bai,
Jing Luo,
Dragomir Anguelov,
Mingxing Tan,
Chiyu Max Jiang
Abstract:
The goal of traffic simulation is to augment a potentially limited amount of manually-driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from p…
▽ More
The goal of traffic simulation is to augment a potentially limited amount of manually-driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from point A to point B by populating the city around the AV and controlling all aspects of the scene, from animating the dynamic agents (e.g., vehicles, pedestrians) to controlling the traffic light states. We refer to this vision as CitySim, which requires an agglomeration of simulation technologies: scene generation to populate the initial scene, agent behavior modeling to animate the scene, occlusion reasoning, dynamic scene generation to seamlessly spawn and remove agents, and environment simulation for factors such as traffic lights. While some key technologies have been separately studied in various works, others such as dynamic scene generation and environment simulation have received less attention in the research community. We propose SceneDiffuser++, the first end-to-end generative world model trained on a single loss function capable of point A-to-B simulation on a city scale integrating all the requirements above. We demonstrate the city-scale traffic simulation capability of SceneDiffuser++ and study its superior realism under long simulation conditions. We evaluate the simulation quality on an augmented version of the Waymo Open Motion Dataset (WOMD) with larger map regions to support trip-level simulation.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
Whole-Body Conditioned Egocentric Video Prediction
Authors:
Yutong Bai,
Danny Tran,
Amir Bar,
Yann LeCun,
Trevor Darrell,
Jitendra Malik
Abstract:
We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional dif…
▽ More
We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Analyzing the Impact of Strategic Bidding on the Reserve Capacity via a Bi-Level Model
Authors:
Yun Xu,
Yunxiao Bai,
Yunyong Zhang,
Peng Wang,
Xuelin Wang,
Jiqun Guo,
Kaijun Xie,
Rusheng Zhao
Abstract:
The growing integration of renewable energy sources necessitates adequate reserve capacity to maintain power balance. However, in market clearing, power companies with flexible resources may submit strategic bids to maximize profits, potentially compromising system reserves. This paper examines the effects of such strategic behavior by modeling the market as a bi-level problem. The upper level rep…
▽ More
The growing integration of renewable energy sources necessitates adequate reserve capacity to maintain power balance. However, in market clearing, power companies with flexible resources may submit strategic bids to maximize profits, potentially compromising system reserves. This paper examines the effects of such strategic behavior by modeling the market as a bi-level problem. The upper level represents a strategic company aiming to maximize profit, while the lower level simulates the system operator clearing the market based on submitted offers. To enable duality-based solution methods, we approximate unit commitments with a continuous reserve capacity calculation. Case studies indicate that, in an imperfectly competitive market, more units are incentivized to operate,enhancing system reserves. However, some units go online mainly for profit, ultimately raising electricity costs for consumers. These findings highlight the importance of market design in managing the trade-off between reserve adequacy and economic efficiency in the presence of strategic bidding behavior.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Authors:
Yuhao Wu,
Yushi Bai,
Zhiqiang Hu,
Roy Ka-Wei Lee,
Juanzi Li
Abstract:
Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this…
▽ More
Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
DiscRec: Disentangled Semantic-Collaborative Modeling for Generative Recommendation
Authors:
Chang Liu,
Yimeng Bai,
Xiaoyan Zhao,
Yang Zhang,
Fuli Feng,
Wenge Rong
Abstract:
Generative recommendation is emerging as a powerful paradigm that directly generates item predictions, moving beyond traditional matching-based approaches. However, current methods face two key challenges: token-item misalignment, where uniform token-level modeling ignores item-level granularity that is critical for collaborative signal learning, and semantic-collaborative signal entanglement, whe…
▽ More
Generative recommendation is emerging as a powerful paradigm that directly generates item predictions, moving beyond traditional matching-based approaches. However, current methods face two key challenges: token-item misalignment, where uniform token-level modeling ignores item-level granularity that is critical for collaborative signal learning, and semantic-collaborative signal entanglement, where collaborative and semantic signals exhibit distinct distributions yet are fused in a unified embedding space, leading to conflicting optimization objectives that limit the recommendation performance. To address these issues, we propose DiscRec, a novel framework that enables Disentangled Semantic-Collaborative signal modeling with flexible fusion for generative Recommendation. First, DiscRec introduces item-level position embeddings, assigned based on indices within each semantic ID, enabling explicit modeling of item structure in input token sequences. Second, DiscRec employs a dual-branch module to disentangle the two signals at the embedding layer: a semantic branch encodes semantic signals using original token embeddings, while a collaborative branch applies localized attention restricted to tokens within the same item to effectively capture collaborative signals. A gating mechanism subsequently fuses both branches while preserving the model's ability to model sequential dependencies. Extensive experiments on four real-world datasets demonstrate that DiscRec effectively decouples these signals and consistently outperforms state-of-the-art baselines. Our codes are available on https://github.com/Ten-Mao/DiscRec.
△ Less
Submitted 22 June, 2025; v1 submitted 18 June, 2025;
originally announced June 2025.
-
ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries
Authors:
Zhou Chen,
Xiao Wang,
Yuanhong Liao,
Ming Lin,
Yuqi Bai
Abstract:
As the issue of global climate change becomes increasingly severe, the demand for research in climate science continues to grow. Natural language processing technologies, represented by Large Language Models (LLMs), have been widely applied to climate change-specific research, providing essential information support for decision-makers and the public. Some studies have improved model performance o…
▽ More
As the issue of global climate change becomes increasingly severe, the demand for research in climate science continues to grow. Natural language processing technologies, represented by Large Language Models (LLMs), have been widely applied to climate change-specific research, providing essential information support for decision-makers and the public. Some studies have improved model performance on relevant tasks by constructing climate change-related instruction data and instruction-tuning LLMs. However, current research remains inadequate in efficiently producing large volumes of high-precision instruction data for climate change, which limits further development of climate change LLMs. This study introduces an automated method for constructing instruction data. The method generates instructions using facts and background knowledge from documents and enhances the diversity of the instruction data through web scraping and the collection of seed instructions. Using this method, we constructed a climate change instruction dataset, named ClimateChat-Corpus, which was used to fine-tune open-source LLMs, resulting in an LLM named ClimateChat. Evaluation results show that ClimateChat significantly improves performance on climate change question-and-answer tasks. Additionally, we evaluated the impact of different base models and instruction data on LLM performance and demonstrated its capability to adapt to a wide range of climate change scientific discovery tasks, emphasizing the importance of selecting an appropriate base model for instruction tuning. This research provides valuable references and empirical support for constructing climate change instruction data and training climate change-specific LLMs.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks
Authors:
Zhou Chen,
Zhiqiang Wei,
Yuqi Bai,
Xue Xiong,
Jianmin Wu
Abstract:
Model routing allocates queries to the suitable model, improving system performance while reducing costs. However, existing routing methods face practical limitations that hinder scalability in large-scale applications and struggle to keep up with the rapid growth of the large language model (LLM) ecosystem. To tackle these challenges, we propose TagRouter, a training-free model routing method des…
▽ More
Model routing allocates queries to the suitable model, improving system performance while reducing costs. However, existing routing methods face practical limitations that hinder scalability in large-scale applications and struggle to keep up with the rapid growth of the large language model (LLM) ecosystem. To tackle these challenges, we propose TagRouter, a training-free model routing method designed to optimize the synergy among multiple LLMs for open-domain text generation tasks. Experimental results demonstrate that TagRouter outperforms 13 baseline methods, increasing the accept rate of system by 6.15% and reducing costs by 17.20%, achieving optimal cost-efficiency. Our findings provides the LLM community with an efficient and scalable solution for model ensembling, offering users an evolvable "super model."
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Efficient Long-Context LLM Inference via KV Cache Clustering
Authors:
Jie Hu,
Shengnan Wang,
Yutong He,
Ping Gong,
Jiawei Yi,
Juncheng Zhang,
Youhui Bai,
Renhai Chen,
Gong Zhang,
Cheng Li,
Kun Yuan
Abstract:
Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational ov…
▽ More
Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that Chelsea achieves up to 80% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, Chelsea accelerates the decoding stage of inference by up to 3.19$\times$ and reduces end-to-end latency by up to 2.72$\times$.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
TARDIS STRIDE: A Spatio-Temporal Road Image Dataset and World Model for Autonomy
Authors:
Héctor Carrión,
Yutong Bai,
VÃctor A. Hernández Castro,
Kishan Panaganti,
Ayush Zenith,
Matthew Trang,
Tony Zhang,
Pietro Perona,
Jitendra Malik
Abstract:
World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observatio…
▽ More
World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time. We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing. These results suggest a promising direction towards sophisticated generalist agents--capable of understanding and manipulating the spatial and temporal aspects of their material environments--with enhanced embodied reasoning capabilities. Training code, datasets, and model checkpoints are made available at https://huggingface.co/datasets/Tera-AI/STRIDE.
△ Less
Submitted 19 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
Vector Representations of Vessel Trees
Authors:
James Batten,
Michiel Schaap,
Matthew Sinclair,
Ying Bai,
Ben Glocker
Abstract:
We introduce a novel framework for learning vector representations of tree-structured geometric data focusing on 3D vascular networks. Our approach employs two sequentially trained Transformer-based autoencoders. In the first stage, the Vessel Autoencoder captures continuous geometric details of individual vessel segments by learning embeddings from sampled points along each curve. In the second s…
▽ More
We introduce a novel framework for learning vector representations of tree-structured geometric data focusing on 3D vascular networks. Our approach employs two sequentially trained Transformer-based autoencoders. In the first stage, the Vessel Autoencoder captures continuous geometric details of individual vessel segments by learning embeddings from sampled points along each curve. In the second stage, the Vessel Tree Autoencoder encodes the topology of the vascular network as a single vector representation, leveraging the segment-level embeddings from the first model. A recursive decoding process ensures that the reconstructed topology is a valid tree structure. Compared to 3D convolutional models, this proposed approach substantially lowers GPU memory requirements, facilitating large-scale training. Experimental results on a 2D synthetic tree dataset and a 3D coronary artery dataset demonstrate superior reconstruction fidelity, accurate topology preservation, and realistic interpolations in latent space. Our scalable framework, named VeTTA, offers precise, flexible, and topologically consistent modeling of anatomical tree structures in medical imaging.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping
Authors:
Yang Bai,
Liudi Yang,
George Eskandar,
Fengyi Shen,
Dong Chen,
Mohammad Altillawi,
Ziyuan Liu,
Gitta Kutyniok
Abstract:
Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods t…
▽ More
Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
MiniCPM4: Ultra-Efficient LLMs on End Devices
Authors:
MiniCPM Team,
Chaojun Xiao,
Yuxuan Li,
Xu Han,
Yuzhuo Bai,
Jie Cai,
Haotian Chen,
Wentong Chen,
Xin Cong,
Ganqu Cui,
Ning Ding,
Shengdan Fan,
Yewei Fang,
Zixuan Fu,
Wenyu Guan,
Yitong Guan,
Junshao Guo,
Yufeng Han,
Bingxiang He,
Yuxiang Huang,
Cunliang Kong,
Qiuzuo Li,
Siyuan Li,
Wenhao Li,
Yanghao Li
, et al. (50 additional authors not shown)
Abstract:
This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelera…
▽ More
This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models
Authors:
Guang Liu,
Liangdong Wang,
Jijie Li,
Yang Yu,
Yao Xu,
Jiabei Chen,
Yu Bai,
Feng Liao,
Yonghua Lin
Abstract:
We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources fr…
▽ More
We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract $4.5$ billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Filling the Missings: Spatiotemporal Data Imputation by Conditional Diffusion
Authors:
Wenying He,
Jieling Huang,
Junhua Gu,
Ji Zhang,
Yude Bai
Abstract:
Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interde…
▽ More
Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interdependencies between spatial and temporal dimensions effectively and, more importantly, suffer from cumulative errors during the data imputation process, which propagate and amplify through iterations. To address these limitations, we propose CoFILL, a novel Conditional Diffusion Model for spatiotemporal data imputation. CoFILL builds on the inherent advantages of diffusion models to generate high-quality imputations without relying on potentially error-prone prior estimates. It incorporates an innovative dual-stream architecture that processes temporal and frequency domain features in parallel. By fusing these complementary features, CoFILL captures both rapid fluctuations and underlying patterns in the data, which enables more robust imputation. The extensive experiments reveal that CoFILL's noise prediction network successfully transforms random noise into meaningful values that align with the true data distribution. The results also show that CoFILL outperforms state-of-the-art methods in imputation accuracy. The source code is publicly available at https://github.com/joyHJL/CoFILL.
△ Less
Submitted 8 June, 2025;
originally announced June 2025.
-
KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering
Authors:
Zhongze Luo,
Weixuan Wan,
Qizhi Zheng,
Yanhong Bai,
Jingyun Sun,
Jian Wang,
Dan Wang
Abstract:
There are many types of standards in the field of communication. The traditional consulting model has a long cycle and relies on the knowledge and experience of experts, making it difficult to meet the rapidly developing technological demands. This paper combines the fine-tuning of large language models with the construction of knowledge graphs to implement an intelligent consultation and question…
▽ More
There are many types of standards in the field of communication. The traditional consulting model has a long cycle and relies on the knowledge and experience of experts, making it difficult to meet the rapidly developing technological demands. This paper combines the fine-tuning of large language models with the construction of knowledge graphs to implement an intelligent consultation and question-answering system for communication standards. The experimental results show that after LoRA tuning on the constructed dataset of 6,587 questions and answers in the field of communication standards, Qwen2.5-7B-Instruct demonstrates outstanding professional capabilities in the field of communication standards on the test set. BLEU-4 rose from 18.8564 to 66.8993, and evaluation indicators such as ROUGE also increased significantly, outperforming the fine-tuning effect of the comparison model Llama-3-8B-Instruct. Based on the ontology framework containing 6 entity attributes and 10 relation attributes, a knowledge graph of the communication standard domain containing 13,906 entities and 13,524 relations was constructed, showing a relatively good query accuracy rate. The intelligent consultation and question-answering system enables the fine-tuned model on the server side to access the locally constructed knowledge graph and conduct graphical retrieval of key information first, which is conducive to improving the question-answering effect. The evaluation using DeepSeek as the Judge on the test set shows that our RAG framework enables the fine-tuned model to improve the scores at all five angles, with an average score increase of 2.26%. And combined with web services and API interfaces, it has achieved very good results in terms of interaction experience and back-end access, and has very good practical application value.
△ Less
Submitted 8 June, 2025;
originally announced June 2025.
-
MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning
Authors:
Ye Bai,
Minghan Wang,
Thuy-Trang Vu
Abstract:
Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single-pass inference. Existing approaches, such as Chain-of-Thought reasoning and question decomposition, lack error detection mechanisms and discard problem-solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi-agent Ada…
▽ More
Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single-pass inference. Existing approaches, such as Chain-of-Thought reasoning and question decomposition, lack error detection mechanisms and discard problem-solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi-agent Adaptive Planning with Long-term mEmory), a novel framework that mimics human problem-solving through specialized cognitive agents working in a feedback-driven loop. MAPLE integrates 4 key components: (1) a Solver using the ReAct paradigm for reasoning, (2) a Checker for answer verification, (3) a Reflector for error diagnosis and strategy correction, and (4) an Archiver managing long-term memory for experience reuse and evolution. Experiments on WiKiTQ and TabFact demonstrate significant improvements over existing methods, achieving state-of-the-art performance across multiple LLM backbones.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Discrete Minds in a Continuous World: Do Language Models Know Time Passes?
Authors:
Minghan Wang,
Ye Bai,
Thuy-Trang Vu,
Ehsan Shareghi,
Gholamreza Haffari
Abstract:
While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can m…
▽ More
While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can map discrete token counts to continuous wall-clock time, and validate this through a dialogue duration judgment task. Second, we demonstrate that LLMs could use this awareness to adapt their response length while maintaining accuracy when users express urgency in question answering tasks. Finally, we develop BombRush, an interactive navigation challenge that examines how LLMs modify behavior under progressive time pressure in dynamic environments. Our findings indicate that LLMs possess certain awareness of time passage, enabling them to bridge discrete linguistic tokens and continuous physical time, though this capability varies with model size and reasoning abilities. This work establishes a theoretical foundation for enhancing temporal awareness in LLMs for time-sensitive applications.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Noninvasive precision modulation of high-level neural population activity via natural vision perturbations
Authors:
Guy Gaziv,
Sarah Goulding,
Ani Ayvazian-Hancock,
Yoon Bai,
James J. DiCarlo
Abstract:
Precise control of neural activity -- modulating target neurons deep in the brain while leaving nearby neurons unaffected -- is an outstanding challenge in neuroscience, generally approached using invasive techniques. This study investigates the possibility of precisely and noninvasively modulating neural activity in the high-level primate ventral visual stream via perturbations on one's natural v…
▽ More
Precise control of neural activity -- modulating target neurons deep in the brain while leaving nearby neurons unaffected -- is an outstanding challenge in neuroscience, generally approached using invasive techniques. This study investigates the possibility of precisely and noninvasively modulating neural activity in the high-level primate ventral visual stream via perturbations on one's natural visual feed. When tested on macaque inferior temporal (IT) neural populations, we found quantitative agreement between the model-predicted and biologically realized effect: strong modulation concentrated on targeted neural sites. We extended this to demonstrate accurate injection of experimenter-chosen neural population patterns via subtle perturbations applied on the background of typical natural visual feeds. These results highlight that current machine-executable models of the ventral stream can now design noninvasive, visually-delivered, possibly imperceptible neural interventions at the resolution of individual neurons.
△ Less
Submitted 13 June, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Authors:
Kyung Rok Kim,
Yumo Bai,
Chonghuan Wang,
Guanting Chen
Abstract:
We study the role of the sampling distribution in Direct Preference Optimization (DPO) and aim to understand its impact on DPO's training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the generating distribution. We first analyze how distribution of responses influences policy updates during gradient descent, drawi…
▽ More
We study the role of the sampling distribution in Direct Preference Optimization (DPO) and aim to understand its impact on DPO's training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the generating distribution. We first analyze how distribution of responses influences policy updates during gradient descent, drawing connections to common phenomena found in practice. We then design a simplified yet well-structured alignment model as a proxy, and develop quantitative results showing how more frequent high-quality responses amplify the gradient signal and improve the optimization landscape, leading to more effective policy learning. Our theoretical findings are supported by empirical experiments and provide a principled justification for the online DPO framework in practice.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models
Authors:
Yuhao Wu,
Yushi Bai,
Zhiqiang Hu,
Juanzi Li,
Roy Ka-Wei Lee
Abstract:
Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent…
▽ More
Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
From Virtual Agents to Robot Teams: A Multi-Robot Framework Evaluation in High-Stakes Healthcare Context
Authors:
Yuanchen Bai,
Zijian Ding,
Angelique Taylor
Abstract:
Advancements in generative models have enabled multi-agent systems (MAS) to perform complex virtual tasks such as writing and code generation, which do not generalize well to physical multi-agent robotic teams. Current frameworks often treat agents as conceptual task executors rather than physically embodied entities, and overlook critical real-world constraints such as spatial context, robotic ca…
▽ More
Advancements in generative models have enabled multi-agent systems (MAS) to perform complex virtual tasks such as writing and code generation, which do not generalize well to physical multi-agent robotic teams. Current frameworks often treat agents as conceptual task executors rather than physically embodied entities, and overlook critical real-world constraints such as spatial context, robotic capabilities (e.g., sensing and navigation). To probe this gap, we reconfigure and stress-test a hierarchical multi-agent robotic team built on the CrewAI framework in a simulated emergency department onboarding scenario. We identify five persistent failure modes: role misalignment; tool access violations; lack of in-time handling of failure reports; noncompliance with prescribed workflows; bypassing or false reporting of task completion. Based on this analysis, we propose three design guidelines emphasizing process transparency, proactive failure recovery, and contextual grounding. Our work informs the development of more resilient and robust multi-agent robotic systems (MARS), including opportunities to extend virtual multi-agent frameworks to the real world.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Seed-Coder: Let the Code Model Curate Data for Itself
Authors:
ByteDance Seed,
Yuyu Zhang,
Jing Su,
Yifan Sun,
Chenguang Xi,
Xia Xiao,
Shen Zheng,
Anxiang Zhang,
Kaibo Liu,
Daoguang Zan,
Tao Sun,
Jinhua Zhu,
Shulin Xin,
Dong Huang,
Yetao Bai,
Lixin Dong,
Chao Li,
Jianchong Chen,
Hanzhi Zhou,
Yifan Huang,
Guanghan Ning,
Xierui Song,
Jiaze Chen,
Siyao Liu,
Kai Shen
, et al. (2 additional authors not shown)
Abstract:
Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality f…
▽ More
Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.
△ Less
Submitted 4 June, 2025; v1 submitted 3 June, 2025;
originally announced June 2025.
-
A Foundation Model for Spatial Proteomics
Authors:
Muhammad Shaban,
Yuzhou Chang,
Huaying Qiu,
Yao Yu Yeo,
Andrew H. Song,
Guillaume Jaume,
Yuchen Wang,
Luca L. Weishaupt,
Tong Ding,
Anurag Vaidya,
Abdallah Lamane,
Daniel Shao,
Mohammed Zidane,
Yunhao Bai,
Paige McCallum,
Shuli Luo,
Wenrui Wu,
Yang Wang,
Precious Cramer,
Chi Ngai Chan,
Pierre Stephan,
Johanna Schaffenrath,
Jia Le Lee,
Hendrik A. Michel,
Caiwei Tian
, et al. (35 additional authors not shown)
Abstract:
Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-superv…
▽ More
Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at https://github.com/mahmoodlab/KRONOS.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference
Authors:
Ping Gong,
Jiawei Yi,
Shengnan Wang,
Juncheng Zhang,
Zewen Jin,
Ouxiang Zhou,
Ruibo Liu,
Guanbin Xu,
Youhui Bai,
Bowen Ye,
Kun Yuan,
Tong Yang,
Gong Zhang,
Renhai Chen,
Feng Wu,
Cheng Li
Abstract:
Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top-$k$ attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance betwee…
▽ More
Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top-$k$ attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance between efficiency and accuracy. In this paper, we introduce HATA (Hash-Aware Top-$k$ Attention), a novel approach that systematically integrates low-overhead learning-to-hash techniques into the Top-$k$ attention process. Different from the existing top-k attention methods which are devoted to seeking an absolute estimation of qk score, typically with a great cost, HATA maps queries and keys into binary hash codes, and acquires the relative qk score order with a quite low cost, which is sufficient for realizing top-k attention. Extensive experiments demonstrate that HATA achieves up to 7.2$\times$ speedup compared to vanilla full attention while maintaining model accuracy. In addition, HATA outperforms the state-of-the-art top-$k$ attention methods in both accuracy and efficiency across multiple mainstream LLM models and diverse tasks. HATA is open source at https://github.com/gpzlx1/HATA.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Dynamic Domain Adaptation-Driven Physics-Informed Graph Representation Learning for AC-OPF
Authors:
Hongjie Zhu,
Zezheng Zhang,
Zeyu Zhang,
Yu Bai,
Shimin Wen,
Huazhang Wang,
Daji Ergu,
Ying Cai,
Yang Zhao
Abstract:
Alternating Current Optimal Power Flow (AC-OPF) aims to optimize generator power outputs by utilizing the non-linear relationships between voltage magnitudes and phase angles in a power system. However, current AC-OPF solvers struggle to effectively represent the complex relationship between variable distributions in the constraint space and their corresponding optimal solutions. This limitation i…
▽ More
Alternating Current Optimal Power Flow (AC-OPF) aims to optimize generator power outputs by utilizing the non-linear relationships between voltage magnitudes and phase angles in a power system. However, current AC-OPF solvers struggle to effectively represent the complex relationship between variable distributions in the constraint space and their corresponding optimal solutions. This limitation in constraint modeling restricts the system's ability to develop diverse knowledge representations. Additionally, modeling the power grid solely based on spatial topology further limits the integration of additional prior knowledge, such as temporal information. To overcome these challenges, we propose DDA-PIGCN (Dynamic Domain Adaptation-Driven Physics-Informed Graph Convolutional Network), a new method designed to address constraint-related issues and build a graph-based learning framework that incorporates spatiotemporal features. DDA-PIGCN improves consistency optimization for features with varying long-range dependencies by applying multi-layer, hard physics-informed constraints. It also uses a dynamic domain adaptation learning mechanism that iteratively updates and refines key state variables under predefined constraints, enabling precise constraint verification. Moreover, it captures spatiotemporal dependencies between generators and loads by leveraging the physical structure of the power grid, allowing for deep integration of topological information across time and space. Extensive comparative and ablation studies show that DDA-PIGCN delivers strong performance across several IEEE standard test cases (such as case9, case30, and case300), achieving mean absolute errors (MAE) from 0.0011 to 0.0624 and constraint satisfaction rates between 99.6% and 100%, establishing it as a reliable and efficient AC-OPF solver.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Authors:
Junyu Zhang,
Runpei Dong,
Han Wang,
Xuying Ning,
Haoran Geng,
Peihao Li,
Xialin He,
Yutong Bai,
Jitendra Malik,
Saurabh Gupta,
Huan Zhang
Abstract:
This paper presents AlphaOne ($α$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $α$1 first introduces $α$ moment, which represents the scaled thinking phase with a universal parameter $α$. Within this scaled pre-$α$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as…
▽ More
This paper presents AlphaOne ($α$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $α$1 first introduces $α$ moment, which represents the scaled thinking phase with a universal parameter $α$. Within this scaled pre-$α$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the $α$ moment, $α$1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate $α$1's superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Multi-task Learning for Heterogeneous Multi-source Block-Wise Missing Data
Authors:
Yang Sui,
Qi Xu,
Yang Bai,
Annie Qu
Abstract:
Multi-task learning (MTL) has emerged as an imperative machine learning tool to solve multiple learning tasks simultaneously and has been successfully applied to healthcare, marketing, and biomedical fields. However, in order to borrow information across different tasks effectively, it is essential to utilize both homogeneous and heterogeneous information. Among the extensive literature on MTL, va…
▽ More
Multi-task learning (MTL) has emerged as an imperative machine learning tool to solve multiple learning tasks simultaneously and has been successfully applied to healthcare, marketing, and biomedical fields. However, in order to borrow information across different tasks effectively, it is essential to utilize both homogeneous and heterogeneous information. Among the extensive literature on MTL, various forms of heterogeneity are presented in MTL problems, such as block-wise, distribution, and posterior heterogeneity. Existing methods, however, struggle to tackle these forms of heterogeneity simultaneously in a unified framework. In this paper, we propose a two-step learning strategy for MTL which addresses the aforementioned heterogeneity. First, we impute the missing blocks using shared representations extracted from homogeneous source across different tasks. Next, we disentangle the mappings between input features and responses into a shared component and a task-specific component, respectively, thereby enabling information borrowing through the shared component. Our numerical experiments and real-data analysis from the ADNI database demonstrate the superior MTL performance of the proposed method compared to other competing methods.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Multi-task Learning for Heterogeneous Data via Integrating Shared and Task-Specific Encodings
Authors:
Yang Sui,
Qi Xu,
Yang Bai,
Annie Qu
Abstract:
Multi-task learning (MTL) has become an essential machine learning tool for addressing multiple learning tasks simultaneously and has been effectively applied across fields such as healthcare, marketing, and biomedical research. However, to enable efficient information sharing across tasks, it is crucial to leverage both shared and heterogeneous information. Despite extensive research on MTL, vari…
▽ More
Multi-task learning (MTL) has become an essential machine learning tool for addressing multiple learning tasks simultaneously and has been effectively applied across fields such as healthcare, marketing, and biomedical research. However, to enable efficient information sharing across tasks, it is crucial to leverage both shared and heterogeneous information. Despite extensive research on MTL, various forms of heterogeneity, including distribution and posterior heterogeneity, present significant challenges. Existing methods often fail to address these forms of heterogeneity within a unified framework. In this paper, we propose a dual-encoder framework to construct a heterogeneous latent factor space for each task, incorporating a task-shared encoder to capture common information across tasks and a task-specific encoder to preserve unique task characteristics. Additionally, we explore the intrinsic similarity structure of the coefficients corresponding to learned latent factors, allowing for adaptive integration across tasks to manage posterior heterogeneity. We introduce a unified algorithm that alternately learns the task-specific and task-shared encoders and coefficients. In theory, we investigate the excess risk bound for the proposed MTL method using local Rademacher complexity and apply it to a new but related task. Through simulation studies, we demonstrate that the proposed method outperforms existing data integration methods across various settings. Furthermore, the proposed method achieves superior predictive performance for time to tumor doubling across five distinct cancer types in PDX data.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
REOrdering Patches Improves Vision Models
Authors:
Declan Kutscher,
David M. Chan,
Yutong Bai,
Trevor Darrell,
Ritwik Gupta
Abstract:
Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch or…
▽ More
Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
How does Transformer Learn Implicit Reasoning?
Authors:
Jiaran Ye,
Zijun Yao,
Zhidian Huang,
Liangming Pan,
Jinxin Liu,
Yushi Bai,
Amy Xin,
Liu Weichuan,
Xiaoyin Che,
Lei Hou,
Juanzi Li
Abstract:
Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly -- producing correct answers without explicitly verbalizing intermediate steps -- but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three…
▽ More
Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly -- producing correct answers without explicitly verbalizing intermediate steps -- but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three-stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Recovering Fairness Directly from Modularity: a New Way for Fair Community Partitioning
Authors:
Yufeng Wang,
Yiguang Bai,
Tianqing Zhu,
Ismail Ben Ayed,
Jing Yuan
Abstract:
Community partitioning is crucial in network analysis, with modularity optimization being the prevailing technique. However, traditional modularity-based methods often overlook fairness, a critical aspect in real-world applications. To address this, we introduce protected group networks and propose a novel fairness-modularity metric. This metric extends traditional modularity by explicitly incorpo…
▽ More
Community partitioning is crucial in network analysis, with modularity optimization being the prevailing technique. However, traditional modularity-based methods often overlook fairness, a critical aspect in real-world applications. To address this, we introduce protected group networks and propose a novel fairness-modularity metric. This metric extends traditional modularity by explicitly incorporating fairness, and we prove that minimizing it yields naturally fair partitions for protected groups while maintaining theoretical soundness. We develop a general optimization framework for fairness partitioning and design the efficient Fair Fast Newman (FairFN) algorithm, enhancing the Fast Newman (FN) method to optimize both modularity and fairness. Experiments show FairFN achieves significantly improved fairness and high-quality partitions compared to state-of-the-art methods, especially on unbalanced datasets.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
Authors:
Kai Sun,
Yushi Bai,
Zhen Yang,
Jiajie Zhang,
Ji Qi,
Lei Hou,
Juanzi Li
Abstract:
Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of ge…
▽ More
Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of geometric problem-solving. To enhance geometric understanding, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train CLIP using our strong negative learning method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further study the impact of different negative sample construction methods and the number of negative samples on the geometric reasoning performance of LMM, yielding fruitful conclusions. The code and dataset are available at https://github.com/THU-KEG/MMGeoLM.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
A Regularization-Guided Equivariant Approach for Image Restoration
Authors:
Yulu Bai,
Jiahong Fu,
Qi Xie,
Deyu Meng
Abstract:
Equivariant and invariant deep learning models have been developed to exploit intrinsic symmetries in data, demonstrating significant effectiveness in certain scenarios. However, these methods often suffer from limited representation accuracy and rely on strict symmetry assumptions that may not hold in practice. These limitations pose a significant drawback for image restoration tasks, which deman…
▽ More
Equivariant and invariant deep learning models have been developed to exploit intrinsic symmetries in data, demonstrating significant effectiveness in certain scenarios. However, these methods often suffer from limited representation accuracy and rely on strict symmetry assumptions that may not hold in practice. These limitations pose a significant drawback for image restoration tasks, which demands high accuracy and precise symmetry representation. To address these challenges, we propose a rotation-equivariant regularization strategy that adaptively enforces the appropriate symmetry constraints on the data while preserving the network's representational accuracy. Specifically, we introduce EQ-Reg, a regularizer designed to enhance rotation equivariance, which innovatively extends the insights of data-augmentation-based and equivariant-based methodologies. This is achieved through self-supervised learning and the spatial rotation and cyclic channel shift of feature maps deduce in the equivariant framework. Our approach firstly enables a non-strictly equivariant network suitable for image restoration, providing a simple and adaptive mechanism for adjusting equivariance based on task. Extensive experiments across three low-level tasks demonstrate the superior accuracy and generalization capability of our method, outperforming state-of-the-art approaches.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
Authors:
Shuang Zeng,
Xinyuan Chang,
Mengwei Xie,
Xinran Liu,
Yifan Bai,
Zheng Pan,
Mu Xu,
Xing Wei
Abstract:
Visual language models (VLMs) have attracted increasing interest in autonomous driving due to their powerful reasoning capabilities. However, existing VLMs typically utilize discrete text Chain-of-Thought (CoT) tailored to the current scenario, which essentially represents highly abstract and symbolic compression of visual information, potentially leading to spatio-temporal relationship ambiguity…
▽ More
Visual language models (VLMs) have attracted increasing interest in autonomous driving due to their powerful reasoning capabilities. However, existing VLMs typically utilize discrete text Chain-of-Thought (CoT) tailored to the current scenario, which essentially represents highly abstract and symbolic compression of visual information, potentially leading to spatio-temporal relationship ambiguity and fine-grained information loss. Is autonomous driving better modeled on real-world simulation and imagination than on pure symbolic logic? In this paper, we propose a spatio-temporal CoT reasoning method that enables models to think visually. First, VLM serves as a world model to generate unified image frame for predicting future world states: where perception results (e.g., lane divider and 3D detection) represent the future spatial relationships, and ordinary future frame represent the temporal evolution relationships. This spatio-temporal CoT then serves as intermediate reasoning steps, enabling the VLM to function as an inverse dynamics model for trajectory planning based on current observations and future predictions. To implement visual generation in VLMs, we propose a unified pretraining paradigm integrating visual generation and understanding, along with a progressive visual CoT enhancing autoregressive image generation. Extensive experimental results demonstrate the effectiveness of the proposed method, advancing autonomous driving towards visual reasoning.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
Authors:
Shuzheng Si,
Haozhe Zhao,
Cheng Gao,
Yuzhuo Bai,
Zhitong Wang,
Bofei Gao,
Kangyang Luo,
Wenhao Li,
Yufei Huang,
Gang Chen,
Fanchao Qi,
Minjia Zhang,
Baobao Chang,
Maosong Sun
Abstract:
Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tas…
▽ More
Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
Authors:
Bin Xu,
Yu Bai,
Huashan Sun,
Yiguan Lin,
Siming Liu,
Xinyue Liang,
Yaolin Li,
Yang Gao,
Heyan Huang
Abstract:
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a se…
▽ More
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.
△ Less
Submitted 27 May, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
TACOcc:Target-Adaptive Cross-Modal Fusion with Volume Rendering for 3D Semantic Occupancy
Authors:
Luyao Lei,
Shuo Xu,
Yifan Bai,
Xing Wei
Abstract:
The performance of multi-modal 3D occupancy prediction is limited by ineffective fusion, mainly due to geometry-semantics mismatch from fixed fusion strategies and surface detail loss caused by sparse, noisy annotations. The mismatch stems from the heterogeneous scale and distribution of point cloud and image features, leading to biased matching under fixed neighborhood fusion. To address this, we…
▽ More
The performance of multi-modal 3D occupancy prediction is limited by ineffective fusion, mainly due to geometry-semantics mismatch from fixed fusion strategies and surface detail loss caused by sparse, noisy annotations. The mismatch stems from the heterogeneous scale and distribution of point cloud and image features, leading to biased matching under fixed neighborhood fusion. To address this, we propose a target-scale adaptive, bidirectional symmetric retrieval mechanism. It expands the neighborhood for large targets to enhance context awareness and shrinks it for small ones to improve efficiency and suppress noise, enabling accurate cross-modal feature alignment. This mechanism explicitly establishes spatial correspondences and improves fusion accuracy. For surface detail loss, sparse labels provide limited supervision, resulting in poor predictions for small objects. We introduce an improved volume rendering pipeline based on 3D Gaussian Splatting, which takes fused features as input to render images, applies photometric consistency supervision, and jointly optimizes 2D-3D consistency. This enhances surface detail reconstruction while suppressing noise propagation. In summary, we propose TACOcc, an adaptive multi-modal fusion framework for 3D semantic occupancy prediction, enhanced by volume rendering supervision. Experiments on the nuScenes and SemanticKITTI benchmarks validate its effectiveness.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization
Authors:
Huashan Sun,
Shengyi Liao,
Yansen Han,
Yu Bai,
Yang Gao,
Cheng Fu,
Weizhou Shen,
Fanqi Wan,
Ming Yan,
Ji Zhang,
Fei Huang
Abstract:
Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named…
▽ More
Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $\textbf{S}$h$\textbf{o}$rt-to-$\textbf{Lo}$ng $\textbf{P}$reference $\textbf{O}$ptimization ($\textbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Who You Are Matters: Bridging Topics and Social Roles via LLM-Enhanced Logical Recommendation
Authors:
Qing Yu,
Xiaobei Wang,
Shuchang Liu,
Yandong Bai,
Xiaoyu Yang,
Xueliang Wang,
Chang Meng,
Shanshan Wu,
Hailan Yang,
Huihui Xiao,
Xiang Li,
Fan Yang,
Xiaoqiang Feng,
Lantao Hu,
Han Li,
Kun Gai,
Lixin Zou
Abstract:
Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e.g., categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of use…
▽ More
Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e.g., categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, TagCF exploits the (Multi-modal) LLM's world knowledge and logic inference ability to extract realistic tag-based virtual logic graphs that reveal dynamic and expressive knowledge of users, refining our understanding of user behaviors. On the other hand, TagCF presents empirically effective integration modules that take advantage of the extracted tag-logic information, augmenting the recommendation performance. We conduct both online experiments and offline experiments with industrial and public datasets as verification of TagCF's effectiveness, and we empirically show that the user role modeling strategy is potentially a better choice than the modeling of item topics. Additionally, we provide evidence that the extracted logic graphs are empirically a general and transferable knowledge that can benefit a wide range of recommendation tasks.
△ Less
Submitted 20 May, 2025; v1 submitted 16 May, 2025;
originally announced May 2025.
-
An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care
Authors:
Zhi Da Soh,
Yang Bai,
Kai Yu,
Yang Zhou,
Xiaofeng Lei,
Sahil Thakur,
Zann Lee,
Lee Ching Linette Phang,
Qingsheng Peng,
Can Can Xue,
Rachel Shujuan Chong,
Quan V. Hoang,
Lavanya Raghavan,
Yih Chung Tham,
Charumathi Sabanayagam,
Wei-Chi Wu,
Ming-Chih Ho,
Jiangnan He,
Preeti Gupta,
Ecosse Lamoureux,
Seang Mei Saw,
Vinay Nangia,
Songhomitra Panda-Jonas,
Jie Xu,
Ya Xing Wang
, et al. (6 additional authors not shown)
Abstract:
Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptati…
▽ More
Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptation, we fine-tuned our VFMs to detect ocular and systemic diseases, differentiate ocular disease severity, and identify common ocular signs. The model achieved 100% accuracy in routing fundus images to appropriate VFMs, which achieved $\ge$ 82.2% accuracy in disease detection, $\ge$ 89% in severity differentiation, $\ge$ 76% in sign identification. Meta-EyeFM was 11% to 43% more accurate than Gemini-1.5-flash and ChatGPT-4o LMMs in detecting various eye diseases and comparable to an ophthalmologist. This system offers enhanced usability and diagnostic performance, making it a valuable decision support tool for primary eye care or an online LLM for fundus evaluation.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Formation Maneuver Control Based on the Augmented Laplacian Method
Authors:
Xinzhe Zhou,
Xuyang Wang,
Xiaoming Duan,
Yuzhu Bai,
Jianping He
Abstract:
This paper proposes a novel formation maneuver control method for both 2-D and 3-D space, which enables the formation to translate, scale, and rotate with arbitrary orientation. The core innovation is the novel design of weights in the proposed augmented Laplacian matrix. Instead of using scalars, we represent weights as matrices, which are designed based on a specified rotation axis and allow the…
▽ More
This paper proposes a novel formation maneuver control method for both 2-D and 3-D space, which enables the formation to translate, scale, and rotate with arbitrary orientation. The core innovation is the novel design of weights in the proposed augmented Laplacian matrix. Instead of using scalars, we represent weights as matrices, which are designed based on a specified rotation axis and allow the formation to perform rotation in 3-D space. To further improve the flexibility and scalability of the formation, the rotational axis adjustment approach and dynamic agent reconfiguration method are developed, allowing formations to rotate around arbitrary axes in 3-D space and new agents to join the formation. Theoretical analysis is provided to show that the proposed approach preserves the original configuration of the formation. The proposed method maintains the advantages of the complex Laplacian-based method, including reduced neighbor requirements and no reliance on generic or convex nominal configurations, while achieving arbitrary orientation rotations via a more simplified implementation. Simulations in both 2-D and 3-D space validate the effectiveness of the proposed method.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
DiffPattern-Flex: Efficient Layout Pattern Generation via Discrete Diffusion
Authors:
Zixiao Wang,
Wenqian Zhao,
Yunheng Shen,
Yang Bai,
Guojin Chen,
Farzan Farnia,
Bei Yu
Abstract:
Recent advancements in layout pattern generation have been dominated by deep generative models. However, relying solely on neural networks for legality guarantees raises concerns in many practical applications. In this paper, we present \tool{DiffPattern}-Flex, a novel approach designed to generate reliable layout patterns efficiently. \tool{DiffPattern}-Flex incorporates a new method for generati…
▽ More
Recent advancements in layout pattern generation have been dominated by deep generative models. However, relying solely on neural networks for legality guarantees raises concerns in many practical applications. In this paper, we present \tool{DiffPattern}-Flex, a novel approach designed to generate reliable layout patterns efficiently. \tool{DiffPattern}-Flex incorporates a new method for generating diverse topologies using a discrete diffusion model while maintaining a lossless and compute-efficient layout representation. To ensure legal pattern generation, we employ {an} optimization-based, white-box pattern assessment process based on specific design rules. Furthermore, fast sampling and efficient legalization technologies are employed to accelerate the generation process. Experimental results across various benchmarks demonstrate that \tool{DiffPattern}-Flex significantly outperforms existing methods and excels at producing reliable layout patterns.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing
Authors:
Rui Lan,
Yancheng Bai,
Xu Duan,
Mingxing Li,
Lei Sun,
Xiangxiang Chu
Abstract:
The task of scene text editing is to modify or add texts on images while maintaining the fidelity of newly generated text and visual coherence with the background. Recent works based on latent diffusion models (LDM) show improved text editing results, yet still face challenges and often generate inaccurate or unrecognizable characters, especially for non-Latin ones (\eg, Chinese), which have compl…
▽ More
The task of scene text editing is to modify or add texts on images while maintaining the fidelity of newly generated text and visual coherence with the background. Recent works based on latent diffusion models (LDM) show improved text editing results, yet still face challenges and often generate inaccurate or unrecognizable characters, especially for non-Latin ones (\eg, Chinese), which have complex glyph structures. To address these issues, we present FLUX-Text, a simple and advanced multilingual scene text editing framework based on FLUX-Fill. Specifically, we carefully investigate glyph conditioning, considering both visual and textual modalities. To retain the original generative capabilities of FLUX-Fill while enhancing its understanding and generation of glyphs, we propose lightweight glyph and text embedding modules. Owning to the lightweight design, FLUX-Text is trained only with $100K$ training examples compared to current popular methods trained with 2.9M ones. With no bells and whistles, our method achieves state-of-the-art performance on text editing tasks. Qualitative and quantitative experiments on the public datasets demonstrate that our method surpasses previous works in text fidelity.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Multi-Party Private Set Operations from Predicative Zero-Sharing
Authors:
Minglang Dong,
Yu Chen,
Cong Zhang,
Yujie Bai,
Yang Cao
Abstract:
Typical protocols in the multi-party private set operations (MPSO) setting enable m > 2 parties to perform certain secure computation on the intersection or union of their private sets, realizing a very limited range of MPSO functionalities. Most works in this field focus on just one or two specific functionalities, resulting in a large variety of isolated schemes and a lack of a unified framework…
▽ More
Typical protocols in the multi-party private set operations (MPSO) setting enable m > 2 parties to perform certain secure computation on the intersection or union of their private sets, realizing a very limited range of MPSO functionalities. Most works in this field focus on just one or two specific functionalities, resulting in a large variety of isolated schemes and a lack of a unified framework in MPSO research. In this work, we present an MPSO framework, which allows m parties, each holding a set, to securely compute any set formulas (arbitrary compositions of a finite number of binary set operations, including intersection, union and difference) on their private sets. Our framework is highly versatile and can be instantiated to accommodate a broad spectrum of MPSO functionalities. To the best of our knowledge, this is the first framework to achieve such a level of flexibility and generality in MPSO, without relying on generic secure multi-party computation (MPC) techniques.
Our framework exhibits favorable theoretical and practical performance. The computation and communication complexity scale linearly with the set size n, and it achieves optimal complexity that is on par with the naive solution for widely used functionalities, such as multi-party private set intersection (MPSI), MPSI with cardinality output (MPSI-card), and MPSI with cardinality and sum (MPSI-card-sum), in the standard semi-honest model. Furthermore, the instantiations of our framework mainly from symmetric-key techniques yield efficient protocols for MPSI, MPSI-card, MPSI-card-sum, and multi-party private set union (MPSU), with online performance surpassing or matching the state of the art.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Learning Efficiency Meets Symmetry Breaking
Authors:
Yingbin Bai,
Sylvie Thiebaux,
Felipe Trevizan
Abstract:
Learning-based planners leveraging Graph Neural Networks can learn search guidance applicable to large search spaces, yet their potential to address symmetries remains largely unexplored. In this paper, we introduce a graph representation of planning problems allying learning efficiency with the ability to detect symmetries, along with two pruning methods, action pruning and state pruning, designe…
▽ More
Learning-based planners leveraging Graph Neural Networks can learn search guidance applicable to large search spaces, yet their potential to address symmetries remains largely unexplored. In this paper, we introduce a graph representation of planning problems allying learning efficiency with the ability to detect symmetries, along with two pruning methods, action pruning and state pruning, designed to manage symmetries during search. The integration of these techniques into Fast Downward achieves a first-time success over LAMA on the latest IPC learning track dataset. Code is released at: https://github.com/bybeye/Distincter.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.