-
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Authors:
Bingda Tang,
Boyang Zheng,
Xichen Pan,
Sayak Paul,
Saining Xie
Abstract:
This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed com…
▽ More
This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Qwen3 Technical Report
Authors:
An Yang,
Anfeng Li,
Baosong Yang,
Beichen Zhang,
Binyuan Hui,
Bo Zheng,
Bowen Yu,
Chang Gao,
Chengen Huang,
Chenxu Lv,
Chujie Zheng,
Dayiheng Liu,
Fan Zhou,
Fei Huang,
Feng Hu,
Hao Ge,
Haoran Wei,
Huan Lin,
Jialong Tang,
Jian Yang,
Jianhong Tu,
Jianwei Zhang,
Jianxin Yang,
Jiaxi Yang,
Jing Zhou
, et al. (35 additional authors not shown)
Abstract:
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration…
▽ More
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
HuB: Learning Extreme Humanoid Balance
Authors:
Tong Zhang,
Boyuan Zheng,
Ruiqian Nai,
Yingdong Hu,
Yen-Jen Wang,
Geng Chen,
Fanqi Lin,
Jiongye Li,
Chuye Hong,
Koushil Sreenath,
Yang Gao
Abstract:
The human body demonstrates exceptional motor capabilities-such as standing steadily on one foot or performing a high kick with the leg raised over 1.5 meters-both requiring precise balance control. While recent research on humanoid control has leveraged reinforcement learning to track human motions for skill acquisition, applying this paradigm to balance-intensive tasks remains challenging. In th…
▽ More
The human body demonstrates exceptional motor capabilities-such as standing steadily on one foot or performing a high kick with the leg raised over 1.5 meters-both requiring precise balance control. While recent research on humanoid control has leveraged reinforcement learning to track human motions for skill acquisition, applying this paradigm to balance-intensive tasks remains challenging. In this work, we identify three key obstacles: instability from reference motion errors, learning difficulties due to morphological mismatch, and the sim-to-real gap caused by sensor noise and unmodeled dynamics. To address these challenges, we propose HuB (Humanoid Balance), a unified framework that integrates reference motion refinement, balance-aware policy learning, and sim-to-real robustness training, with each component targeting a specific challenge. We validate our approach on the Unitree G1 humanoid robot across challenging quasi-static balance tasks, including extreme single-legged poses such as Swallow Balance and Bruce Lee's Kick. Our policy remains stable even under strong physical disturbances-such as a forceful soccer strike-while baseline methods consistently fail to complete these tasks. Project website: https://hub-robot.github.io
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
A Generative Re-ranking Model for List-level Multi-objective Optimization at Taobao
Authors:
Yue Meng,
Cheng Guo,
Yi Cao,
Tong Liu,
Bo Zheng
Abstract:
E-commerce recommendation systems aim to generate ordered lists of items for customers, optimizing multiple business objectives, such as clicks, conversions and Gross Merchandise Volume (GMV). Traditional multi-objective optimization methods like formulas or Learning-to-rank (LTR) models take effect at item-level, neglecting dynamic user intent and contextual item interactions. List-level multi-ob…
▽ More
E-commerce recommendation systems aim to generate ordered lists of items for customers, optimizing multiple business objectives, such as clicks, conversions and Gross Merchandise Volume (GMV). Traditional multi-objective optimization methods like formulas or Learning-to-rank (LTR) models take effect at item-level, neglecting dynamic user intent and contextual item interactions. List-level multi-objective optimization in the re-ranking stage can overcome this limitation, but most current re-ranking models focus more on accuracy improvement with context. In addition, re-ranking is faced with the challenges of time complexity and diversity. In light of this, we propose a novel end-to-end generative re-ranking model named Sequential Ordered Regression Transformer-Generator (SORT-Gen) for the less-studied list-level multi-objective optimization problem. Specifically, SORT-Gen is divided into two parts: 1)Sequential Ordered Regression Transformer innovatively uses Transformer and ordered regression to accurately estimate multi-objective values for variable-length sub-lists. 2)Mask-Driven Fast Generation Algorithm combines multi-objective candidate queues, efficient item selection and diversity mechanism into model inference, providing a fast online list generation method. Comprehensive online experiments demonstrate that SORT-Gen brings +4.13% CLCK and +8.10% GMV for Baiyibutie, a notable Mini-app of Taobao. Currently, SORT-Gen has been successfully deployed in multiple scenarios of Taobao App, serving for a vast number of users.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Authors:
Zihan Qiu,
Zekun Wang,
Bo Zheng,
Zeyu Huang,
Kaiyue Wen,
Songlin Yang,
Rui Men,
Le Yu,
Fei Huang,
Suozhi Huang,
Dayiheng Liu,
Jingren Zhou,
Junyang Lin
Abstract:
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a com…
▽ More
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{https://github.com/qiuzh20/gated_attention}{codes}$ and $\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors
Authors:
Zunjie Zhu,
Yan Zhao,
Yihan Hu,
Guoxiang Wang,
Hai Qiu,
Bolun Zheng,
Chenggang Yan,
Feng Xu
Abstract:
The motion capture system that supports full-body virtual representation is of key significance for virtual reality. Compared to vision-based systems, full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range. However, previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external vi…
▽ More
The motion capture system that supports full-body virtual representation is of key significance for virtual reality. Compared to vision-based systems, full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range. However, previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints. To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists, thereby reducing the complexity of the hardware system. In this work, we propose a method called Progressive Inertial Poser (ProgIP) for human pose estimation, which combines neural network estimation with a human dynamics model, considers the hierarchical structure of the kinematic chain, and employs a multi-stage progressive network estimation with increased depth to reconstruct full-body motion in real time. The encoder combines Transformer Encoder and bidirectional LSTM (TE-biLSTM) to flexibly capture the temporal dependencies of the inertial sequence, while the decoder based on multi-layer perceptrons (MLPs) transforms high-dimensional features and accurately projects them onto Skinned Multi-Person Linear (SMPL) model parameters. Quantitative and qualitative experimental results on multiple public datasets show that our method outperforms state-of-the-art methods with the same inputs, and is comparable to recent works using six IMU sensors.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
X-Driver: Explainable Autonomous Driving with Vision-Language Models
Authors:
Wei Liu,
Jiyuan Zhang,
Binxiong Zheng,
Yufeng Hu,
Yingzhan Lin,
Zengfeng Zeng
Abstract:
End-to-end autonomous driving has advanced significantly, offering benefits such as system simplicity and stronger driving performance in both open-loop and closed-loop settings than conventional pipelines. However, existing frameworks still suffer from low success rates in closed-loop evaluations, highlighting their limitations in real-world deployment. In this paper, we introduce X-Driver, a uni…
▽ More
End-to-end autonomous driving has advanced significantly, offering benefits such as system simplicity and stronger driving performance in both open-loop and closed-loop settings than conventional pipelines. However, existing frameworks still suffer from low success rates in closed-loop evaluations, highlighting their limitations in real-world deployment. In this paper, we introduce X-Driver, a unified multi-modal large language models(MLLMs) framework designed for closed-loop autonomous driving, leveraging Chain-of-Thought(CoT) and autoregressive modeling to enhance perception and decision-making. We validate X-Driver across multiple autonomous driving tasks using public benchmarks in CARLA simulation environment, including Bench2Drive[6]. Our experimental results demonstrate superior closed-loop performance, surpassing the current state-of-the-art(SOTA) while improving the interpretability of driving decisions. These findings underscore the importance of structured reasoning in end-to-end driving and establish X-Driver as a strong baseline for future research in closed-loop autonomous driving.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index
Authors:
Yuxiang Guo,
Zhonghao Hu,
Yuren Mao,
Baihua Zheng,
Yunjun Gao,
Mingwei Zhou
Abstract:
Natural language (NL)-driven table discovery identifies relevant tables from large table repositories based on NL queries. While current deep-learning-based methods using the traditional dense vector search pipeline, i.e., representation-index-search, achieve remarkable accuracy, they face several limitations that impede further performance improvements: (i) the errors accumulated during the table…
▽ More
Natural language (NL)-driven table discovery identifies relevant tables from large table repositories based on NL queries. While current deep-learning-based methods using the traditional dense vector search pipeline, i.e., representation-index-search, achieve remarkable accuracy, they face several limitations that impede further performance improvements: (i) the errors accumulated during the table representation and indexing phases affect the subsequent search accuracy; and (ii) insufficient query-table interaction hinders effective semantic alignment, impeding accuracy improvements. In this paper, we propose a novel framework Birdie, using a differentiable search index. It unifies the indexing and search into a single encoder-decoder language model, thus getting rid of error accumulations. Birdie first assigns each table a prefix-aware identifier and leverages a large language model-based query generator to create synthetic queries for each table. It then encodes the mapping between synthetic queries/tables and their corresponding table identifiers into the parameters of an encoder-decoder language model, enabling deep query-table interactions. During search, the trained model directly generates table identifiers for a given query. To accommodate the continual indexing of dynamic tables, we introduce an index update strategy via parameter isolation, which mitigates the issue of catastrophic forgetting. Extensive experiments demonstrate that Birdie outperforms state-of-the-art dense methods by 16.8% in accuracy, and reduces forgetting by over 90% compared to other continual learning approaches.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding
Authors:
Yanan Guo,
Wenhui Dong,
Jun Song,
Shiding Zhu,
Xuan Zhang,
Hanqing Yang,
Yingbo Wang,
Yang Du,
Xianing Chen,
Bo Zheng
Abstract:
Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to…
▽ More
Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive modules.To address these issues, we propose FiLA(Fine-grained Vision Language Model)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Dual Attention Driven Lumbar Magnetic Resonance Image Feature Enhancement and Automatic Diagnosis of Herniation
Authors:
Lingrui Zhang,
Liang Guo,
Xiao An,
Feng Lin,
Binlong Zheng,
Jiankun Wang,
Zhirui Li
Abstract:
Lumbar disc herniation (LDH) is a common musculoskeletal disease that requires magnetic resonance imaging (MRI) for effective clinical management. However, the interpretation of MRI images heavily relies on the expertise of radiologists, leading to delayed diagnosis and high costs for training physicians. Therefore, this paper proposes an innovative automated LDH classification framework. To addre…
▽ More
Lumbar disc herniation (LDH) is a common musculoskeletal disease that requires magnetic resonance imaging (MRI) for effective clinical management. However, the interpretation of MRI images heavily relies on the expertise of radiologists, leading to delayed diagnosis and high costs for training physicians. Therefore, this paper proposes an innovative automated LDH classification framework. To address these key issues, the framework utilizes T1-weighted and T2-weighted MRI images from 205 people. The framework extracts clinically actionable LDH features and generates standardized diagnostic outputs by leveraging data augmentation and channel and spatial attention mechanisms. These outputs can help physicians make confident and time-effective care decisions when needed. The proposed framework achieves an area under the receiver operating characteristic curve (AUC-ROC) of 0.969 and an accuracy of 0.9486 for LDH detection. The experimental results demonstrate the performance of the proposed framework. Our framework only requires a small number of datasets for training to demonstrate high diagnostic accuracy. This is expected to be a solution to enhance the LDH detection capabilities of primary hospitals.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
Authors:
Liangyu Xu,
Yingxiu Zhao,
Jingyun Wang,
Yingyao Wang,
Bu Pi,
Chen Wang,
Mingliang Zhang,
Jihao Gu,
Xiang Li,
Xiaoyong Zhu,
Jun Song,
Bo Zheng
Abstract:
Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to j…
▽ More
Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of $65.3$. Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.
△ Less
Submitted 23 April, 2025; v1 submitted 16 April, 2025;
originally announced April 2025.
-
DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging
Authors:
Tianhui Song,
Weixin Feng,
Shuai Wang,
Xubin Li,
Tiezheng Ge,
Bo Zheng,
Limin Wang
Abstract:
The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabi…
▽ More
The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabilities of diverse powerful models into a single one. A common practice in model merging adopts static linear interpolation in the parameter space to achieve the goal of style mixing. However, it neglects the features of T2I generation task that numerous distinct models cover sundry styles which may lead to incompatibility and confusion in the merged model. To address this issue, we introduce a style-promptable image generation pipeline which can accurately generate arbitrary-style images under the control of style vectors. Based on this design, we propose the score distillation based model merging paradigm (DMM), compressing multiple models into a single versatile T2I model. Moreover, we rethink and reformulate the model merging task in the context of T2I generation, by presenting new merging goals and evaluation protocols. Our experiments demonstrate that DMM can compactly reorganize the knowledge from multiple teacher models and achieve controllable arbitrary-style generation.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework
Authors:
Zihan Ling,
Zhiyao Guo,
Yixuan Huang,
Yi An,
Shuai Xiao,
Jinsong Lan,
Xiaoyong Zhu,
Bo Zheng
Abstract:
Recent advancements in large language models (LLMs) and multi-modal LLMs have been remarkable. However, these models still rely solely on their parametric knowledge, which limits their ability to generate up-to-date information and increases the risk of producing erroneous content. Retrieval-Augmented Generation (RAG) partially mitigates these challenges by incorporating external data sources, yet…
▽ More
Recent advancements in large language models (LLMs) and multi-modal LLMs have been remarkable. However, these models still rely solely on their parametric knowledge, which limits their ability to generate up-to-date information and increases the risk of producing erroneous content. Retrieval-Augmented Generation (RAG) partially mitigates these challenges by incorporating external data sources, yet the reliance on databases and retrieval systems can introduce irrelevant or inaccurate documents, ultimately undermining both performance and reasoning quality. In this paper, we propose Multi-Modal Knowledge-Based Retrieval-Augmented Generation (MMKB-RAG), a novel multi-modal RAG framework that leverages the inherent knowledge boundaries of models to dynamically generate semantic tags for the retrieval process. This strategy enables the joint filtering of retrieved documents, retaining only the most relevant and accurate references. Extensive experiments on knowledge-based visual question-answering tasks demonstrate the efficacy of our approach: on the E-VQA dataset, our method improves performance by +4.2% on the Single-Hop subset and +0.4% on the full dataset, while on the InfoSeek dataset, it achieves gains of +7.8% on the Unseen-Q subset, +8.2% on the Unseen-E subset, and +8.1% on the full dataset. These results highlight significant enhancements in both accuracy and robustness over the current state-of-the-art MLLM and RAG frameworks.
△ Less
Submitted 20 April, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Authors:
Boyuan Zheng,
Michael Y. Fatemi,
Xiaolong Jin,
Zora Zhiruo Wang,
Apurva Gandhi,
Yueqi Song,
Yu Gu,
Jayanth Srinivasa,
Gaowen Liu,
Graham Neubig,
Yu Su
Abstract:
To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural…
▽ More
To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural knowledge abstraction, refining skills, and skill composition. In this work, we introduce SkillWeaver, a skill-centric framework enabling agents to self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs, significantly enhancing the agent's capabilities. Experiments on WebArena and real-world websites demonstrate the efficacy of SkillWeaver, achieving relative success rate improvements of 31.8% and 39.8%, respectively. Additionally, APIs synthesized by strong agents substantially enhance weaker agents through transferable skills, yielding improvements of up to 54.3% on WebArena. These results demonstrate the effectiveness of honing diverse website interactions into APIs, which can be seamlessly shared among various web agents.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering
Authors:
Yifan Gao,
Zihang Lin,
Chuanbin Liu,
Min Zhou,
Tiezheng Ge,
Bo Zheng,
Hongtao Xie
Abstract:
Product posters, which integrate subject, scene, and text, are crucial promotional tools for attracting customers. Creating such posters using modern image generation methods is valuable, while the main challenge lies in accurately rendering text, especially for complex writing systems like Chinese, which contains over 10,000 individual characters. In this work, we identify the key to precise text…
▽ More
Product posters, which integrate subject, scene, and text, are crucial promotional tools for attracting customers. Creating such posters using modern image generation methods is valuable, while the main challenge lies in accurately rendering text, especially for complex writing systems like Chinese, which contains over 10,000 individual characters. In this work, we identify the key to precise text rendering as constructing a character-discriminative visual feature as a control signal. Based on this insight, we propose a robust character-wise representation as control and we develop TextRenderNet, which achieves a high text rendering accuracy of over 90%. Another challenge in poster generation is maintaining the fidelity of user-specific products. We address this by introducing SceneGenNet, an inpainting-based model, and propose subject fidelity feedback learning to further enhance fidelity. Based on TextRenderNet and SceneGenNet, we present PosterMaker, an end-to-end generation framework. To optimize PosterMaker efficiently, we implement a two-stage training strategy that decouples text rendering and background generation learning. Experimental results show that PosterMaker outperforms existing baselines by a remarkable margin, which demonstrates its effectiveness.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
VALUE: Value-Aware Large Language Model for Query Rewriting via Weighted Trie in Sponsored Search
Authors:
Boyang Zuo,
Xiao Zhang,
Feng Li,
Pengjie Wang,
Jian Xu,
Bo Zheng
Abstract:
In the realm of sponsored search advertising, matching advertisements with the search intent of a user's query is crucial. Query-to-bidwords(i.e. bidding keywords) rewriting is a vital technique that has garnered significant attention. Recently, with the prevalence of LLMs, generative retrieval methods have proven effective in producing high-relevance rewrites. However, we have identified a signif…
▽ More
In the realm of sponsored search advertising, matching advertisements with the search intent of a user's query is crucial. Query-to-bidwords(i.e. bidding keywords) rewriting is a vital technique that has garnered significant attention. Recently, with the prevalence of LLMs, generative retrieval methods have proven effective in producing high-relevance rewrites. However, we have identified a significant limitation in existing approaches: While fine-tuning LLMs for specific domains enhances semantic relevance, these models have no perception of the intrinsic value of their generated outputs, such as commercial value. Therefore, after SFT, a RLHF phase is often employed to address this issue. Nevertheless, traditional preference alignment methods often face challenges in aligning fine-grained values and are susceptible to overfitting, which diminishes the effectiveness and quality of the generated results. To address these challenges, we propose VALUE(Value-Aware Large language model for qUery rewriting via wEighted trie), the first framework that ensures the generation of high-value and highly relevant bidwords. Our approach utilizes weighted trie, an innovative modification of the traditional trie data structure. By modulating the LLM's output probability distribution with value information from the trie during decoding process, we constrain the generation space and guide the trajectory of text production. Offline experiments demonstrate the effectiveness of our method in semantic matching and preference alignment, showing a remarkable improvement in the value attribute by more than fivefold. Online A/B tests further revealed that our Revenue Per Mille (RPM) metric increased by 1.64%. VALUE has been deployed on our advertising system since October 2024 and served the Double Eleven promotions, the biggest shopping carnival in China.
△ Less
Submitted 25 February, 2025;
originally announced April 2025.
-
Universal Item Tokenization for Transferable Generative Recommendation
Authors:
Bowen Zheng,
Hongyu Lu,
Yu Chen,
Wayne Xin Zhao,
Ji-Rong Wen
Abstract:
Recently, generative recommendation has emerged as a promising paradigm, attracting significant research attention. The basic framework involves an item tokenizer, which represents each item as a sequence of codes serving as its identifier, and a generative recommender that predicts the next item by autoregressively generating the target item identifier. However, in existing methods, both the toke…
▽ More
Recently, generative recommendation has emerged as a promising paradigm, attracting significant research attention. The basic framework involves an item tokenizer, which represents each item as a sequence of codes serving as its identifier, and a generative recommender that predicts the next item by autoregressively generating the target item identifier. However, in existing methods, both the tokenizer and the recommender are typically domain-specific, limiting their ability for effective transfer or adaptation to new domains. To this end, we propose UTGRec, a Universal item Tokenization approach for transferable Generative Recommendation. Specifically, we design a universal item tokenizer for encoding rich item semantics by adapting a multimodal large language model (MLLM). By devising tree-structured codebooks, we discretize content representations into corresponding codes for item tokenization. To effectively learn the universal item tokenizer on multiple domains, we introduce two key techniques in our approach. For raw content reconstruction, we employ dual lightweight decoders to reconstruct item text and images from discrete representations to capture general knowledge embedded in the content. For collaborative knowledge integration, we assume that co-occurring items are similar and integrate collaborative signals through co-occurrence alignment and reconstruction. Finally, we present a joint learning framework to pre-train and adapt the transferable generative recommender across multiple domains. Extensive experiments on four public datasets demonstrate the superiority of UTGRec compared to both traditional and generative recommendation baselines.
△ Less
Submitted 13 April, 2025; v1 submitted 6 April, 2025;
originally announced April 2025.
-
Pre-training Generative Recommender with Multi-Identifier Item Tokenization
Authors:
Bowen Zheng,
Enze Liu,
Zhongfu Chen,
Zhongrui Ma,
Yue Wang,
Wayne Xin Zhao,
Ji-Rong Wen
Abstract:
Generative recommendation autoregressively generates item identifiers to recommend potential items. Existing methods typically adopt a one-to-one mapping strategy, where each item is represented by a single identifier. However, this scheme poses issues, such as suboptimal semantic modeling for low-frequency items and limited diversity in token sequence data. To overcome these limitations, we propo…
▽ More
Generative recommendation autoregressively generates item identifiers to recommend potential items. Existing methods typically adopt a one-to-one mapping strategy, where each item is represented by a single identifier. However, this scheme poses issues, such as suboptimal semantic modeling for low-frequency items and limited diversity in token sequence data. To overcome these limitations, we propose MTGRec, which leverages Multi-identifier item Tokenization to augment token sequence data for Generative Recommender pre-training. Our approach involves two key innovations: multi-identifier item tokenization and curriculum recommender pre-training. For multi-identifier item tokenization, we leverage the RQ-VAE as the tokenizer backbone and treat model checkpoints from adjacent training epochs as semantically relevant tokenizers. This allows each item to be associated with multiple identifiers, enabling a single user interaction sequence to be converted into several token sequences as different data groups. For curriculum recommender pre-training, we introduce a curriculum learning scheme guided by data influence estimation, dynamically adjusting the sampling probability of each data group during recommender pre-training. After pre-training, we fine-tune the model using a single tokenizer to ensure accurate item identification for recommendation. Extensive experiments on three public benchmark datasets demonstrate that MTGRec significantly outperforms both traditional and generative recommendation baselines in terms of effectiveness and scalability.
△ Less
Submitted 13 April, 2025; v1 submitted 6 April, 2025;
originally announced April 2025.
-
Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
Authors:
Zhanda Zhu,
Christina Giannoula,
Muralidhar Andoorveedu,
Qidong Su,
Karttikeya Mangalam,
Bojian Zheng,
Gennady Pekhimenko
Abstract:
Various parallelism, such as data, tensor, and pipeline parallelism, along with memory optimizations like activation checkpointing, redundancy elimination, and offloading, have been proposed to accelerate distributed training for Large Language Models. To find the best combination of these techniques, automatic distributed training systems are proposed. However, existing systems only tune a subset…
▽ More
Various parallelism, such as data, tensor, and pipeline parallelism, along with memory optimizations like activation checkpointing, redundancy elimination, and offloading, have been proposed to accelerate distributed training for Large Language Models. To find the best combination of these techniques, automatic distributed training systems are proposed. However, existing systems only tune a subset of optimizations, due to the lack of overlap awareness, inability to navigate the vast search space, and ignoring the inter-microbatch imbalance, leading to sub-optimal performance. To address these shortcomings, we propose Mist, a memory, overlap, and imbalance-aware automatic distributed training system that comprehensively co-optimizes all memory footprint reduction techniques alongside parallelism. Mist is based on three key ideas: (1) fine-grained overlap-centric scheduling, orchestrating optimizations in an overlapped manner, (2) symbolic-based performance analysis that predicts runtime and memory usage using symbolic expressions for fast tuning, and (3) imbalance-aware hierarchical tuning, decoupling the process into an inter-stage imbalance and overlap aware Mixed Integer Linear Programming problem and an intra-stage Dual-Objective Constrained Optimization problem, and connecting them through Pareto frontier sampling. Our evaluation results show that Mist achieves an average of 1.28$\times$ (up to 1.73$\times$) and 1.27$\times$ (up to 2.04$\times$) speedup compared to state-of-the-art manual system Megatron-LM and state-of-the-art automatic system Aceso, respectively.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
A Tutorial on Six-Dimensional Movable Antenna for 6G Networks: Synergizing Positionable and Rotatable Antennas
Authors:
Xiaodan Shao,
Weidong Mei,
Changsheng You,
Qingqing Wu,
Beixiong Zheng,
Cheng-Xiang Wang,
Junling Li,
Rui Zhang,
Robert Schober,
Lipeng Zhu,
Weihua Zhuang,
Xuemin Shen
Abstract:
Six-dimensional movable antenna (6DMA) is a new
and revolutionary technique that fully exploits the wireless
channel spatial variations at the transmitter/receiver by flexibly
adjusting the three-dimensional (3D) positions and/or 3D rotations
of antennas/antenna surfaces (sub-arrays), thereby improving the performance of wireless
networks cost-effectively without the need to deploy addit…
▽ More
Six-dimensional movable antenna (6DMA) is a new
and revolutionary technique that fully exploits the wireless
channel spatial variations at the transmitter/receiver by flexibly
adjusting the three-dimensional (3D) positions and/or 3D rotations
of antennas/antenna surfaces (sub-arrays), thereby improving the performance of wireless
networks cost-effectively without the need to deploy additional
antennas. It is thus expected that
the integration of new 6DMAs into future sixth-generation (6G) wireless networks will fundamentally enhance
antenna agility and adaptability, and introduce new degrees
of freedom (DoFs) for system design. Despite its great potential,
6DMA faces new challenges to be efficiently implemented in wireless
networks, including corresponding architectures, antenna position and rotation optimization, channel estimation,
and system design from both communication and sensing perspectives. In
this paper, we provide a tutorial on 6DMA-enhanced wireless
networks to address the above issues by unveiling associated new channel models, hardware implementations and
practical position/rotation constraints, as well as various appealing applications in
wireless networks. Moreover, we discuss two special cases of 6DMA, namely, rotatable 6DMA with fixed antenna position and positionable 6DMA with fixed antenna rotation, and highlight their respective design challenges and applications.
We further present prototypes developed for 6DMA-enhanced communication along with experimental results obtained with these prototypes. Finally, we outline promising directions for further investigation.
△ Less
Submitted 7 May, 2025; v1 submitted 23 March, 2025;
originally announced March 2025.
-
A Comprehensive Survey on Long Context Language Modeling
Authors:
Jiaheng Liu,
Dawei Zhu,
Zhiqi Bai,
Yancheng He,
Huanxuan Liao,
Haoran Que,
Zekun Wang,
Chenchen Zhang,
Ge Zhang,
Jiebin Zhang,
Yuanxing Zhang,
Zhuo Chen,
Hangyu Guo,
Shilong Li,
Ziqiang Liu,
Yong Shan,
Yifan Song,
Jiayi Tian,
Wenhao Wu,
Zhejian Zhou,
Ruijie Zhu,
Junlan Feng,
Yang Gao,
Shizhu He,
Zhoujun Li
, et al. (12 additional authors not shown)
Abstract:
Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-c…
▽ More
Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling}{\color[RGB]{175,36,67}{LCLM-Horizon}}.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
R2LDM: An Efficient 4D Radar Super-Resolution Framework Leveraging Diffusion Model
Authors:
Boyuan Zheng,
Shouyi Lu,
Renbo Huang,
Minqing Huang,
Fan Lu,
Wei Tian,
Guirong Zhuo,
Lu Xiong
Abstract:
We introduce R2LDM, an innovative approach for generating dense and accurate 4D radar point clouds, guided by corresponding LiDAR point clouds. Instead of utilizing range images or bird's eye view (BEV) images, we represent both LiDAR and 4D radar point clouds using voxel features, which more effectively capture 3D shape information. Subsequently, we propose the Latent Voxel Diffusion Model (LVDM)…
▽ More
We introduce R2LDM, an innovative approach for generating dense and accurate 4D radar point clouds, guided by corresponding LiDAR point clouds. Instead of utilizing range images or bird's eye view (BEV) images, we represent both LiDAR and 4D radar point clouds using voxel features, which more effectively capture 3D shape information. Subsequently, we propose the Latent Voxel Diffusion Model (LVDM), which performs the diffusion process in the latent space. Additionally, a novel Latent Point Cloud Reconstruction (LPCR) module is utilized to reconstruct point clouds from high-dimensional latent voxel features. As a result, R2LDM effectively generates LiDAR-like point clouds from paired raw radar data. We evaluate our approach on two different datasets, and the experimental results demonstrate that our model achieves 6- to 10-fold densification of radar point clouds, outperforming state-of-the-art baselines in 4D radar point cloud super-resolution. Furthermore, the enhanced radar point clouds generated by our method significantly improve downstream tasks, achieving up to 31.7% improvement in point cloud registration recall rate and 24.9% improvement in object detection accuracy.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation
Authors:
Yijia Luo,
Yulin Song,
Xingyao Zhang,
Jiaheng Liu,
Weixun Wang,
GengRu Chen,
Wenbo Su,
Bo Zheng
Abstract:
Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities through long chain-of-thought (CoT) reasoning. The R1 distillation scheme has emerged as a promising approach for training cost-effective models with enhanced reasoning abilities. However, the underlying mechanisms driving its effectiveness remain unclear. This study examines the universality of…
▽ More
Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities through long chain-of-thought (CoT) reasoning. The R1 distillation scheme has emerged as a promising approach for training cost-effective models with enhanced reasoning abilities. However, the underlying mechanisms driving its effectiveness remain unclear. This study examines the universality of distillation data and identifies key components that enable the efficient transfer of long-chain reasoning capabilities in LLM distillation. Our findings reveal that the effectiveness of long CoT reasoning distillation from teacher models like Qwen-QwQ degrades significantly on nonhomologous models, challenging the assumed universality of current distillation methods. To gain deeper insights into the structure and patterns of long CoT reasoning, we propose DLCoT (Deconstructing Long Chain-of-Thought), a distillation data enhancement framework. DLCoT consists of three key steps: (1) data segmentation to decompose complex long CoT structures, (2) simplification by eliminating unsolvable and redundant solutions, and (3) optimization of intermediate error states. Our approach significantly improves model performance and token efficiency, facilitating the development of high-performance LLMs.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph
Authors:
Langming Liu,
Haibin Chen,
Yuhao Wang,
Yujin Yuan,
Shilei Liu,
Wenbo Su,
Xiangyu Zhao,
Bo Zheng
Abstract:
Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. One primary concern associated with LLMs is their factuality (e.g., hallucination), which is urgent in e-commerce due to its significant impa…
▽ More
Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. One primary concern associated with LLMs is their factuality (e.g., hallucination), which is urgent in e-commerce due to its significant impact on user experience and revenue. Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce. To bridge the evaluation gap, we propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge. Specifically, we adopt a standardized workflow to automatically generate questions based on a large-scale knowledge graph, guaranteeing sufficient reliability. We employ the simple question-answering paradigm, substantially improving the evaluation efficiency by the least input and output tokens. Furthermore, we inject abundant e-commerce expertise in each evaluation stage, including human annotation, prompt design, negative sampling, and verification. Besides, we explore the LLMs' knowledge boundaries in e-commerce from a novel perspective. Through comprehensive evaluations of several advanced LLMs on ECKGBench, we provide meticulous analysis and insights into leveraging LLMs for e-commerce.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
KDSelector: A Knowledge-Enhanced and Data-Efficient Model Selector Learning Framework for Time Series Anomaly Detection
Authors:
Zhiyu Liang,
Dongrui Cai,
Chenyuan Zhang,
Zheng Liang,
Chen Liang,
Bo Zheng,
Shi Qiu,
Jin Wang,
Hongzhi Wang
Abstract:
Model selection has been raised as an essential problem in the area of time series anomaly detection (TSAD), because there is no single best TSAD model for the highly heterogeneous time series in real-world applications. However, despite the success of existing model selection solutions that train a classification model (especially neural network, NN) using historical data as a selector to predict…
▽ More
Model selection has been raised as an essential problem in the area of time series anomaly detection (TSAD), because there is no single best TSAD model for the highly heterogeneous time series in real-world applications. However, despite the success of existing model selection solutions that train a classification model (especially neural network, NN) using historical data as a selector to predict the correct TSAD model for each series, the NN-based selector learning methods used by existing solutions do not make full use of the knowledge in the historical data and require iterating over all training samples, which limits the accuracy and training speed of the selector. To address these limitations, we propose KDSelector, a novel knowledge-enhanced and data-efficient framework for learning the NN-based TSAD model selector, of which three key components are specifically designed to integrate available knowledge into the selector and dynamically prune less important and redundant samples during the learning. We develop a TSAD model selection system with KDSelector as the internal, to demonstrate how users improve the accuracy and training speed of their selectors by using KDSelector as a plug-and-play module. Our demonstration video is hosted at https://youtu.be/2uqupDWvTF0.
△ Less
Submitted 19 March, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
Bridging Textual-Collaborative Gap through Semantic Codes for Sequential Recommendation
Authors:
Enze Liu,
Bowen Zheng,
Wayne Xin Zhao,
Ji-Rong Wen
Abstract:
In recent years, substantial research efforts have been devoted to enhancing sequential recommender systems by integrating abundant side information with ID-based collaborative information. This study specifically focuses on leveraging the textual metadata (e.g., titles and brands) associated with items. While existing methods have achieved notable success by combining text and ID representations,…
▽ More
In recent years, substantial research efforts have been devoted to enhancing sequential recommender systems by integrating abundant side information with ID-based collaborative information. This study specifically focuses on leveraging the textual metadata (e.g., titles and brands) associated with items. While existing methods have achieved notable success by combining text and ID representations, they often struggle to strike a balance between textual information embedded in text representations and collaborative information from sequential patterns of user behavior. In light of this, we propose CoCoRec, a novel Code-based textual and Collaborative semantic fusion method for sequential Recommendation. The key idea behind our approach is to bridge the gap between textual and collaborative information using semantic codes. Specifically, we generate fine-grained semantic codes from multi-view text embeddings through vector quantization techniques. Subsequently, we develop a code-guided semantic-fusion module based on the cross-attention mechanism to flexibly extract and integrate relevant information from text representations. In order to further enhance the fusion of textual and collaborative semantics, we introduce an optimization strategy that employs code masking with two specific objectives: masked code modeling and masked sequence alignment. The merit of these objectives lies in leveraging mask prediction tasks and augmented item representations to capture code correlations within individual items and enhance the sequence modeling of the recommendation backbone. Extensive experiments conducted on four public datasets demonstrate the superiority of CoCoRec, showing significant improvements over various sequential recommendation models. Our code is available at https://anonymous.4open.science/r/CoCoRec-6E41.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning
Authors:
Jia Zhang,
Chen-Xi Zhang,
Yao Liu,
Yi-Xuan Jin,
Xiao-Wen Yang,
Bo Zheng,
Yi Liu,
Lan-Zhe Guo
Abstract:
Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and effi…
▽ More
Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on three datasets demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Rotatable Antennas for Integrated Sensing and Communications
Authors:
Chao Zhou,
Changsheng You,
Beixiong Zheng,
Xiaodan Shao,
Rui Zhang
Abstract:
In this letter, we propose to deploy rotatable antennas (RAs) at the base station (BS) to enhance both communication and sensing (C&S) performances, by exploiting a new spatial degree-of-freedom (DoF) offered by array rotation. Specifically, we formulate a multi-objective optimization problem to simultaneously maximize the sum-rate of multiple communication users and minimize the Cramér-Rao bound…
▽ More
In this letter, we propose to deploy rotatable antennas (RAs) at the base station (BS) to enhance both communication and sensing (C&S) performances, by exploiting a new spatial degree-of-freedom (DoF) offered by array rotation. Specifically, we formulate a multi-objective optimization problem to simultaneously maximize the sum-rate of multiple communication users and minimize the Cramér-Rao bound (CRB) for target angle estimation, by jointly optimizing the transmit beamforming vectors and the array rotation angle at the BS. To solve this problem, we first equivalently decompose it into two subproblems, corresponding to an inner problem for beamforming optimization and an outer problem for array rotation optimization. Although these two subproblems are non-convex, we obtain their high-quality solutions by applying the block coordinate descent (BCD) technique and one-dimensional exhaustive search, respectively. Moreover, we show that for the communication-only case, RAs provide an additional rotation gain to improve communication performance; while for the sensing-only case, the equivalent spatial aperture can be enlarged by RAs for achieving higher sensing accuracy. Finally, numerical results are presented to showcase the performance gains of RAs over fixed-rotation antennas in integrated sensing and communications (ISAC).
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Nash Equilibrium Constrained Auto-bidding With Bi-level Reinforcement Learning
Authors:
Zhiyu Mou,
Miao Xu,
Rongquan Bai,
Zhuoran Yang,
Chuan Yu,
Jian Xu,
Bo Zheng
Abstract:
Many online advertising platforms provide advertisers with auto-bidding services to enhance their advertising performance. However, most existing auto-bidding algorithms fail to accurately capture the auto-bidding problem formulation that the platform truly faces, let alone solve it. Actually, we argue that the platform should try to help optimize each advertiser's performance to the greatest exte…
▽ More
Many online advertising platforms provide advertisers with auto-bidding services to enhance their advertising performance. However, most existing auto-bidding algorithms fail to accurately capture the auto-bidding problem formulation that the platform truly faces, let alone solve it. Actually, we argue that the platform should try to help optimize each advertiser's performance to the greatest extent -- which makes $ε$-Nash Equilibrium ($ε$-NE) a necessary solution concept -- while maximizing the social welfare of all the advertisers for the platform's long-term value. Based on this, we introduce the \emph{Nash-Equilibrium Constrained Bidding} (NCB), a new formulation of the auto-bidding problem from the platform's perspective. Specifically, it aims to maximize the social welfare of all advertisers under the $ε$-NE constraint. However, the NCB problem presents significant challenges due to its constrained bi-level structure and the typically large number of advertisers involved. To address these challenges, we propose a \emph{Bi-level Policy Gradient} (BPG) framework with theoretical guarantees. Notably, its computational complexity is independent of the number of advertisers, and the associated gradients are straightforward to compute. Extensive simulated and real-world experiments validate the effectiveness of the BPG framework.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Authors:
Peng Chen,
Pi Bu,
Yingyao Wang,
Xinyi Wang,
Ziming Wang,
Jie Guo,
Yingxiu Zhao,
Qi Zhu,
Jun Song,
Siran Yang,
Jiamang Wang,
Bo Zheng
Abstract:
Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimize…
▽ More
Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integrates into an action execution framework, allowing efficient inference through our truncated AoT strategy. Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat. Moreover, it has a higher task success rate than human players. We will open-source all resources, including the action tracker, dataset, benchmark, model weights, training code, and the implementation of the framework at https://combatvla.github.io/.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity
Authors:
Yijie Xu,
Bolun Zheng,
Wei Zhu,
Hangjia Pan,
Yuchen Yao,
Ning Xu,
Anan Liu,
Quan Zhang,
Chenggang Yan
Abstract:
Social media popularity prediction task aims to predict the popularity of posts on social media platforms, which has a positive driving effect on application scenarios such as content optimization, digital marketing and online advertising. Though many studies have made significant progress, few of them pay much attention to the integration between popularity prediction with temporal alignment. In…
▽ More
Social media popularity prediction task aims to predict the popularity of posts on social media platforms, which has a positive driving effect on application scenarios such as content optimization, digital marketing and online advertising. Though many studies have made significant progress, few of them pay much attention to the integration between popularity prediction with temporal alignment. In this paper, with exploring YouTube's multilingual and multi-modal content, we construct a new social media temporal popularity prediction benchmark, namely SMTPD, and suggest a baseline framework for temporal popularity prediction. Through data analysis and experiments, we verify that temporal alignment and early popularity play crucial roles in social media popularity prediction for not only deepening the understanding of temporal dynamics of popularity in social media but also offering a suggestion about developing more effective prediction models in this field. Code is available at https://github.com/zhuwei321/SMTPD.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Gradient Deconfliction via Orthogonal Projections onto Subspaces For Multi-task Learning
Authors:
Shijie Zhu,
Hui Zhao,
Tianshu Wu,
Pengjie Wang,
Hongbo Deng,
Jian Xu,
Bo Zheng
Abstract:
Although multi-task learning (MTL) has been a preferred approach and successfully applied in many real-world scenarios, MTL models are not guaranteed to outperform single-task models on all tasks mainly due to the negative effects of conflicting gradients among the tasks. In this paper, we fully examine the influence of conflicting gradients and further emphasize the importance and advantages of a…
▽ More
Although multi-task learning (MTL) has been a preferred approach and successfully applied in many real-world scenarios, MTL models are not guaranteed to outperform single-task models on all tasks mainly due to the negative effects of conflicting gradients among the tasks. In this paper, we fully examine the influence of conflicting gradients and further emphasize the importance and advantages of achieving non-conflicting gradients which allows simple but effective trade-off strategies among the tasks with stable performance. Based on our findings, we propose the Gradient Deconfliction via Orthogonal Projections onto Subspaces (GradOPS) spanned by other task-specific gradients. Our method not only solves all conflicts among the tasks, but can also effectively search for diverse solutions towards different trade-off preferences among the tasks. Theoretical analysis on convergence is provided, and performance of our algorithm is fully testified on multiple benchmarks in various domains. Results demonstrate that our method can effectively find multiple state-of-the-art solutions with different trade-off strategies among the tasks on multiple datasets.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Straight-Line Diffusion Model for Efficient 3D Molecular Generation
Authors:
Yuyan Ni,
Shikun Feng,
Haohan Chi,
Bowen Zheng,
Huan-ang Gao,
Wei-Ying Ma,
Zhi-Ming Ma,
Yanyan Lan
Abstract:
Diffusion-based models have shown great promise in molecular generation but often require a large number of sampling steps to generate valid samples. In this paper, we introduce a novel Straight-Line Diffusion Model (SLDM) to tackle this problem, by formulating the diffusion process to follow a linear trajectory. The proposed process aligns well with the noise sensitivity characteristic of molecul…
▽ More
Diffusion-based models have shown great promise in molecular generation but often require a large number of sampling steps to generate valid samples. In this paper, we introduce a novel Straight-Line Diffusion Model (SLDM) to tackle this problem, by formulating the diffusion process to follow a linear trajectory. The proposed process aligns well with the noise sensitivity characteristic of molecular structures and uniformly distributes reconstruction effort across the generative process, thus enhancing learning efficiency and efficacy. Consequently, SLDM achieves state-of-the-art performance on 3D molecule generation benchmarks, delivering a 100-fold improvement in sampling efficiency. Furthermore, experiments on toy data and image generation tasks validate the generality and robustness of SLDM, showcasing its potential across diverse generative modeling domains.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Task-Agnostic Guided Feature Expansion for Class-Incremental Learning
Authors:
Bowen Zheng,
Da-Wei Zhou,
Han-Jia Ye,
De-Chuan Zhan
Abstract:
The ability to learn new concepts while preserve the learned knowledge is desirable for learning systems in Class-Incremental Learning (CIL). Recently, feature expansion of the model become a prevalent solution for CIL, where the old features are fixed during the training of the new task while new features are expanded for the new tasks. However, such task-specific features learned from the new ta…
▽ More
The ability to learn new concepts while preserve the learned knowledge is desirable for learning systems in Class-Incremental Learning (CIL). Recently, feature expansion of the model become a prevalent solution for CIL, where the old features are fixed during the training of the new task while new features are expanded for the new tasks. However, such task-specific features learned from the new task may collide with the old features, leading to misclassification between tasks. Therefore, the expanded model is often encouraged to capture diverse features from the new task, aiming to avoid such collision. However, the existing solution is largely restricted to the samples from the current task, because of the poor accessibility to previous samples. To promote the learning and transferring of diverse features across tasks, we propose a framework called Task-Agnostic Guided Feature Expansion (TagFex). Firstly, it captures task-agnostic features continually with a separate model, providing extra task-agnostic features for subsequent tasks. Secondly, to obtain useful features from the task-agnostic model for the current task, it aggregates the task-agnostic features with the task-specific feature using a merge attention. Then the aggregated feature is transferred back into the task-specific feature for inference, helping the task-specific model capture diverse features. Extensive experiments show the effectiveness and superiority of TagFex on various CIL settings. Code is available at https://github.com/bwnzheng/TagFex_CVPR2025.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
Unifying Light Field Perception with Field of Parallax
Authors:
Fei Teng,
Buyin Deng,
Boyuan Zheng,
Kai Luo,
Kunyu Peng,
Jiaming Zhang,
Kailun Yang
Abstract:
Field of Parallax (FoP)}, a spatial field that distills the common features from different LF representations to provide flexible and consistent support for multi-task learning. FoP is built upon three core features--projection difference, adjacency divergence, and contextual consistency--which are essential for cross-task adaptability. To implement FoP, we design a two-step angular adapter: the f…
▽ More
Field of Parallax (FoP)}, a spatial field that distills the common features from different LF representations to provide flexible and consistent support for multi-task learning. FoP is built upon three core features--projection difference, adjacency divergence, and contextual consistency--which are essential for cross-task adaptability. To implement FoP, we design a two-step angular adapter: the first step captures angular-specific differences, while the second step consolidates contextual consistency to ensure robust representation. Leveraging the FoP-based representation, we introduce the LFX framework, the first to handle arbitrary LF representations seamlessly, unifying LF multi-task vision. We evaluated LFX across three different tasks, achieving new state-of-the-art results, compared with previous task-specific architectures: 84.74% in mIoU for semantic segmentation on UrbanLF, 0.84% in AP for object detection on PKU, and 0.030 in MAE and 0.026 in MAE for salient object detection on Duftv2 and PKU, respectively. The source code will be made publicly available at https://github.com/warriordby/LFX.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models
Authors:
Haibin Chen,
Kangtao Lv,
Chengwei Hu,
Yanshi Li,
Yujin Yuan,
Yancheng He,
Xingyao Zhang,
Langming Liu,
Shilei Liu,
Wenbo Su,
Bo Zheng
Abstract:
With the increasing use of Large Language Models (LLMs) in fields such as e-commerce, domain-specific concept evaluation benchmarks are crucial for assessing their domain capabilities. Existing LLMs may generate factually incorrect information within the complex e-commerce applications. Therefore, it is necessary to build an e-commerce concept benchmark. Existing benchmarks encounter two primary c…
▽ More
With the increasing use of Large Language Models (LLMs) in fields such as e-commerce, domain-specific concept evaluation benchmarks are crucial for assessing their domain capabilities. Existing LLMs may generate factually incorrect information within the complex e-commerce applications. Therefore, it is necessary to build an e-commerce concept benchmark. Existing benchmarks encounter two primary challenges: (1) handle the heterogeneous and diverse nature of tasks, (2) distinguish between generality and specificity within the e-commerce field. To address these problems, we propose \textbf{ChineseEcomQA}, a scalable question-answering benchmark focused on fundamental e-commerce concepts. ChineseEcomQA is built on three core characteristics: \textbf{Focus on Fundamental Concept}, \textbf{E-commerce Generality} and \textbf{E-commerce Expertise}. Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks, thus addressing the challenge of heterogeneity and diversity. Additionally, by carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts, allowing for precise validation of domain capabilities. We achieve this through a scalable benchmark construction process that combines LLM validation, Retrieval-Augmented Generation (RAG) validation, and rigorous manual annotation. Based on ChineseEcomQA, we conduct extensive evaluations on mainstream LLMs and provide some valuable insights. We hope that ChineseEcomQA could guide future domain-specific evaluations, and facilitate broader LLM adoption in e-commerce applications.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Authors:
Yancheng He,
Shilong Li,
Jiaheng Liu,
Weixun Wang,
Xingyuan Bu,
Ge Zhang,
Zhongyuan Peng,
Zhaoxiang Zhang,
Zhicheng Zheng,
Wenbo Su,
Bo Zheng
Abstract:
Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated lo…
▽ More
Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.
△ Less
Submitted 30 March, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering
Authors:
Langming Liu,
Shilei Liu,
Yujin Yuan,
Yizhen Zhang,
Bencheng Yan,
Zhiyuan Zeng,
Zihao Wang,
Jiaqi Liu,
Di Wang,
Wenbo Su,
Pengjie Wang,
Jian Xu,
Bo Zheng
Abstract:
Large language models (LLMs) achieve remarkable success in natural language processing (NLP). In practical scenarios like recommendations, as users increasingly seek personalized experiences, it becomes crucial to incorporate user interaction history into the context of LLMs to enhance personalization. However, from a practical utility perspective, user interactions' extensive length and noise pre…
▽ More
Large language models (LLMs) achieve remarkable success in natural language processing (NLP). In practical scenarios like recommendations, as users increasingly seek personalized experiences, it becomes crucial to incorporate user interaction history into the context of LLMs to enhance personalization. However, from a practical utility perspective, user interactions' extensive length and noise present challenges when used directly as text prompts. A promising solution is to compress and distill interactions into compact embeddings, serving as soft prompts to assist LLMs in generating personalized responses. Although this approach brings efficiency, a critical concern emerges: Can user embeddings adequately capture valuable information and prompt LLMs? To address this concern, we propose \name, a benchmark designed to evaluate the effectiveness of user embeddings in prompting LLMs for personalization. We establish a fair and standardized evaluation process, encompassing pre-training, fine-tuning, and evaluation stages. To thoroughly evaluate user embeddings, we design three dimensions of tasks: sequence understanding, action prediction, and interest perception. These evaluation tasks cover the industry's demands in traditional recommendation tasks, such as improving prediction accuracy, and its aspirations for LLM-based methods, such as accurately understanding user interests and enhancing the user experience. We conduct extensive experiments on various state-of-the-art methods for modeling user embeddings. Additionally, we reveal the scaling laws of leveraging user embeddings to prompt LLMs. The benchmark is available online.
△ Less
Submitted 1 April, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
AIR: Complex Instruction Generation via Automatic Iterative Refinement
Authors:
Wei Liu,
Yancheng He,
Hui Huang,
Chengwei Hu,
Jiaheng Liu,
Shilong Li,
Wenbo Su,
Bo Zheng
Abstract:
With the development of large language models, their ability to follow simple instructions has significantly improved. However, adhering to complex instructions remains a major challenge. Current approaches to generating complex instructions are often irrelevant to the current instruction requirements or suffer from limited scalability and diversity. Moreover, methods such as back-translation, whi…
▽ More
With the development of large language models, their ability to follow simple instructions has significantly improved. However, adhering to complex instructions remains a major challenge. Current approaches to generating complex instructions are often irrelevant to the current instruction requirements or suffer from limited scalability and diversity. Moreover, methods such as back-translation, while effective for simple instruction generation, fail to leverage the rich contents and structures in large web corpora. In this paper, we propose a novel automatic iterative refinement framework to generate complex instructions with constraints, which not only better reflects the requirements of real scenarios but also significantly enhances LLMs' ability to follow complex instructions. The AIR framework consists of two stages: (1)Generate an initial instruction from a document; (2)Iteratively refine instructions with LLM-as-judge guidance by comparing the model's output with the document to incorporate valuable constraints. Finally, we construct the AIR-10K dataset with 10K complex instructions and demonstrate that instructions generated with our approach significantly improve the model's ability to follow complex instructions, outperforming existing methods for instruction generation.
△ Less
Submitted 27 February, 2025; v1 submitted 24 February, 2025;
originally announced February 2025.
-
HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States
Authors:
Yilei Jiang,
Xinyan Gao,
Tianshuo Peng,
Yingshui Tan,
Xiaoyong Zhu,
Bo Zheng,
Xiangyu Yue
Abstract:
The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inheren…
▽ More
The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at https://github.com/leigest519/HiddenDetect.
△ Less
Submitted 21 April, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
ChineseSimpleVQA -- "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models
Authors:
Jihao Gu,
Yingyao Wang,
Pi Bu,
Chen Wang,
Ziming Wang,
Tengtao Song,
Donglai Wei,
Jiale Yuan,
Yingxiu Zhao,
Yancheng He,
Shilong Li,
Jiaheng Liu,
Meng Cao,
Jun Song,
Yingshui Tan,
Xiang Li,
Wenbo Su,
Zhicheng Zheng,
Xiaoyong Zhu,
Bo Zheng
Abstract:
The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models' knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major t…
▽ More
The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models' knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field.
△ Less
Submitted 26 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Authors:
Yingshui Tan,
Yilei Jiang,
Yanshi Li,
Jiaheng Liu,
Xingyuan Bu,
Wenbo Su,
Xiangyu Yue,
Xiaoyong Zhu,
Bo Zheng
Abstract:
Fine-tuning large language models (LLMs) based on human preferences, commonly achieved through reinforcement learning from human feedback (RLHF), has been effective in improving their performance. However, maintaining LLM safety throughout the fine-tuning process remains a significant challenge, as resolving conflicts between safety and helpfulness can be non-trivial. Typically, the safety alignme…
▽ More
Fine-tuning large language models (LLMs) based on human preferences, commonly achieved through reinforcement learning from human feedback (RLHF), has been effective in improving their performance. However, maintaining LLM safety throughout the fine-tuning process remains a significant challenge, as resolving conflicts between safety and helpfulness can be non-trivial. Typically, the safety alignment of LLM is trained on data with safety-related categories. However, our experiments find that naively increasing the scale of safety training data usually leads the LLMs to an ``overly safe'' state rather than a ``truly safe'' state, boosting the refusal rate through extensive safety-aligned data without genuinely understanding the requirements for safe responses. Such an approach can inadvertently diminish the models' helpfulness. To understand the phenomenon, we first investigate the role of safety data by categorizing them into three different groups, and observe that each group behaves differently as training data scales up. To boost the balance between safety and helpfulness, we propose an Equilibrate RLHF framework including a Fine-grained Data-centric (FDC) approach that achieves better safety alignment even with fewer training data, and an Adaptive Message-wise Alignment (AMA) approach, which selectively highlight the key segments through a gradient masking strategy. Extensive experimental results demonstrate that our approach significantly enhances the safety alignment of LLMs while balancing safety and helpfulness.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Unlocking Scaling Law in Industrial Recommendation Systems with a Three-step Paradigm based Large User Model
Authors:
Bencheng Yan,
Shilei Liu,
Zhiyuan Zeng,
Zihao Wang,
Yizhen Zhang,
Yujin Yuan,
Langming Liu,
Jiaqi Liu,
Di Wang,
Wenbo Su,
Wang Pengjie,
Jian Xu,
Bo Zheng
Abstract:
Recent advancements in autoregressive Large Language Models (LLMs) have achieved significant milestones, largely attributed to their scalability, often referred to as the "scaling law". Inspired by these achievements, there has been a growing interest in adapting LLMs for Recommendation Systems (RecSys) by reformulating RecSys tasks into generative problems. However, these End-to-End Generative Re…
▽ More
Recent advancements in autoregressive Large Language Models (LLMs) have achieved significant milestones, largely attributed to their scalability, often referred to as the "scaling law". Inspired by these achievements, there has been a growing interest in adapting LLMs for Recommendation Systems (RecSys) by reformulating RecSys tasks into generative problems. However, these End-to-End Generative Recommendation (E2E-GR) methods tend to prioritize idealized goals, often at the expense of the practical advantages offered by traditional Deep Learning based Recommendation Models (DLRMs) in terms of in features, architecture, and practices. This disparity between idealized goals and practical needs introduces several challenges and limitations, locking the scaling law in industrial RecSys. In this paper, we introduce a large user model (LUM) that addresses these limitations through a three-step paradigm, designed to meet the stringent requirements of industrial settings while unlocking the potential for scalable recommendations. Our extensive experimental evaluations demonstrate that LUM outperforms both state-of-the-art DLRMs and E2E-GR approaches. Notably, LUM exhibits excellent scalability, with performance improvements observed as the model scales up to 7 billion parameters. Additionally, we have successfully deployed LUM in an industrial application, where it achieved significant gains in an A/B test, further validating its effectiveness and practicality.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following
Authors:
Vivek Myers,
Bill Chunyuan Zheng,
Anca Dragan,
Kuan Fang,
Sergey Levine
Abstract:
Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositi…
▽ More
Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositionality. We show that learning to associate the representations of current and future states with a temporal alignment loss can improve compositional generalization, even in the absence of any explicit subtask planning or reinforcement learning. We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.
△ Less
Submitted 13 February, 2025; v1 submitted 8 February, 2025;
originally announced February 2025.
-
An Adaptable Budget Planner for Enhancing Budget-Constrained Auto-Bidding in Online Advertising
Authors:
Zhijian Duan,
Yusen Huo,
Tianyu Wang,
Zhilin Zhang,
Yeshu Li,
Chuan Yu,
Jian Xu,
Bo Zheng,
Xiaotie Deng
Abstract:
In online advertising, advertisers commonly utilize auto-bidding services to bid for impression opportunities. A typical objective of the auto-bidder is to optimize the advertiser's cumulative value of winning impressions within specified budget constraints. However, such a problem is challenging due to the complex bidding environment faced by diverse advertisers. To address this challenge, we int…
▽ More
In online advertising, advertisers commonly utilize auto-bidding services to bid for impression opportunities. A typical objective of the auto-bidder is to optimize the advertiser's cumulative value of winning impressions within specified budget constraints. However, such a problem is challenging due to the complex bidding environment faced by diverse advertisers. To address this challenge, we introduce ABPlanner, a few-shot adaptable budget planner designed to improve budget-constrained auto-bidding. ABPlanner is based on a hierarchical bidding framework that decomposes the bidding process into shorter, manageable stages. Within this framework, ABPlanner allocates the budget across all stages, allowing a low-level auto-bidder to bids based on the budget allocation plan. The adaptability of ABPlanner is achieved through a sequential decision-making approach, inspired by in-context reinforcement learning. For each advertiser, ABPlanner adjusts the budget allocation plan episode by episode, using data from previous episodes as prompt for current decisions. This enables ABPlanner to quickly adapt to different advertisers with few-shot data, providing a sample-efficient solution. Extensive simulation experiments and real-world A/B testing validate the effectiveness of ABPlanner, demonstrating its capability to enhance the cumulative value achieved by auto-bidders.
△ Less
Submitted 26 January, 2025;
originally announced February 2025.
-
Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning
Authors:
Bokeng Zheng,
Bo Rao,
Tianxiang Zhu,
Chee Wei Tan,
Jingpu Duan,
Zhi Zhou,
Xu Chen,
Xiaoxi Zhang
Abstract:
Advances in artificial intelligence (AI) including foundation models (FMs), are increasingly transforming human society, with smart city driving the evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities. In particular, ride-hailing vehicles can effectively facilitate flexible data collection and…
▽ More
Advances in artificial intelligence (AI) including foundation models (FMs), are increasingly transforming human society, with smart city driving the evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities. In particular, ride-hailing vehicles can effectively facilitate flexible data collection and contribute towards urban intelligence, despite resource limitations. Therefore, this work explores a promising scenario, where edge-assisted vehicles perform joint tasks of order serving and the emerging foundation model fine-tuning using various urban data. However, integrating the VCS AI task with the conventional order serving task is challenging, due to their inconsistent spatio-temporal characteristics: (i) The distributions of ride orders and data point-of-interests (PoIs) may not coincide in geography, both following a priori unknown patterns; (ii) they have distinct forms of temporal effects, i.e., prolonged waiting makes orders become instantly invalid while data with increased staleness gradually reduces its utility for model fine-tuning.To overcome these obstacles, we propose an online framework based on multi-agent reinforcement learning (MARL) with careful augmentation. A new quality-of-service (QoS) metric is designed to characterize and balance the utility of the two joint tasks, under the effects of varying data volumes and staleness. We also integrate graph neural networks (GNNs) with MARL to enhance state representations, capturing graph-structured, time-varying dependencies among vehicles and across locations. Extensive experiments on our testbed simulator, utilizing various real-world foundation model fine-tuning tasks and the New York City Taxi ride order dataset, demonstrate the advantage of our proposed method.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Large Language Models Are Universal Recommendation Learners
Authors:
Junguang Jiang,
Yanwen Huang,
Bin Liu,
Xiaoyu Kong,
Ziru Xu,
Han Zhu,
Jian Xu,
Bo Zheng
Abstract:
In real-world recommender systems, different tasks are typically addressed using supervised learning on task-specific datasets with carefully designed model architectures. We demonstrate that large language models (LLMs) can function as universal recommendation learners, capable of handling multiple tasks within a unified input-output framework, eliminating the need for specialized model designs.…
▽ More
In real-world recommender systems, different tasks are typically addressed using supervised learning on task-specific datasets with carefully designed model architectures. We demonstrate that large language models (LLMs) can function as universal recommendation learners, capable of handling multiple tasks within a unified input-output framework, eliminating the need for specialized model designs. To improve the recommendation performance of LLMs, we introduce a multimodal fusion module for item representation and a sequence-in-set-out approach for efficient candidate generation. When applied to industrial-scale data, our LLM achieves competitive results with expert models elaborately designed for different recommendation tasks. Furthermore, our analysis reveals that recommendation outcomes are highly sensitive to text input, highlighting the potential of prompt engineering in optimizing industrial-scale recommender systems.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
Harmful Terms and Where to Find Them: Measuring and Modeling Unfavorable Financial Terms and Conditions in Shopping Websites at Scale
Authors:
Elisa Tsai,
Neal Mangaokar,
Boyuan Zheng,
Haizhong Zheng,
Atul Prakash
Abstract:
Terms and conditions for online shopping websites often contain terms that can have significant financial consequences for customers. Despite their impact, there is currently no comprehensive understanding of the types and potential risks associated with unfavorable financial terms. Furthermore, there are no publicly available detection systems or datasets to systematically identify or mitigate th…
▽ More
Terms and conditions for online shopping websites often contain terms that can have significant financial consequences for customers. Despite their impact, there is currently no comprehensive understanding of the types and potential risks associated with unfavorable financial terms. Furthermore, there are no publicly available detection systems or datasets to systematically identify or mitigate these terms. In this paper, we take the first steps toward solving this problem with three key contributions.
\textit{First}, we introduce \textit{TermMiner}, an automated data collection and topic modeling pipeline to understand the landscape of unfavorable financial terms. \textit{Second}, we create \textit{ShopTC-100K}, a dataset of terms and conditions from shopping websites in the Tranco top 100K list, comprising 1.8 million terms from 8,251 websites. Consequently, we develop a taxonomy of 22 types from 4 categories of unfavorable financial terms -- spanning purchase, post-purchase, account termination, and legal aspects. \textit{Third}, we build \textit{TermLens}, an automated detector that uses Large Language Models (LLMs) to identify unfavorable financial terms.
Fine-tuned on an annotated dataset, \textit{TermLens} achieves an F1 score of 94.6\% and a false positive rate of 2.3\% using GPT-4o. When applied to shopping websites from the Tranco top 100K, we find that 42.06\% of these sites contain at least one unfavorable financial term, with such terms being more prevalent on less popular websites. Case studies further highlight the financial risks and customer dissatisfaction associated with unfavorable financial terms, as well as the limitations of existing ecosystem defenses.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
MIM: Multi-modal Content Interest Modeling Paradigm for User Behavior Modeling
Authors:
Bencheng Yan,
Si Chen,
Shichang Jia,
Jianyu Liu,
Yueran Liu,
Chenghan Fu,
Wanxian Guan,
Hui Zhao,
Xiang Zhang,
Kai Zhang,
Wenbo Su,
Pengjie Wang,
Jian Xu,
Bo Zheng,
Baolin Liu
Abstract:
Click-Through Rate (CTR) prediction is a crucial task in recommendation systems, online searches, and advertising platforms, where accurately capturing users' real interests in content is essential for performance. However, existing methods heavily rely on ID embeddings, which fail to reflect users' true preferences for content such as images and titles. This limitation becomes particularly eviden…
▽ More
Click-Through Rate (CTR) prediction is a crucial task in recommendation systems, online searches, and advertising platforms, where accurately capturing users' real interests in content is essential for performance. However, existing methods heavily rely on ID embeddings, which fail to reflect users' true preferences for content such as images and titles. This limitation becomes particularly evident in cold-start and long-tail scenarios, where traditional approaches struggle to deliver effective results. To address these challenges, we propose a novel Multi-modal Content Interest Modeling paradigm (MIM), which consists of three key stages: Pre-training, Content-Interest-Aware Supervised Fine-Tuning (C-SFT), and Content-Interest-Aware UBM (CiUBM). The pre-training stage adapts foundational models to domain-specific data, enabling the extraction of high-quality multi-modal embeddings. The C-SFT stage bridges the semantic gap between content and user interests by leveraging user behavior signals to guide the alignment of embeddings with user preferences. Finally, the CiUBM stage integrates multi-modal embeddings and ID-based collaborative filtering signals into a unified framework. Comprehensive offline experiments and online A/B tests conducted on the Taobao, one of the world's largest e-commerce platforms, demonstrated the effectiveness and efficiency of MIM method. The method has been successfully deployed online, achieving a significant increase of +14.14% in CTR and +4.12% in RPM, showcasing its industrial applicability and substantial impact on platform performance. To promote further research, we have publicly released the code and dataset at https://pan.quark.cn/s/8fc8ec3e74f3.
△ Less
Submitted 23 February, 2025; v1 submitted 1 February, 2025;
originally announced February 2025.
-
DAGPrompT: Pushing the Limits of Graph Prompting with a Distribution-aware Graph Prompt Tuning Approach
Authors:
Qin Chen,
Liang Wang,
Bo Zheng,
Guojie Song
Abstract:
The pre-train then fine-tune approach has advanced GNNs by enabling general knowledge capture without task-specific labels. However, an objective gap between pre-training and downstream tasks limits its effectiveness. Recent graph prompting methods aim to close this gap through task reformulations and learnable prompts. Despite this, they struggle with complex graphs like heterophily graphs. Freez…
▽ More
The pre-train then fine-tune approach has advanced GNNs by enabling general knowledge capture without task-specific labels. However, an objective gap between pre-training and downstream tasks limits its effectiveness. Recent graph prompting methods aim to close this gap through task reformulations and learnable prompts. Despite this, they struggle with complex graphs like heterophily graphs. Freezing the GNN encoder can reduce the impact of prompting, while simple prompts fail to handle diverse hop-level distributions. This paper identifies two key challenges in adapting graph prompting methods for complex graphs: (1) adapting the model to new distributions in downstream tasks to mitigate pre-training and fine-tuning discrepancies from heterophily and (2) customizing prompts for hop-specific node requirements. To overcome these challenges, we propose Distribution-aware Graph Prompt Tuning (DAGPrompT), which integrates a GLoRA module for optimizing the GNN encoder's projection matrix and message-passing schema through low-rank adaptation. DAGPrompT also incorporates hop-specific prompts accounting for varying graph structures and distributions among hops. Evaluations on 10 datasets and 14 baselines demonstrate that DAGPrompT improves accuracy by up to 4.79 in node and graph classification tasks, setting a new state-of-the-art while preserving efficiency. Codes are available at GitHub.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.