-
Neural Thermodynamic Laws for Large Language Model Training
Authors:
Ziming Liu,
Yizhou Liu,
Jeff Gore,
Max Tegmark
Abstract:
Beyond neural scaling laws, little is known about the laws underlying large language models (LLMs). We introduce Neural Thermodynamic Laws (NTL) -- a new framework that offers fresh insights into LLM training dynamics. On the theoretical side, we demonstrate that key thermodynamic quantities (e.g., temperature, entropy, heat capacity, thermal conduction) and classical thermodynamic principles (e.g…
▽ More
Beyond neural scaling laws, little is known about the laws underlying large language models (LLMs). We introduce Neural Thermodynamic Laws (NTL) -- a new framework that offers fresh insights into LLM training dynamics. On the theoretical side, we demonstrate that key thermodynamic quantities (e.g., temperature, entropy, heat capacity, thermal conduction) and classical thermodynamic principles (e.g., the three laws of thermodynamics and the equipartition theorem) naturally emerge under river-valley loss landscape assumptions. On the practical side, this scientific perspective yields intuitive guidelines for designing learning rate schedules.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data
Authors:
Yiwen Liu,
Jessica Bader,
Jae Myung Kim
Abstract:
With the development of photorealistic diffusion models, models trained in part or fully on synthetic data achieve progressively better results. However, diffusion models still routinely generate images that would not exist in reality, such as a dog floating above the ground or with unrealistic texture artifacts. We define the concept of feasibility as whether attributes in a synthetic image could…
▽ More
With the development of photorealistic diffusion models, models trained in part or fully on synthetic data achieve progressively better results. However, diffusion models still routinely generate images that would not exist in reality, such as a dog floating above the ground or with unrealistic texture artifacts. We define the concept of feasibility as whether attributes in a synthetic image could realistically exist in the real-world domain; synthetic images containing attributes that violate this criterion are considered infeasible. Intuitively, infeasible images are typically considered out-of-distribution; thus, training on such images is expected to hinder a model's ability to generalize to real-world data, and they should therefore be excluded from the training set whenever possible. However, does feasibility really matter? In this paper, we investigate whether enforcing feasibility is necessary when generating synthetic training data for CLIP-based classifiers, focusing on three target attributes: background, color, and texture. We introduce VariReal, a pipeline that minimally edits a given source image to include feasible or infeasible attributes given by the textual prompt generated by a large language model. Our experiments show that feasibility minimally affects LoRA-fine-tuned CLIP performance, with mostly less than 0.3% difference in top-1 accuracy across three fine-grained datasets. Also, the attribute matters on whether the feasible/infeasible images adversarially influence the classification performance. Finally, mixing feasible and infeasible images in training datasets does not significantly impact performance compared to using purely feasible or infeasible datasets.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Superposition Yields Robust Neural Scaling
Authors:
Yizhou liu,
Ziming Liu,
Jeff Gore
Abstract:
The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law -- the finding that loss decreases as a power law with model size -- remains unclear. Starting from two empirical principles -- that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superpo…
▽ More
The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law -- the finding that loss decreases as a power law with model size -- remains unclear. Starting from two empirical principles -- that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superposed), and that words or concepts in language occur with varying frequencies -- we constructed a toy model to study the loss scaling with model size. We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions. This robust scaling behavior is explained geometrically: when many more vectors are packed into a lower dimensional space, the interference (squared overlaps) between vectors scales inversely with that dimension. We then analyzed four families of open-sourced LLMs and found that they exhibit strong superposition and quantitatively match the predictions of our toy model. The Chinchilla scaling law turned out to also agree with our results. We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws. We anticipate that these insights will inspire new training strategies and model architectures to achieve better performance with less computation and fewer parameters.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Internal State Estimation in Groups via Active Information Gathering
Authors:
Xuebo Ji,
Zherong Pan,
Xifeng Gao,
Lei Yang,
Xinxin Du,
Kaiyun Li,
Yongjin Liu,
Wenping Wang,
Changhe Tu,
Jia Pan
Abstract:
Accurately estimating human internal states, such as personality traits or behavioral patterns, is critical for enhancing the effectiveness of human-robot interaction, particularly in group settings. These insights are key in applications ranging from social navigation to autism diagnosis. However, prior methods are limited by scalability and passive observation, making real-time estimation in com…
▽ More
Accurately estimating human internal states, such as personality traits or behavioral patterns, is critical for enhancing the effectiveness of human-robot interaction, particularly in group settings. These insights are key in applications ranging from social navigation to autism diagnosis. However, prior methods are limited by scalability and passive observation, making real-time estimation in complex, multi-human settings difficult. In this work, we propose a practical method for active human personality estimation in groups, with a focus on applications related to Autism Spectrum Disorder (ASD). Our method combines a personality-conditioned behavior model, based on the Eysenck 3-Factor theory, with an active robot information gathering policy that triggers human behaviors through a receding-horizon planner. The robot's belief about human personality is then updated via Bayesian inference. We demonstrate the effectiveness of our approach through simulations, user studies with typical adults, and preliminary experiments involving participants with ASD. Our results show that our method can scale to tens of humans and reduce personality prediction error by 29.2% and uncertainty by 79.9% in simulation. User studies with typical adults confirm the method's ability to generalize across complex personality distributions. Additionally, we explore its application in autism-related scenarios, demonstrating that the method can identify the difference between neurotypical and autistic behavior, highlighting its potential for diagnosing ASD. The results suggest that our framework could serve as a foundation for future ASD-specific interventions.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Rethinking Repetition Problems of LLMs in Code Generation
Authors:
Yihong Dong,
Yuchen Liu,
Xue Jiang,
Zhi Jin,
Ge Li
Abstract:
With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetiti…
▽ More
With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetition. In structural repetition, the repeated code appears in various patterns but possesses a fixed structure, which can be inherently reflected in grammar. In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs. Specifically, RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions, thereby mitigating them in code generation. To facilitate this study, we construct a new dataset CodeRepetEval to comprehensively evaluate approaches for mitigating the repetition problems in code generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of generated code.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
An Integrated UVM-TLM Co-Simulation Framework for RISC-V Functional Verification and Performance Evaluation
Authors:
Ruizhi Qiu,
Yang Liu
Abstract:
The burgeoning RISC-V ecosystem necessitates efficient verification methodologies for complex processors. Traditional approaches often struggle to concurrently evaluate functional correctness and performance, or balance simulation speed with modeling accuracy. This paper introduces an integrated co-simulation framework leveraging Universal Verification Methodology (UVM) and Transaction-Level Model…
▽ More
The burgeoning RISC-V ecosystem necessitates efficient verification methodologies for complex processors. Traditional approaches often struggle to concurrently evaluate functional correctness and performance, or balance simulation speed with modeling accuracy. This paper introduces an integrated co-simulation framework leveraging Universal Verification Methodology (UVM) and Transaction-Level Modeling (TLM) for RISC-V processor validation. We present a configurable UVM-TLM model (vmodel) of a superscalar, out-of-order RISC-V core, featuring key microarchitectural modeling techniques such as credit-based pipeline flow control. This environment facilitates unified functional verification via co-simulation against the Spike ISA simulator and enables early-stage performance assessment using benchmarks like CoreMark, orchestrated within UVM. The methodology prioritizes integration, simulation efficiency, and acceptable fidelity for architectural exploration over cycle-level precision. Experimental results validate functional correctness and significant simulation speedup over RTL approaches, accelerating design iterations and enhancing verification coverage.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection
Authors:
Bin-Bin Gao,
Yue Zhu,
Jiangtao Yan,
Yuezhi Cai,
Weixi Zhang,
Meng Wang,
Jun Liu,
Yong Liu,
Lei Wang,
Chengjie Wang
Abstract:
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token…
▽ More
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at https://github.com/gaobb/AdaptCLIP.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Trustless Autonomy: Understanding Motivations, Benefits and Governance Dilemma in Self-Sovereign Decentralized AI Agents
Authors:
Botao Amber Hu,
Yuhan Liu,
Helena Rong
Abstract:
The recent trend of self-sovereign Decentralized AI Agents (DeAgents) combines Large Language Model (LLM)-based AI agents with decentralization technologies such as blockchain smart contracts and trusted execution environments (TEEs). These tamper-resistant trustless substrates allow agents to achieve self-sovereignty through ownership of cryptowallet private keys and control of digital assets and…
▽ More
The recent trend of self-sovereign Decentralized AI Agents (DeAgents) combines Large Language Model (LLM)-based AI agents with decentralization technologies such as blockchain smart contracts and trusted execution environments (TEEs). These tamper-resistant trustless substrates allow agents to achieve self-sovereignty through ownership of cryptowallet private keys and control of digital assets and social media accounts. DeAgent eliminates centralized control and reduces human intervention, addressing key trust concerns inherent in centralized AI systems. However, given ongoing challenges in LLM reliability such as hallucinations, this creates paradoxical tension between trustlessness and unreliable autonomy. This study addresses this empirical research gap through interviews with DeAgents stakeholders-experts, founders, and developers-to examine their motivations, benefits, and governance dilemmas. The findings will guide future DeAgents system and protocol design and inform discussions about governance in sociotechnical AI systems in the future agentic web.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Enabling Group Fairness in Graph Unlearning via Bi-level Debiasing
Authors:
Yezi Liu,
Prathyush Poduval,
Wenjun Huang,
Yang Ni,
Hanning Chen,
Mohsen Imani
Abstract:
Graph unlearning is a crucial approach for protecting user privacy by erasing the influence of user data on trained graph models. Recent developments in graph unlearning methods have primarily focused on maintaining model prediction performance while removing user information. However, we have observed that when user information is deleted from the model, the prediction distribution across differe…
▽ More
Graph unlearning is a crucial approach for protecting user privacy by erasing the influence of user data on trained graph models. Recent developments in graph unlearning methods have primarily focused on maintaining model prediction performance while removing user information. However, we have observed that when user information is deleted from the model, the prediction distribution across different sensitive groups often changes. Furthermore, graph models are shown to be prone to amplifying biases, making the study of fairness in graph unlearning particularly important. This raises the question: Does graph unlearning actually introduce bias? Our findings indicate that the predictions of post-unlearning models become highly correlated with sensitive attributes, confirming the introduction of bias in the graph unlearning process. To address this issue, we propose a fair graph unlearning method, FGU. To guarantee privacy, FGU trains shard models on partitioned subgraphs, unlearns the requested data from the corresponding subgraphs, and retrains the shard models on the modified subgraphs. To ensure fairness, FGU employs a bi-level debiasing process: it first enables shard-level fairness by incorporating a fairness regularizer in the shard model retraining, and then achieves global-level fairness by aligning all shard models to minimize global disparity. Our experiments demonstrate that FGU achieves superior fairness while maintaining privacy and accuracy. Additionally, FGU is robust to diverse unlearning requests, ensuring fairness and utility performance across various data distributions.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
MIGRATION-BENCH: Repository-Level Code Migration Benchmark from Java 8
Authors:
Linbo Liu,
Xinle Liu,
Qiang Zhou,
Lin Chen,
Yihan Liu,
Hoan Nguyen,
Behrooz Omidvar-Tehrani,
Xi Shen,
Jun Huan,
Omer Tripp,
Anoop Deoras
Abstract:
With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark datasets have been developed to evaluate the coding capabilities of these models, while they primarily focus on problem-solving and issue-resolution tasks. In contras…
▽ More
With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark datasets have been developed to evaluate the coding capabilities of these models, while they primarily focus on problem-solving and issue-resolution tasks. In contrast, we introduce a new coding benchmark MIGRATION-BENCH with a distinct focus: code migration. MIGRATION-BENCH aims to serve as a comprehensive benchmark for migration from Java 8 to the latest long-term support (LTS) versions (Java 17, 21), MIGRATION-BENCH includes a full dataset and its subset selected with $5,102$ and $300$ repositories respectively. Selected is a representative subset curated for complexity and difficulty, offering a versatile resource to support research in the field of code migration. Additionally, we provide a comprehensive evaluation framework to facilitate rigorous and standardized assessment of LLMs on this challenging task. We further propose SD-Feedback and demonstrate that LLMs can effectively tackle repository-level code migration to Java 17. For the selected subset with Claude-3.5-Sonnet-v2, SD-Feedback achieves 62.33% and 27.00% success rate (pass@1) for minimal and maximal migration respectively. The benchmark dataset and source code are available at: https://huggingface.co/collections/AmazonScience and https://github.com/amazon-science/self_debug respectively.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Learning Long-Context Diffusion Policies via Past-Token Prediction
Authors:
Marcel Torne,
Andy Tang,
Yuejiang Liu,
Chelsea Finn
Abstract:
Reasoning over long sequences of observations and actions is essential for many robotic tasks. Yet, learning effective long-context policies from demonstrations remains challenging. As context length increases, training becomes increasingly expensive due to rising memory demands, and policy performance often degrades as a result of spurious correlations. Recent methods typically sidestep these iss…
▽ More
Reasoning over long sequences of observations and actions is essential for many robotic tasks. Yet, learning effective long-context policies from demonstrations remains challenging. As context length increases, training becomes increasingly expensive due to rising memory demands, and policy performance often degrades as a result of spurious correlations. Recent methods typically sidestep these issues by truncating context length, discarding historical information that may be critical for subsequent decisions. In this paper, we propose an alternative approach that explicitly regularizes the retention of past information. We first revisit the copycat problem in imitation learning and identify an opposite challenge in recent diffusion policies: rather than over-relying on prior actions, they often fail to capture essential dependencies between past and future actions. To address this, we introduce Past-Token Prediction (PTP), an auxiliary task in which the policy learns to predict past action tokens alongside future ones. This regularization significantly improves temporal modeling in the policy head, with minimal reliance on visual representations. Building on this observation, we further introduce a multistage training strategy: pre-train the visual encoder with short contexts, and fine-tune the policy head using cached long-context embeddings. This strategy preserves the benefits of PTP while greatly reducing memory and computational overhead. Finally, we extend PTP into a self-verification mechanism at test time, enabling the policy to score and select candidates consistent with past actions during inference. Experiments across four real-world and six simulated tasks demonstrate that our proposed method improves the performance of long-context diffusion policies by 3x and accelerates policy training by more than 10x.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
FreeDriveRF: Monocular RGB Dynamic NeRF without Poses for Autonomous Driving via Point-Level Dynamic-Static Decoupling
Authors:
Yue Wen,
Liang Song,
Yijia Liu,
Siting Zhu,
Yanzi Miao,
Lijun Han,
Hesheng Wang
Abstract:
Dynamic scene reconstruction for autonomous driving enables vehicles to perceive and interpret complex scene changes more precisely. Dynamic Neural Radiance Fields (NeRFs) have recently shown promising capability in scene modeling. However, many existing methods rely heavily on accurate poses inputs and multi-sensor data, leading to increased system complexity. To address this, we propose FreeDriv…
▽ More
Dynamic scene reconstruction for autonomous driving enables vehicles to perceive and interpret complex scene changes more precisely. Dynamic Neural Radiance Fields (NeRFs) have recently shown promising capability in scene modeling. However, many existing methods rely heavily on accurate poses inputs and multi-sensor data, leading to increased system complexity. To address this, we propose FreeDriveRF, which reconstructs dynamic driving scenes using only sequential RGB images without requiring poses inputs. We innovatively decouple dynamic and static parts at the early sampling level using semantic supervision, mitigating image blurring and artifacts. To overcome the challenges posed by object motion and occlusion in monocular camera, we introduce a warped ray-guided dynamic object rendering consistency loss, utilizing optical flow to better constrain the dynamic modeling process. Additionally, we incorporate estimated dynamic flow to constrain the pose optimization process, improving the stability and accuracy of unbounded scene reconstruction. Extensive experiments conducted on the KITTI and Waymo datasets demonstrate the superior performance of our method in dynamic scene modeling for autonomous driving.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Qwen3 Technical Report
Authors:
An Yang,
Anfeng Li,
Baosong Yang,
Beichen Zhang,
Binyuan Hui,
Bo Zheng,
Bowen Yu,
Chang Gao,
Chengen Huang,
Chenxu Lv,
Chujie Zheng,
Dayiheng Liu,
Fan Zhou,
Fei Huang,
Feng Hu,
Hao Ge,
Haoran Wei,
Huan Lin,
Jialong Tang,
Jian Yang,
Jianhong Tu,
Jianwei Zhang,
Jianxin Yang,
Jiaxi Yang,
Jing Zhou
, et al. (35 additional authors not shown)
Abstract:
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration…
▽ More
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
Authors:
Chenggang Zhao,
Chengqi Deng,
Chong Ruan,
Damai Dai,
Huazuo Gao,
Jiashi Li,
Liyue Zhang,
Panpan Huang,
Shangyan Zhou,
Shirong Ma,
Wenfeng Liang,
Ying He,
Yuqing Wang,
Yuxuan Liu,
Y. X. Wei
Abstract:
The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inferen…
▽ More
The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Generating Full-field Evolution of Physical Dynamics from Irregular Sparse Observations
Authors:
Panqi Chen,
Yifan Sun,
Lei Cheng,
Yang Yang,
Weichang Li,
Yang Liu,
Weiqing Liu,
Jiang Bian,
Shikai Fang
Abstract:
Modeling and reconstructing multidimensional physical dynamics from sparse and off-grid observations presents a fundamental challenge in scientific research. Recently, diffusion-based generative modeling shows promising potential for physical simulation. However, current approaches typically operate on on-grid data with preset spatiotemporal resolution, but struggle with the sparsely observed and…
▽ More
Modeling and reconstructing multidimensional physical dynamics from sparse and off-grid observations presents a fundamental challenge in scientific research. Recently, diffusion-based generative modeling shows promising potential for physical simulation. However, current approaches typically operate on on-grid data with preset spatiotemporal resolution, but struggle with the sparsely observed and continuous nature of real-world physical dynamics. To fill the gaps, we present SDIFT, Sequential DIffusion in Functional Tucker space, a novel framework that generates full-field evolution of physical dynamics from irregular sparse observations. SDIFT leverages the functional Tucker model as the latent space representer with proven universal approximation property, and represents observations as latent functions and Tucker core sequences. We then construct a sequential diffusion model with temporally augmented UNet in the functional Tucker space, denoising noise drawn from a Gaussian process to generate the sequence of core tensors.
At the posterior sampling stage, we propose a Message-Passing Posterior Sampling mechanism, enabling conditional generation of the entire sequence guided by observations at limited time steps. We validate SDIFT on three physical systems spanning astronomical (supernova explosions, light-year scale), environmental (ocean sound speed fields, kilometer scale), and molecular (organic liquid, millimeter scale) domains, demonstrating significant improvements in both reconstruction accuracy and computational efficiency compared to state-of-the-art approaches.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Argus: Federated Non-convex Bilevel Learning over 6G Space-Air-Ground Integrated Network
Authors:
Ya Liu,
Kai Yang,
Yu Zhu,
Keying Yang,
Haibo Zhao
Abstract:
The space-air-ground integrated network (SAGIN) has recently emerged as a core element in the 6G networks. However, traditional centralized and synchronous optimization algorithms are unsuitable for SAGIN due to infrastructureless and time-varying environments. This paper aims to develop a novel Asynchronous algorithm a.k.a. Argus for tackling non-convex and non-smooth decentralized federated bile…
▽ More
The space-air-ground integrated network (SAGIN) has recently emerged as a core element in the 6G networks. However, traditional centralized and synchronous optimization algorithms are unsuitable for SAGIN due to infrastructureless and time-varying environments. This paper aims to develop a novel Asynchronous algorithm a.k.a. Argus for tackling non-convex and non-smooth decentralized federated bilevel learning over SAGIN. The proposed algorithm allows networked agents (e.g. autonomous aerial vehicles) to tackle bilevel learning problems in time-varying networks asynchronously, thereby averting stragglers from impeding the overall training speed. We provide a theoretical analysis of the iteration complexity, communication complexity, and computational complexity of Argus. Its effectiveness is further demonstrated through numerical experiments.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Adaptive Schema-aware Event Extraction with Retrieval-Augmented Generation
Authors:
Sheng Liang,
Hang Lv,
Zhihao Wen,
Yaxiong Wu,
Yongyue Zhang,
Hao Wang,
Yong Liu
Abstract:
Event extraction (EE) is a fundamental task in natural language processing (NLP) that involves identifying and extracting event information from unstructured text. Effective EE in real-world scenarios requires two key steps: selecting appropriate schemas from hundreds of candidates and executing the extraction process. Existing research exhibits two critical gaps: (1) the rigid schema fixation in…
▽ More
Event extraction (EE) is a fundamental task in natural language processing (NLP) that involves identifying and extracting event information from unstructured text. Effective EE in real-world scenarios requires two key steps: selecting appropriate schemas from hundreds of candidates and executing the extraction process. Existing research exhibits two critical gaps: (1) the rigid schema fixation in existing pipeline systems, and (2) the absence of benchmarks for evaluating joint schema matching and extraction. Although large language models (LLMs) offer potential solutions, their schema hallucination tendencies and context window limitations pose challenges for practical deployment. In response, we propose Adaptive Schema-aware Event Extraction (ASEE), a novel paradigm combining schema paraphrasing with schema retrieval-augmented generation. ASEE adeptly retrieves paraphrased schemas and accurately generates targeted structures. To facilitate rigorous evaluation, we construct the Multi-Dimensional Schema-aware Event Extraction (MD-SEE) benchmark, which systematically consolidates 12 datasets across diverse domains, complexity levels, and language settings. Extensive evaluations on MD-SEE show that our proposed ASEE demonstrates strong adaptability across various scenarios, significantly improving the accuracy of event extraction.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain
Authors:
Wenzhen Yue,
Yong Liu,
Haoxuan Li,
Hao Wang,
Xianghua Ying,
Ruohao Guo,
Bowei Xing,
Ji Shi
Abstract:
This paper presents $\mathbf{OLinear}$, a $\mathbf{linear}$-based multivariate time series forecasting model that operates in an $\mathbf{o}$rthogonally transformed domain. Recent forecasting models typically adopt the temporal forecast (TF) paradigm, which directly encode and decode time series in the time domain. However, the entangled step-wise dependencies in series data can hinder the perform…
▽ More
This paper presents $\mathbf{OLinear}$, a $\mathbf{linear}$-based multivariate time series forecasting model that operates in an $\mathbf{o}$rthogonally transformed domain. Recent forecasting models typically adopt the temporal forecast (TF) paradigm, which directly encode and decode time series in the time domain. However, the entangled step-wise dependencies in series data can hinder the performance of TF. To address this, some forecasters conduct encoding and decoding in the transformed domain using fixed, dataset-independent bases (e.g., sine and cosine signals in the Fourier transform). In contrast, we utilize $\mathbf{OrthoTrans}$, a data-adaptive transformation based on an orthogonal matrix that diagonalizes the series' temporal Pearson correlation matrix. This approach enables more effective encoding and decoding in the decorrelated feature domain and can serve as a plug-in module to enhance existing forecasters. To enhance the representation learning for multivariate time series, we introduce a customized linear layer, $\mathbf{NormLin}$, which employs a normalized weight matrix to capture multivariate dependencies. Empirically, the NormLin module shows a surprising performance advantage over multi-head self-attention, while requiring nearly half the FLOPs. Extensive experiments on 24 benchmarks and 140 forecasting tasks demonstrate that OLinear consistently achieves state-of-the-art performance with high efficiency. Notably, as a plug-in replacement for self-attention, the NormLin module consistently enhances Transformer-based forecasters. The code and datasets are available at https://anonymous.4open.science/r/OLinear
△ Less
Submitted 14 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
The Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News
Authors:
Yuhan Liu,
Yuxuan Liu,
Xiaoqing Zhang,
Xiuying Chen,
Rui Yan
Abstract:
In today's digital environment, the rapid propagation of fake news via social networks poses significant social challenges. Most existing detection methods either employ traditional classification models, which suffer from low interpretability and limited generalization capabilities, or craft specific prompts for large language models (LLMs) to produce explanations and results directly, failing to…
▽ More
In today's digital environment, the rapid propagation of fake news via social networks poses significant social challenges. Most existing detection methods either employ traditional classification models, which suffer from low interpretability and limited generalization capabilities, or craft specific prompts for large language models (LLMs) to produce explanations and results directly, failing to leverage LLMs' reasoning abilities fully. Inspired by the saying that "truth becomes clearer through debate," our study introduces a novel multi-agent system with LLMs named TruEDebate (TED) to enhance the interpretability and effectiveness of fake news detection. TED employs a rigorous debate process inspired by formal debate settings. Central to our approach are two innovative components: the DebateFlow Agents and the InsightFlow Agents. The DebateFlow Agents organize agents into two teams, where one supports and the other challenges the truth of the news. These agents engage in opening statements, cross-examination, rebuttal, and closing statements, simulating a rigorous debate process akin to human discourse analysis, allowing for a thorough evaluation of news content. Concurrently, the InsightFlow Agents consist of two specialized sub-agents: the Synthesis Agent and the Analysis Agent. The Synthesis Agent summarizes the debates and provides an overarching viewpoint, ensuring a coherent and comprehensive evaluation. The Analysis Agent, which includes a role-aware encoder and a debate graph, integrates role embeddings and models the interactions between debate roles and arguments using an attention mechanism, providing the final judgment.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Dual-UAV-Enabled Secure Communication and Sensing for A2G-ISAC Systems with Maneuverable Jamming
Authors:
Libiao Lou,
Yuan Liu,
Fotis Foukalas,
Hongjiang Lei,
Gaofeng Pan,
Theodoros A. Tsiftsis,
Hongwu Liu
Abstract:
In this paper, we propose a dual-unmanned aerial vehicle (UAV)-enabled secure communication and sensing (SCS) scheme for an air-to-ground integrated sensing and communication (ISAC) system, in which a dual-functional source UAV and jamming UAV collaborate to enhance both the secure communication and target sensing performance. From a perspective of hybrid monostatitc-bistatic radar, the jamming UA…
▽ More
In this paper, we propose a dual-unmanned aerial vehicle (UAV)-enabled secure communication and sensing (SCS) scheme for an air-to-ground integrated sensing and communication (ISAC) system, in which a dual-functional source UAV and jamming UAV collaborate to enhance both the secure communication and target sensing performance. From a perspective of hybrid monostatitc-bistatic radar, the jamming UAV maneuvers to aid the source UAV to detect multiple ground targets by emitting artificial noise, meanwhile interfering with the ground eavesdropper. Residual interference is considered to reflect the effects of imperfect successive interference cancellation (SIC) on the receive signal-plus-interference-to-noise ratios, which results in a degraded system performance. To maximize the average secrecy rate (ASR), the dual-UAV trajectory and dual-UAV beamforming are jointly optimized subject to the transmit power budget, UAV maneuvering constraint, and sensing requirements. To tackle the highly complicated non-convex ASR maximization problem, the dual-UAV trajectory and dual-UAV beamforming are optimized for the secure communication (SC) purpose and the SCS purpose, sequentially. In the SC phase, a block coordinate descent algorithm is proposed to optimize the dual-UAV trajectory and dual-UAV beamforming iteratively, using the trust-region successive convex approximation (SCA) and semidefinite relaxation (SDR) techniques. Then, a weighted distance minimization problem is formulated to determine the dual-UAV maneuvering positions suitable for the SCS purpose, which is solved by a heuristic greedy algorithm, followed by the joint optimization of source beamforming and jamming beamforming.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care
Authors:
Zhi Da Soh,
Yang Bai,
Kai Yu,
Yang Zhou,
Xiaofeng Lei,
Sahil Thakur,
Zann Lee,
Lee Ching Linette Phang,
Qingsheng Peng,
Can Can Xue,
Rachel Shujuan Chong,
Quan V. Hoang,
Lavanya Raghavan,
Yih Chung Tham,
Charumathi Sabanayagam,
Wei-Chi Wu,
Ming-Chih Ho,
Jiangnan He,
Preeti Gupta,
Ecosse Lamoureux,
Seang Mei Saw,
Vinay Nangia,
Songhomitra Panda-Jonas,
Jie Xu,
Ya Xing Wang
, et al. (6 additional authors not shown)
Abstract:
Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptati…
▽ More
Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptation, we fine-tuned our VFMs to detect ocular and systemic diseases, differentiate ocular disease severity, and identify common ocular signs. The model achieved 100% accuracy in routing fundus images to appropriate VFMs, which achieved $\ge$ 82.2% accuracy in disease detection, $\ge$ 89% in severity differentiation, $\ge$ 76% in sign identification. Meta-EyeFM was 11% to 43% more accurate than Gemini-1.5-flash and ChatGPT-4o LMMs in detecting various eye diseases and comparable to an ophthalmologist. This system offers enhanced usability and diagnostic performance, making it a valuable decision support tool for primary eye care or an online LLM for fundus evaluation.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
An Identifiable Cost-Aware Causal Decision-Making Framework Using Counterfactual Reasoning
Authors:
Ruichu Cai,
Xi Chen,
Jie Qiao,
Zijian Li,
Yuequn Liu,
Wei Chen,
Keli Zhang,
Jiale Zheng
Abstract:
Decision making under abnormal conditions is a critical process that involves evaluating the current state and determining the optimal action to restore the system to a normal state at an acceptable cost. However, in such scenarios, existing decision-making frameworks highly rely on reinforcement learning or root cause analysis, resulting in them frequently neglecting the cost of the actions or fa…
▽ More
Decision making under abnormal conditions is a critical process that involves evaluating the current state and determining the optimal action to restore the system to a normal state at an acceptable cost. However, in such scenarios, existing decision-making frameworks highly rely on reinforcement learning or root cause analysis, resulting in them frequently neglecting the cost of the actions or failing to incorporate causal mechanisms adequately. By relaxing the existing causal decision framework to solve the necessary cause, we propose a minimum-cost causal decision (MiCCD) framework via counterfactual reasoning to address the above challenges. Emphasis is placed on making counterfactual reasoning processes identifiable in the presence of a large amount of mixed anomaly data, as well as finding the optimal intervention state in a continuous decision space. Specifically, it formulates a surrogate model based on causal graphs, using abnormal pattern clustering labels as supervisory signals. This enables the approximation of the structural causal model among the variables and lays a foundation for identifiable counterfactual reasoning. With the causal structure approximated, we then established an optimization model based on counterfactual estimation. The Sequential Least Squares Programming (SLSQP) algorithm is further employed to optimize intervention strategies while taking costs into account. Experimental evaluations on both synthetic and real-world datasets reveal that MiCCD outperforms conventional methods across multiple metrics, including F1-score, cost efficiency, and ranking quality(nDCG@k values), thus validating its efficacy and broad applicability.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands
Authors:
Junda Huang,
Jianshu Zhou,
Honghao Guo,
Yunhui Liu
Abstract:
As robotics progresses toward general manipulation, dexterous hands are becoming increasingly critical. However, proprioception in dexterous hands remains a bottleneck due to limitations in volume and generality. In this work, we present HandCept, a novel visual-inertial proprioception framework designed to overcome the challenges of traditional joint angle estimation methods. HandCept addresses t…
▽ More
As robotics progresses toward general manipulation, dexterous hands are becoming increasingly critical. However, proprioception in dexterous hands remains a bottleneck due to limitations in volume and generality. In this work, we present HandCept, a novel visual-inertial proprioception framework designed to overcome the challenges of traditional joint angle estimation methods. HandCept addresses the difficulty of achieving accurate and robust joint angle estimation in dynamic environments where both visual and inertial measurements are prone to noise and drift. It leverages a zero-shot learning approach using a wrist-mounted RGB-D camera and 9-axis IMUs, fused in real time via a latency-free Extended Kalman Filter (EKF). Our results show that HandCept achieves joint angle estimation errors between $2^{\circ}$ and $4^{\circ}$ without observable drift, outperforming visual-only and inertial-only methods. Furthermore, we validate the stability and uniformity of the IMU system, demonstrating that a common base frame across IMUs simplifies system calibration. To support sim-to-real transfer, we also open-sourced our high-fidelity rendering pipeline, which is essential for training without real-world ground truth. This work offers a robust, generalizable solution for proprioception in dexterous hands, with significant implications for robotic manipulation and human-robot interaction.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation
Authors:
Yutong Liu,
Feng Xiao,
Ziyue Zhang,
Yongbin Yu,
Cheng Huang,
Fan Gao,
Xiangxiang Wang,
Ma-bao Ban,
Manping Fan,
Thupten Tsering,
Cheng Huang,
Gadeng Luosang,
Renzeng Duojie,
Nyima Tashi
Abstract:
Multi-level Tibetan spelling correction addresses errors at both the character and syllable levels within a unified model. Existing methods focus mainly on single-level correction and lack effective integration of both levels. Moreover, there are no open-source datasets or augmentation methods tailored for this task in Tibetan. To tackle this, we propose a data augmentation approach using unlabele…
▽ More
Multi-level Tibetan spelling correction addresses errors at both the character and syllable levels within a unified model. Existing methods focus mainly on single-level correction and lack effective integration of both levels. Moreover, there are no open-source datasets or augmentation methods tailored for this task in Tibetan. To tackle this, we propose a data augmentation approach using unlabeled text to generate multi-level corruptions, and introduce TiSpell, a semi-masked model capable of correcting both character- and syllable-level errors. Although syllable-level correction is more challenging due to its reliance on global context, our semi-masked strategy simplifies this process. We synthesize nine types of corruptions on clean sentences to create a robust training set. Experiments on both simulated and real-world data demonstrate that TiSpell, trained on our dataset, outperforms baseline models and matches the performance of state-of-the-art approaches, confirming its effectiveness.
△ Less
Submitted 14 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
Self-cross Feature based Spiking Neural Networks for Efficient Few-shot Learning
Authors:
Qi Xu,
Junyang Zhu,
Dongdong Zhou,
Hao Chen,
Yang Liu,
Jiangrong Shen,
Qiang Zhang
Abstract:
Deep neural networks (DNNs) excel in computer vision tasks, especially, few-shot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynam…
▽ More
Deep neural networks (DNNs) excel in computer vision tasks, especially, few-shot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynamic data, though they still encounter difficulties in capturing complex spatiotemporal features and performing accurate cross-class comparisons. To further enhance the performance and efficiency of SNNs in few-shot learning, we propose a few-shot learning framework based on SNNs, which combines a self-feature extractor module and a cross-feature contrastive module to refine feature representation and reduce power consumption. We apply the combination of temporal efficient training loss and InfoNCE loss to optimize the temporal dynamics of spike trains and enhance the discriminative power. Experimental results show that the proposed FSL-SNN significantly improves the classification performance on the neuromorphic dataset N-Omniglot, and also achieves competitive performance to ANNs on static datasets such as CUB and miniImageNet with low power consumption.
△ Less
Submitted 14 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
Channel Fingerprint Construction for Massive MIMO: A Deep Conditional Generative Approach
Authors:
Zhenzhou Jin,
Li You,
Xudong Li,
Zhen Gao,
Yuanwei Liu,
Xiang-Gen Xia,
Xiqi Gao
Abstract:
Accurate channel state information (CSI) acquisition for massive multiple-input multiple-output (MIMO) systems is essential for future mobile communication networks. Channel fingerprint (CF), also referred to as channel knowledge map, is a key enabler for intelligent environment-aware communication and can facilitate CSI acquisition. However, due to the cost limitations of practical sensing nodes…
▽ More
Accurate channel state information (CSI) acquisition for massive multiple-input multiple-output (MIMO) systems is essential for future mobile communication networks. Channel fingerprint (CF), also referred to as channel knowledge map, is a key enabler for intelligent environment-aware communication and can facilitate CSI acquisition. However, due to the cost limitations of practical sensing nodes and test vehicles, the resulting CF is typically coarse-grained, making it insufficient for wireless transceiver design. In this work, we introduce the concept of CF twins and design a conditional generative diffusion model (CGDM) with strong implicit prior learning capabilities as the computational core of the CF twin to establish the connection between coarse- and fine-grained CFs. Specifically, we employ a variational inference technique to derive the evidence lower bound (ELBO) for the log-marginal distribution of the observed fine-grained CF conditioned on the coarse-grained CF, enabling the CGDM to learn the complicated distribution of the target data. During the denoising neural network optimization, the coarse-grained CF is introduced as side information to accurately guide the conditioned generation of the CGDM. To make the proposed CGDM lightweight, we further leverage the additivity of network layers and introduce a one-shot pruning approach along with a multi-objective knowledge distillation technique. Experimental results show that the proposed approach exhibits significant improvement in reconstruction performance compared to the baselines. Additionally, zero-shot testing on reconstruction tasks with different magnification factors further demonstrates the scalability and generalization ability of the proposed approach.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
Authors:
Yuyang Liu,
Liuzhenghao Lv,
Xiancheng Zhang,
Li Yuan,
Yonghong Tian
Abstract:
Biological protocols are fundamental to reproducible and safe life science research. While LLMs excel on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning. While lim…
▽ More
Biological protocols are fundamental to reproducible and safe life science research. While LLMs excel on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning. While limited benchmarks have touched upon specific aspects like protocol QA, BioProBench provides a comprehensive suite of five core tasks: Protocol Question Answering, Step Ordering, Error Correction, Protocol Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on procedural biological texts. Built upon 27K original protocols, it yields nearly 556K high-quality structured instances. We evaluate 12 mainstream open/closed-source LLMs on BioProBench. Experimental results reveal that while top models preform well on surface understanding tasks, struggle significantly with deep reasoning and structured generation tasks like ordering and generation. Furthermore, model comparisons reveal diverse performance: certain open-source models approach closed-source levels on some tasks, yet bio-specific small models lag behind general LLMs, indicating limitations on complex procedural content. Overall, our findings underscore that procedural reasoning within biological protocols represents a significant challenge for current LLMs. BioProBench serves as a standardized framework to diagnose these specific limitations and guide the development of AI systems better equipped for safely automating complex scientific procedures. The code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/GreatCaptainNemo/BioProBench.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Enhancing Trust Management System for Connected Autonomous Vehicles Using Machine Learning Methods: A Survey
Authors:
Qian Xu,
Lei Zhang,
Yixiao Liu
Abstract:
Connected Autonomous Vehicles (CAVs) operate in dynamic, open, and multi-domain networks, rendering them vulnerable to various threats. Trust Management Systems (TMS) systematically organize essential steps in the trust mechanism, identifying malicious nodes against internal threats and external threats, as well as ensuring reliable decision-making for more cooperative tasks. Recent advances in ma…
▽ More
Connected Autonomous Vehicles (CAVs) operate in dynamic, open, and multi-domain networks, rendering them vulnerable to various threats. Trust Management Systems (TMS) systematically organize essential steps in the trust mechanism, identifying malicious nodes against internal threats and external threats, as well as ensuring reliable decision-making for more cooperative tasks. Recent advances in machine learning (ML) offer significant potential to enhance TMS, especially for the strict requirements of CAVs, such as CAV nodes moving at varying speeds, and opportunistic and intermittent network behavior. Those features distinguish ML-based TMS from social networks, static IoT, and Social IoT. This survey proposes a novel three-layer ML-based TMS framework for CAVs in the vehicle-road-cloud integration system, i.e., trust data layer, trust calculation layer and trust incentive layer. A six-dimensional taxonomy of objectives is proposed. Furthermore, the principles of ML methods for each module in each layer are analyzed. Then, recent studies are categorized based on traffic scenarios that are against the proposed objectives. Finally, future directions are suggested, addressing the open issues and meeting the research trend. We maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/octoberzzzzz/ML-based-TMS-CAV-Survey.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
SweRank: Software Issue Localization with Code Ranking
Authors:
Revanth Gangi Reddy,
Tarun Suresh,
JaeHyeok Doo,
Ye Liu,
Xuan Phi Nguyen,
Yingbo Zhou,
Semih Yavuz,
Caiming Xiong,
Heng Ji,
Shafiq Joty
Abstract:
Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step rea…
▽ More
Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
ai.txt: A Domain-Specific Language for Guiding AI Interactions with the Internet
Authors:
Yuekang Li,
Wei Song,
Bangshuo Zhu,
Dong Gong,
Yi Liu,
Gelei Deng,
Chunyang Chen,
Lei Ma,
Jun Sun,
Toby Walsh,
Jingling Xue
Abstract:
We introduce ai.txt, a novel domain-specific language (DSL) designed to explicitly regulate interactions between AI models, agents, and web content, addressing critical limitations of the widely adopted robots.txt standard. As AI increasingly engages with online materials for tasks such as training, summarization, and content modification, existing regulatory methods lack the necessary granularity…
▽ More
We introduce ai.txt, a novel domain-specific language (DSL) designed to explicitly regulate interactions between AI models, agents, and web content, addressing critical limitations of the widely adopted robots.txt standard. As AI increasingly engages with online materials for tasks such as training, summarization, and content modification, existing regulatory methods lack the necessary granularity and semantic expressiveness to ensure ethical and legal compliance. ai.txt extends traditional URL-based access controls by enabling precise element-level regulations and incorporating natural language instructions interpretable by AI systems. To facilitate practical deployment, we provide an integrated development environment with code autocompletion and automatic XML generation. Furthermore, we propose two compliance mechanisms: XML-based programmatic enforcement and natural language prompt integration, and demonstrate their effectiveness through preliminary experiments and case studies. Our approach aims to aid the governance of AI-Internet interactions, promoting responsible AI use in digital ecosystems.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Relative Overfitting and Accept-Reject Framework
Authors:
Yanxin Liu,
Yunqi Zhang
Abstract:
Currently, the scaling law of Large Language Models (LLMs) faces challenges and bottlenecks. This paper posits that noise effects, stemming from changes in the signal-to-noise ratio under diminishing marginal returns, are the root cause of these issues. To control this noise, we investigated the differences between models with performance advantages and disadvantages, introducing the concept of "r…
▽ More
Currently, the scaling law of Large Language Models (LLMs) faces challenges and bottlenecks. This paper posits that noise effects, stemming from changes in the signal-to-noise ratio under diminishing marginal returns, are the root cause of these issues. To control this noise, we investigated the differences between models with performance advantages and disadvantages, introducing the concept of "relative overfitting." Based on their complementary strengths, we have proposed an application framework, Accept-Reject (AR). In Natural Language Processing (NLP), we use LLMs and Small Language Models (SLMs) as the medium for discussion. This framework enables SLMs to exert a universal positive influence on LLM decision outputs, rather than the intuitively expected negative influence. We validated our approach using self-built models based on mainstream architectures and pre-trained mainstream models across multiple datasets, including basic language modeling, long-context tasks, subject examination, and question-answering (QA) benchmarks. The results demonstrate that through our structure, compared to increasing the LLM's parameters, we can achieve better performance improvements with significantly lower parameter and computational costs in many scenarios. These improvements are universal, stable, and effective. Furthermore, we explore the potential of "relative overfitting" and the AR framework in other machine learning domains, such as computer vision (CV) and AI for science. We hope the proposed approach can help scale laws overcome existing bottlenecks.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
ABase: the Multi-Tenant NoSQL Serverless Database for Diverse and Dynamic Workloads in Large-scale Cloud Environments
Authors:
Rong Kang,
Yanbin Chen,
Ye Liu,
Fuxin Jiang,
Qingshuo Li,
Miao Ma,
Jian Liu,
Guangliang Zhao,
Tieying Zhang,
Jianjun Chen,
Lei Zhang
Abstract:
Multi-tenant architectures enhance the elasticity and resource utilization of NoSQL databases by allowing multiple tenants to co-locate and share resources. However, in large-scale cloud environments, the diverse and dynamic nature of workloads poses significant challenges for multi-tenant NoSQL databases. Based on our practical observations, we have identified three crucial challenges: (1) the im…
▽ More
Multi-tenant architectures enhance the elasticity and resource utilization of NoSQL databases by allowing multiple tenants to co-locate and share resources. However, in large-scale cloud environments, the diverse and dynamic nature of workloads poses significant challenges for multi-tenant NoSQL databases. Based on our practical observations, we have identified three crucial challenges: (1) the impact of caching on performance isolation, as cache hits alter request execution and resource consumption, leading to inaccurate traffic control; (2) the dynamic changes in traffic, with changes in tenant traffic trends causing throttling or resource wastage, and changes in access distribution causing hot key pressure or cache hit ratio drops; and (3) the imbalanced layout of data nodes due to tenants' diverse resource requirements, leading to low resource utilization. To address these challenges, we introduce ABase, a multi-tenant NoSQL serverless database developed at ByteDance. ABase introduces a two-layer caching mechanism with a cache-aware isolation mechanism to ensure accurate resource consumption estimates. Furthermore, ABase employs a predictive autoscaling policy to dynamically adjust resources in response to tenant traffic changes and a multi-resource rescheduling algorithm to balance resource utilization across data nodes. With these innovations, ABase has successfully served ByteDance's large-scale cloud environment, supporting a total workload that has achieved a peak QPS of over 13 billion and total storage exceeding 1 EB.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
TUM2TWIN: Introducing the Large-Scale Multimodal Urban Digital Twin Benchmark Dataset
Authors:
Olaf Wysocki,
Benedikt Schwab,
Manoj Kumar Biswanath,
Michael Greza,
Qilin Zhang,
Jingwei Zhu,
Thomas Froech,
Medhini Heeramaglore,
Ihab Hijazi,
Khaoula Kanna,
Mathias Pechinger,
Zhaiyu Chen,
Yao Sun,
Alejandro Rueda Segura,
Ziyang Xu,
Omar AbdelGafar,
Mansour Mehranfar,
Chandan Yeshwanth,
Yueh-Cheng Liu,
Hadi Yazdi,
Jiapan Wang,
Stefan Auer,
Katharina Anders,
Klaus Bogenberger,
Andre Borrmann
, et al. (9 additional authors not shown)
Abstract:
Urban Digital Twins (UDTs) have become essential for managing cities and integrating complex, heterogeneous data from diverse sources. Creating UDTs involves challenges at multiple process stages, including acquiring accurate 3D source data, reconstructing high-fidelity 3D models, maintaining models' updates, and ensuring seamless interoperability to downstream tasks. Current datasets are usually…
▽ More
Urban Digital Twins (UDTs) have become essential for managing cities and integrating complex, heterogeneous data from diverse sources. Creating UDTs involves challenges at multiple process stages, including acquiring accurate 3D source data, reconstructing high-fidelity 3D models, maintaining models' updates, and ensuring seamless interoperability to downstream tasks. Current datasets are usually limited to one part of the processing chain, hampering comprehensive UDTs validation. To address these challenges, we introduce the first comprehensive multimodal Urban Digital Twin benchmark dataset: TUM2TWIN. This dataset includes georeferenced, semantically aligned 3D models and networks along with various terrestrial, mobile, aerial, and satellite observations boasting 32 data subsets over roughly 100,000 $m^2$ and currently 767 GB of data. By ensuring georeferenced indoor-outdoor acquisition, high accuracy, and multimodal data integration, the benchmark supports robust analysis of sensors and the development of advanced reconstruction methods. Additionally, we explore downstream tasks demonstrating the potential of TUM2TWIN, including novel view synthesis of NeRF and Gaussian Splatting, solar potential analysis, point cloud semantic segmentation, and LoD3 building reconstruction. We are convinced this contribution lays a foundation for overcoming current limitations in UDT creation, fostering new research directions and practical solutions for smarter, data-driven urban environments. The project is available under: https://tum2t.win
△ Less
Submitted 13 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
Authors:
Xiaokun Wang,
Chris,
Jiangbo Pei,
Wei Shen,
Yi Peng,
Yunzhuo Hao,
Weijie Qiu,
Ai Jian,
Tianyidan Xie,
Xuchen Song,
Yang Liu,
Yahui Zhou
Abstract:
We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM rea…
▽ More
We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Seed1.5-VL Technical Report
Authors:
Dong Guo,
Faming Wu,
Feida Zhu,
Fuxing Leng,
Guang Shi,
Haobin Chen,
Haoqi Fan,
Jian Wang,
Jianyu Jiang,
Jiawei Wang,
Jingji Chen,
Jingjia Huang,
Kang Lei,
Liping Yuan,
Lishu Luo,
Pengfei Liu,
Qinghao Ye,
Rui Qian,
Shen Yan,
Shixiong Zhao,
Shuai Peng,
Shuangye Li,
Sihang Yuan,
Sijin Wu,
Tianheng Cheng
, et al. (172 additional authors not shown)
Abstract:
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati…
▽ More
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation
Authors:
Peng Li,
Suizhi Ma,
Jialiang Chen,
Yuan Liu,
Chongyi Zhang,
Wei Xue,
Wenhan Luo,
Alla Sheffer,
Wenping Wang,
Yike Guo
Abstract:
Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method ca…
▽ More
Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model. In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components. This conditional multiview diffusion model not only allows the generation of 3D models part by part but also enables local editing of 3D models according to the local revision of the input image without changing other 3D parts. Extensive experiments are conducted to demonstrate that CMD decomposes a complex 3D generation task into multiple components, improving the generation quality. Meanwhile, CMD enables efficient and flexible local editing of a 3D model by just editing one rendered image.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI
Authors:
Chao Ding,
Mouxiao Bian,
Pengcheng Chen,
Hongliang Zhang,
Tianbin Li,
Lihao Liu,
Jiayuan Chen,
Zhuoran Li,
Yabei Zhong,
Yongqi Liu,
Haiqing Huang,
Dongming Shan,
Junjun He,
Jie Xu
Abstract:
Despite strong performance in medical question-answering, the clinical adoption of Large Language Models (LLMs) is critically hampered by their opaque 'black-box' reasoning, limiting clinician trust. This challenge is compounded by the predominant reliance of current medical LLMs on corpora from scientific literature or synthetic data, which often lack the granular expert validation and high clini…
▽ More
Despite strong performance in medical question-answering, the clinical adoption of Large Language Models (LLMs) is critically hampered by their opaque 'black-box' reasoning, limiting clinician trust. This challenge is compounded by the predominant reliance of current medical LLMs on corpora from scientific literature or synthetic data, which often lack the granular expert validation and high clinical relevance essential for advancing their specialized medical capabilities. To address these critical gaps, we introduce a highly clinically relevant dataset with 31,247 medical question-answer pairs, each accompanied by expert-validated chain-of-thought (CoT) explanations. This resource, spanning multiple clinical domains, was curated via a scalable human-LLM hybrid pipeline: LLM-generated rationales were iteratively reviewed, scored, and refined by medical experts against a structured rubric, with substandard outputs revised through human effort or guided LLM regeneration until expert consensus. This publicly available dataset provides a vital source for the development of medical LLMs that capable of transparent and verifiable reasoning, thereby advancing safer and more interpretable AI in medicine.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Pinching-Antenna Assisted Simultaneous Wireless Information and Power Transfer
Authors:
Yixuan Li,
Ji Wang,
Yuanwei Liu,
Zhiguo Ding
Abstract:
This letter introduces a novel pinching-antenna-system (PASS) assisted simultaneous wireless information and power transfer (SWIPT), where multiple pinching antennas (PAs) are strategically activiated on a waveguide to facilitate information transmission to multiple information receivers (IRs) and power transfer to multiple energy receivers (ERs) simultaneously. Leveraging the single-waveguide arc…
▽ More
This letter introduces a novel pinching-antenna-system (PASS) assisted simultaneous wireless information and power transfer (SWIPT), where multiple pinching antennas (PAs) are strategically activiated on a waveguide to facilitate information transmission to multiple information receivers (IRs) and power transfer to multiple energy receivers (ERs) simultaneously. Leveraging the single-waveguide architecture, non-orthogonal multiple access is employed to enable the superposed transmission of information signals, eliminating the need for dedicated energy carriers by concurrently serving both the IRs and the ERs. In this letter, an optimization problem is first formulated with an objective is to maximize the sum power received at the ERs via joint optimizing the IRs power allocation and the PAs positioning. Given the challenging non-convex nature of the formulated optimization problem, it is decoupled into two sub-problems which are solved alternatively. Specifically, the power allocation subproblem is recast a convex optimization problem, and hence can be solved efficiently. Furthermore, we propose a high-precision element-wise algorithm and a low-complexity linearly decreasing weight particle swarm optimization algorithm to solve the position optimization sub-problem. The numerical results demonstrate that PAs assisted SWIPT can achieve a remarkable performance gain compared to conventional system.
△ Less
Submitted 26 April, 2025;
originally announced May 2025.
-
Multi-Modal Molecular Representation Learning via Structure Awareness
Authors:
Rong Yin,
Ruyue Liu,
Xiaoshuai Hao,
Xingrui Zhou,
Yong Liu,
Can Ma,
Weiping Wang
Abstract:
Accurate extraction of molecular representations is a critical step in the drug discovery process. In recent years, significant progress has been made in molecular representation learning methods, among which multi-modal molecular representation methods based on images, and 2D/3D topologies have become increasingly mainstream. However, existing these multi-modal approaches often directly fuse info…
▽ More
Accurate extraction of molecular representations is a critical step in the drug discovery process. In recent years, significant progress has been made in molecular representation learning methods, among which multi-modal molecular representation methods based on images, and 2D/3D topologies have become increasingly mainstream. However, existing these multi-modal approaches often directly fuse information from different modalities, overlooking the potential of intermodal interactions and failing to adequately capture the complex higher-order relationships and invariant features between molecules. To overcome these challenges, we propose a structure-awareness-based multi-modal self-supervised molecular representation pre-training framework (MMSA) designed to enhance molecular graph representations by leveraging invariant knowledge between molecules. The framework consists of two main modules: the multi-modal molecular representation learning module and the structure-awareness module. The multi-modal molecular representation learning module collaboratively processes information from different modalities of the same molecule to overcome intermodal differences and generate a unified molecular embedding. Subsequently, the structure-awareness module enhances the molecular representation by constructing a hypergraph structure to model higher-order correlations between molecules. This module also introduces a memory mechanism for storing typical molecular representations, aligning them with memory anchors in the memory bank to integrate invariant knowledge, thereby improving the model generalization ability. Extensive experiments have demonstrated the effectiveness of MMSA, which achieves state-of-the-art performance on the MoleculeNet benchmark, with average ROC-AUC improvements ranging from 1.8% to 9.6% over baseline methods.
△ Less
Submitted 11 May, 2025; v1 submitted 9 May, 2025;
originally announced May 2025.
-
QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization
Authors:
Yueh-Cheng Liu,
Lukas Höllein,
Matthias Nießner,
Angela Dai
Abstract:
Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns data-dri…
▽ More
Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns data-driven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by up to 48% in comparison to state of the art methods.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
StereoINR: Cross-View Geometry Consistent Stereo Super Resolution with Implicit Neural Representation
Authors:
Yi Liu,
Xinyi Liu,
Panwang Xia,
Qiong Wu,
Yi Wan,
Yongjun Zhang
Abstract:
Stereo image super-resolution (SSR) aims to enhance high-resolution details by leveraging information from stereo image pairs. However, existing stereo super-resolution (SSR) upsampling methods (e.g., pixel shuffle) often overlook cross-view geometric consistency and are limited to fixed-scale upsampling. The key issue is that previous upsampling methods use convolution to independently process de…
▽ More
Stereo image super-resolution (SSR) aims to enhance high-resolution details by leveraging information from stereo image pairs. However, existing stereo super-resolution (SSR) upsampling methods (e.g., pixel shuffle) often overlook cross-view geometric consistency and are limited to fixed-scale upsampling. The key issue is that previous upsampling methods use convolution to independently process deep features of different views, lacking cross-view and non-local information perception, making it difficult to select beneficial information from multi-view scenes adaptively. In this work, we propose Stereo Implicit Neural Representation (StereoINR), which innovatively models stereo image pairs as continuous implicit representations. This continuous representation breaks through the scale limitations, providing a unified solution for arbitrary-scale stereo super-resolution reconstruction of left-right views. Furthermore, by incorporating spatial warping and cross-attention mechanisms, StereoINR enables effective cross-view information fusion and achieves significant improvements in pixel-level geometric consistency. Extensive experiments across multiple datasets show that StereoINR outperforms out-of-training-distribution scale upsampling and matches state-of-the-art SSR methods within training-distribution scales.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation
Authors:
Yiming Qin,
Zhu Xu,
Yang Liu
Abstract:
Recent text-to-3D models can render high-quality assets, yet they still stumble on objects with complex attributes. The key obstacles are: (1) existing text-to-3D approaches typically lift text-to-image models to extract semantics via text encoders, while the text encoder exhibits limited comprehension ability for long descriptions, leading to deviated cross-attention focus, subsequently wrong att…
▽ More
Recent text-to-3D models can render high-quality assets, yet they still stumble on objects with complex attributes. The key obstacles are: (1) existing text-to-3D approaches typically lift text-to-image models to extract semantics via text encoders, while the text encoder exhibits limited comprehension ability for long descriptions, leading to deviated cross-attention focus, subsequently wrong attribute binding in generated results. (2) Occluded object parts demand a disciplined generation order and explicit part disentanglement. Though some works introduce manual efforts to alleviate the above issues, their quality is unstable and highly reliant on manual information. To tackle above problems, we propose a automated method Hierarchical-Chain-of-Generation (HCoG). It leverages a large language model to decompose the long description into blocks representing different object parts, and orders them from inside out according to occlusions, forming a hierarchical chain. Within each block we first coarsely create components, then precisely bind attributes via target-region localization and corresponding 3D Gaussian kernel optimization. Between blocks, we introduce Gaussian Extension and Label Elimination to seamlessly generate new parts by extending new Gaussian kernels, re-assigning semantic labels, and eliminating unnecessary kernels, ensuring that only relevant parts are added without disrupting previously optimized parts. Experiments confirm that HCoG yields structurally coherent, attribute-faithful 3D objects with complex attributes. The code is available at https://github.com/Wakals/GASCOL .
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Weighted Envy-Freeness Revisited: Indivisible Resource and House Allocations
Authors:
Yuxi Liu,
Mingyu Xiao
Abstract:
Envy-Freeness is one of the most fundamental and important concepts in fair allocation. Some recent studies have focused on the concept of weighted envy-freeness. Under this concept, each agent is assigned a weight, and their valuations are divided by their weights when assessing fairness. This concept can promote more fairness in some scenarios. But on the other hand, experimental research has sh…
▽ More
Envy-Freeness is one of the most fundamental and important concepts in fair allocation. Some recent studies have focused on the concept of weighted envy-freeness. Under this concept, each agent is assigned a weight, and their valuations are divided by their weights when assessing fairness. This concept can promote more fairness in some scenarios. But on the other hand, experimental research has shown that this weighted envy-freeness significantly reduces the likelihood of fair allocations. When we must allocate the resources, we may propose fairness concepts with lower requirements that are potentially more feasible to implement. In this paper, we revisit weighted envy-freeness and propose a new concept called SumAvg-envy-freeness, which substantially increases the existence of fair allocations. This new concept can be seen as a complement of the normal weighted envy-fairness. Furthermore, we systematically study the computational complexity of finding fair allocations under the old and new weighted fairness concepts in two types of classic problems: Indivisible Resource Allocation and House Allocation. Our study provides a comprehensive characterization of various properties of weighted envy-freeness.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents
Authors:
Kaixin Wang,
Tianlin Li,
Xiaoyu Zhang,
Chong Wang,
Weisong Sun,
Yang Liu,
Bin Shi
Abstract:
Code large language models (CodeLLMs) and agents have shown great promise in tackling complex software engineering tasks.Compared to traditional software engineering methods, CodeLLMs and agents offer stronger abilities, and can flexibly process inputs and outputs in both natural and code. Benchmarking plays a crucial role in evaluating the capabilities of CodeLLMs and agents, guiding their develo…
▽ More
Code large language models (CodeLLMs) and agents have shown great promise in tackling complex software engineering tasks.Compared to traditional software engineering methods, CodeLLMs and agents offer stronger abilities, and can flexibly process inputs and outputs in both natural and code. Benchmarking plays a crucial role in evaluating the capabilities of CodeLLMs and agents, guiding their development and deployment. However, despite their growing significance, there remains a lack of comprehensive reviews of benchmarks for CodeLLMs and agents. To bridge this gap, this paper provides a comprehensive review of existing benchmarks for CodeLLMs and agents, studying and analyzing 181 benchmarks from 461 relevant papers, covering the different phases of the software development life cycle (SDLC). Our findings reveal a notable imbalance in the coverage of current benchmarks, with approximately 60% focused on the software development phase in SDLC, while requirements engineering and software design phases receive minimal attention at only 5% and 3%, respectively. Additionally, Python emerges as the dominant programming language across the reviewed benchmarks. Finally, this paper highlights the challenges of current research and proposes future directions, aiming to narrow the gap between the theoretical capabilities of CodeLLMs and agents and their application in real-world scenarios.
△ Less
Submitted 8 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction
Authors:
Kun Peng,
Chaodong Tong,
Cong Cao,
Hao Peng,
Qian Li,
Guanlin Wu,
Lei Jiang,
Yanbing Liu,
Philip S. Yu
Abstract:
Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstr…
▽ More
Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability to capture relations can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
ICNN-enhanced 2SP: Leveraging input convex neural networks for solving two-stage stochastic programming
Authors:
Yu Liu,
Fabricio Oliveira
Abstract:
Two-stage stochastic programming (2SP) offers a basic framework for modelling decision-making under uncertainty, yet scalability remains a challenge due to the computational complexity of recourse function evaluation. Existing learning-based methods like Neural Two-Stage Stochastic Programming (Neur2SP) employ neural networks (NNs) as recourse function surrogates but rely on computationally intens…
▽ More
Two-stage stochastic programming (2SP) offers a basic framework for modelling decision-making under uncertainty, yet scalability remains a challenge due to the computational complexity of recourse function evaluation. Existing learning-based methods like Neural Two-Stage Stochastic Programming (Neur2SP) employ neural networks (NNs) as recourse function surrogates but rely on computationally intensive mixed-integer programming (MIP) formulations. We propose ICNN-enhanced 2SP, a method that leverages Input Convex Neural Networks (ICNNs) to exploit linear programming (LP) representability in convex 2SP problems. By architecturally enforcing convexity and enabling exact inference through LP, our approach eliminates the need for integer variables inherent to the conventional MIP-based formulation while retaining an exact embedding of the ICNN surrogate within the 2SP framework. This results in a more computationally efficient alternative that maintains solution quality. Comprehensive experiments reveal that ICNNs incur only marginally longer training times while achieving validation accuracy on par with their MIP-based counterparts. Across benchmark problems, ICNN-enhanced 2SP often exhibits considerably faster solution times than the MIP-based formulations while preserving solution quality, with these advantages becoming significantly more pronounced as problem scale increases. For the most challenging instances, the method achieves speedups of up to 100$\times$ and solution quality superior to MIP-based formulations.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
PADriver: Towards Personalized Autonomous Driving
Authors:
Genghua Kou,
Fan Jia,
Weixin Mao,
Yingfei Liu,
Yucheng Zhao,
Ziheng Zhang,
Osamu Yoshie,
Tiancai Wang,
Ying Li,
Xiangyu Zhang
Abstract:
In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action…
▽ More
In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action and provides an explicit reference for the final action, which corresponds to the preset personalized prompt. Moreover, we construct a closed-loop benchmark named PAD-Highway based on Highway-Env simulator to comprehensively evaluate the decision performance under traffic rules. The dataset contains 250 hours videos with high-quality annotation to facilitate the development of PAD behavior analysis. Experimental results on the constructed benchmark show that PADriver outperforms state-of-the-art approaches on different evaluation metrics, and enables various driving modes.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
A Weighted Byzantine Fault Tolerance Consensus Driven Trusted Multiple Large Language Models Network
Authors:
Haoxiang Luo,
Gang Sun,
Yinqiu Liu,
Dongcheng Zhao,
Dusit Niyato,
Hongfang Yu,
Schahram Dustdar
Abstract:
Large Language Models (LLMs) have achieved remarkable success across a wide range of applications. However, individual LLMs often produce inconsistent, biased, or hallucinated outputs due to limitations in their training corpora and model architectures. Recently, collaborative frameworks such as the Multi-LLM Network (MultiLLMN) have been introduced, enabling multiple LLMs to interact and jointly…
▽ More
Large Language Models (LLMs) have achieved remarkable success across a wide range of applications. However, individual LLMs often produce inconsistent, biased, or hallucinated outputs due to limitations in their training corpora and model architectures. Recently, collaborative frameworks such as the Multi-LLM Network (MultiLLMN) have been introduced, enabling multiple LLMs to interact and jointly respond to user queries. Nevertheless, MultiLLMN architectures raise critical concerns regarding the reliability and security of the generated content, particularly in open environments where malicious or compromised LLMs may be present. Moreover, reliance on centralized coordination undermines system efficiency and introduces single points of failure. In this paper, we propose a novel Trusted MultiLLMN framework, driven by a Weighted Byzantine Fault Tolerance (WBFT) blockchain consensus mechanism, to ensure the reliability, security, and efficiency of multi-LLM collaboration. In WBFT, voting weights are adaptively assigned to each LLM based on its response quality and trustworthiness, incentivizing reliable behavior, and reducing the impact of malicious nodes. Extensive simulations demonstrate that WBFT significantly improves both consensus security and efficiency compared to classical and modern consensus mechanisms, particularly under wireless network conditions. Furthermore, our evaluations reveal that Trusted MultiLLMN supported by WBFT can deliver higher-quality and more credible responses than both single LLMs and conventional MultiLLMNs, thereby providing a promising path toward building robust, decentralized AI collaboration networks.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
LSRP: A Leader-Subordinate Retrieval Framework for Privacy-Preserving Cloud-Device Collaboration
Authors:
Yingyi Zhang,
Pengyue Jia,
Xianneng Li,
Derong Xu,
Maolin Wang,
Yichao Wang,
Zhaocheng Du,
Huifeng Guo,
Yong Liu,
Ruiming Tang,
Xiangyu Zhao
Abstract:
Cloud-device collaboration leverages on-cloud Large Language Models (LLMs) for handling public user queries and on-device Small Language Models (SLMs) for processing private user data, collectively forming a powerful and privacy-preserving solution. However, existing approaches often fail to fully leverage the scalable problem-solving capabilities of on-cloud LLMs while underutilizing the advantag…
▽ More
Cloud-device collaboration leverages on-cloud Large Language Models (LLMs) for handling public user queries and on-device Small Language Models (SLMs) for processing private user data, collectively forming a powerful and privacy-preserving solution. However, existing approaches often fail to fully leverage the scalable problem-solving capabilities of on-cloud LLMs while underutilizing the advantage of on-device SLMs in accessing and processing personalized data. This leads to two interconnected issues: 1) Limited utilization of the problem-solving capabilities of on-cloud LLMs, which fail to align with personalized user-task needs, and 2) Inadequate integration of user data into on-device SLM responses, resulting in mismatches in contextual user information.
In this paper, we propose a Leader-Subordinate Retrieval framework for Privacy-preserving cloud-device collaboration (LSRP), a novel solution that bridges these gaps by: 1) enhancing on-cloud LLM guidance to on-device SLM through a dynamic selection of task-specific leader strategies named as user-to-user retrieval-augmented generation (U-U-RAG), and 2) integrating the data advantages of on-device SLMs through small model feedback Direct Preference Optimization (SMFB-DPO) for aligning the on-cloud LLM with the on-device SLM. Experiments on two datasets demonstrate that LSRP consistently outperforms state-of-the-art baselines, significantly improving question-answer relevance and personalization, while preserving user privacy through efficient on-device retrieval. Our code is available at: https://github.com/Zhang-Yingyi/LSRP.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
FedRE: Robust and Effective Federated Learning with Privacy Preference
Authors:
Tianzhe Xiao,
Yichen Li,
Yu Zhou,
Yining Qi,
Yi Liu,
Wei Wang,
Haozhao Wang,
Yi Wang,
Ruixuan Li
Abstract:
Despite Federated Learning (FL) employing gradient aggregation at the server for distributed training to prevent the privacy leakage of raw data, private information can still be divulged through the analysis of uploaded gradients from clients. Substantial efforts have been made to integrate local differential privacy (LDP) into the system to achieve a strict privacy guarantee. However, existing m…
▽ More
Despite Federated Learning (FL) employing gradient aggregation at the server for distributed training to prevent the privacy leakage of raw data, private information can still be divulged through the analysis of uploaded gradients from clients. Substantial efforts have been made to integrate local differential privacy (LDP) into the system to achieve a strict privacy guarantee. However, existing methods fail to take practical issues into account by merely perturbing each sample with the same mechanism while each client may have their own privacy preferences on privacy-sensitive information (PSI), which is not uniformly distributed across the raw data. In such a case, excessive privacy protection from private-insensitive information can additionally introduce unnecessary noise, which may degrade the model performance. In this work, we study the PSI within data and develop FedRE, that can simultaneously achieve robustness and effectiveness benefits with LDP protection. More specifically, we first define PSI with regard to the privacy preferences of each client. Then, we optimize the LDP by allocating less privacy budget to gradients with higher PSI in a layer-wise manner, thus providing a stricter privacy guarantee for PSI. Furthermore, to mitigate the performance degradation caused by LDP, we design a parameter aggregation mechanism based on the distribution of the perturbed information. We conducted experiments with text tamper detection on T-SROIE and DocTamper datasets, and FedRE achieves competitive performance compared to state-of-the-art methods.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.