-
CineMyoPS: Segmenting Myocardial Pathologies from Cine Cardiac MR
Authors:
Wangbin Ding,
Lei Li,
Junyi Qiu,
Bogen Lin,
Mingjing Yang,
Liqin Huang,
Lianming Wu,
Sihan Wang,
Xiahai Zhuang
Abstract:
Myocardial infarction (MI) is a leading cause of death worldwide. Late gadolinium enhancement (LGE) and T2-weighted cardiac magnetic resonance (CMR) imaging can respectively identify scarring and edema areas, both of which are essential for MI risk stratification and prognosis assessment. Although combining complementary information from multi-sequence CMR is useful, acquiring these sequences can…
▽ More
Myocardial infarction (MI) is a leading cause of death worldwide. Late gadolinium enhancement (LGE) and T2-weighted cardiac magnetic resonance (CMR) imaging can respectively identify scarring and edema areas, both of which are essential for MI risk stratification and prognosis assessment. Although combining complementary information from multi-sequence CMR is useful, acquiring these sequences can be time-consuming and prohibitive, e.g., due to the administration of contrast agents. Cine CMR is a rapid and contrast-free imaging technique that can visualize both motion and structural abnormalities of the myocardium induced by acute MI. Therefore, we present a new end-to-end deep neural network, referred to as CineMyoPS, to segment myocardial pathologies, \ie scars and edema, solely from cine CMR images. Specifically, CineMyoPS extracts both motion and anatomy features associated with MI. Given the interdependence between these features, we design a consistency loss (resembling the co-training strategy) to facilitate their joint learning. Furthermore, we propose a time-series aggregation strategy to integrate MI-related features across the cardiac cycle, thereby enhancing segmentation accuracy for myocardial pathologies. Experimental results on a multi-center dataset demonstrate that CineMyoPS achieves promising performance in myocardial pathology segmentation, motion estimation, and anatomy segmentation.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Crystal Growth of Chalcogenides and Oxy-Chalcogenides Using Chloride Exchange Reaction
Authors:
Shantanu Singh,
Boyang Zhao,
Christopher E. Stevens,
Mythili Surendran,
Tzu-Chi Huang,
Bi-Hsuan Lin,
Joshua R. Hendrickson,
Jayakanth Ravichandran
Abstract:
Chalcogenides and oxy-chalcogenides, including complex chalcogenides and transition metal dichalcogenides, are emerging semiconductors with direct or indirect band gaps within the visible spectrum. These materials are being explored for various photonic and electronic applications, such as photodetectors, photovoltaics, and phase-change electronics. Understanding the fundamental properties of thes…
▽ More
Chalcogenides and oxy-chalcogenides, including complex chalcogenides and transition metal dichalcogenides, are emerging semiconductors with direct or indirect band gaps within the visible spectrum. These materials are being explored for various photonic and electronic applications, such as photodetectors, photovoltaics, and phase-change electronics. Understanding the fundamental properties of these materials is crucial for optimizing their functionalities. Therefore, the availability of large, high-quality single crystals of chalcogenides and oxy-chalcogenides is essential for a better comprehension of their structure and properties. In this study, we present a novel crystal growth method that utilizes the exchange reaction between BaS and ZrCl$_4$/ HfCl$_4$. By carefully controlling the stoichiometric ratio of the binary sulfide to the chloride, we can grow single crystals of several materials, such as ZrS$_2$, HfS$_2$, BaZrS$_3$, and ZrOS. This method results in large single crystals with a short reaction time of 24 to 48 hours. High-resolution thin film diffraction and single-crystal X-ray diffraction confirm the quality of the crystals produced through this exchange reaction. We also report the optical properties of these materials investigated using photoluminescence and Raman measurements. The chloride exchange reaction method paves the way for the synthesis of single crystals of chalcogenides and oxy-chalcogenide systems with a short reaction time but with low mosaicity and can be an alternative growth technique for single crystals of materials that are difficult to synthesize using conventional growth techniques.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
GeoCAD: Local Geometry-Controllable CAD Generation
Authors:
Zhanwei Zhang,
Kaiyuan Liu,
Junjie Liu,
Wenxiao Wang,
Binbin Lin,
Liang Xie,
Chen Shen,
Deng Cai
Abstract:
Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this…
▽ More
Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption $\sim$221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at https://github.com/Zhanwei-Z/GeoCAD.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
Authors:
Liang Ma,
Jiajun Wen,
Min Lin,
Rongtao Xu,
Xiwen Liang,
Bingqian Lin,
Jun Ma,
Yongxin Wang,
Ziming Wei,
Haokun Lin,
Mingfei Han,
Meng Cao,
Bokui Chen,
Ivan Laptev,
Xiaodan Liang
Abstract:
While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block…
▽ More
While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
MiniCPM4: Ultra-Efficient LLMs on End Devices
Authors:
MiniCPM Team,
Chaojun Xiao,
Yuxuan Li,
Xu Han,
Yuzhuo Bai,
Jie Cai,
Haotian Chen,
Wentong Chen,
Xin Cong,
Ganqu Cui,
Ning Ding,
Shengdan Fan,
Yewei Fang,
Zixuan Fu,
Wenyu Guan,
Yitong Guan,
Junshao Guo,
Yufeng Han,
Bingxiang He,
Yuxiang Huang,
Cunliang Kong,
Qiuzuo Li,
Siyuan Li,
Wenhao Li,
Yanghao Li
, et al. (50 additional authors not shown)
Abstract:
This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelera…
▽ More
This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning
Authors:
Mengsong Wu,
YaFei Wang,
Yidong Ming,
Yuqi An,
Yuwei Wan,
Wenliang Chen,
Binbin Lin,
Yuqiang Li,
Tong Xie,
Dongzhan Zhou
Abstract:
Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to…
▽ More
Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions, and a dataset curation pipeline to generate the dataset ChemToolBench that facilitates both effective tool selection and precise parameter filling during fine-tuning and evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search (HE-MCTS) framework, enabling independent optimization of tool planning and execution. By leveraging self-generated data, our approach supports step-level fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM that surpass GPT-4o. Experimental evaluations demonstrate that our approach significantly improves performance in Chemistry QA and discovery tasks, offering a robust solution to integrate specialized tools with LLMs for advanced chemical applications. All datasets and code are available at https://github.com/AI4Chem/ChemistryAgent .
△ Less
Submitted 12 June, 2025; v1 submitted 9 June, 2025;
originally announced June 2025.
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Authors:
Bin Lin,
Zongjian Li,
Xinhua Cheng,
Yuwei Niu,
Yang Ye,
Xianyi He,
Shenghai Yuan,
Wangbo Yu,
Shaodong Wang,
Yunyang Ge,
Yatian Pang,
Li Yuan
Abstract:
Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipul…
▽ More
Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld-V1 achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld-V1 framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.
△ Less
Submitted 18 June, 2025; v1 submitted 3 June, 2025;
originally announced June 2025.
-
Adjusting Tissue Puncture Omnidirectionally In Situ with Pneumatic Rotatable Biopsy Mechanism and Hierarchical Airflow Management in Tortuous Luminal Pathways
Authors:
Botao Lin,
Tinghua Zhang,
Sishen Yuan,
Tiantian Wang,
Jiaole Wang,
Wu Yuan,
Hongliang Ren
Abstract:
In situ tissue biopsy with an endoluminal catheter is an efficient approach for disease diagnosis, featuring low invasiveness and few complications. However, the endoluminal catheter struggles to adjust the biopsy direction by distal endoscope bending or proximal twisting for tissue sampling within the tortuous luminal organs, due to friction-induced hysteresis and narrow spaces. Here, we propose…
▽ More
In situ tissue biopsy with an endoluminal catheter is an efficient approach for disease diagnosis, featuring low invasiveness and few complications. However, the endoluminal catheter struggles to adjust the biopsy direction by distal endoscope bending or proximal twisting for tissue sampling within the tortuous luminal organs, due to friction-induced hysteresis and narrow spaces. Here, we propose a pneumatically-driven robotic catheter enabling the adjustment of the sampling direction without twisting the catheter for an accurate in situ omnidirectional biopsy. The distal end of the robotic catheter consists of a pneumatic bending actuator for the catheter's deployment in torturous luminal organs and a pneumatic rotatable biopsy mechanism (PRBM). By hierarchical airflow control, the PRBM can adjust the biopsy direction under low airflow and deploy the biopsy needle with higher airflow, allowing for rapid omnidirectional sampling of tissue in situ. This paper describes the design, modeling, and characterization of the proposed robotic catheter, including repeated deployment assessments of the biopsy needle, puncture force measurement, and validation via phantom tests. The PRBM prototype has six sampling directions evenly distributed across 360 degrees when actuated by a positive pressure of 0.3 MPa. The pneumatically-driven robotic catheter provides a novel biopsy strategy, potentially facilitating in situ multidirectional biopsies in tortuous luminal organs with minimum invasiveness.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation
Authors:
Bingqian Lin,
Yunshuang Nie,
Khun Loun Zai,
Ziming Wei,
Mingfei Han,
Rongtao Xu,
Minzhe Niu,
Jianhua Han,
Liang Lin,
Cewu Lu,
Xiaodan Liang
Abstract:
Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corp…
▽ More
Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL
Authors:
Yichen Feng,
Zhangchen Xu,
Fengqing Jiang,
Yuetai Li,
Bhaskar Ramasubramanian,
Luyao Niu,
Bill Yuchen Lin,
Radha Poovendran
Abstract:
Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical…
▽ More
Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Bayesian Non-Parametric Inference for Lévy Measures in State-Space Models
Authors:
Bill Z. Lin,
Simon Godsill
Abstract:
Lévy processes, known for their ability to model complex dynamics with skewness, heavy tails and discontinuities, play a critical role in stochastic modeling across various domains. However, inference for most Lévy processes, whether in parametric or non-parametric settings, remains a significant challenge. In this work, we present a novel Bayesian non-parametric inference framework for inferring…
▽ More
Lévy processes, known for their ability to model complex dynamics with skewness, heavy tails and discontinuities, play a critical role in stochastic modeling across various domains. However, inference for most Lévy processes, whether in parametric or non-parametric settings, remains a significant challenge. In this work, we present a novel Bayesian non-parametric inference framework for inferring the Lévy measures of subordinators and normal variance-mean (NVM) processes within a linear Lévy state space model, a setup that significantly extends existing methodologies. We employ the Dirichlet process which further results in a Student-t mixture representation to enable inference for the Lévy measures. An efficient augmented Markov Chain Monte Carlo algorithm is developed for this problem that ensures both accuracy and computational feasibility. The effectiveness of the method is demonstrated on synthetic and tick-level (high-frequency) financial datasets, and we show the practical utility of the inference results using the forecasting performance.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration
Authors:
Yong Wu,
Weihang Pan,
Ke Li,
Chen Binhui,
Ping Li,
Binbin Lin
Abstract:
Large language models (LLMs) have shown remarkable reasoning capabilities, yet aligning such abilities to small language models (SLMs) remains a challenge due to distributional mismatches and limited model capacity. Existing reasoning datasets, typically designed for powerful LLMs, often lead to degraded performance when directly applied to weaker models. In this work, we introduce Dynamic Adaptat…
▽ More
Large language models (LLMs) have shown remarkable reasoning capabilities, yet aligning such abilities to small language models (SLMs) remains a challenge due to distributional mismatches and limited model capacity. Existing reasoning datasets, typically designed for powerful LLMs, often lead to degraded performance when directly applied to weaker models. In this work, we introduce Dynamic Adaptation of Reasoning Trajectories (DART), a novel data adaptation framework that bridges the capability gap between expert reasoning trajectories and diverse SLMs. Instead of uniformly imitating expert steps, DART employs a selective imitation strategy guided by step-wise adaptability estimation via solution simulation. When expert steps surpass the student's capacity -- signaled by an Imitation Gap -- the student autonomously explores alternative reasoning paths, constrained by outcome consistency. We validate DART across multiple reasoning benchmarks and model scales, demonstrating that it significantly improves generalization and data efficiency over static fine-tuning. Our method enhances supervision quality by aligning training signals with the student's reasoning capabilities, offering a scalable solution for reasoning alignment in resource-constrained models.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
Authors:
Shenghai Yuan,
Xianyi He,
Yufan Deng,
Yang Ye,
Jinfa Huang,
Bin Lin,
Jiebo Luo,
Li Yuan
Abstract:
Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from V…
▽ More
Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 18 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.
△ Less
Submitted 3 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
ImgEdit: A Unified Image Editing Dataset and Benchmark
Authors:
Yang Ye,
Xianyi He,
Zongjian Li,
Bin Lin,
Shenghai Yuan,
Zhiyuan Yan,
Bohan Hou,
Li Yuan
Abstract:
Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated…
▽ More
Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on https://github.com/PKU-YuanGroup/ImgEdit.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Temporal Sampling for Forgotten Reasoning in LLMs
Authors:
Yuetai Li,
Zhangchen Xu,
Fengqing Jiang,
Bhaskar Ramasubramanian,
Luyao Niu,
Bill Yuchen Lin,
Xiang Yue,
Radha Poovendran
Abstract:
Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning…
▽ More
Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents
Authors:
Ziming Wei,
Bingqian Lin,
Zijian Jiao,
Yunshuang Nie,
Liang Ma,
Yuecheng Liu,
Yuzheng Zhuang,
Xiaodan Liang
Abstract:
Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluati…
▽ More
Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluating the spatial intelligence of Multimodal Large Language Models (MLLMs). Nevertheless, these benchmarks primarily focus on spatial reasoning based on typical Visual Question-Answering (VQA) forms, which suffers from the gap between abstract spatial understanding and concrete task execution. In this work, we take a step further to build a comprehensive benchmark called MineAnyBuild, aiming to evaluate the spatial planning ability of open-world AI agents in the Minecraft game. Specifically, MineAnyBuild requires an agent to generate executable architecture building plans based on the given multi-modal human instructions. It involves 4,000 curated spatial planning tasks and also provides a paradigm for infinitely expandable data collection by utilizing rich player-generated content. MineAnyBuild evaluates spatial planning through four core supporting dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Based on MineAnyBuild, we perform a comprehensive evaluation for existing MLLM-based agents, revealing the severe limitations but enormous potential in their spatial planning abilities. We believe our MineAnyBuild will open new avenues for the evaluation of spatial intelligence and help promote further development for open-world AI agents capable of spatial planning.
△ Less
Submitted 27 May, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning
Authors:
Zhangchen Xu,
Yuetai Li,
Fengqing Jiang,
Bhaskar Ramasubramanian,
Luyao Niu,
Bill Yuchen Lin,
Radha Poovendran
Abstract:
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in…
▽ More
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.
△ Less
Submitted 22 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
RGB-to-Polarization Estimation: A New Task and Benchmark Study
Authors:
Beibei Lin,
Zifeng Yuan,
Tingting Chen
Abstract:
Polarization images provide rich physical information that is fundamentally absent from standard RGB images, benefiting a wide range of computer vision applications such as reflection separation and material classification. However, the acquisition of polarization images typically requires additional optical components, which increases both the cost and the complexity of the applications. To bridg…
▽ More
Polarization images provide rich physical information that is fundamentally absent from standard RGB images, benefiting a wide range of computer vision applications such as reflection separation and material classification. However, the acquisition of polarization images typically requires additional optical components, which increases both the cost and the complexity of the applications. To bridge this gap, we introduce a new task: RGB-to-polarization image estimation, which aims to infer polarization information directly from RGB images. In this work, we establish the first comprehensive benchmark for this task by leveraging existing polarization datasets and evaluating a diverse set of state-of-the-art deep learning models, including both restoration-oriented and generative architectures. Through extensive quantitative and qualitative analysis, our benchmark not only establishes the current performance ceiling of RGB-to-polarization estimation, but also systematically reveals the respective strengths and limitations of different model families -- such as direct reconstruction versus generative synthesis, and task-specific training versus large-scale pre-training. In addition, we provide some potential directions for future research on polarization estimation. This benchmark is intended to serve as a foundational resource to facilitate the design and evaluation of future methods for polarization estimation from standard RGB inputs.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Decoding Neighborhood Environments with Large Language Models
Authors:
Andrew Cart,
Shaohu Zhang,
Melanie Escue,
Xugui Zhou,
Haitao Zhao,
Prashanth BusiReddyGari,
Beiyu Lin,
Shuang Li
Abstract:
Neighborhood environments include physical and environmental conditions such as housing quality, roads, and sidewalks, which significantly influence human health and well-being. Traditional methods for assessing these environments, including field surveys and geographic information systems (GIS), are resource-intensive and challenging to evaluate neighborhood environments at scale. Although machin…
▽ More
Neighborhood environments include physical and environmental conditions such as housing quality, roads, and sidewalks, which significantly influence human health and well-being. Traditional methods for assessing these environments, including field surveys and geographic information systems (GIS), are resource-intensive and challenging to evaluate neighborhood environments at scale. Although machine learning offers potential for automated analysis, the laborious process of labeling training data and the lack of accessible models hinder scalability. This study explores the feasibility of large language models (LLMs) such as ChatGPT and Gemini as tools for decoding neighborhood environments (e.g., sidewalk and powerline) at scale. We train a robust YOLOv11-based model, which achieves an average accuracy of 99.13% in detecting six environmental indicators, including streetlight, sidewalk, powerline, apartment, single-lane road, and multilane road. We then evaluate four LLMs, including ChatGPT, Gemini, Claude, and Grok, to assess their feasibility, robustness, and limitations in identifying these indicators, with a focus on the impact of prompting strategies and fine-tuning. We apply majority voting with the top three LLMs to achieve over 88% accuracy, which demonstrates LLMs could be a useful tool to decode the neighborhood environment without any training effort.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model
Authors:
Baijiong Lin,
Weisen Jiang,
Yuancheng Xu,
Hao Chen,
Ying-Cong Chen
Abstract:
Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors…
▽ More
Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at https://github.com/Baijiong-Lin/PARM.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks
Authors:
Kang Yang,
Xinjun Mao,
Shangwen Wang,
Yanlin Wang,
Tanghaoran Zhang,
Bo Lin,
Yihao Qin,
Zhang Zhang,
Yao Lu,
Kamal Al-Sabahi
Abstract:
Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model performance. Large language models (LLMs) excel at generating high-quality code comments. We investigate whether replacing human-written comments with LLM-gener…
▽ More
Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model performance. Large language models (LLMs) excel at generating high-quality code comments. We investigate whether replacing human-written comments with LLM-generated ones improves pre-training datasets. Since standard metrics cannot assess reference comment quality, we propose two novel reference-free evaluation tasks: code-comment inconsistency detection and semantic code search. Results show that LLM-generated comments are more semantically consistent with code than human-written ones, as confirmed by manual evaluation. Leveraging this finding, we rebuild the CodeSearchNet dataset with LLM-generated comments and re-pre-train CodeT5. Evaluations demonstrate that models trained on LLM-enhanced data outperform those using original human comments in code summarization, generation, and translation tasks. This work validates rebuilding pre-training datasets with LLMs to advance code intelligence, challenging the traditional reliance on human reference comments.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Give LLMs a Security Course: Securing Retrieval-Augmented Code Generation via Knowledge Injection
Authors:
Bo Lin,
Shangwen Wang,
Yihao Qin,
Liqian Chen,
Xiaoguang Mao
Abstract:
Retrieval-Augmented Code Generation (RACG) leverages external knowledge to enhance Large Language Models (LLMs) in code synthesis, improving the functional correctness of the generated code. However, existing RACG systems largely overlook security, leading to substantial risks. Especially, the poisoning of malicious code into knowledge bases can mislead LLMs, resulting in the generation of insecur…
▽ More
Retrieval-Augmented Code Generation (RACG) leverages external knowledge to enhance Large Language Models (LLMs) in code synthesis, improving the functional correctness of the generated code. However, existing RACG systems largely overlook security, leading to substantial risks. Especially, the poisoning of malicious code into knowledge bases can mislead LLMs, resulting in the generation of insecure outputs, which poses a critical threat in modern software development. To address this, we propose a security-hardening framework for RACG systems, CodeGuarder, that shifts the paradigm from retrieving only functional code examples to incorporating both functional code and security knowledge. Our framework constructs a security knowledge base from real-world vulnerability databases, including secure code samples and root cause annotations. For each code generation query, a retriever decomposes the query into fine-grained sub-tasks and fetches relevant security knowledge. To prioritize critical security guidance, we introduce a re-ranking and filtering mechanism by leveraging the LLMs' susceptibility to different vulnerability types. This filtered security knowledge is seamlessly integrated into the generation prompt. Our evaluation shows CodeGuarder significantly improves code security rates across various LLMs, achieving average improvements of 20.12\% in standard RACG, and 31.53\% and 21.91\% under two distinct poisoning scenarios without compromising functional correctness. Furthermore, CodeGuarder demonstrates strong generalization, enhancing security even when the targeted language's security knowledge is lacking. This work presents CodeGuarder as a pivotal advancement towards building secure and trustworthy RACG systems.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
An energy optimization method based on mixed-integer model and variational quantum computing algorithm for faster IMPT
Authors:
Ya-Nan Zhu,
Nimita Shinde,
Bowen Lin,
Hao Gao
Abstract:
Intensity-modulated proton therapy (IMPT) offers superior dose conformity with reduced exposure to surrounding healthy tissues compared to conventional photon therapy. Improving IMPT delivery efficiency reduces motion-related uncertainties, enhances plan robustness, and benefits breath-hold techniques by shortening treatment time. Among various factors, energy switching time plays a critical role,…
▽ More
Intensity-modulated proton therapy (IMPT) offers superior dose conformity with reduced exposure to surrounding healthy tissues compared to conventional photon therapy. Improving IMPT delivery efficiency reduces motion-related uncertainties, enhances plan robustness, and benefits breath-hold techniques by shortening treatment time. Among various factors, energy switching time plays a critical role, making energy layer optimization (ELO) essential. This work develops an energy layer optimization method based on mixed integer model and variational quantum computing algorithm to enhance the efficiency of IMPT. The energy layer optimization problem is modeled as a mixed-integer program, where continuous variables optimize the dose distribution and binary variables indicate energy layer selection. To solve it, iterative convex relaxation decouples the dose-volume constraints, followed by the alternating direction method of multipliers (ADMM) to separate mixed-variable optimization and the minimum monitor unit (MMU) constraint. The resulting beam intensity subproblem, subject to MMU, either admits a closed-form solution or is efficiently solvable via conjugate gradient. The binary subproblem is cast as a quadratic unconstrained binary optimization (QUBO) problem, solvable using variational quantum computing algorithms. With nearly the same plan quality, the proposed method noticeable reduces the number of the used energies. For example, compared to conventional IMPT, QC can reduce the number of energy layers from 61 to 35 in HN case, from 56 to 35 in lung case, and from 59 to 32 to abdomen case. The reduced number of energies also results in fewer delivery time, e.g., the delivery time is reduced from 100.6, 232.0, 185.3 seconds to 90.7, 215.4, 154.0 seconds, respectively.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Hyperlocal disaster damage assessment using bi-temporal street-view imagery and pre-trained vision models
Authors:
Yifan Yang,
Lei Zou,
Bing Zhou,
Daoyang Li,
Binbin Lin,
Joynal Abedin,
Mingzheng Yang
Abstract:
Street-view images offer unique advantages for disaster damage estimation as they capture impacts from a visual perspective and provide detailed, on-the-ground insights. Despite several investigations attempting to analyze street-view images for damage estimation, they mainly focus on post-disaster images. The potential of time-series street-view images remains underexplored. Pre-disaster images p…
▽ More
Street-view images offer unique advantages for disaster damage estimation as they capture impacts from a visual perspective and provide detailed, on-the-ground insights. Despite several investigations attempting to analyze street-view images for damage estimation, they mainly focus on post-disaster images. The potential of time-series street-view images remains underexplored. Pre-disaster images provide valuable benchmarks for accurate damage estimations at building and street levels. These images could aid annotators in objectively labeling post-disaster impacts, improving the reliability of labeled data sets for model training, and potentially enhancing the model performance in damage evaluation. The goal of this study is to estimate hyperlocal, on-the-ground disaster damages using bi-temporal street-view images and advanced pre-trained vision models. Street-view images before and after 2024 Hurricane Milton in Horseshoe Beach, Florida, were collected for experiments. The objectives are: (1) to assess the performance gains of incorporating pre-disaster street-view images as a no-damage category in fine-tuning pre-trained models, including Swin Transformer and ConvNeXt, for damage level classification; (2) to design and evaluate a dual-channel algorithm that reads pair-wise pre- and post-disaster street-view images for hyperlocal damage assessment. The results indicate that incorporating pre-disaster street-view images and employing a dual-channel processing framework can significantly enhance damage assessment accuracy. The accuracy improves from 66.14% with the Swin Transformer baseline to 77.11% with the dual-channel Feature-Fusion ConvNeXt model. This research enables rapid, operational damage assessments at hyperlocal spatial resolutions, providing valuable insights to support effective decision-making in disaster management and resilience planning.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
LSR-MCTS: Alleviating Long Range Dependency in Code Generation
Authors:
Tingwei Lu,
Yangning Li,
Liyuan Wang,
Binghuai Lin,
Jiwei Tang,
Qingsong Lv,
Wanshi Xu,
Hai-Tao Zheng,
Yinghui Li,
Xin Su,
Zifei Shan
Abstract:
The emergence of large language models (LLMs) has significantly promoted the development of code generation task, sparking a surge in pertinent literature. Current research is hindered by redundant generation results and a tendency to overfit local patterns in the short term. Although existing studies attempt to alleviate the issue by adopting a multi-token prediction strategy, there remains limit…
▽ More
The emergence of large language models (LLMs) has significantly promoted the development of code generation task, sparking a surge in pertinent literature. Current research is hindered by redundant generation results and a tendency to overfit local patterns in the short term. Although existing studies attempt to alleviate the issue by adopting a multi-token prediction strategy, there remains limited focus on choosing the appropriate processing length for generations. By analyzing the attention between tokens during the generation process of LLMs, it can be observed that the high spikes of the attention scores typically appear at the end of lines. This insight suggests that it is reasonable to treat each line of code as a fundamental processing unit and generate them sequentially. Inspired by this, we propose the \textbf{LSR-MCTS} algorithm, which leverages MCTS to determine the code line-by-line and select the optimal path. Further, we integrate a self-refine mechanism at each node to enhance diversity and generate higher-quality programs through error correction. Extensive experiments and comprehensive analyses on three public coding benchmarks demonstrate that our method outperforms the state-of-the-art performance approaches.
△ Less
Submitted 17 May, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
Characteristics of Ge-doped Multi-Mode Fibers in Total Ionizing Dose
Authors:
Datao Gong,
Suen Hou,
Bo-Jing Juang,
Bin Lin,
Chonghan Liu,
Tiankuan Liu,
Ming Qi,
Yi Yang,
Jingbo Ye,
Lei Zhang,
Li Zhang,
HuiPing Zhu
Abstract:
Purpose: The fiber optical links in 850 nm band with Ge-doped multi-mode (MM) fibers are well developed for data transmission at 10 Gbps and higher. The applications in nuclear environments require radiation resistance. The characteristics of Ge-doped MM fibers are investigated for Radiation Induced Attenuation (RIA) in Total Ionizing Dose (TID).
Methods: Commercial samples of Ge-doped MM fibers…
▽ More
Purpose: The fiber optical links in 850 nm band with Ge-doped multi-mode (MM) fibers are well developed for data transmission at 10 Gbps and higher. The applications in nuclear environments require radiation resistance. The characteristics of Ge-doped MM fibers are investigated for Radiation Induced Attenuation (RIA) in Total Ionizing Dose (TID).
Methods: Commercial samples of Ge-doped MM fibers were irradiated in Go-60 gamma rays at dose rates of 5 to 1.4k Gy(SiO2)/hr. The fiber samples were packaged in water tanks maintained at constant temperatures in the range of -15 to 45 degC. The optical power transmitted through the fibers were recorded in irradiation, and in annealing when the source was shielded. The measurements of RIA in time are analyzed for dose rate and temperature dependences.
Results: Ge-doped fiber samples of OM2 to OM4 grades were investigated for attenuation of optical power in radiation ionizing dose. Depending on the fabrication technology, two of the fiber types show radiation resistance with the RIAs of 0.2 dB/m and 0.05 dB/m, respectively, for the TID of 300 kGy(SiO2). At low dose rate of 5 Gy/hr, the RIA increases steadily and the annealing of low density ionizing defects does not cause notable deviation. At 1.4 kGy/hr the accumulated defects result to twice higher RIA during irradiation, and is worsen to a factor three in cold temperature. However, once the source is shielded the recovery is effective in a few hours.
Conclusion: The telecom products of 850 nm Ge-doped MM fibers provide high speed communication in distances of a few hundred meters. The industrial fabrication methods provide fibers that can endure radiation ionizing dose for applications in nuclear instrumentation.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
OnRL-RAG: Real-Time Personalized Mental Health Dialogue System
Authors:
Ahsan Bilal,
Beiyu Lin
Abstract:
Large language models (LLMs) have been widely used for various tasks and applications. However, LLMs and fine-tuning are limited to the pre-trained data. For example, ChatGPT's world knowledge until 2021 can be outdated or inaccurate. To enhance the capabilities of LLMs, Retrieval-Augmented Generation (RAG), is proposed to augment LLMs with additional, new, latest details and information to LLMs.…
▽ More
Large language models (LLMs) have been widely used for various tasks and applications. However, LLMs and fine-tuning are limited to the pre-trained data. For example, ChatGPT's world knowledge until 2021 can be outdated or inaccurate. To enhance the capabilities of LLMs, Retrieval-Augmented Generation (RAG), is proposed to augment LLMs with additional, new, latest details and information to LLMs. While RAG offers the correct information, it may not best present it, especially to different population groups with personalizations. Reinforcement Learning from Human Feedback (RLHF) adapts to user needs by aligning model responses with human preference through feedback loops. In real-life applications, such as mental health problems, a dynamic and feedback-based model would continuously adapt to new information and offer personalized assistance due to complex factors fluctuating in a daily environment. Thus, we propose an Online Reinforcement Learning-based Retrieval-Augmented Generation (OnRL-RAG) system to detect and personalize the responding systems to mental health problems, such as stress, anxiety, and depression. We use an open-source dataset collected from 2028 College Students with 28 survey questions for each student to demonstrate the performance of our proposed system with the existing systems. Our system achieves superior performance compared to standard RAG and simple LLM via GPT-4o, GPT-4o-mini, Gemini-1.5, and GPT-3.5. This work would open up the possibilities of real-life applications of LLMs for personalized services in the everyday environment. The results will also help researchers in the fields of sociology, psychology, and neuroscience to align their theories more closely with the actual human daily environment.
△ Less
Submitted 22 April, 2025; v1 submitted 2 April, 2025;
originally announced April 2025.
-
Exploring the Impact of an LLM-Powered Teachable Agent on Learning Gains and Cognitive Load in Music Education
Authors:
Lingxi Jin,
Baicheng Lin,
Mengze Hong,
Kun Zhang,
Hyo-Jeong So
Abstract:
This study examines the impact of an LLM-powered teachable agent, grounded in the Learning by Teaching (LBT) pedagogy, on students' music theory learning and cognitive load. The participants were 28 Chinese university students with prior music instrumental experiences. In an online experiment, they were assigned to either an experimental group, which engaged in music analysis with the teachable ag…
▽ More
This study examines the impact of an LLM-powered teachable agent, grounded in the Learning by Teaching (LBT) pedagogy, on students' music theory learning and cognitive load. The participants were 28 Chinese university students with prior music instrumental experiences. In an online experiment, they were assigned to either an experimental group, which engaged in music analysis with the teachable agent, or a control group, which conducted self-directed analysis using instructional materials. Findings indicate that students in the experimental group achieved significantly higher post-test scores than those in the control group. Additionally, they reported lower cognitive load, suggesting that the teachable agent effectively reduced the cognitive demands of music analysis tasks. These results highlight the potential of AI-driven scaffolding based on LBT principles to enhance music theory education, supporting teachers in delivering theory-oriented instruction while fostering students' self-directed learning skills.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
LLMs for Explainable AI: A Comprehensive Survey
Authors:
Ahsan Bilal,
David Ebert,
Beiyu Lin
Abstract:
Large Language Models (LLMs) offer a promising approach to enhancing Explainable AI (XAI) by transforming complex machine learning outputs into easy-to-understand narratives, making model predictions more accessible to users, and helping bridge the gap between sophisticated model behavior and human interpretability. AI models, such as state-of-the-art neural networks and deep learning models, are…
▽ More
Large Language Models (LLMs) offer a promising approach to enhancing Explainable AI (XAI) by transforming complex machine learning outputs into easy-to-understand narratives, making model predictions more accessible to users, and helping bridge the gap between sophisticated model behavior and human interpretability. AI models, such as state-of-the-art neural networks and deep learning models, are often seen as "black boxes" due to a lack of transparency. As users cannot fully understand how the models reach conclusions, users have difficulty trusting decisions from AI models, which leads to less effective decision-making processes, reduced accountabilities, and unclear potential biases. A challenge arises in developing explainable AI (XAI) models to gain users' trust and provide insights into how models generate their outputs. With the development of Large Language Models, we want to explore the possibilities of using human language-based models, LLMs, for model explainabilities. This survey provides a comprehensive overview of existing approaches regarding LLMs for XAI, and evaluation techniques for LLM-generated explanation, discusses the corresponding challenges and limitations, and examines real-world applications. Finally, we discuss future directions by emphasizing the need for more interpretable, automated, user-centric, and multidisciplinary approaches for XAI via LLMs.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation
Authors:
Jixuan Leng,
Chengsong Huang,
Langlin Huang,
Bill Yuchen Lin,
William W. Cohen,
Haohan Wang,
Jiaxin Huang
Abstract:
Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities o…
▽ More
Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
△ Less
Submitted 30 March, 2025;
originally announced April 2025.
-
Toward building next-generation Geocoding systems: a systematic review
Authors:
Zhengcong Yin,
Daniel W. Goldberg,
Binbin Lin,
Bing Zhou,
Diya Li,
Andong Ma,
Ziqian Ming,
Heng Cai,
Zhe Zhang,
Shaohua Wang,
Shanzhen Gao,
Joey Ying Lee,
Xiao Li,
Da Huo
Abstract:
Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across vari…
▽ More
Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across various scenarios these systems must address. It then provides a detailed analysis of how to construct such systems by breaking them down into key functional components and reviewing a broad spectrum of existing approaches, from traditional rule-based methods to advanced techniques in information retrieval, natural language processing, and large language models. Finally, we identify opportunities to improve next-generation geocoding systems in light of recent technological advances.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
3DSwapping: Texture Swapping For 3D Object From Single Reference Image
Authors:
Xiao Cao,
Beibei Lin,
Bo Wang,
Zhiyong Huang,
Robby T. Tan
Abstract:
3D texture swapping allows for the customization of 3D object textures, enabling efficient and versatile visual transformations in 3D editing. While no dedicated method exists, adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing requires frame-by-frame manipulation, causing inconsistencies across views, while text-driven 3D editing struggles to pres…
▽ More
3D texture swapping allows for the customization of 3D object textures, enabling efficient and versatile visual transformations in 3D editing. While no dedicated method exists, adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing requires frame-by-frame manipulation, causing inconsistencies across views, while text-driven 3D editing struggles to preserve texture characteristics from reference images. To tackle these challenges, we introduce 3DSwapping, a 3D texture swapping method that integrates: 1) progressive generation, 2) view-consistency gradient guidance, and 3) prompt-tuned gradient guidance. To ensure view consistency, our progressive generation process starts by editing a single reference image and gradually propagates the edits to adjacent views. Our view-consistency gradient guidance further reinforces consistency by conditioning the generation model on feature differences between consistent and inconsistent outputs. To preserve texture characteristics, we introduce prompt-tuning-based gradient guidance, which learns a token that precisely captures the difference between the reference image and the 3D object. This token then guides the editing process, ensuring more consistent texture preservation across views. Overall, 3DSwapping integrates these novel strategies to achieve higher-fidelity texture transfer while preserving structural coherence across multiple viewpoints. Extensive qualitative and quantitative evaluations confirm that our three novel components enable convincing and effective 2D texture swapping for 3D objects. Code will be available upon acceptance.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Authors:
Ziming Wei,
Bingqian Lin,
Yunshuang Nie,
Jiaqi Chen,
Shikui Ma,
Hang Xu,
Xiaodan Liang
Abstract:
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires exte…
▽ More
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts
Authors:
Sheng Ouyang,
Yihao Qin,
Bo Lin,
Liqian Chen,
Xiaoguang Mao,
Shangwen Wang
Abstract:
The proliferation of Large Language Models (LLMs) has revolutionized natural language processing and significantly impacted code generation tasks, enhancing software development efficiency and productivity. Notably, LLMs like GPT-4 have demonstrated remarkable proficiency in text-to-code generation tasks. However, the growing reliance on LLMs for code generation necessitates a critical examination…
▽ More
The proliferation of Large Language Models (LLMs) has revolutionized natural language processing and significantly impacted code generation tasks, enhancing software development efficiency and productivity. Notably, LLMs like GPT-4 have demonstrated remarkable proficiency in text-to-code generation tasks. However, the growing reliance on LLMs for code generation necessitates a critical examination of the safety implications associated with their outputs. Existing research efforts have primarily focused on verifying the functional correctness of LLMs, overlooking their safety in code generation. This paper introduces a jailbreaking approach, CodeJailbreaker, designed to uncover safety concerns in LLM-based code generation. The basic observation is that existing safety mechanisms for LLMs are built through the instruction-following paradigm, where malicious intent is explicitly articulated within the instruction of the prompt. Consequently, CodeJailbreaker explores to construct a prompt whose instruction is benign and the malicious intent is implicitly encoded in a covert channel, i.e., the commit message, to bypass the safety mechanism. Experiments on the recently-released RMCBench benchmark demonstrate that CodeJailbreaker markedly surpasses the conventional jailbreaking strategy, which explicitly conveys malicious intents in the instructions, in terms of the attack effectiveness across three code generation tasks. This study challenges the traditional safety paradigms in LLM-based code generation, emphasizing the need for enhanced safety measures in safeguarding against implicit malicious cues.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
ImViD: Immersive Volumetric Videos for Enhanced VR Engagement
Authors:
Zhengxian Yang,
Shi Pan,
Shengqi Wang,
Haoxiang Wang,
Li Lin,
Guanjun Li,
Zhengqi Wen,
Borong Lin,
Jianhua Tao,
Tao Yu
Abstract:
User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos,…
▽ More
User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture.
The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video
Authors:
Chengshu Zhao,
Yunyang Ge,
Xinhua Cheng,
Bin Zhu,
Yatian Pang,
Bin Lin,
Fan Yang,
Feng Gao,
Li Yuan
Abstract:
Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years. Existing methods treat video body-swapping as a composite of multiple tasks instead of an independent task and typically rely on various models to achieve video body-swapping sequentially. However, these methods fail to achieve end-to-end opti…
▽ More
Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years. Existing methods treat video body-swapping as a composite of multiple tasks instead of an independent task and typically rely on various models to achieve video body-swapping sequentially. However, these methods fail to achieve end-to-end optimization for the video body-swapping which causes issues such as variations in luminance among frames, disorganized occlusion relationships, and the noticeable separation between bodies and background. In this work, we define video body-swapping as an independent task and propose three critical consistencies: identity consistency, motion consistency, and environment consistency. We introduce an end-to-end model named SwapAnyone, treating video body-swapping as a video inpainting task with reference fidelity and motion control. To improve the ability to maintain environmental harmony, particularly luminance harmony in the resulting video, we introduce a novel EnvHarmony strategy for training our model progressively. Additionally, we provide a dataset named HumanAction-32K covering various videos about human actions. Extensive experiments demonstrate that our method achieves State-Of-The-Art (SOTA) performance among open-source methods while approaching or surpassing closed-source models across multiple dimensions. All code, model weights, and the HumanAction-32K dataset will be open-sourced at https://github.com/PKU-YuanGroup/SwapAnyone.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Large Language Models-Aided Program Debloating
Authors:
Bo Lin,
Shangwen Wang,
Yihao Qin,
Liqian Chen,
Xiaoguang Mao
Abstract:
As software grows in complexity to accommodate diverse features and platforms, software bloating has emerged as a significant challenge, adversely affecting performance and security. However, existing approaches inadequately address the dual objectives of debloating: maintaining functionality by preserving essential features and enhancing security by reducing security issues. Specifically, current…
▽ More
As software grows in complexity to accommodate diverse features and platforms, software bloating has emerged as a significant challenge, adversely affecting performance and security. However, existing approaches inadequately address the dual objectives of debloating: maintaining functionality by preserving essential features and enhancing security by reducing security issues. Specifically, current software debloating techniques often rely on input-based analysis, using user inputs as proxies for the specifications of desired features. However, these approaches frequently overfit provided inputs, leading to functionality loss and potential security vulnerabilities. To address these limitations, we propose LEADER, a program debloating framework enhanced by Large Language Models (LLMs), which leverages their semantic understanding, generative capabilities, and decision-making strengths. LEADER mainly consists of two modules: (1) a documentation-guided test augmentation module designed to preserve functionality, which leverages LLMs to comprehend program documentation and generates sufficient tests to cover the desired features comprehensively, and (2) a multi-advisor-aided program debloating module that employs a neuro-symbolic pipeline to ensure that the security of the software can be perceived during debloating. This module combines debloating and security advisors for analysis and employs an LLM as a decision-maker to eliminate undesired code securely. Extensive evaluations on widely used benchmarks demonstrate the efficacy of LEADER. These results demonstrate that LEADER surpasses the state-of-the-art tool CovA in functionality and security. These results underscore the potential of LEADER to set a new standard in program debloating by effectively balancing functionality and security.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Seeing Beyond Haze: Generative Nighttime Image Dehazing
Authors:
Beibei Lin,
Stephen Lin,
Robby Tan
Abstract:
Nighttime image dehazing is particularly challenging when dense haze and intense glow severely degrade or completely obscure background information. Existing methods often encounter difficulties due to insufficient background priors and limited generative ability, both essential for handling such conditions. In this paper, we introduce BeyondHaze, a generative nighttime dehazing method that not on…
▽ More
Nighttime image dehazing is particularly challenging when dense haze and intense glow severely degrade or completely obscure background information. Existing methods often encounter difficulties due to insufficient background priors and limited generative ability, both essential for handling such conditions. In this paper, we introduce BeyondHaze, a generative nighttime dehazing method that not only significantly reduces haze and glow effects but also infers background information in regions where it may be absent. Our approach is developed on two main ideas: gaining strong background priors by adapting image diffusion models to the nighttime dehazing problem, and enhancing generative ability for haze- and glow-obscured scene areas through guided training. Task-specific nighttime dehazing knowledge is distilled into an image diffusion model in a manner that preserves its capacity to generate clean images. The diffusion model is additionally trained on image pairs designed to improve its ability to generate background details and content that are missing in the input image due to haze effects. Since generative models are susceptible to hallucinations, we develop our framework to allow user control over the generative level, balancing visual realism and factual accuracy. Experiments on real-world images demonstrate that BeyondHaze effectively restores visibility in dense nighttime haze.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Authors:
Yuwei Niu,
Munan Ning,
Mengren Zheng,
Weiyang Jin,
Bin Lin,
Peng Jin,
Jiaqi Liao,
Chaoran Feng,
Kunpeng Ning,
Bin Zhu,
Li Yuan
Abstract:
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose…
▽ More
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.
△ Less
Submitted 27 May, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
HFedCKD: Toward Robust Heterogeneous Federated Learning via Data-free Knowledge Distillation and Two-way Contrast
Authors:
Yiting Zheng,
Bohan Lin,
Jinqian Chen,
Jihua Zhu
Abstract:
Most current federated learning frameworks are modeled as static processes, ignoring the dynamic characteristics of the learning system. Under the limited communication budget of the central server, the flexible model architecture of a large number of clients participating in knowledge transfer requires a lower participation rate, active clients have uneven contributions, and the client scale seri…
▽ More
Most current federated learning frameworks are modeled as static processes, ignoring the dynamic characteristics of the learning system. Under the limited communication budget of the central server, the flexible model architecture of a large number of clients participating in knowledge transfer requires a lower participation rate, active clients have uneven contributions, and the client scale seriously hinders the performance of FL. We consider a more general and practical federation scenario and propose a system heterogeneous federation method based on data-free knowledge distillation and two-way contrast (HFedCKD). We apply the Inverse Probability Weighted Distillation (IPWD) strategy to the data-free knowledge transfer framework. The generator completes the data features of the nonparticipating clients. IPWD implements a dynamic evaluation of the prediction contribution of each client under different data distributions. Based on the antibiased weighting of its prediction loss, the weight distribution of each client is effectively adjusted to fairly integrate the knowledge of participating clients. At the same time, the local model is split into a feature extractor and a classifier. Through differential contrast learning, the feature extractor is aligned with the global model in the feature space, while the classifier maintains personalized decision-making capabilities. HFedCKD effectively alleviates the knowledge offset caused by a low participation rate under data-free knowledge distillation and improves the performance and stability of the model. We conduct extensive experiments on image and IoT datasets to comprehensively evaluate and verify the generalization and robustness of the proposed HFedCKD framework.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications
Authors:
Nam Huynh,
Beiyu Lin
Abstract:
Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate executable code. We begin with understanding LLMs' limitations and challenges in automated code generation. Subsequently, we review various fine-tuning techniques de…
▽ More
Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate executable code. We begin with understanding LLMs' limitations and challenges in automated code generation. Subsequently, we review various fine-tuning techniques designed to enhance both the performance and adaptability of LLMs in code generation tasks. We then review the existing metrics and benchmarks for evaluations to assess model performance based on fine-tuning techniques. Finally, we explore the applications of LLMs (e.g. CodeLlama, GitHub Copilot, ToolGen) in code generation tasks to illustrate their roles and functionalities. This survey provides a comprehensive overview of LLMs for code generation, helps researchers in diverse fields better understand the current state-of-the-art technologies, and offers the potential of effectively leveraging LLMs for code generation tasks.
△ Less
Submitted 2 April, 2025; v1 submitted 3 March, 2025;
originally announced March 2025.
-
Simulation of the Background from $^{13}$C$(α, n)^{16}$O Reaction in the JUNO Scintillator
Authors:
JUNO Collaboration,
Thomas Adam,
Kai Adamowicz,
Shakeel Ahmad,
Rizwan Ahmed,
Sebastiano Aiello,
Fengpeng An,
Costas Andreopoulos,
Giuseppe Andronico,
Nikolay Anfimov,
Vito Antonelli,
Tatiana Antoshkina,
João Pedro Athayde Marcondes de André,
Didier Auguste,
Weidong Bai,
Nikita Balashov,
Andrea Barresi,
Davide Basilico,
Eric Baussan,
Marco Beretta,
Antonio Bergnoli,
Nikita Bessonov,
Daniel Bick,
Lukas Bieger,
Svetlana Biktemerova
, et al. (608 additional authors not shown)
Abstract:
Large-scale organic liquid scintillator detectors are highly efficient in the detection of MeV-scale electron antineutrinos. These signal events can be detected through inverse beta decay on protons, which produce a positron accompanied by a neutron. A noteworthy background for antineutrinos coming from nuclear power reactors and from the depths of the Earth (geoneutrinos) is generated by ($α, n$)…
▽ More
Large-scale organic liquid scintillator detectors are highly efficient in the detection of MeV-scale electron antineutrinos. These signal events can be detected through inverse beta decay on protons, which produce a positron accompanied by a neutron. A noteworthy background for antineutrinos coming from nuclear power reactors and from the depths of the Earth (geoneutrinos) is generated by ($α, n$) reactions. In organic liquid scintillator detectors, $α$ particles emitted from intrinsic contaminants such as $^{238}$U, $^{232}$Th, and $^{210}$Pb/$^{210}$Po, can be captured on $^{13}$C nuclei, followed by the emission of a MeV-scale neutron. Three distinct interaction mechanisms can produce prompt energy depositions preceding the delayed neutron capture, leading to a pair of events correlated in space and time within the detector. Thus, ($α, n$) reactions represent an indistinguishable background in liquid scintillator-based antineutrino detectors, where their expected rate and energy spectrum are typically evaluated via Monte Carlo simulations. This work presents results from the open-source SaG4n software, used to calculate the expected energy depositions from the neutron and any associated de-excitation products. Also simulated is a detailed detector response to these interactions, using a dedicated Geant4-based simulation software from the JUNO experiment. An expected measurable $^{13}$C$(α, n)^{16}$O event rate and reconstructed prompt energy spectrum with associated uncertainties, are presented in the context of JUNO, however, the methods and results are applicable and relevant to other organic liquid scintillator neutrino detectors.
△ Less
Submitted 2 May, 2025; v1 submitted 2 March, 2025;
originally announced March 2025.
-
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning
Authors:
Xiwen Liang,
Min Lin,
Weiqi Ruan,
Rongtao Xu,
Yuecheng Liu,
Jiaqi Chen,
Bingqian Lin,
Yuzheng Zhuang,
Xiaodan Liang
Abstract:
Existing methods for vision-language task planning excel in short-horizon tasks but often fall short in complex, long-horizon planning within dynamic environments. These challenges primarily arise from the difficulty of effectively training models to produce high-quality reasoning processes for long-horizon tasks. To address this, we propose Structured Preference Optimization (SPO), which aims to…
▽ More
Existing methods for vision-language task planning excel in short-horizon tasks but often fall short in complex, long-horizon planning within dynamic environments. These challenges primarily arise from the difficulty of effectively training models to produce high-quality reasoning processes for long-horizon tasks. To address this, we propose Structured Preference Optimization (SPO), which aims to enhance reasoning and action selection in long-horizon task planning through structured preference evaluation and optimized training strategies. Specifically, SPO introduces: 1) Preference-Based Scoring and Optimization, which systematically evaluates reasoning chains based on task relevance, visual grounding, and historical consistency; and 2) Curriculum-Guided Training, where the model progressively adapts from simple to complex tasks, improving its generalization ability in long-horizon scenarios and enhancing reasoning robustness. To advance research in vision-language long-horizon task planning, we introduce ExtendaBench, a comprehensive benchmark covering 1,509 tasks across VirtualHome and Habitat 2.0, categorized into ultra-short, short, medium, and long tasks. Experimental results demonstrate that SPO significantly improves reasoning quality and final decision accuracy, outperforming prior methods on long-horizon tasks and underscoring the effectiveness of preference-driven optimization in vision-language task planning. Specifically, SPO achieves a +5.98% GCR and +4.68% SR improvement in VirtualHome and a +3.30% GCR and +2.11% SR improvement in Habitat over the best-performing baselines.
△ Less
Submitted 15 May, 2025; v1 submitted 28 February, 2025;
originally announced February 2025.
-
Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs
Authors:
Tingting Chen,
Srinivas Anumasa,
Beibei Lin,
Vedant Shah,
Anirudh Goyal,
Dianbo Liu
Abstract:
Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions;…
▽ More
Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, \textit{Auto-Bench}, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Small Models Struggle to Learn from Strong Reasoners
Authors:
Yuetai Li,
Xiang Yue,
Zhangchen Xu,
Fengqing Jiang,
Luyao Niu,
Bill Yuchen Lin,
Bhaskar Ramasubramanian,
Radha Poovendran
Abstract:
Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they per…
▽ More
Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.
△ Less
Submitted 22 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Authors:
Fengqing Jiang,
Zhangchen Xu,
Yuetai Li,
Luyao Niu,
Zhen Xiang,
Bo Li,
Bill Yuchen Lin,
Radha Poovendran
Abstract:
Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Cu…
▽ More
Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Tropical Fréchet Means
Authors:
Bo Lin,
Kamillo Ferry,
Carlos Améndola,
Anthea Monod,
Ruriko Yoshida
Abstract:
The Fréchet mean is a key measure of central tendency as a barycenter for a given set of points in a general metric space. It is computed by solving an optimization problem and is a fundamental quantity in statistics. In this paper, we study Fréchet means in tropical geometry -- a piecewise linear, combinatorial, and polyhedral variant of algebraic geometry that has gained prominence in applicatio…
▽ More
The Fréchet mean is a key measure of central tendency as a barycenter for a given set of points in a general metric space. It is computed by solving an optimization problem and is a fundamental quantity in statistics. In this paper, we study Fréchet means in tropical geometry -- a piecewise linear, combinatorial, and polyhedral variant of algebraic geometry that has gained prominence in applications. A key property of Fréchet means is that uniqueness is generally not guaranteed, which is true in tropical settings. In solving the tropical Fréchet mean optimization problem, we obtain a geometric characterization of the collection of all Fréchet means in a general tropical space as a tropically and classically convex polytope. Furthermore, we prove that a certificate of positivity for finitely many quadratic polynomials in $\mathbb{R}[x_1,\ldots,x_n]$ always exists, given that their quadratic homogeneous components are sums of squares. We propose an algorithm to symbolically compute the Fréchet mean polytope based on our exact quadratic optimization result and study its complexity.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Exploring the Security Threats of Knowledge Base Poisoning in Retrieval-Augmented Code Generation
Authors:
Bo Lin,
Shangwen Wang,
Liqian Chen,
Xiaoguang Mao
Abstract:
The integration of Large Language Models (LLMs) into software development has revolutionized the field, particularly through the use of Retrieval-Augmented Code Generation (RACG) systems that enhance code generation with information from external knowledge bases. However, the security implications of RACG systems, particularly the risks posed by vulnerable code examples in the knowledge base, rema…
▽ More
The integration of Large Language Models (LLMs) into software development has revolutionized the field, particularly through the use of Retrieval-Augmented Code Generation (RACG) systems that enhance code generation with information from external knowledge bases. However, the security implications of RACG systems, particularly the risks posed by vulnerable code examples in the knowledge base, remain largely unexplored. This risk is particularly concerning given that public code repositories, which often serve as the sources for knowledge base collection in RACG systems, are usually accessible to anyone in the community. Malicious attackers can exploit this accessibility to inject vulnerable code into the knowledge base, making it toxic. Once these poisoned samples are retrieved and incorporated into the generated code, they can propagate security vulnerabilities into the final product. This paper presents the first comprehensive study on the security risks associated with RACG systems, focusing on how vulnerable code in the knowledge base compromises the security of generated code. We investigate the LLM-generated code security across different settings through extensive experiments using four major LLMs, two retrievers, and two poisoning scenarios. Our findings highlight the significant threat of knowledge base poisoning, where even a single poisoned code example can compromise up to 48% of generated code. Our findings provide crucial insights into vulnerability introduction in RACG systems and offer practical mitigation recommendations, thereby helping improve the security of LLM-generated code in future works.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Authors:
Bill Yuchen Lin,
Ronan Le Bras,
Kyle Richardson,
Ashish Sabharwal,
Radha Poovendran,
Peter Clark,
Yejin Choi
Abstract:
We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and qu…
▽ More
We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty.
Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling
Authors:
Jia-Hua Lee,
Bor-Jiun Lin,
Wei-Fang Sun,
Chun-Yi Lee
Abstract:
World models represent a promising approach for training reinforcement learning agents with significantly improved sample efficiency. While most world model methods primarily rely on sequences of discrete latent variables to model environment dynamics, this compression often neglects critical visual details essential for reinforcement learning. Recent diffusion-based world models condition generat…
▽ More
World models represent a promising approach for training reinforcement learning agents with significantly improved sample efficiency. While most world model methods primarily rely on sequences of discrete latent variables to model environment dynamics, this compression often neglects critical visual details essential for reinforcement learning. Recent diffusion-based world models condition generation on a fixed context length of frames to predict the next observation, using separate recurrent neural networks to model rewards and termination signals. Although this architecture effectively enhances visual fidelity, the fixed context length approach inherently limits memory capacity. In this paper, we introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models. Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding Crafter benchmark, and 3D first-person ViZDoom environments, demonstrating superior performance in all these diverse challenges.
△ Less
Submitted 15 June, 2025; v1 submitted 1 February, 2025;
originally announced February 2025.