-
VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts
Authors:
Xin Liu,
Lechen Zhang,
Sheza Munir,
Yiyang Gu,
Lu Wang
Abstract:
Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality…
▽ More
Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset
Authors:
Yicheng Gu,
Chaoren Wang,
Junan Zhang,
Xueyao Zhang,
Zihao Fang,
Haorui He,
Zhizheng Wu
Abstract:
The lack of a publicly-available large-scale and diverse dataset has long been a significant bottleneck for singing voice applications like Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC). To tackle this problem, we present SingNet, an extensive, diverse, and in-the-wild singing voice dataset. Specifically, we propose a data processing pipeline to extract ready-to-use training dat…
▽ More
The lack of a publicly-available large-scale and diverse dataset has long been a significant bottleneck for singing voice applications like Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC). To tackle this problem, we present SingNet, an extensive, diverse, and in-the-wild singing voice dataset. Specifically, we propose a data processing pipeline to extract ready-to-use training data from sample packs and songs on the internet, forming 3000 hours of singing voices in various languages and styles. Furthermore, to facilitate the use and demonstrate the effectiveness of SingNet, we pre-train and open-source various state-of-the-art (SOTA) models on Wav2vec2, BigVGAN, and NSF-HiFiGAN based on our collected singing voice data. We also conduct benchmark experiments on Automatic Lyric Transcription (ALT), Neural Vocoder, and Singing Voice Conversion (SVC). Audio demos are available at: https://singnet-dataset.github.io/.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
DFEN: Dual Feature Equalization Network for Medical Image Segmentation
Authors:
Jianjian Yin,
Yi Chen,
Chengyu Li,
Zhichao Zheng,
Yanhui Gu,
Junsheng Zhou
Abstract:
Current methods for medical image segmentation primarily focus on extracting contextual feature information from the perspective of the whole image. While these methods have shown effective performance, none of them take into account the fact that pixels at the boundary and regions with a low number of class pixels capture more contextual feature information from other classes, leading to misclass…
▽ More
Current methods for medical image segmentation primarily focus on extracting contextual feature information from the perspective of the whole image. While these methods have shown effective performance, none of them take into account the fact that pixels at the boundary and regions with a low number of class pixels capture more contextual feature information from other classes, leading to misclassification of pixels by unequal contextual feature information. In this paper, we propose a dual feature equalization network based on the hybrid architecture of Swin Transformer and Convolutional Neural Network, aiming to augment the pixel feature representations by image-level equalization feature information and class-level equalization feature information. Firstly, the image-level feature equalization module is designed to equalize the contextual information of pixels within the image. Secondly, we aggregate regions of the same class to equalize the pixel feature representations of the corresponding class by class-level feature equalization module. Finally, the pixel feature representations are enhanced by learning weights for image-level equalization feature information and class-level equalization feature information. In addition, Swin Transformer is utilized as both the encoder and decoder, thereby bolstering the ability of the model to capture long-range dependencies and spatial correlations. We conducted extensive experiments on Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC2017), Automated Cardiac Diagnosis Challenge (ACDC) and PH$^2$ datasets. The experimental results demonstrate that our method have achieved state-of-the-art performance. Our code is publicly available at https://github.com/JianJianYin/DFEN.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains
Authors:
Qianchu Liu,
Sheng Zhang,
Guanghui Qin,
Timothy Ossowski,
Yu Gu,
Ying Jin,
Sid Kiblawi,
Sam Preston,
Mu Wei,
Paul Vozila,
Tristan Naumann,
Hoifung Poon
Abstract:
Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper…
▽ More
Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Deep Learning Optimization Using Self-Adaptive Weighted Auxiliary Variables
Authors:
Yaru Liu,
Yiqi Gu,
Michael K. Ng
Abstract:
In this paper, we develop a new optimization framework for the least squares learning problem via fully connected neural networks or physics-informed neural networks. The gradient descent sometimes behaves inefficiently in deep learning because of the high non-convexity of loss functions and the vanishing gradient issue. Our idea is to introduce auxiliary variables to separate the layers of the de…
▽ More
In this paper, we develop a new optimization framework for the least squares learning problem via fully connected neural networks or physics-informed neural networks. The gradient descent sometimes behaves inefficiently in deep learning because of the high non-convexity of loss functions and the vanishing gradient issue. Our idea is to introduce auxiliary variables to separate the layers of the deep neural networks and reformulate the loss functions for ease of optimization. We design the self-adaptive weights to preserve the consistency between the reformulated loss and the original mean squared loss, which guarantees that optimizing the new loss helps optimize the original problem. Numerical experiments are presented to verify the consistency and show the effectiveness and robustness of our models over gradient descent.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Can a Large Language Model Assess Urban Design Quality? Evaluating Walkability Metrics Across Expertise Levels
Authors:
Chenyi Cai,
Kosuke Kuriyama,
Youlong Gu,
Filip Biljecki,
Pieter Herthogs
Abstract:
Urban street environments are vital to supporting human activity in public spaces. The emergence of big data, such as street view images (SVIs) combined with multimodal large language models (MLLMs), is transforming how researchers and practitioners investigate, measure, and evaluate semantic and visual elements of urban environments. Considering the low threshold for creating automated evaluative…
▽ More
Urban street environments are vital to supporting human activity in public spaces. The emergence of big data, such as street view images (SVIs) combined with multimodal large language models (MLLMs), is transforming how researchers and practitioners investigate, measure, and evaluate semantic and visual elements of urban environments. Considering the low threshold for creating automated evaluative workflows using MLLMs, it is crucial to explore both the risks and opportunities associated with these probabilistic models. In particular, the extent to which the integration of expert knowledge can influence the performance of MLLMs in evaluating the quality of urban design has not been fully explored. This study sets out an initial exploration of how integrating more formal and structured representations of expert urban design knowledge into the input prompts of an MLLM (ChatGPT-4) can enhance the model's capability and reliability in evaluating the walkability of built environments using SVIs. We collect walkability metrics from the existing literature and categorize them using relevant ontologies. We then select a subset of these metrics, focusing on the subthemes of pedestrian safety and attractiveness, and develop prompts for the MLLM accordingly. We analyze the MLLM's ability to evaluate SVI walkability subthemes through prompts with varying levels of clarity and specificity regarding evaluation criteria. Our experiments demonstrate that MLLMs are capable of providing assessments and interpretations based on general knowledge and can support the automation of multimodal image-text evaluations. However, they generally provide more optimistic scores and can make mistakes when interpreting the provided metrics, resulting in incorrect evaluations. By integrating expert knowledge, the MLLM's evaluative performance exhibits higher consistency and concentration.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation
Authors:
Zhe Dong,
Yuzhe Sun,
Tianzhu Liu,
Wangmeng Zuo,
Yanfeng Gu
Abstract:
Satellite imagery and maps, as two fundamental data modalities in remote sensing, offer direct observations of the Earth's surface and human-interpretable geographic abstractions, respectively. The task of bidirectional translation between satellite images and maps (BSMT) holds significant potential for applications in urban planning and disaster response. However, this task presents two major cha…
▽ More
Satellite imagery and maps, as two fundamental data modalities in remote sensing, offer direct observations of the Earth's surface and human-interpretable geographic abstractions, respectively. The task of bidirectional translation between satellite images and maps (BSMT) holds significant potential for applications in urban planning and disaster response. However, this task presents two major challenges: first, the absence of precise pixel-wise alignment between the two modalities substantially complicates the translation process; second, it requires achieving both high-level abstraction of geographic features and high-quality visual synthesis, which further elevates the technical complexity. To address these limitations, we introduce EarthMapper, a novel autoregressive framework for controllable bidirectional satellite-map translation. EarthMapper employs geographic coordinate embeddings to anchor generation, ensuring region-specific adaptability, and leverages multi-scale feature alignment within a geo-conditioned joint scale autoregression (GJSA) process to unify bidirectional translation in a single training cycle. A semantic infusion (SI) mechanism is introduced to enhance feature-level consistency, while a key point adaptive guidance (KPAG) mechanism is proposed to dynamically balance diversity and precision during inference. We further contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities, enabling robust benchmarking. Extensive experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance, achieving significant improvements in visual realism, semantic consistency, and structural fidelity over state-of-the-art methods. Additionally, EarthMapper excels in zero-shot tasks like in-painting, out-painting and coordinate-conditional generation, underscoring its versatility.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Authors:
Peilin Zhou,
Bruce Leon,
Xiang Ying,
Can Zhang,
Yifan Shao,
Qichen Ye,
Dading Chong,
Zhiling Jin,
Chenxuan Xie,
Meng Cao,
Yuxin Gu,
Sixin Hong,
Jing Ren,
Jian Chen,
Chao Liu,
Yining Hua
Abstract:
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese.…
▽ More
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.
△ Less
Submitted 1 May, 2025; v1 submitted 27 April, 2025;
originally announced April 2025.
-
TSUE: A Two-Stage Data Update Method for an Erasure Coded Cluster File System
Authors:
Zheng Wei,
Jing Xing,
Yida Gu,
Wenjing Huang,
Dong Dai,
Guangming Tan,
Dingwen Tao
Abstract:
Compared to replication-based storage systems, erasure-coded storage incurs significantly higher overhead during data updates. To address this issue, various parity logging methods have been pro- posed. Nevertheless, due to the long update path and substantial amount of random I/O involved in erasure code update processes, the resulting long latency and low throughput often fail to meet the requir…
▽ More
Compared to replication-based storage systems, erasure-coded storage incurs significantly higher overhead during data updates. To address this issue, various parity logging methods have been pro- posed. Nevertheless, due to the long update path and substantial amount of random I/O involved in erasure code update processes, the resulting long latency and low throughput often fail to meet the requirements of high performance applications. To this end, we propose a two-stage data update method called TSUE. TSUE divides the update process into a synchronous stage that records updates in a data log, and an asynchronous stage that recycles the log in real-time. TSUE effectively reduces update latency by transforming random I/O into sequential I/O, and it significantly reduces recycle overhead by utilizing a three-layer log and the spatio-temporal locality of access patterns. In SSDs cluster, TSUE significantly im- proves update performance, achieving improvements of 7.6X under Ali-Cloud trace, 5X under Ten-Cloud trace, while it also extends the SSD's lifespan by up to 13X through reducing the frequencies of reads/writes and of erase operations.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Enhancing the Patent Matching Capability of Large Language Models via the Memory Graph
Authors:
Qiushi Xiong,
Zhipeng Xu,
Zhenghao Liu,
Mengjia Wang,
Zulong Chen,
Yue Sun,
Yu Gu,
Xiaohua Li,
Ge Yu
Abstract:
Intellectual Property (IP) management involves strategically protecting and utilizing intellectual assets to enhance organizational innovation, competitiveness, and value creation. Patent matching is a crucial task in intellectual property management, which facilitates the organization and utilization of patents. Existing models often rely on the emergent capabilities of Large Language Models (LLM…
▽ More
Intellectual Property (IP) management involves strategically protecting and utilizing intellectual assets to enhance organizational innovation, competitiveness, and value creation. Patent matching is a crucial task in intellectual property management, which facilitates the organization and utilization of patents. Existing models often rely on the emergent capabilities of Large Language Models (LLMs) and leverage them to identify related patents directly. However, these methods usually depend on matching keywords and overlook the hierarchical classification and categorical relationships of patents. In this paper, we propose MemGraph, a method that augments the patent matching capabilities of LLMs by incorporating a memory graph derived from their parametric memory. Specifically, MemGraph prompts LLMs to traverse their memory to identify relevant entities within patents, followed by attributing these entities to corresponding ontologies. After traversing the memory graph, we utilize extracted entities and ontologies to improve the capability of LLM in comprehending the semantics of patents. Experimental results on the PatentMatch dataset demonstrate the effectiveness of MemGraph, achieving a 17.68% performance improvement over baseline LLMs. The further analysis highlights the generalization ability of MemGraph across various LLMs, both in-domain and out-of-domain, and its capacity to enhance the internal reasoning processes of LLMs during patent matching. All data and codes are available at https://github.com/NEUIR/MemGraph.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
CSI2Dig: Recovering Digit Content from Smartphone Loudspeakers Using Channel State Information
Authors:
Yangyang Gu,
Xianglong Li,
Haolin Wu,
Jing Chen,
Kun He,
Ruiying Du,
Cong Wu
Abstract:
Eavesdropping on sounds emitted by mobile device loudspeakers can capture sensitive digital information, such as SMS verification codes, credit card numbers, and withdrawal passwords, which poses significant security risks. Existing schemes either require expensive specialized equipment, rely on spyware, or are limited to close-range signal acquisition. In this paper, we propose a scheme, CSI2Dig,…
▽ More
Eavesdropping on sounds emitted by mobile device loudspeakers can capture sensitive digital information, such as SMS verification codes, credit card numbers, and withdrawal passwords, which poses significant security risks. Existing schemes either require expensive specialized equipment, rely on spyware, or are limited to close-range signal acquisition. In this paper, we propose a scheme, CSI2Dig, for recovering digit content from Channel State Information (CSI) when digits are played through a smartphone loudspeaker. We observe that the electromagnetic interference caused by the audio signals from the loudspeaker affects the WiFi signals emitted by the phone's WiFi antenna. Building upon contrastive learning and denoising autoencoders, we develop a two-branch autoencoder network designed to amplify the impact of this electromagnetic interference on CSI. For feature extraction, we introduce the TS-Net, a model that captures relevant features from both the temporal and spatial dimensions of the CSI data. We evaluate our scheme across various devices, distances, volumes, and other settings. Experimental results demonstrate that our scheme can achieve an accuracy of 72.97%.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Learning Critically: Selective Self Distillation in Federated Learning on Non-IID Data
Authors:
Yuting He,
Yiqiang Chen,
XiaoDong Yang,
Hanchao Yu,
Yi-Hua Huang,
Yang Gu
Abstract:
Federated learning (FL) enables multiple clients to collaboratively train a global model while keeping local data decentralized. Data heterogeneity (non-IID) across clients has imposed significant challenges to FL, which makes local models re-optimize towards their own local optima and forget the global knowledge, resulting in performance degradation and convergence slowdown. Many existing works h…
▽ More
Federated learning (FL) enables multiple clients to collaboratively train a global model while keeping local data decentralized. Data heterogeneity (non-IID) across clients has imposed significant challenges to FL, which makes local models re-optimize towards their own local optima and forget the global knowledge, resulting in performance degradation and convergence slowdown. Many existing works have attempted to address the non-IID issue by adding an extra global-model-based regularizing item to the local training but without an adaption scheme, which is not efficient enough to achieve high performance with deep learning models. In this paper, we propose a Selective Self-Distillation method for Federated learning (FedSSD), which imposes adaptive constraints on the local updates by self-distilling the global model's knowledge and selectively weighting it by evaluating the credibility at both the class and sample level. The convergence guarantee of FedSSD is theoretically analyzed and extensive experiments are conducted on three public benchmark datasets, which demonstrates that FedSSD achieves better generalization and robustness in fewer communication rounds, compared with other state-of-the-art FL methods.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Streamlining Biomedical Research with Specialized LLMs
Authors:
Linqing Chen,
Weilei Wang,
Yubin Xia,
Wentao Wu,
Peng Xu,
Zilong Bai,
Jie Fang,
Chaobo Xu,
Ran Hu,
Licong Xu,
Haoran Hua,
Jing Sun,
Hanmeng Zhong,
Jin Liu,
Tian Qiu,
Haowen Liu,
Meng Hu,
Xiuwen Li,
Fei Gao,
Yong Gu,
Tao Shi,
Chaochao Wang,
Jianping Lu,
Cheng Sun,
Yixin Wang
, et al. (8 additional authors not shown)
Abstract:
In this paper, we propose a novel system that integrates state-of-the-art, domain-specific large language models with advanced information retrieval techniques to deliver comprehensive and context-aware responses. Our approach facilitates seamless interaction among diverse components, enabling cross-validation of outputs to produce accurate, high-quality responses enriched with relevant data, imag…
▽ More
In this paper, we propose a novel system that integrates state-of-the-art, domain-specific large language models with advanced information retrieval techniques to deliver comprehensive and context-aware responses. Our approach facilitates seamless interaction among diverse components, enabling cross-validation of outputs to produce accurate, high-quality responses enriched with relevant data, images, tables, and other modalities. We demonstrate the system's capability to enhance response precision by leveraging a robust question-answering model, significantly improving the quality of dialogue generation. The system provides an accessible platform for real-time, high-fidelity interactions, allowing users to benefit from efficient human-computer interaction, precise retrieval, and simultaneous access to a wide range of literature and data. This dramatically improves the research efficiency of professionals in the biomedical and pharmaceutical domains and facilitates faster, more informed decision-making throughout the R\&D process. Furthermore, the system proposed in this paper is available at https://synapse-chat.patsnap.com.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
An Efficient and Mixed Heterogeneous Model for Image Restoration
Authors:
Yubin Gu,
Yuan Meng,
Kaihang Zheng,
Xiaoshuai Sun,
Jiayi Ji,
Weijian Ruan,
Liujuan Cao,
Rongrong Ji
Abstract:
Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms:…
▽ More
Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. CNNs excel in efficient inference, whereas Transformers and Mamba excel at capturing long-range dependencies and modeling global contexts. While each architecture has demonstrated success in specialized, single-task settings, limited efforts have been made to effectively integrate heterogeneous architectures to jointly address diverse IR challenges. To bridge this gap, we propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion. RestorMixer adopts a three-stage encoder-decoder structure, where each stage is tailored to the resolution and feature characteristics of the input. In the initial high-resolution stage, CNN-based blocks are employed to rapidly extract shallow local features. In the subsequent stages, we integrate a refined multi-directional scanning Mamba module with a multi-scale window-based self-attention mechanism. This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement. Extensive experimental results demonstrate that RestorMixer achieves leading performance across multiple IR tasks while maintaining high inference efficiency. The official code can be accessed at https://github.com/ClimBin/RestorMixer.
△ Less
Submitted 19 April, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
Pillar-Voxel Fusion Network for 3D Object Detection in Airborne Hyperspectral Point Clouds
Authors:
Yanze Jiang,
Yanfeng Gu,
Xian Li
Abstract:
Hyperspectral point clouds (HPCs) can simultaneously characterize 3D spatial and spectral information of ground objects, offering excellent 3D perception and target recognition capabilities. Current approaches for generating HPCs often involve fusion techniques with hyperspectral images and LiDAR point clouds, which inevitably lead to geometric-spectral distortions due to fusion errors and obstacl…
▽ More
Hyperspectral point clouds (HPCs) can simultaneously characterize 3D spatial and spectral information of ground objects, offering excellent 3D perception and target recognition capabilities. Current approaches for generating HPCs often involve fusion techniques with hyperspectral images and LiDAR point clouds, which inevitably lead to geometric-spectral distortions due to fusion errors and obstacle occlusions. These adverse effects limit their performance in downstream fine-grained tasks across multiple scenarios, particularly in airborne applications. To address these issues, we propose PiV-AHPC, a 3D object detection network for airborne HPCs. To the best of our knowledge, this is the first attempt at this HPCs task. Specifically, we first develop a pillar-voxel dual-branch encoder, where the former captures spectral and vertical structural features from HPCs to overcome spectral distortion, while the latter emphasizes extracting accurate 3D spatial features from point clouds. A multi-level feature fusion mechanism is devised to enhance information interaction between the two branches, achieving neighborhood feature alignment and channel-adaptive selection, thereby organically integrating heterogeneous features and mitigating geometric distortion. Extensive experiments on two airborne HPCs datasets demonstrate that PiV-AHPC possesses state-of-the-art detection performance and high generalization capability.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
MBE-ARI: A Multimodal Dataset Mapping Bi-directional Engagement in Animal-Robot Interaction
Authors:
Ian Noronha,
Advait Prasad Jawaji,
Juan Camilo Soto,
Jiajun An,
Yan Gu,
Upinder Kaur
Abstract:
Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidire…
▽ More
Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at https://github.com/RISELabPurdue/MBE-ARI/, inviting further exploration and development in this critical area.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion
Authors:
Na Li,
Chuke Wang,
Yu Gu,
Zhifeng Li
Abstract:
Voice conversion (VC) transforms source speech into a target voice by preserving the content. However, timbre information from the source speaker is inherently embedded in the content representations, causing significant timbre leakage and reducing similarity to the target speaker. To address this, we introduce a residual block to a content extractor. The residual block consists of two weighted br…
▽ More
Voice conversion (VC) transforms source speech into a target voice by preserving the content. However, timbre information from the source speaker is inherently embedded in the content representations, causing significant timbre leakage and reducing similarity to the target speaker. To address this, we introduce a residual block to a content extractor. The residual block consists of two weighted branches: 1) universal semantic dictionary based Content Feature Re-expression (CFR) module, supplying timbre-free content representation. 2) skip connection to the original content layer, providing complementary fine-grained information. In the CFR module, each dictionary entry in the universal semantic dictionary represents a phoneme class, computed statistically using speech from multiple speakers, creating a stable, speaker-independent semantic set. We introduce a CFR method to obtain timbre-free content representations by expressing each content frame as a weighted linear combination of dictionary entries using corresponding phoneme posteriors as weights. Extensive experiments across various VC frameworks demonstrate that our approach effectively mitigates timbre leakage and significantly improves similarity to the target speaker.
△ Less
Submitted 29 April, 2025; v1 submitted 11 April, 2025;
originally announced April 2025.
-
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Authors:
Boyuan Zheng,
Michael Y. Fatemi,
Xiaolong Jin,
Zora Zhiruo Wang,
Apurva Gandhi,
Yueqi Song,
Yu Gu,
Jayanth Srinivasa,
Gaowen Liu,
Graham Neubig,
Yu Su
Abstract:
To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural…
▽ More
To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural knowledge abstraction, refining skills, and skill composition. In this work, we introduce SkillWeaver, a skill-centric framework enabling agents to self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs, significantly enhancing the agent's capabilities. Experiments on WebArena and real-world websites demonstrate the efficacy of SkillWeaver, achieving relative success rate improvements of 31.8% and 39.8%, respectively. Additionally, APIs synthesized by strong agents substantially enhance weaker agents through transferable skills, yielding improvements of up to 54.3% on WebArena. These results demonstrate the effectiveness of honing diverse website interactions into APIs, which can be seamlessly shared among various web agents.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
AVP-AP: Self-supervised Automatic View Positioning in 3D cardiac CT via Atlas Prompting
Authors:
Xiaolin Fan,
Yan Wang,
Yingying Zhang,
Mingkun Bao,
Bosen Jia,
Dong Lu,
Yifan Gu,
Jian Cheng,
Haogang Zhu
Abstract:
Automatic view positioning is crucial for cardiac computed tomography (CT) examinations, including disease diagnosis and surgical planning. However, it is highly challenging due to individual variability and large 3D search space. Existing work needs labor-intensive and time-consuming manual annotations to train view-specific models, which are limited to predicting only a fixed set of planes. Howe…
▽ More
Automatic view positioning is crucial for cardiac computed tomography (CT) examinations, including disease diagnosis and surgical planning. However, it is highly challenging due to individual variability and large 3D search space. Existing work needs labor-intensive and time-consuming manual annotations to train view-specific models, which are limited to predicting only a fixed set of planes. However, in real clinical scenarios, the challenge of positioning semantic 2D slices with any orientation into varying coordinate space in arbitrary 3D volume remains unsolved. We thus introduce a novel framework, AVP-AP, the first to use Atlas Prompting for self-supervised Automatic View Positioning in the 3D CT volume. Specifically, this paper first proposes an atlas prompting method, which generates a 3D canonical atlas and trains a network to map slices into their corresponding positions in the atlas space via a self-supervised manner. Then, guided by atlas prompts corresponding to the given query images in a reference CT, we identify the coarse positions of slices in the target CT volume using rigid transformation between the 3D atlas and target CT volume, effectively reducing the search space. Finally, we refine the coarse positions by maximizing the similarity between the predicted slices and the query images in the feature space of a given foundation model. Our framework is flexible and efficient compared to other methods, outperforming other methods by 19.8% average structural similarity (SSIM) in arbitrary view positioning and achieving 9% SSIM in two-chamber view compared to four radiologists. Meanwhile, experiments on a public dataset validate our framework's generalizability.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
Authors:
Qi Mao,
Lan Chen,
Yuchao Gu,
Mike Zheng Shou,
Ming-Hsuan Yang
Abstract:
Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly…
▽ More
Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
New Algorithms for Incremental Minimum Spanning Trees and Temporal Graph Applications
Authors:
Xiangyun Ding,
Yan Gu,
Yihan Sun
Abstract:
Processing graphs with temporal information (the temporal graphs) has become increasingly important in the real world. In this paper, we study efficient solutions to temporal graph applications using new algorithms for Incremental Minimum Spanning Trees (MST). The first contribution of this work is to formally discuss how a broad set of setting-problem combinations of temporal graph processing can…
▽ More
Processing graphs with temporal information (the temporal graphs) has become increasingly important in the real world. In this paper, we study efficient solutions to temporal graph applications using new algorithms for Incremental Minimum Spanning Trees (MST). The first contribution of this work is to formally discuss how a broad set of setting-problem combinations of temporal graph processing can be solved using incremental MST, along with their theoretical guarantees. Despite the importance of the problem, we observe a gap between theory and practice for efficient incremental MST algorithms. While many classic data structures, such as the link-cut tree, provide strong bounds for incremental MST, their performance is limited in practice. Meanwhile, existing practical solutions used in applications do not have any non-trivial theoretical guarantees. Our second and main contribution includes new algorithms for incremental MST that are efficient both in theory and in practice. Our new data structure, the AM-tree, achieves the same theoretical bound as the link-cut tree for temporal graph processing and shows strong performance in practice. In our experiments, the AM-tree has competitive or better performance than existing practical solutions due to theoretical guarantees, and can be significantly faster than the link-cut tree (7.8-11x in updates and 7.7-13.7x in queries).
△ Less
Submitted 9 May, 2025; v1 submitted 6 April, 2025;
originally announced April 2025.
-
Diff-SSL-G-Comp: Towards a Large-Scale and Diverse Dataset for Virtual Analog Modeling
Authors:
Yicheng Gu,
Runsong Zhang,
Lauri Juvela,
Zhizheng Wu
Abstract:
Virtual Analog (VA) modeling aims to simulate the behavior of hardware circuits via algorithms to replicate their tone digitally. Dynamic Range Compressor (DRC) is an audio processing module that controls the dynamics of a track by reducing and amplifying the volumes of loud and quiet sounds, which is essential in music production. In recent years, neural-network-based VA modeling has shown great…
▽ More
Virtual Analog (VA) modeling aims to simulate the behavior of hardware circuits via algorithms to replicate their tone digitally. Dynamic Range Compressor (DRC) is an audio processing module that controls the dynamics of a track by reducing and amplifying the volumes of loud and quiet sounds, which is essential in music production. In recent years, neural-network-based VA modeling has shown great potential in producing high-fidelity models. However, due to the lack of data quantity and diversity, their generalization ability in different parameter settings and input sounds is still limited. To tackle this problem, we present Diff-SSL-G-Comp, the first large-scale and diverse dataset for modeling the SSL 500 G-Bus Compressor. Specifically, we manually collected 175 unmastered songs from the Cambridge Multitrack Library. We recorded the compressed audio in 220 parameter combinations, resulting in an extensive 2528-hour dataset with diverse genres, instruments, tempos, and keys. Moreover, to facilitate the use of our proposed dataset, we conducted benchmark experiments in various open-sourced black-box and grey-box models, as well as white-box plugins. We also conducted ablation studies in different data subsets to illustrate the effectiveness of improved data diversity and quantity. The dataset and demos are on our project page: http://www.yichenggu.com/DiffSSLGComp/.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
Enhanced ECG Arrhythmia Detection Accuracy by Optimizing Divergence-Based Data Fusion
Authors:
Baozhuo Su,
Qingli Dou,
Kang Liu,
Zhengxian Qu,
Jerry Deng,
Ting Tan,
Yanan Gu
Abstract:
AI computation in healthcare faces significant challenges when clinical datasets are limited and heterogeneous. Integrating datasets from multiple sources and different equipments is critical for effective AI computation but is complicated by their diversity, complexity, and lack of representativeness, so we often need to join multiple datasets for analysis. The currently used method is fusion aft…
▽ More
AI computation in healthcare faces significant challenges when clinical datasets are limited and heterogeneous. Integrating datasets from multiple sources and different equipments is critical for effective AI computation but is complicated by their diversity, complexity, and lack of representativeness, so we often need to join multiple datasets for analysis. The currently used method is fusion after normalization. But when using this method, it can introduce redundant information, decreasing the signal-to-noise ratio and reducing classification accuracy. To tackle this issue, we propose a feature-based fusion algorithm utilizing Kernel Density Estimation (KDE) and Kullback-Leibler (KL) divergence. Our approach involves initially preprocessing and continuous estimation on the extracted features, followed by employing the gradient descent method to identify the optimal linear parameters that minimize the KL divergence between the feature distributions. Using our in-house datasets consisting of ECG signals collected from 2000 healthy and 2000 diseased individuals by different equipments and verifying our method by using the publicly available PTB-XL dataset which contains 21,837 ECG recordings from 18,885 patients. We employ a Light Gradient Boosting Machine (LGBM) model to do the binary classification. The results demonstrate that the proposed fusion method significantly enhances feature-based classification accuracy for abnormal ECG cases in the merged datasets, compared to the normalization method. This data fusion strategy provides a new approach to process heterogeneous datasets for the optimal AI computation results.
△ Less
Submitted 19 March, 2025;
originally announced April 2025.
-
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Authors:
Bang Liu,
Xinfeng Li,
Jiayi Zhang,
Jinlin Wang,
Tanjin He,
Sirui Hong,
Hongzhang Liu,
Shaokun Zhang,
Kaitao Song,
Kunlun Zhu,
Yuheng Cheng,
Suyuchen Wang,
Xiaoqiang Wang,
Yuyu Luo,
Haibo Jin,
Peiyan Zhang,
Ollie Liu,
Jiaqi Chen,
Huan Zhang,
Zhaoyang Yu,
Haochen Shi,
Boyan Li,
Dekun Wu,
Fengwei Teng,
Xiaojun Jia
, et al. (22 additional authors not shown)
Abstract:
The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate…
▽ More
The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This survey provides a comprehensive overview, framing intelligent agents within a modular, brain-inspired architecture that integrates principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we delve into the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities, and elucidating core components such as memory, world modeling, reward processing, and emotion-like systems. Second, we discuss self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms, including emerging AutoML and LLM-driven optimization strategies. Third, we examine collaborative and evolutionary multi-agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures, highlighting parallels to human social dynamics. Finally, we address the critical imperative of building safe, secure, and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real-world deployment.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing
Authors:
Yunqi Gu,
Ian Huang,
Jihyeon Je,
Guandao Yang,
Leonidas Guibas
Abstract:
3D graphics editing is crucial in applications like movie production and game design, yet it remains a time-consuming process that demands highly specialized domain expertise. Automating this process is challenging because graphical editing requires performing a variety of tasks, each requiring distinct skill sets. Recently, vision-language models (VLMs) have emerged as a powerful framework for au…
▽ More
3D graphics editing is crucial in applications like movie production and game design, yet it remains a time-consuming process that demands highly specialized domain expertise. Automating this process is challenging because graphical editing requires performing a variety of tasks, each requiring distinct skill sets. Recently, vision-language models (VLMs) have emerged as a powerful framework for automating the editing process, but their development and evaluation are bottlenecked by the lack of a comprehensive benchmark that requires human-level perception and presents real-world editing complexity. In this work, we present BlenderGym, the first comprehensive VLM system benchmark for 3D graphics editing. BlenderGym evaluates VLM systems through code-based 3D reconstruction tasks. We evaluate closed- and open-source VLM systems and observe that even the state-of-the-art VLM system struggles with tasks relatively easy for human Blender users. Enabled by BlenderGym, we study how inference scaling techniques impact VLM's performance on graphics editing tasks. Notably, our findings reveal that the verifier used to guide the scaling of generation can itself be improved through inference scaling, complementing recent insights on inference scaling of LLM generation in coding and math tasks. We further show that inference compute is not uniformly effective and can be optimized by strategically distributing it between generation and verification.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Chinese Grammatical Error Correction: A Survey
Authors:
Mengyang Qiu,
Qingyu Gao,
Linxuan Yang,
Yang Gu,
Tran Minh Nguyen,
Zihao Huang,
Jungyeul Park
Abstract:
Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is…
▽ More
Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
A Survey on Unlearnable Data
Authors:
Jiahao Li,
Yiqiang Chen,
Yunbing Xing,
Yang Gu,
Xiangyuan Lan
Abstract:
Unlearnable data (ULD) has emerged as an innovative defense technique to prevent machine learning models from learning meaningful patterns from specific data, thus protecting data privacy and security. By introducing perturbations to the training data, ULD degrades model performance, making it difficult for unauthorized models to extract useful representations. Despite the growing significance of…
▽ More
Unlearnable data (ULD) has emerged as an innovative defense technique to prevent machine learning models from learning meaningful patterns from specific data, thus protecting data privacy and security. By introducing perturbations to the training data, ULD degrades model performance, making it difficult for unauthorized models to extract useful representations. Despite the growing significance of ULD, existing surveys predominantly focus on related fields, such as adversarial attacks and machine unlearning, with little attention given to ULD as an independent area of study. This survey fills that gap by offering a comprehensive review of ULD, examining unlearnable data generation methods, public benchmarks, evaluation metrics, theoretical foundations and practical applications. We compare and contrast different ULD approaches, analyzing their strengths, limitations, and trade-offs related to unlearnability, imperceptibility, efficiency and robustness. Moreover, we discuss key challenges, such as balancing perturbation imperceptibility with model degradation and the computational complexity of ULD generation. Finally, we highlight promising future research directions to advance the effectiveness and applicability of ULD, underscoring its potential to become a crucial tool in the evolving landscape of data protection in machine learning.
△ Less
Submitted 1 April, 2025; v1 submitted 30 March, 2025;
originally announced March 2025.
-
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Authors:
Junyu Luo,
Weizhi Zhang,
Ye Yuan,
Yusheng Zhao,
Junwei Yang,
Yiyang Gu,
Bohan Wu,
Binqi Chen,
Ziyue Qiao,
Qingqing Long,
Rongcheng Tu,
Xiao Luo,
Wei Ju,
Zhiping Xiao,
Yifan Wang,
Meng Xiao,
Chenwu Liu,
Jingyang Yuan,
Shichang Zhang,
Yiqiao Jin,
Fan Zhang,
Xian Wu,
Hanqing Zhao,
Dacheng Tao,
Philip S. Yu
, et al. (1 additional authors not shown)
Abstract:
The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architec…
▽ More
The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo-junyu/Awesome-Agent-Papers.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Robust Flower Cluster Matching Using The Unscented Transform
Authors:
Andy Chu,
Rashik Shrestha,
Yu Gu,
Jason N. Gross
Abstract:
Monitoring flowers over time is essential for precision robotic pollination in agriculture. To accomplish this, a continuous spatial-temporal observation of plant growth can be done using stationary RGB-D cameras. However, image registration becomes a serious challenge due to changes in the visual appearance of the plant caused by the pollination process and occlusions from growth and camera angle…
▽ More
Monitoring flowers over time is essential for precision robotic pollination in agriculture. To accomplish this, a continuous spatial-temporal observation of plant growth can be done using stationary RGB-D cameras. However, image registration becomes a serious challenge due to changes in the visual appearance of the plant caused by the pollination process and occlusions from growth and camera angles. Plants flower in a manner that produces distinct clusters on branches. This paper presents a method for matching flower clusters using descriptors generated from RGB-D data and considers allowing for spatial uncertainty within the cluster. The proposed approach leverages the Unscented Transform to efficiently estimate plant descriptor uncertainty tolerances, enabling a robust image-registration process despite temporal changes. The Unscented Transform is used to handle the nonlinear transformations by propagating the uncertainty of flower positions to determine the variations in the descriptor domain. A Monte Carlo simulation is used to validate the Unscented Transform results, confirming our method's effectiveness for flower cluster matching. Therefore, it can facilitate improved robotics pollination in dynamic environments.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
M$^2$CD: A Unified MultiModal Framework for Optical-SAR Change Detection with Mixture of Experts and Self-Distillation
Authors:
Ziyuan Liu,
Jiawei Zhang,
Wenyu Wang,
Yuantao Gu
Abstract:
Most existing change detection (CD) methods focus on optical images captured at different times, and deep learning (DL) has achieved remarkable success in this domain. However, in extreme scenarios such as disaster response, synthetic aperture radar (SAR), with its active imaging capability, is more suitable for providing post-event data. This introduces new challenges for CD methods, as existing…
▽ More
Most existing change detection (CD) methods focus on optical images captured at different times, and deep learning (DL) has achieved remarkable success in this domain. However, in extreme scenarios such as disaster response, synthetic aperture radar (SAR), with its active imaging capability, is more suitable for providing post-event data. This introduces new challenges for CD methods, as existing weight-sharing Siamese networks struggle to effectively learn the cross-modal data distribution between optical and SAR images. To address this challenge, we propose a unified MultiModal CD framework, M$^2$CD. We integrate Mixture of Experts (MoE) modules into the backbone to explicitly handle diverse modalities, thereby enhancing the model's ability to learn multimodal data distributions. Additionally, we innovatively propose an Optical-to-SAR guided path (O2SP) and implement self-distillation during training to reduce the feature space discrepancy between different modalities, further alleviating the model's learning burden. We design multiple variants of M$^2$CD based on both CNN and Transformer backbones. Extensive experiments validate the effectiveness of the proposed framework, with the MiT-b1 version of M$^2$CD outperforming all state-of-the-art (SOTA) methods in optical-SAR CD tasks.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Authors:
Yuchao Gu,
Weijia Mao,
Mike Zheng Shou
Abstract:
Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models t…
▽ More
Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context video modeling faces challenges due to visual redundancy. Training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle this issue, we propose balancing locality and long-range dependency through long short-term context modeling. A high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length, thereby significantly reducing training time and memory usage. Furthermore, we propose a multi-level KV cache designed to support the long short-term context modeling, which accelerating inference on long video sequences. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling. The code is released at https://github.com/showlab/FAR.
△ Less
Submitted 17 April, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection
Authors:
Wenxi Chen,
Raymond A. Yeh,
Shaoshuai Mou,
Yan Gu
Abstract:
Out-of-distribution (OOD) detection is the task of identifying inputs that deviate from the training data distribution. This capability is essential for safely deploying deep computer vision models in open-world environments. In this work, we propose a post-hoc method, Perturbation-Rectified OOD detection (PRO), based on the insight that prediction confidence for OOD inputs is more susceptible to…
▽ More
Out-of-distribution (OOD) detection is the task of identifying inputs that deviate from the training data distribution. This capability is essential for safely deploying deep computer vision models in open-world environments. In this work, we propose a post-hoc method, Perturbation-Rectified OOD detection (PRO), based on the insight that prediction confidence for OOD inputs is more susceptible to reduction under perturbation than in-distribution (IND) inputs. Based on the observation, we propose an adversarial score function that searches for the local minimum scores near the original inputs by applying gradient descent. This procedure enhances the separability between IND and OOD samples. Importantly, the approach improves OOD detection performance without complex modifications to the underlying model architectures. We conduct extensive experiments using the OpenOOD benchmark~\cite{yang2022openood}. Our approach further pushes the limit of softmax-based OOD detection and is the leading post-hoc method for small-scale models. On a CIFAR-10 model with adversarial training, PRO effectively detects near-OOD inputs, achieving a reduction of more than 10\% on FPR@95 compared to state-of-the-art methods.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Histomorphology-driven multi-instance learning for breast cancer WSI classification
Authors:
Baizhi Wang,
Rui Yan,
Wenxin Ma,
Xu Zhang,
Yuhao Wang,
Xiaolong Li,
Yunjie Gu,
Zihang Jiang,
S. Kevin Zhou
Abstract:
Histomorphology is crucial in breast cancer diagnosis. However, existing whole slide image (WSI) classification methods struggle to effectively incorporate histomorphology information, limiting their ability to capture key and fine-grained pathological features. To address this limitation, we propose a novel framework that explicitly incorporates histomorphology (tumor cellularity, cellular morpho…
▽ More
Histomorphology is crucial in breast cancer diagnosis. However, existing whole slide image (WSI) classification methods struggle to effectively incorporate histomorphology information, limiting their ability to capture key and fine-grained pathological features. To address this limitation, we propose a novel framework that explicitly incorporates histomorphology (tumor cellularity, cellular morphology, and tissue architecture) into WSI classification. Specifically, our approach consists of three key components: (1) estimating the importance of tumor-related histomorphology information at the patch level based on medical prior knowledge; (2) generating representative cluster-level features through histomorphology-driven cluster pooling; and (3) enabling WSI-level classification through histomorphology-driven multi-instance aggregation. With the incorporation of histomorphological information, our framework strengthens the model's ability to capture key and fine-grained pathological patterns, thereby enhancing WSI classification performance. Experimental results demonstrate its effectiveness, achieving high diagnostic accuracy for molecular subtyping and cancer subtyping. The code will be made available at https://github.com/Badgewho/HMDMIL.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
Authors:
Yuming Gu,
Phong Tran,
Yujian Zheng,
Hongyi Xu,
Heyuan Li,
Adilbek Karmanov,
Hao Li
Abstract:
Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views…
▽ More
Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles. We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Edit Transfer: Learning Image Editing via Vision In-Context Relations
Authors:
Lan Chen,
Qi Mao,
Yuchao Gu,
Mike Zheng Shou
Abstract:
We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style…
▽ More
We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Consistent-Point: Consistent Pseudo-Points for Semi-Supervised Crowd Counting and Localization
Authors:
Yuda Zou,
Zelong Liu,
Yuliang Gu,
Bo Du,
Yongchao Xu
Abstract:
Crowd counting and localization are important in applications such as public security and traffic management. Existing methods have achieved impressive results thanks to extensive laborious annotations. This paper propose a novel point-localization-based semi-supervised crowd counting and localization method termed Consistent-Point. We identify and address two inconsistencies of pseudo-points, whi…
▽ More
Crowd counting and localization are important in applications such as public security and traffic management. Existing methods have achieved impressive results thanks to extensive laborious annotations. This paper propose a novel point-localization-based semi-supervised crowd counting and localization method termed Consistent-Point. We identify and address two inconsistencies of pseudo-points, which have not been adequately explored. To enhance their position consistency, we aggregate the positions of neighboring auxiliary proposal-points. Additionally, an instance-wise uncertainty calibration is proposed to improve the class consistency of pseudo-points. By generating more consistent pseudo-points, Consistent-Point provides more stable supervision to the training process, yielding improved results. Extensive experiments across five widely used datasets and three different labeled ratio settings demonstrate that our method achieves state-of-the-art performance in crowd localization while also attaining impressive crowd counting results. The code will be available.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
FloPE: Flower Pose Estimation for Precision Pollination
Authors:
Rashik Shrestha,
Madhav Rijal,
Trevor Smith,
Yu Gu
Abstract:
This study presents Flower Pose Estimation (FloPE), a real-time flower pose estimation framework for computationally constrained robotic pollination systems. Robotic pollination has been proposed to supplement natural pollination to ensure global food security due to the decreased population of natural pollinators. However, flower pose estimation for pollination is challenging due to natural varia…
▽ More
This study presents Flower Pose Estimation (FloPE), a real-time flower pose estimation framework for computationally constrained robotic pollination systems. Robotic pollination has been proposed to supplement natural pollination to ensure global food security due to the decreased population of natural pollinators. However, flower pose estimation for pollination is challenging due to natural variability, flower clusters, and high accuracy demands due to the flowers' fragility when pollinating. This method leverages 3D Gaussian Splatting to generate photorealistic synthetic datasets with precise pose annotations, enabling effective knowledge distillation from a high-capacity teacher model to a lightweight student model for efficient inference. The approach was evaluated on both single and multi-arm robotic platforms, achieving a mean pose estimation error of 0.6 cm and 19.14 degrees within a low computational cost. Our experiments validate the effectiveness of FloPE, achieving up to 78.75% pollination success rate and outperforming prior robotic pollination techniques.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset
Authors:
Yibing Weng,
Yu Gu,
Fuji Ren
Abstract:
Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then enga…
▽ More
Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates. To this end, we propose the road rage reasoning task, along with a finely annotated test dataset and evaluation metrics, to assess the capabilities of current mainstream VLMs in scene understanding, event recognition, and road rage reasoning. The results indicate that current VLMs exhibit significant shortcomings in scene understanding within the visual modality, as well as in comprehending the spatial relationships between objects in the textual modality. Improving VLMs' performance in these areas will greatly benefit downstream tasks like antecedent-focused road rage regulation.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation
Authors:
Jiawei Zhang,
Ziyuan Liu,
Leon Yan,
Gen Li,
Yuantao Gu
Abstract:
Diffusion models have demonstrated remarkable performance in modeling complex data priors, catalyzing their widespread adoption in solving various inverse problems. However, the inherently iterative nature of diffusion-based inverse algorithms often requires hundreds to thousands of steps, with performance degradation occurring under fewer steps which limits their practical applicability. While hi…
▽ More
Diffusion models have demonstrated remarkable performance in modeling complex data priors, catalyzing their widespread adoption in solving various inverse problems. However, the inherently iterative nature of diffusion-based inverse algorithms often requires hundreds to thousands of steps, with performance degradation occurring under fewer steps which limits their practical applicability. While high-order diffusion ODE solvers have been extensively explored for efficient diffusion sampling without observations, their application to inverse problems remains underexplored due to the diverse forms of inverse algorithms and their need for repeated trajectory correction based on observations. To address this gap, we first introduce a canonical form that decomposes existing diffusion-based inverse algorithms into three modules to unify their analysis. Inspired by the linear subspace search strategy in the design of high-order diffusion ODE solvers, we propose the Learnable Linear Extrapolation (LLE) method, a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm that fits the proposed canonical form. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLE_inverse_problem.
△ Less
Submitted 16 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Sequential Multi-Object Grasping with One Dexterous Hand
Authors:
Sicheng He,
Zeyu Shangguan,
Kuanning Wang,
Yongchong Gu,
Yuqian Fu,
Yanwei Fu,
Daniel Seita
Abstract:
Sequentially grasping multiple objects with multi-fingered hands is common in daily life, where humans can fully leverage the dexterity of their hands to enclose multiple objects. However, the diversity of object geometries and the complex contact interactions required for high-DOF hands to grasp one object while enclosing another make sequential multi-object grasping challenging for robots. In th…
▽ More
Sequentially grasping multiple objects with multi-fingered hands is common in daily life, where humans can fully leverage the dexterity of their hands to enclose multiple objects. However, the diversity of object geometries and the complex contact interactions required for high-DOF hands to grasp one object while enclosing another make sequential multi-object grasping challenging for robots. In this paper, we propose SeqMultiGrasp, a system for sequentially grasping objects with a four-fingered Allegro Hand. We focus on sequentially grasping two objects, ensuring that the hand fully encloses one object before lifting it and then grasps the second object without dropping the first. Our system first synthesizes single-object grasp candidates, where each grasp is constrained to use only a subset of the hand's links. These grasps are then validated in a physics simulator to ensure stability and feasibility. Next, we merge the validated single-object grasp poses to construct multi-object grasp configurations. For real-world deployment, we train a diffusion model conditioned on point clouds to propose grasp poses, followed by a heuristic-based execution strategy. We test our system using $8 \times 8$ object combinations in simulation and $6 \times 3$ object combinations in real. Our diffusion-based grasp model obtains an average success rate of 65.8% over 1600 simulation trials and 56.7% over 90 real-world trials, suggesting that it is a promising approach for sequential multi-object grasping with multi-fingered hands. Supplementary material is available on our project website: https://hesic73.github.io/SeqMultiGrasp.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Force Aware Branch Manipulation To Assist Agricultural Tasks
Authors:
Madhav Rijal,
Rashik Shrestha,
Trevor Smith,
Yu Gu
Abstract:
This study presents a methodology to safely manipulate branches to aid various agricultural tasks. Humans in a real agricultural environment often manipulate branches to perform agricultural tasks effectively, but current agricultural robots lack this capability. This proposed strategy to manipulate branches can aid in different precision agriculture tasks, such as fruit picking in dense foliage,…
▽ More
This study presents a methodology to safely manipulate branches to aid various agricultural tasks. Humans in a real agricultural environment often manipulate branches to perform agricultural tasks effectively, but current agricultural robots lack this capability. This proposed strategy to manipulate branches can aid in different precision agriculture tasks, such as fruit picking in dense foliage, pollinating flowers under occlusion, and moving overhanging vines and branches for navigation. The proposed method modifies RRT* to plan a path that satisfies the branch geometric constraints and obeys branch deformable characteristics. Re-planning is done to obtain a path that helps the robot exert force within a desired range so that branches are not damaged during manipulation. Experimentally, this method achieved a success rate of 78% across 50 trials, successfully moving a branch from different starting points to a target region.
△ Less
Submitted 11 March, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
Frequency-Based Alignment of EEG and Audio Signals Using Contrastive Learning and SincNet for Auditory Attention Detection
Authors:
Yuan Liao,
Yuhong Zhang,
Qiushi Han,
Yuhang Yang,
Weiwei Ding,
Yuzhe Gu,
Hengxin Yang,
Liya Huang
Abstract:
Humans exhibit a remarkable ability to focus auditory attention in complex acoustic environments, such as cocktail parties. Auditory attention detection (AAD) aims to identify the attended speaker by analyzing brain signals, such as electroencephalography (EEG) data. Existing AAD algorithms often leverage deep learning's powerful nonlinear modeling capabilities, few consider the neural mechanisms…
▽ More
Humans exhibit a remarkable ability to focus auditory attention in complex acoustic environments, such as cocktail parties. Auditory attention detection (AAD) aims to identify the attended speaker by analyzing brain signals, such as electroencephalography (EEG) data. Existing AAD algorithms often leverage deep learning's powerful nonlinear modeling capabilities, few consider the neural mechanisms underlying auditory processing in the brain. In this paper, we propose SincAlignNet, a novel network based on an improved SincNet and contrastive learning, designed to align audio and EEG features for auditory attention detection. The SincNet component simulates the brain's processing of audio during auditory attention, while contrastive learning guides the model to learn the relationship between EEG signals and attended speech. During inference, we calculate the cosine similarity between EEG and audio features and also explore direct inference of the attended speaker using EEG data. Cross-trial evaluations results demonstrate that SincAlignNet outperforms state-of-the-art AAD methods on two publicly available datasets, KUL and DTU, achieving average accuracies of 78.3% and 92.2%, respectively, with a 1-second decision window. The model exhibits strong interpretability, revealing that the left and right temporal lobes are more active during both male and female speaker scenarios. Furthermore, we found that using data from only six electrodes near the temporal lobes maintains similar or even better performance compared to using 64 electrodes. These findings indicate that efficient low-density EEG online decoding is achievable, marking an important step toward the practical implementation of neuro-guided hearing aids in real-world applications. Code is available at: https://github.com/LiaoEuan/SincAlignNet.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
Authors:
Wenqiao Li,
Yao Gu,
Xintao Chen,
Xiaohao Xu,
Ming Hu,
Xiaonan Huang,
Yingna Wu
Abstract:
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where…
▽ More
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential. To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality. We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes. Our project is available at https://guyao2023.github.io/Phys-AD/.
△ Less
Submitted 25 March, 2025; v1 submitted 5 March, 2025;
originally announced March 2025.
-
Interactive Segmentation and Report Generation for CT Images
Authors:
Yannian Gu,
Wenhui Lei,
Hanyu Chen,
Xiaofan Zhang,
Shaoting Zhang
Abstract:
Automated CT report generation plays a crucial role in improving diagnostic accuracy and clinical workflow efficiency. However, existing methods lack interpretability and impede patient-clinician understanding, while their static nature restricts radiologists from dynamically adjusting assessments during image review. Inspired by interactive segmentation techniques, we propose a novel interactive…
▽ More
Automated CT report generation plays a crucial role in improving diagnostic accuracy and clinical workflow efficiency. However, existing methods lack interpretability and impede patient-clinician understanding, while their static nature restricts radiologists from dynamically adjusting assessments during image review. Inspired by interactive segmentation techniques, we propose a novel interactive framework for 3D lesion morphology reporting that seamlessly generates segmentation masks with comprehensive attribute descriptions, enabling clinicians to generate detailed lesion profiles for enhanced diagnostic assessment. To our best knowledge, we are the first to integrate the interactive segmentation and structured reports in 3D CT medical images. Experimental results across 15 lesion types demonstrate the effectiveness of our approach in providing a more comprehensive and reliable reporting system for lesion segmentation and capturing. The source code will be made publicly available following paper acceptance.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs
Authors:
Yuzhe Gu,
Wenwei Zhang,
Chengqi Lyu,
Dahua Lin,
Kai Chen
Abstract:
Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grain…
▽ More
Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grained factuality alignment method based on Direct Preference Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as mask signals, Mask-DPO only learns from factually correct sentences in the preferred samples and prevents the penalty on factual contents in the not preferred samples, which resolves the ambiguity in the preference learning. Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its FactScore on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%. We further study the generalization property of Mask-DPO using different training sample scaling strategies and find that scaling the number of topics in the dataset is more effective than the number of questions. We provide a hypothesis of what factual alignment is doing with LLMs, on the implication of this phenomenon, and conduct proof-of-concept experiments to verify it. We hope the method and the findings pave the way for future research on scaling factuality alignment.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Quantifying Point Contributions: A Lightweight Framework for Efficient and Effective Query-Driven Trajectory Simplification
Authors:
Yumeng Song,
Yu Gu,
Tianyi Li,
Yushuai Li,
Christian S. Jensen,
Ge Yu
Abstract:
As large volumes of trajectory data accumulate, simplifying trajectories to reduce storage and querying costs is increasingly studied. Existing proposals face three main problems. First, they require numerous iterations to decide which GPS points to delete. Second, they focus only on the relationships between neighboring points (local information) while neglecting the overall structure (global inf…
▽ More
As large volumes of trajectory data accumulate, simplifying trajectories to reduce storage and querying costs is increasingly studied. Existing proposals face three main problems. First, they require numerous iterations to decide which GPS points to delete. Second, they focus only on the relationships between neighboring points (local information) while neglecting the overall structure (global information), reducing the global similarity between the simplified and original trajectories and making it difficult to maintain consistency in query results, especially for similarity-based queries. Finally, they fail to differentiate the importance of points with similar features, leading to suboptimal selection of points to retain the original trajectory information.
We propose MLSimp, a novel Mutual Learning query-driven trajectory simplification framework that integrates two distinct models: GNN-TS, based on graph neural networks, and Diff-TS, based on diffusion models. GNN-TS evaluates the importance of a point according to its globality, capturing its correlation with the entire trajectory, and its uniqueness, capturing its differences from neighboring points. It also incorporates attention mechanisms in the GNN layers, enabling simultaneous data integration from all points within the same trajectory and refining representations, thus avoiding iterative processes. Diff-TS generates amplified signals to enable the retention of the most important points at low compression rates. Experiments involving eight baselines on three databases show that MLSimp reduces the simplification time by 42%--70% and improves query accuracy over simplified trajectories by up to 34.6%.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems
Authors:
Zekun Zhou,
Xiaocheng Feng,
Lei Huang,
Xiachong Feng,
Ziyun Song,
Ruihan Chen,
Liang Zhao,
Weitao Ma,
Yuxuan Gu,
Baoxin Wang,
Dayong Wu,
Guoping Hu,
Ting Liu,
Bing Qin
Abstract:
Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in t…
▽ More
Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Scalable and Accurate Application-Level Crash-Consistency Testing via Representative Testing
Authors:
Yile Gu,
Ian Neal,
Jiexiao Xu,
Shaun Christopher Lee,
Ayman Said,
Musa Haydar,
Jacob Van Geffen,
Rohan Kadekodi,
Andrew Quinn,
Baris Kasikci
Abstract:
Crash consistency is essential for applications that must persist data. Crash-consistency testing has been commonly applied to find crash-consistency bugs in applications. The crash-state space grows exponentially as the number of operations in the program increases, necessitating techniques for pruning the search space. However, state-of-the-art crash-state space pruning is far from ideal. Some t…
▽ More
Crash consistency is essential for applications that must persist data. Crash-consistency testing has been commonly applied to find crash-consistency bugs in applications. The crash-state space grows exponentially as the number of operations in the program increases, necessitating techniques for pruning the search space. However, state-of-the-art crash-state space pruning is far from ideal. Some techniques look for known buggy patterns or bound the exploration for efficiency, but they sacrifice coverage and may miss bugs lodged deep within applications. Other techniques eliminate redundancy in the search space by skipping identical crash states, but they still fail to scale to larger applications.
In this work, we propose representative testing: a new crash-state space reduction strategy that achieves high scalability and high coverage. Our key observation is that the consistency of crash states is often correlated, even if those crash states are not identical. We build Pathfinder, a crash-consistency testing tool that implements an update behaviors-based heuristic to approximate a small set of representative crash states.
We evaluate Pathfinder on POSIX-based and MMIO-based applications, where it finds 18 (7 new) bugs across 8 production-ready systems. Pathfinder scales more effectively to large applications than prior works and finds 4x more bugs in POSIX-based applications and 8x more bugs in MMIO-based applications compared to state-of-the-art systems.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Evolution of Information in Interactive Decision Making: A Case Study for Multi-Armed Bandits
Authors:
Yuzhou Gu,
Yanjun Han,
Jian Qian
Abstract:
We study the evolution of information in interactive decision making through the lens of a stochastic multi-armed bandit problem. Focusing on a fundamental example where a unique optimal arm outperforms the rest by a fixed margin, we characterize the optimal success probability and mutual information over time. Our findings reveal distinct growth phases in mutual information -- initially linear, t…
▽ More
We study the evolution of information in interactive decision making through the lens of a stochastic multi-armed bandit problem. Focusing on a fundamental example where a unique optimal arm outperforms the rest by a fixed margin, we characterize the optimal success probability and mutual information over time. Our findings reveal distinct growth phases in mutual information -- initially linear, transitioning to quadratic, and finally returning to linear -- highlighting curious behavioral differences between interactive and non-interactive environments. In particular, we show that optimal success probability and mutual information can be decoupled, where achieving optimal learning does not necessarily require maximizing information gain. These findings shed new light on the intricate interplay between information and learning in interactive decision making.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
A novel boundary integrated neural networks for in plane fracture mechanics analysis of elastic and piezoelectric materials
Authors:
Peijun Zhang,
Yan Gu,
Okyay Altay,
Chuanzeng Zhang
Abstract:
In this study, we propose a novel approach, termed boundary integrated neural networks (BINNs), for analyzing in-plane crack problems within the framework of linear elastic fracture mechanics. The proposed approach integrates artificial neural networks (ANNs) with classical boundary integral equations (BIEs), enabling an efficient and accurate evaluation of partial differential equations (PDEs) as…
▽ More
In this study, we propose a novel approach, termed boundary integrated neural networks (BINNs), for analyzing in-plane crack problems within the framework of linear elastic fracture mechanics. The proposed approach integrates artificial neural networks (ANNs) with classical boundary integral equations (BIEs), enabling an efficient and accurate evaluation of partial differential equations (PDEs) associated with fracture mechanics. Additionally, novel special ANN-based crack-tip elements, the Special crack-tip Neural Networks(SPNNs) are developed to improve the modeling of displacement and stress fields in regions near crack tips. These specialized elements integrate the asymptotic characteristics of fracture mechanics into the neural network framework, ensuring enhanced accuracy in capturing the intricate singularities and steep gradients near the crack tips. Compared to conventional simulation tools in fracture mechanics, the present method offers several distinct advantages. First, by embedding higher-order fracture mechanics principles into the neural networks, the method achieves a more accurate and reliable representation of the near-tip fields, even when using relatively large crack-tip elements. Second, the SPNNs, which incorporates information about varying near-tip singularity orders, improve the method's versatility in solving problems involving complex and diverse crack-tips geometries. Moreover, the method demonstrates excellent computational efficiency due to the dimensionality reduction achieved by employing BIEs. Numerical experiments confirm that the proposed framework serves as a reliable, robust, and accurate tool for addressing fracture mechanics problems, offering substantial advantages over conventional numerical approaches.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.