Search | arXiv e-print repository

ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View Stereo

Authors: Yuxi Hu, Jun Zhang, Zhe Zhang, Rafael Weilharter, Yuchen Rao, Kuangyi Chen, Runze Yuan, Friedrich Fraundorfer

Abstract: Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically,… ▽ More Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically, we develop an intra-view feature fusion module that leverages the feature coordinate correlations within a single image to enhance robust cost matching. Additionally, we introduce a lightweight cross-view aggregation module that efficiently utilizes the contextual information from volume correlations to guide regularization. Our method is evaluated on the DTU dataset and Tanks and Temples benchmark, consistently achieving competitive performance against state-of-the-art works, while requiring lower computational resources. △ Less

Submitted 27 March, 2025; originally announced March 2025.

arXiv:2503.21057 [pdf, other]

Validation and Calibration of Energy Models with Real Vehicle Data from Chassis Dynamometer Experiments

Authors: Joy Carpio, Sulaiman Almatrudi, Nour Khoudari, Zhe Fu, Kenneth Butts, Jonathan Lee, Benjamin Seibold, Alexandre Bayen

Abstract: Accurate estimation of vehicle fuel consumption typically requires detailed modeling of complex internal powertrain dynamics, often resulting in computationally intensive simulations. However, many transportation applications-such as traffic flow modeling, optimization, and control-require simplified models that are fast, interpretable, and easy to implement, while still maintaining fidelity to ph… ▽ More Accurate estimation of vehicle fuel consumption typically requires detailed modeling of complex internal powertrain dynamics, often resulting in computationally intensive simulations. However, many transportation applications-such as traffic flow modeling, optimization, and control-require simplified models that are fast, interpretable, and easy to implement, while still maintaining fidelity to physical energy behavior. This work builds upon a recently developed model reduction pipeline that derives physics-like energy models from high-fidelity Autonomie vehicle simulations. These reduced models preserve essential vehicle dynamics, enabling realistic fuel consumption estimation with minimal computational overhead. While the reduced models have demonstrated strong agreement with their Autonomie counterparts, previous validation efforts have been confined to simulation environments. This study extends the validation by comparing the reduced energy model's outputs against real-world vehicle data. Focusing on the MidSUV category, we tune the baseline Autonomie model to closely replicate the characteristics of a Toyota RAV4. We then assess the accuracy of the resulting reduced model in estimating fuel consumption under actual drive conditions. Our findings suggest that, when the reference Autonomie model is properly calibrated, the simplified model produced by the reduction pipeline can provide reliable, semi-principled fuel rate estimates suitable for large-scale transportation applications. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.20377 [pdf, other]

UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture

Authors: Heng Liao, Bingyang Liu, Xianping Chen, Zhigang Guo, Chuanning Cheng, Jianbing Wang, Xiangyu Chen, Peng Dong, Rui Meng, Wenjie Liu, Zhe Zhou, Ziyang Zhang, Yuhang Gai, Cunle Qian, Yi Xiong, Zhongwu Cheng, Jing Xia, Yuli Ma, Xi Chen, Wenhua Du, Shizhong Xiao, Chungang Li, Yong Qin, Liudong Xiong, Zhou Yu , et al. (9 additional authors not shown)

Abstract: As the Large-scale Language Models (LLMs) continue to scale, the requisite computational power and bandwidth escalate. To address this, we introduce UB-Mesh, a novel AI datacenter network architecture designed to enhance scalability, performance, cost-efficiency and availability. Unlike traditional datacenters that provide symmetrical node-to-node bandwidth, UB-Mesh employs a hierarchically locali… ▽ More As the Large-scale Language Models (LLMs) continue to scale, the requisite computational power and bandwidth escalate. To address this, we introduce UB-Mesh, a novel AI datacenter network architecture designed to enhance scalability, performance, cost-efficiency and availability. Unlike traditional datacenters that provide symmetrical node-to-node bandwidth, UB-Mesh employs a hierarchically localized nD-FullMesh network topology. This design fully leverages the data locality of LLM training, prioritizing short-range, direct interconnects to minimize data movement distance and reduce switch usage. Although UB-Mesh's nD-FullMesh topology offers several theoretical advantages, its concrete architecture design, physical implementation and networking system optimization present new challenges. For the actual construction of UB-Mesh, we first design the UB-Mesh-Pod architecture, which is based on a 4D-FullMesh topology. UB-Mesh-Pod is implemented via a suite of hardware components that serve as the foundational building blocks, including specifically-designed NPU, CPU, Low-Radix-Switch (LRS), High-Radix-Switch (HRS), NICs and others. These components are interconnected via a novel Unified Bus (UB) technique, which enables flexible IO bandwidth allocation and hardware resource pooling. For networking system optimization, we propose advanced routing mechanism named All-Path-Routing (APR) to efficiently manage data traffic. These optimizations, combined with topology-aware performance enhancements and robust reliability measures like 64+1 backup design, result in 2.04x higher cost-efficiency, 7.2% higher network availability compared to traditional Clos architecture and 95%+ linearity in various LLM training tasks. △ Less

Submitted 17 May, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.20265 [pdf, other]

Fixseeker: An Empirical Driven Graph-based Approach for Detecting Silent Vulnerability Fixes in Open Source Software

Authors: Yiran Cheng, Ting Zhang, Lwin Khin Shar, Zhe Lang, David Lo, Shichao Lv, Dongliang Fang, Zhiqiang Shi, Limin Sun

Abstract: Open source software vulnerabilities pose significant security risks to downstream applications. While vulnerability databases provide valuable information for mitigation, many security patches are released silently in new commits of OSS repositories without explicit indications of their security impact. This makes it challenging for software maintainers and users to detect and address these vulne… ▽ More Open source software vulnerabilities pose significant security risks to downstream applications. While vulnerability databases provide valuable information for mitigation, many security patches are released silently in new commits of OSS repositories without explicit indications of their security impact. This makes it challenging for software maintainers and users to detect and address these vulnerability fixes. There are a few approaches for detecting vulnerability-fixing commits (VFCs) but most of these approaches leverage commit messages, which would miss silent VFCs. On the other hand, there are some approaches for detecting silent VFCs based on code change patterns but they often fail to adequately characterize vulnerability fix patterns, thereby lacking effectiveness. For example, some approaches analyze each hunk in known VFCs, in isolation, to learn vulnerability fix patterns; but vulnerabiliy fixes are often associated with multiple hunks, in which cases correlations of code changes across those hunks are essential for characterizing the vulnerability fixes. To address these problems, we first conduct a large-scale empirical study on 11,900 VFCs across six programming languages, in which we found that over 70% of VFCs involve multiple hunks with various types of correlations. Based on our findings, we propose Fixseeker, a graph-based approach that extracts the various correlations between code changes at the hunk level to detect silent vulnerability fixes. Our evaluation demonstrates that Fixseeker outperforms state-of-the-art approaches across multiple programming languages, achieving a high F1 score of 0.8404 on average in balanced datasets and consistently improving F1 score, AUC-ROC and AUC-PR scores by 32.40%, 1.55% and 8.24% on imbalanced datasets. Our evaluation also indicates the generality of Fixseeker across different repository sizes and commit complexities. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.20117 [pdf, ps, other]

Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable

Authors: Bicheng Ying, Zhe Li, Haibo Yang

Abstract: This work tackles the fundamental challenges in Federated Learning (FL) posed by arbitrary client participation and data heterogeneity, prevalent characteristics in practical FL settings. It is well-established that popular FedAvg-style algorithms struggle with exact convergence and can suffer from slow convergence rates since a decaying learning rate is required to mitigate these scenarios. To ad… ▽ More This work tackles the fundamental challenges in Federated Learning (FL) posed by arbitrary client participation and data heterogeneity, prevalent characteristics in practical FL settings. It is well-established that popular FedAvg-style algorithms struggle with exact convergence and can suffer from slow convergence rates since a decaying learning rate is required to mitigate these scenarios. To address these issues, we introduce the concept of stochastic matrix and the corresponding time-varying graphs as a novel modeling tool to accurately capture the dynamics of arbitrary client participation and the local update procedure. Leveraging this approach, we offer a fresh perspective on designing FL algorithms, provide a rigorous quantitative analysis of the limitations inherent in the FedAvg algorithm, and present FOCUS, Federated Optimization with Exact Convergence via Push-pull Strategy, a provably convergent algorithm designed to effectively overcome the previously mentioned two challenges. More specifically, we provide a rigorous proof demonstrating that FOCUS achieves exact convergence with a linear rate regardless of the arbitrary client participation, establishing it as the first work to demonstrate this significant result. △ Less

Submitted 3 June, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

Comments: Under review

arXiv:2503.19941 [pdf, other]

Body Discovery of Embodied AI

Authors: Zhe Sun, Pengfei Tian, Xiaozhu Hu, Xiaoyu Zhao, Huiying Li, Zhenliang Zhang

Abstract: In the pursuit of realizing artificial general intelligence (AGI), the importance of embodied artificial intelligence (AI) becomes increasingly apparent. Following this trend, research integrating robots with AGI has become prominent. As various kinds of embodiments have been designed, adaptability to diverse embodiments will become important to AGI. We introduce a new challenge, termed "Body Disc… ▽ More In the pursuit of realizing artificial general intelligence (AGI), the importance of embodied artificial intelligence (AI) becomes increasingly apparent. Following this trend, research integrating robots with AGI has become prominent. As various kinds of embodiments have been designed, adaptability to diverse embodiments will become important to AGI. We introduce a new challenge, termed "Body Discovery of Embodied AI", focusing on tasks of recognizing embodiments and summarizing neural signal functionality. The challenge encompasses the precise definition of an AI body and the intricate task of identifying embodiments in dynamic environments, where conventional approaches often prove inadequate. To address these challenges, we apply causal inference method and evaluate it by developing a simulator tailored for testing algorithms with virtual environments. Finally, we validate the efficacy of our algorithms through empirical testing, demonstrating their robust performance in various scenarios based on virtual environments. △ Less

Submitted 25 March, 2025; originally announced March 2025.

arXiv:2503.18943 [pdf, other]

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Authors: Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin Dehghan

Abstract: We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is… ▽ More We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for mobile-friendly models. Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks. △ Less

Submitted 27 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

Comments: Technical report

arXiv:2503.18888 [pdf, other]

Toward building next-generation Geocoding systems: a systematic review

Authors: Zhengcong Yin, Daniel W. Goldberg, Binbin Lin, Bing Zhou, Diya Li, Andong Ma, Ziqian Ming, Heng Cai, Zhe Zhang, Shaohua Wang, Shanzhen Gao, Joey Ying Lee, Xiao Li, Da Huo

Abstract: Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across vari… ▽ More Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across various scenarios these systems must address. It then provides a detailed analysis of how to construct such systems by breaking them down into key functional components and reviewing a broad spectrum of existing approaches, from traditional rule-based methods to advanced techniques in information retrieval, natural language processing, and large language models. Finally, we identify opportunities to improve next-generation geocoding systems in light of recent technological advances. △ Less

Submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.18091 [pdf, ps, other]

doi 10.1103/PhysRevB.111.214203

Fermi energy sensitive universal conductance fluctuations in anisotropic materials

Authors: Qiang Yang, Yayun Hu, Zhe Hou, Peiqing Tong

Abstract: Universal conductance fluctuations (UCF) are a hallmark of quantum interference in mesoscopic devices. According to the Altshuler-Lee-Stone theory, the amplitude of UCF remains independent of system parameters such as Fermi energy and disorder strength. However, recent experiments have demonstrated a significant variation in UCF with respect to Fermi energy in the anisotropic Dirac semimetal… ▽ More Universal conductance fluctuations (UCF) are a hallmark of quantum interference in mesoscopic devices. According to the Altshuler-Lee-Stone theory, the amplitude of UCF remains independent of system parameters such as Fermi energy and disorder strength. However, recent experiments have demonstrated a significant variation in UCF with respect to Fermi energy in the anisotropic Dirac semimetal $\mathrm{Cd_3As_2}$, suggesting a dependence on band anisotropy. In this work, we reconcile the discrepancy between theoretical predictions and experimental observations through a detailed study of UCF versus Fermi energy using a tight-binding model with tunable anisotropy parameters. Near the band edge, the Hamiltonian is simplified to an anisotropic free electron gas model, recovering the generalized Altshuler-Lee-Stone theory. However, as the Fermi energy shifts toward the band center, where rotational symmetry breaks into $C_4$ (four-fold rotational) symmetry, the UCF amplitude deviates from the standard theory. Our findings reveal that UCF becomes increasingly sensitive to Fermi energy as the anisotropy grows stronger. Furthermore, using realistic parameters for $\mathrm{Cd_3As_2}$, our calculations demonstrate an increase in UCF away from the Dirac point, in qualitative agreement with experimental results. The enhancement of UCF occurs in two perpendicular transport directions that we have calculated, albeit with quantitative differences in magnitude, which can be tested in future experiments. Given the prevalence of anisotropic materials and technical advances in engineering anisotropy through strain or twist, our results offer a valuable reference for characterizing intrinsic electronic properties via UCF. △ Less

Submitted 10 June, 2025; v1 submitted 23 March, 2025; originally announced March 2025.

Comments: 19 pages, 11 figures

Journal ref: Phys. Rev. B 111, 214203 (2025)

arXiv:2503.17788 [pdf, other]

Aligning Foundation Model Priors and Diffusion-Based Hand Interactions for Occlusion-Resistant Two-Hand Reconstruction

Authors: Gaoge Han, Yongkang Cheng, Zhe Chen, Shaoli Huang, Tongliang Liu

Abstract: Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a novel framework that attempts to precisely alig… ▽ More Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a novel framework that attempts to precisely align hand poses and interactions by synergistically integrating foundation model-driven 2D priors with diffusion-based interaction refinement for occlusion-resistant two-hand reconstruction. First, we introduce a Fusion Alignment Encoder that learns to align fused multimodal priors keypoints, segmentation maps, and depth cues from foundation models during training. This provides robust structured guidance, further enabling efficient inference without foundation models at test time while maintaining high reconstruction accuracy. Second, we employ a two-hand diffusion model explicitly trained to transform interpenetrated poses into plausible, non-penetrated interactions, leveraging gradient-guided denoising to correct artifacts and ensure realistic spatial relations. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on InterHand2.6M, FreiHAND, and HIC datasets, significantly advancing occlusion handling and interaction robustness. △ Less

Submitted 22 March, 2025; originally announced March 2025.

arXiv:2503.16965 [pdf, other]

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

Authors: Zhe Hu, Jing Li, Zhongzhu Pu, Hou Pong Chan, Yu Yin

Abstract: Vision Language Models exhibited immense potential for embodied AI, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are represented merely as text-only descriptions, suggesting foundational reasoning can be effectively learned from language. Mo… ▽ More Vision Language Models exhibited immense potential for embodied AI, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are represented merely as text-only descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability. △ Less

Submitted 22 May, 2025; v1 submitted 21 March, 2025; originally announced March 2025.

arXiv:2503.16886 [pdf, other]

Insight-HXMT observations of the 2023 outburst in Aql X-1

Authors: Zhe Yan, Guobao Zhang, Yu-Peng Chen, Mariano Méndez, Jirong Mao, Ming Lyu, Shu Zhang, Pei Jin

Abstract: We conducted an analysis of the continuum during the onset and initial decline phases of the 2023 outburst in transient neutron star low-mass X-ray binary Aql X$-$1 using broadband observations from the \textit{Insight-Hard X-ray Modulation Telescope (Insight-HXMT)} instrument. To determine the most appropriate model for the continuum of this outburst, we employed three models to explore the evolu… ▽ More We conducted an analysis of the continuum during the onset and initial decline phases of the 2023 outburst in transient neutron star low-mass X-ray binary Aql X$-$1 using broadband observations from the \textit{Insight-Hard X-ray Modulation Telescope (Insight-HXMT)} instrument. To determine the most appropriate model for the continuum of this outburst, we employed three models to explore the evolution of the spectral component. These observations revealed that the source transitions from the hard state to the soft state. The disk-corona and sphere-corona models both adequately described the spectra of the hard state, while the double blackbody model became preferable after the hard X-ray emission ($>$25 keV) disappeared during the state transition. In the soft state, the total emission is dominated by changes in the disk and other blackbody components. The combination of the sphere-corona model and the double blackbody model is the most suitable model for this outburst. The results suggest that as the source transitioned into the soft state, the emission from the boundary layer was enhanced, and a hot spot occurred. Notably, we identified two type-I X-ray bursts, one of which exhibited a significant hard X-ray deficit (significance $\sim$ 4.82 $σ$), which indicates that \textit{Insight-HXMT} has the capability to capture the evolution of the corona in a single burst. △ Less

Submitted 21 March, 2025; originally announced March 2025.

Comments: 6 figures

arXiv:2503.16820 [pdf, ps, other]

doi 10.1038/s41467-025-57713-w

Giant Self Spin-Valve Effect in the Kagome Helimagnet

Authors: Xitong Xu, Yonglai Liu, Kesen Zhao, Che-Min Lin, Miao He, Haitian Zhao, Qingqi Zeng, Yubin Hou, Qingyou Lu, Ding-Fu Shao, Shuang Jia, Haifeng Du, Wenjie Meng, Tay-Rong Chang, Zhe Qu

Abstract: Kagome magnets can combine non-trivial band topology and electron correlations, offering a versatile playground for various quantum phenomena. In this work we propose that kagome magnets with frustrated interlayer interactions can intrinsically support a self spin-valve effect, and experimentally confirm this in the kagome helimagnet TmMn$_6$Sn$_6$. Under a magnetic field perpendicular to the heli… ▽ More Kagome magnets can combine non-trivial band topology and electron correlations, offering a versatile playground for various quantum phenomena. In this work we propose that kagome magnets with frustrated interlayer interactions can intrinsically support a self spin-valve effect, and experimentally confirm this in the kagome helimagnet TmMn$_6$Sn$_6$. Under a magnetic field perpendicular to the helical axis, using magnetic force microscopy we observed stripe domains that stack strictly along the helical axis, which we attribute to the stability loss of the kagome helimagnetic state. Such a domain pattern spontaneously mimics the artificial multilayered structure in traditional spin valves, which, combined with the high spin polarization, leads to a giant magnetoresistance (GMR) ratio over 160%. This discovery opens an avenue to realize inherent spin valves in a variety of quantum magnets, and can hold promise in future spintronics. △ Less

Submitted 20 March, 2025; originally announced March 2025.

Comments: Accepted version

Journal ref: Nat. Commun. 16, 2630 (2025)

arXiv:2503.15837 [pdf, other]

Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

Authors: Shangqing Zhao, Yuhao Zhou, Yupei Ren, Zhe Chen, Chenghao Jia, Fang Zhe, Zhaogaung Long, Shu Liu, Man Lan

Abstract: Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese.… ▽ More Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce Fùxì, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain. △ Less

Submitted 20 March, 2025; originally announced March 2025.

Comments: working in progress

arXiv:2503.15558 [pdf, other]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Authors: NVIDIA, :, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz , et al. (29 additional authors not shown)

Abstract: Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, wit… ▽ More Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1. △ Less

Submitted 19 May, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

arXiv:2503.15404 [pdf, other]

Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement

Authors: Yuchen Ren, Zhengyu Zhao, Chenhao Lin, Bo Yang, Lu Zhou, Zhe Liu, Chao Shen

Abstract: Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate… ▽ More Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at https://github.com/RYC-98/FPR. △ Less

Submitted 19 March, 2025; originally announced March 2025.

Comments: CVPR2025

arXiv:2503.15144 [pdf, other]

PointSFDA: Source-free Domain Adaptation for Point Cloud Completion

Authors: Xing He, Zhe Zhu, Liangliang Nan, Honghua Chen, Jing Qin, Mingqiang Wei

Abstract: Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbf{PointSFDA}. Unlike unsupervised domain adaptation that reduces the domain gap by directly lever… ▽ More Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbf{PointSFDA}. Unlike unsupervised domain adaptation that reduces the domain gap by directly leveraging labeled source data, PointSFDA uses only a pretrained source model and unlabeled target data for adaptation, avoiding the need for inaccessible source data in practical scenarios. Being the first source-free domain adaptation architecture for point cloud completion, our method offers two core contributions. First, we introduce a coarse-to-fine distillation solution to explicitly transfer the global geometry knowledge learned from the source dataset. Second, as noise may be introduced due to domain gaps, we propose a self-supervised partial-mask consistency training strategy to learn local geometry information in the target domain. Extensive experiments have validated that our method significantly improves the performance of state-of-the-art networks in cross-domain shape completion. Our code is available at \emph{\textcolor{magenta}{https://github.com/Starak-x/PointSFDA}}. △ Less

Submitted 19 March, 2025; originally announced March 2025.

arXiv:2503.15091 [pdf, other]

Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs

Authors: Yao Cheng, Zhe Han, Fengyang Jiang, Huaizhen Wang, Fengyu Zhou, Qingshan Yin, Lei Wei

Abstract: This paper addresses the high demand in advanced intelligent robot navigation for a more holistic understanding of spatial environments, by introducing a novel system that harnesses the capabilities of Large Language Models (LLMs) to construct hierarchical 3D Scene Graphs (3DSGs) for indoor scenarios. The proposed framework constructs 3DSGs consisting of a fundamental layer with rich metric-semant… ▽ More This paper addresses the high demand in advanced intelligent robot navigation for a more holistic understanding of spatial environments, by introducing a novel system that harnesses the capabilities of Large Language Models (LLMs) to construct hierarchical 3D Scene Graphs (3DSGs) for indoor scenarios. The proposed framework constructs 3DSGs consisting of a fundamental layer with rich metric-semantic information, an object layer featuring precise point-cloud representation of object nodes as well as visual descriptors, and higher layers of room, floor, and building nodes. Thanks to the innovative application of LLMs, not only object nodes but also nodes of higher layers, e.g., room nodes, are annotated in an intelligent and accurate manner. A polling mechanism for room classification using LLMs is proposed to enhance the accuracy and reliability of the room node annotation. Thorough numerical experiments demonstrate the system's ability to integrate semantic descriptions with geometric data, creating an accurate and comprehensive representation of the environment instrumental for context-aware navigation and task planning. △ Less

Submitted 19 March, 2025; originally announced March 2025.

Comments: accepted by WRC SARA 2024

arXiv:2503.14000 [pdf, other]

LLM-based Unit Test Generation for Dynamically-Typed Programs

Authors: Runlin Liu, Zhe Zhang, Yunge Hu, Yuhang Lin, Xiang Gao, Hailong Sun

Abstract: Automated unit test generation has been widely studied, but generating effective tests for dynamically typed programs remains a significant challenge. Existing approaches, including search-based software testing (SBST) and recent LLM-based methods, often suffer from type errors, leading to invalid inputs and assertion failures, ultimately reducing testing effectiveness. To address this, we propose… ▽ More Automated unit test generation has been widely studied, but generating effective tests for dynamically typed programs remains a significant challenge. Existing approaches, including search-based software testing (SBST) and recent LLM-based methods, often suffer from type errors, leading to invalid inputs and assertion failures, ultimately reducing testing effectiveness. To address this, we propose TypeTest, a novel framework that enhances type correctness in test generation through a vector-based Retrieval-Augmented Generation (RAG) system. TypeTest employs call instance retrieval and feature-based retrieval to infer parameter types accurately and construct valid test inputs. Furthermore, it utilizes the call graph to extract richer contextual information, enabling more accurate assertion generation. In addition, TypeTest incorporates a repair mechanism and iterative test generation, progressively refining test cases to improve coverage. In an evaluation on 125 real-world Python modules, TypeTest achieved an average statement coverage of 86.6% and branch coverage of 76.8%, outperforming state-of-theart tools by 5.4% and 9.3%, respectively. △ Less

Submitted 18 March, 2025; originally announced March 2025.

arXiv:2503.13880 [pdf]

The development of vibration modes propagation method to perform wave-optics simulation of beamline vibration

Authors: Han Xu, Xiao Li, Ming Li, Zhe Ren, Yi Zhang, Peng Liu, Yuhui Dong, Liang Zhou

Abstract: The evolution from 3rd to 4th generation synchrotron radiation (SR) sources provide promising potential improvements in X-ray techniques, particularly in spatial resolution for imaging, temporal resolution for dynamic studies, and beam size control for nanoprobes. Achieving these enhancements demands effective vibration suppression in beamline systems. This challenge drives the need for optical de… ▽ More The evolution from 3rd to 4th generation synchrotron radiation (SR) sources provide promising potential improvements in X-ray techniques, particularly in spatial resolution for imaging, temporal resolution for dynamic studies, and beam size control for nanoprobes. Achieving these enhancements demands effective vibration suppression in beamline systems. This challenge drives the need for optical designs that ensure efficient photon transport while maintaining vibration within acceptable thresholds. To address the advanced coherence requirements of fourth-generation SR sources, wave-optics simulations must be incorporated into optical design processes. We therefore propose a vibration mode propagation method using wave-optics techniques for beamline vibration simulation. Our approach achieves an almost 40-fold computational acceleration in actual beamline models compared to conventional methods, enabling direct analysis of propagating wavefront vibrations. This framework allows systematic evaluation of intensity distribution variations, coherence changes, and beam positioning errors caused by mechanical vibrations. △ Less

Submitted 18 March, 2025; originally announced March 2025.

arXiv:2503.13848 [pdf, other]

FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems

Authors: Tinglue Wang, Yiming Li, Wei Tang, Jiapeng Guan, Zhenghui Guo, Renshuang Jiang, Ran Wei, Jing Li, Zhe Jiang

Abstract: Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource… ▽ More Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource over-provisioning, and diminished real-time responsiveness, particularly in modern systems where tasks with varying reliability requirements are consolidated on shared processors to improve efficiency, reduce costs, and save power. To address these challenges, this work presents FlexStep, a systematic solution that integrates hardware and software across the SoC, ISA, and OS scheduling layers. FlexStep features a novel microarchitecture that supports dynamic core configuration and asynchronous, preemptive error detection. The FlexStep architecture naturally allows for flexible task scheduling and error detection, enabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.13555 [pdf, other]

Feasibility study for reconstruction of knee MRI from one corresponding X-ray via CNN

Authors: Zhe Wang, Aladine Chetouani, Rachid Jennane

Abstract: Generally, X-ray, as an inexpensive and popular medical imaging technique, is widely chosen by medical practitioners. With the development of medical technology, Magnetic Resonance Imaging (MRI), an advanced medical imaging technique, has already become a supplementary diagnostic option for the diagnosis of KOA. We propose in this paper a deep-learning-based approach for generating MRI from one co… ▽ More Generally, X-ray, as an inexpensive and popular medical imaging technique, is widely chosen by medical practitioners. With the development of medical technology, Magnetic Resonance Imaging (MRI), an advanced medical imaging technique, has already become a supplementary diagnostic option for the diagnosis of KOA. We propose in this paper a deep-learning-based approach for generating MRI from one corresponding X-ray. Our method uses the hidden variables of a Convolutional Auto-Encoder (CAE) model, trained for reconstructing X-ray image, as inputs of a generator model to provide 3D MRI. △ Less

Submitted 16 March, 2025; originally announced March 2025.

arXiv:2503.13012 [pdf, other]

Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation

Authors: Xingguo Lv, Xingbo Dong, Liwen Wang, Jiewen Yang, Lei Zhao, Bin Pu, Zhe Jin, Xuejun Li

Abstract: Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmenta… ▽ More Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. The source code is available at https://github.com/Yore0/TTDG-MGM. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.12652 [pdf, other]

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Authors: Tsu-Jui Fu, Yusu Qian, Chen Chen, Wenze Hu, Zhe Gan, Yinfei Yang

Abstract: Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a general… ▽ More Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model. △ Less

Submitted 22 April, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

arXiv:2503.12482 [pdf, other]

Fuzzy Clustering for Low-Complexity Time Domain Chromatic Dispersion Compensation Scheme in Coherent Optical Fiber Communication Systems

Authors: Wenkai Wan, Aiying Yang, Peng Guo, Zhe Zhao, Tianjia Xu, Jinxuan Wu, Zhiheng Liu

Abstract: Chromatic dispersion compensation (CDC), implemented in either the time-domain or frequency-domain, is crucial for enhancing power efficiency in the digital signal processing of modern optical fiber communication systems. Developing low-complexity CDC schemes is essential for hardware implemention, particularly for high-speed and long-haul optical fiber communication systems. In this work, we prop… ▽ More Chromatic dispersion compensation (CDC), implemented in either the time-domain or frequency-domain, is crucial for enhancing power efficiency in the digital signal processing of modern optical fiber communication systems. Developing low-complexity CDC schemes is essential for hardware implemention, particularly for high-speed and long-haul optical fiber communication systems. In this work, we propose a novel two-stage fuzzy clustered time-domain chromatic dispersion compensation scheme. Unlike hard decisions of CDC filter coefficients after determining the cluster centroids, our approach applies a soft fuzzy decision, allowing the coefficients to belong to multiple clusters. Experiments on a single-channel, single-polarization 20Gbaud 16-QAM 1800 km standard single-mode fiber communication system demonstrate that our approach has a complexity reduction of 53.8% and 40% compared with clustered TD-CDC and FD-CDC at a target Q-factor of 20% HD-FEC, respectively. Furthermore, the proposed method achieves the same optimal Q-factor as FD-CDC with a 27% complexity reduction. △ Less

Submitted 16 March, 2025; originally announced March 2025.

arXiv:2503.12339 [pdf, other]

Augmented Adversarial Trigger Learning

Authors: Zhe Wang, Yanjun Qi

Abstract: Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies i… ▽ More Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs. △ Less