-
ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View Stereo
Authors:
Yuxi Hu,
Jun Zhang,
Zhe Zhang,
Rafael Weilharter,
Yuchen Rao,
Kuangyi Chen,
Runze Yuan,
Friedrich Fraundorfer
Abstract:
Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically,…
▽ More
Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically, we develop an intra-view feature fusion module that leverages the feature coordinate correlations within a single image to enhance robust cost matching. Additionally, we introduce a lightweight cross-view aggregation module that efficiently utilizes the contextual information from volume correlations to guide regularization. Our method is evaluated on the DTU dataset and Tanks and Temples benchmark, consistently achieving competitive performance against state-of-the-art works, while requiring lower computational resources.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Validation and Calibration of Energy Models with Real Vehicle Data from Chassis Dynamometer Experiments
Authors:
Joy Carpio,
Sulaiman Almatrudi,
Nour Khoudari,
Zhe Fu,
Kenneth Butts,
Jonathan Lee,
Benjamin Seibold,
Alexandre Bayen
Abstract:
Accurate estimation of vehicle fuel consumption typically requires detailed modeling of complex internal powertrain dynamics, often resulting in computationally intensive simulations. However, many transportation applications-such as traffic flow modeling, optimization, and control-require simplified models that are fast, interpretable, and easy to implement, while still maintaining fidelity to ph…
▽ More
Accurate estimation of vehicle fuel consumption typically requires detailed modeling of complex internal powertrain dynamics, often resulting in computationally intensive simulations. However, many transportation applications-such as traffic flow modeling, optimization, and control-require simplified models that are fast, interpretable, and easy to implement, while still maintaining fidelity to physical energy behavior. This work builds upon a recently developed model reduction pipeline that derives physics-like energy models from high-fidelity Autonomie vehicle simulations. These reduced models preserve essential vehicle dynamics, enabling realistic fuel consumption estimation with minimal computational overhead. While the reduced models have demonstrated strong agreement with their Autonomie counterparts, previous validation efforts have been confined to simulation environments. This study extends the validation by comparing the reduced energy model's outputs against real-world vehicle data. Focusing on the MidSUV category, we tune the baseline Autonomie model to closely replicate the characteristics of a Toyota RAV4. We then assess the accuracy of the resulting reduced model in estimating fuel consumption under actual drive conditions. Our findings suggest that, when the reference Autonomie model is properly calibrated, the simplified model produced by the reduction pipeline can provide reliable, semi-principled fuel rate estimates suitable for large-scale transportation applications.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture
Authors:
Heng Liao,
Bingyang Liu,
Xianping Chen,
Zhigang Guo,
Chuanning Cheng,
Jianbing Wang,
Xiangyu Chen,
Peng Dong,
Rui Meng,
Wenjie Liu,
Zhe Zhou,
Ziyang Zhang,
Yuhang Gai,
Cunle Qian,
Yi Xiong,
Zhongwu Cheng,
Jing Xia,
Yuli Ma,
Xi Chen,
Wenhua Du,
Shizhong Xiao,
Chungang Li,
Yong Qin,
Liudong Xiong,
Zhou Yu
, et al. (9 additional authors not shown)
Abstract:
As the Large-scale Language Models (LLMs) continue to scale, the requisite computational power and bandwidth escalate. To address this, we introduce UB-Mesh, a novel AI datacenter network architecture designed to enhance scalability, performance, cost-efficiency and availability. Unlike traditional datacenters that provide symmetrical node-to-node bandwidth, UB-Mesh employs a hierarchically locali…
▽ More
As the Large-scale Language Models (LLMs) continue to scale, the requisite computational power and bandwidth escalate. To address this, we introduce UB-Mesh, a novel AI datacenter network architecture designed to enhance scalability, performance, cost-efficiency and availability. Unlike traditional datacenters that provide symmetrical node-to-node bandwidth, UB-Mesh employs a hierarchically localized nD-FullMesh network topology. This design fully leverages the data locality of LLM training, prioritizing short-range, direct interconnects to minimize data movement distance and reduce switch usage.
Although UB-Mesh's nD-FullMesh topology offers several theoretical advantages, its concrete architecture design, physical implementation and networking system optimization present new challenges. For the actual construction of UB-Mesh, we first design the UB-Mesh-Pod architecture, which is based on a 4D-FullMesh topology. UB-Mesh-Pod is implemented via a suite of hardware components that serve as the foundational building blocks, including specifically-designed NPU, CPU, Low-Radix-Switch (LRS), High-Radix-Switch (HRS), NICs and others. These components are interconnected via a novel Unified Bus (UB) technique, which enables flexible IO bandwidth allocation and hardware resource pooling. For networking system optimization, we propose advanced routing mechanism named All-Path-Routing (APR) to efficiently manage data traffic. These optimizations, combined with topology-aware performance enhancements and robust reliability measures like 64+1 backup design, result in 2.04x higher cost-efficiency, 7.2% higher network availability compared to traditional Clos architecture and 95%+ linearity in various LLM training tasks.
△ Less
Submitted 17 May, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
Fixseeker: An Empirical Driven Graph-based Approach for Detecting Silent Vulnerability Fixes in Open Source Software
Authors:
Yiran Cheng,
Ting Zhang,
Lwin Khin Shar,
Zhe Lang,
David Lo,
Shichao Lv,
Dongliang Fang,
Zhiqiang Shi,
Limin Sun
Abstract:
Open source software vulnerabilities pose significant security risks to downstream applications. While vulnerability databases provide valuable information for mitigation, many security patches are released silently in new commits of OSS repositories without explicit indications of their security impact. This makes it challenging for software maintainers and users to detect and address these vulne…
▽ More
Open source software vulnerabilities pose significant security risks to downstream applications. While vulnerability databases provide valuable information for mitigation, many security patches are released silently in new commits of OSS repositories without explicit indications of their security impact. This makes it challenging for software maintainers and users to detect and address these vulnerability fixes. There are a few approaches for detecting vulnerability-fixing commits (VFCs) but most of these approaches leverage commit messages, which would miss silent VFCs. On the other hand, there are some approaches for detecting silent VFCs based on code change patterns but they often fail to adequately characterize vulnerability fix patterns, thereby lacking effectiveness. For example, some approaches analyze each hunk in known VFCs, in isolation, to learn vulnerability fix patterns; but vulnerabiliy fixes are often associated with multiple hunks, in which cases correlations of code changes across those hunks are essential for characterizing the vulnerability fixes. To address these problems, we first conduct a large-scale empirical study on 11,900 VFCs across six programming languages, in which we found that over 70% of VFCs involve multiple hunks with various types of correlations. Based on our findings, we propose Fixseeker, a graph-based approach that extracts the various correlations between code changes at the hunk level to detect silent vulnerability fixes. Our evaluation demonstrates that Fixseeker outperforms state-of-the-art approaches across multiple programming languages, achieving a high F1 score of 0.8404 on average in balanced datasets and consistently improving F1 score, AUC-ROC and AUC-PR scores by 32.40%, 1.55% and 8.24% on imbalanced datasets. Our evaluation also indicates the generality of Fixseeker across different repository sizes and commit complexities.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable
Authors:
Bicheng Ying,
Zhe Li,
Haibo Yang
Abstract:
This work tackles the fundamental challenges in Federated Learning (FL) posed by arbitrary client participation and data heterogeneity, prevalent characteristics in practical FL settings. It is well-established that popular FedAvg-style algorithms struggle with exact convergence and can suffer from slow convergence rates since a decaying learning rate is required to mitigate these scenarios. To ad…
▽ More
This work tackles the fundamental challenges in Federated Learning (FL) posed by arbitrary client participation and data heterogeneity, prevalent characteristics in practical FL settings. It is well-established that popular FedAvg-style algorithms struggle with exact convergence and can suffer from slow convergence rates since a decaying learning rate is required to mitigate these scenarios. To address these issues, we introduce the concept of stochastic matrix and the corresponding time-varying graphs as a novel modeling tool to accurately capture the dynamics of arbitrary client participation and the local update procedure. Leveraging this approach, we offer a fresh perspective on designing FL algorithms, provide a rigorous quantitative analysis of the limitations inherent in the FedAvg algorithm, and present FOCUS, Federated Optimization with Exact Convergence via Push-pull Strategy, a provably convergent algorithm designed to effectively overcome the previously mentioned two challenges. More specifically, we provide a rigorous proof demonstrating that FOCUS achieves exact convergence with a linear rate regardless of the arbitrary client participation, establishing it as the first work to demonstrate this significant result.
△ Less
Submitted 3 June, 2025; v1 submitted 25 March, 2025;
originally announced March 2025.
-
Body Discovery of Embodied AI
Authors:
Zhe Sun,
Pengfei Tian,
Xiaozhu Hu,
Xiaoyu Zhao,
Huiying Li,
Zhenliang Zhang
Abstract:
In the pursuit of realizing artificial general intelligence (AGI), the importance of embodied artificial intelligence (AI) becomes increasingly apparent. Following this trend, research integrating robots with AGI has become prominent. As various kinds of embodiments have been designed, adaptability to diverse embodiments will become important to AGI. We introduce a new challenge, termed "Body Disc…
▽ More
In the pursuit of realizing artificial general intelligence (AGI), the importance of embodied artificial intelligence (AI) becomes increasingly apparent. Following this trend, research integrating robots with AGI has become prominent. As various kinds of embodiments have been designed, adaptability to diverse embodiments will become important to AGI. We introduce a new challenge, termed "Body Discovery of Embodied AI", focusing on tasks of recognizing embodiments and summarizing neural signal functionality. The challenge encompasses the precise definition of an AI body and the intricate task of identifying embodiments in dynamic environments, where conventional approaches often prove inadequate. To address these challenges, we apply causal inference method and evaluate it by developing a simulator tailored for testing algorithms with virtual environments. Finally, we validate the efficacy of our algorithms through empirical testing, demonstrating their robust performance in various scenarios based on virtual environments.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Authors:
Mingze Xu,
Mingfei Gao,
Shiyu Li,
Jiasen Lu,
Zhe Gan,
Zhengfeng Lai,
Meng Cao,
Kai Kang,
Yinfei Yang,
Afshin Dehghan
Abstract:
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is…
▽ More
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for mobile-friendly models. Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks.
△ Less
Submitted 27 March, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
Toward building next-generation Geocoding systems: a systematic review
Authors:
Zhengcong Yin,
Daniel W. Goldberg,
Binbin Lin,
Bing Zhou,
Diya Li,
Andong Ma,
Ziqian Ming,
Heng Cai,
Zhe Zhang,
Shaohua Wang,
Shanzhen Gao,
Joey Ying Lee,
Xiao Li,
Da Huo
Abstract:
Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across vari…
▽ More
Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across various scenarios these systems must address. It then provides a detailed analysis of how to construct such systems by breaking them down into key functional components and reviewing a broad spectrum of existing approaches, from traditional rule-based methods to advanced techniques in information retrieval, natural language processing, and large language models. Finally, we identify opportunities to improve next-generation geocoding systems in light of recent technological advances.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Fermi energy sensitive universal conductance fluctuations in anisotropic materials
Authors:
Qiang Yang,
Yayun Hu,
Zhe Hou,
Peiqing Tong
Abstract:
Universal conductance fluctuations (UCF) are a hallmark of quantum interference in mesoscopic devices. According to the Altshuler-Lee-Stone theory, the amplitude of UCF remains independent of system parameters such as Fermi energy and disorder strength. However, recent experiments have demonstrated a significant variation in UCF with respect to Fermi energy in the anisotropic Dirac semimetal…
▽ More
Universal conductance fluctuations (UCF) are a hallmark of quantum interference in mesoscopic devices. According to the Altshuler-Lee-Stone theory, the amplitude of UCF remains independent of system parameters such as Fermi energy and disorder strength. However, recent experiments have demonstrated a significant variation in UCF with respect to Fermi energy in the anisotropic Dirac semimetal $\mathrm{Cd_3As_2}$, suggesting a dependence on band anisotropy. In this work, we reconcile the discrepancy between theoretical predictions and experimental observations through a detailed study of UCF versus Fermi energy using a tight-binding model with tunable anisotropy parameters. Near the band edge, the Hamiltonian is simplified to an anisotropic free electron gas model, recovering the generalized Altshuler-Lee-Stone theory. However, as the Fermi energy shifts toward the band center, where rotational symmetry breaks into $C_4$ (four-fold rotational) symmetry, the UCF amplitude deviates from the standard theory. Our findings reveal that UCF becomes increasingly sensitive to Fermi energy as the anisotropy grows stronger. Furthermore, using realistic parameters for $\mathrm{Cd_3As_2}$, our calculations demonstrate an increase in UCF away from the Dirac point, in qualitative agreement with experimental results. The enhancement of UCF occurs in two perpendicular transport directions that we have calculated, albeit with quantitative differences in magnitude, which can be tested in future experiments. Given the prevalence of anisotropic materials and technical advances in engineering anisotropy through strain or twist, our results offer a valuable reference for characterizing intrinsic electronic properties via UCF.
△ Less
Submitted 10 June, 2025; v1 submitted 23 March, 2025;
originally announced March 2025.
-
Aligning Foundation Model Priors and Diffusion-Based Hand Interactions for Occlusion-Resistant Two-Hand Reconstruction
Authors:
Gaoge Han,
Yongkang Cheng,
Zhe Chen,
Shaoli Huang,
Tongliang Liu
Abstract:
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a novel framework that attempts to precisely alig…
▽ More
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a novel framework that attempts to precisely align hand poses and interactions by synergistically integrating foundation model-driven 2D priors with diffusion-based interaction refinement for occlusion-resistant two-hand reconstruction. First, we introduce a Fusion Alignment Encoder that learns to align fused multimodal priors keypoints, segmentation maps, and depth cues from foundation models during training. This provides robust structured guidance, further enabling efficient inference without foundation models at test time while maintaining high reconstruction accuracy. Second, we employ a two-hand diffusion model explicitly trained to transform interpenetrated poses into plausible, non-penetrated interactions, leveraging gradient-guided denoising to correct artifacts and ensure realistic spatial relations. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on InterHand2.6M, FreiHAND, and HIC datasets, significantly advancing occlusion handling and interaction robustness.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning
Authors:
Zhe Hu,
Jing Li,
Zhongzhu Pu,
Hou Pong Chan,
Yu Yin
Abstract:
Vision Language Models exhibited immense potential for embodied AI, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are represented merely as text-only descriptions, suggesting foundational reasoning can be effectively learned from language. Mo…
▽ More
Vision Language Models exhibited immense potential for embodied AI, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are represented merely as text-only descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability.
△ Less
Submitted 22 May, 2025; v1 submitted 21 March, 2025;
originally announced March 2025.
-
Insight-HXMT observations of the 2023 outburst in Aql X-1
Authors:
Zhe Yan,
Guobao Zhang,
Yu-Peng Chen,
Mariano Méndez,
Jirong Mao,
Ming Lyu,
Shu Zhang,
Pei Jin
Abstract:
We conducted an analysis of the continuum during the onset and initial decline phases of the 2023 outburst in transient neutron star low-mass X-ray binary Aql X$-$1 using broadband observations from the \textit{Insight-Hard X-ray Modulation Telescope (Insight-HXMT)} instrument. To determine the most appropriate model for the continuum of this outburst, we employed three models to explore the evolu…
▽ More
We conducted an analysis of the continuum during the onset and initial decline phases of the 2023 outburst in transient neutron star low-mass X-ray binary Aql X$-$1 using broadband observations from the \textit{Insight-Hard X-ray Modulation Telescope (Insight-HXMT)} instrument. To determine the most appropriate model for the continuum of this outburst, we employed three models to explore the evolution of the spectral component. These observations revealed that the source transitions from the hard state to the soft state. The disk-corona and sphere-corona models both adequately described the spectra of the hard state, while the double blackbody model became preferable after the hard X-ray emission ($>$25 keV) disappeared during the state transition. In the soft state, the total emission is dominated by changes in the disk and other blackbody components. The combination of the sphere-corona model and the double blackbody model is the most suitable model for this outburst. The results suggest that as the source transitioned into the soft state, the emission from the boundary layer was enhanced, and a hot spot occurred. Notably, we identified two type-I X-ray bursts, one of which exhibited a significant hard X-ray deficit (significance $\sim$ 4.82 $σ$), which indicates that \textit{Insight-HXMT} has the capability to capture the evolution of the corona in a single burst.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Giant Self Spin-Valve Effect in the Kagome Helimagnet
Authors:
Xitong Xu,
Yonglai Liu,
Kesen Zhao,
Che-Min Lin,
Miao He,
Haitian Zhao,
Qingqi Zeng,
Yubin Hou,
Qingyou Lu,
Ding-Fu Shao,
Shuang Jia,
Haifeng Du,
Wenjie Meng,
Tay-Rong Chang,
Zhe Qu
Abstract:
Kagome magnets can combine non-trivial band topology and electron correlations, offering a versatile playground for various quantum phenomena. In this work we propose that kagome magnets with frustrated interlayer interactions can intrinsically support a self spin-valve effect, and experimentally confirm this in the kagome helimagnet TmMn$_6$Sn$_6$. Under a magnetic field perpendicular to the heli…
▽ More
Kagome magnets can combine non-trivial band topology and electron correlations, offering a versatile playground for various quantum phenomena. In this work we propose that kagome magnets with frustrated interlayer interactions can intrinsically support a self spin-valve effect, and experimentally confirm this in the kagome helimagnet TmMn$_6$Sn$_6$. Under a magnetic field perpendicular to the helical axis, using magnetic force microscopy we observed stripe domains that stack strictly along the helical axis, which we attribute to the stability loss of the kagome helimagnetic state. Such a domain pattern spontaneously mimics the artificial multilayered structure in traditional spin valves, which, combined with the high spin polarization, leads to a giant magnetoresistance (GMR) ratio over 160%. This discovery opens an avenue to realize inherent spin valves in a variety of quantum magnets, and can hold promise in future spintronics.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation
Authors:
Shangqing Zhao,
Yuhao Zhou,
Yupei Ren,
Zhe Chen,
Chenghao Jia,
Fang Zhe,
Zhaogaung Long,
Shu Liu,
Man Lan
Abstract:
Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese.…
▽ More
Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce Fùxì, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Authors:
NVIDIA,
:,
Alisson Azzolini,
Junjie Bai,
Hannah Brandon,
Jiaxin Cao,
Prithvijit Chattopadhyay,
Huayu Chen,
Jinju Chu,
Yin Cui,
Jenna Diamond,
Yifan Ding,
Liang Feng,
Francesco Ferroni,
Rama Govindaraju,
Jinwei Gu,
Siddharth Gururani,
Imad El Hanafi,
Zekun Hao,
Jacob Huffman,
Jingyi Jin,
Brendan Johnson,
Rizwan Khan,
George Kurian,
Elena Lantz
, et al. (29 additional authors not shown)
Abstract:
Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, wit…
▽ More
Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.
△ Less
Submitted 19 May, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
Authors:
Yuchen Ren,
Zhengyu Zhao,
Chenhao Lin,
Bo Yang,
Lu Zhou,
Zhe Liu,
Chao Shen
Abstract:
Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate…
▽ More
Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at https://github.com/RYC-98/FPR.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
PointSFDA: Source-free Domain Adaptation for Point Cloud Completion
Authors:
Xing He,
Zhe Zhu,
Liangliang Nan,
Honghua Chen,
Jing Qin,
Mingqiang Wei
Abstract:
Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbf{PointSFDA}. Unlike unsupervised domain adaptation that reduces the domain gap by directly lever…
▽ More
Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbf{PointSFDA}. Unlike unsupervised domain adaptation that reduces the domain gap by directly leveraging labeled source data, PointSFDA uses only a pretrained source model and unlabeled target data for adaptation, avoiding the need for inaccessible source data in practical scenarios. Being the first source-free domain adaptation architecture for point cloud completion, our method offers two core contributions. First, we introduce a coarse-to-fine distillation solution to explicitly transfer the global geometry knowledge learned from the source dataset. Second, as noise may be introduced due to domain gaps, we propose a self-supervised partial-mask consistency training strategy to learn local geometry information in the target domain. Extensive experiments have validated that our method significantly improves the performance of state-of-the-art networks in cross-domain shape completion. Our code is available at \emph{\textcolor{magenta}{https://github.com/Starak-x/PointSFDA}}.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs
Authors:
Yao Cheng,
Zhe Han,
Fengyang Jiang,
Huaizhen Wang,
Fengyu Zhou,
Qingshan Yin,
Lei Wei
Abstract:
This paper addresses the high demand in advanced intelligent robot navigation for a more holistic understanding of spatial environments, by introducing a novel system that harnesses the capabilities of Large Language Models (LLMs) to construct hierarchical 3D Scene Graphs (3DSGs) for indoor scenarios. The proposed framework constructs 3DSGs consisting of a fundamental layer with rich metric-semant…
▽ More
This paper addresses the high demand in advanced intelligent robot navigation for a more holistic understanding of spatial environments, by introducing a novel system that harnesses the capabilities of Large Language Models (LLMs) to construct hierarchical 3D Scene Graphs (3DSGs) for indoor scenarios. The proposed framework constructs 3DSGs consisting of a fundamental layer with rich metric-semantic information, an object layer featuring precise point-cloud representation of object nodes as well as visual descriptors, and higher layers of room, floor, and building nodes. Thanks to the innovative application of LLMs, not only object nodes but also nodes of higher layers, e.g., room nodes, are annotated in an intelligent and accurate manner. A polling mechanism for room classification using LLMs is proposed to enhance the accuracy and reliability of the room node annotation. Thorough numerical experiments demonstrate the system's ability to integrate semantic descriptions with geometric data, creating an accurate and comprehensive representation of the environment instrumental for context-aware navigation and task planning.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
LLM-based Unit Test Generation for Dynamically-Typed Programs
Authors:
Runlin Liu,
Zhe Zhang,
Yunge Hu,
Yuhang Lin,
Xiang Gao,
Hailong Sun
Abstract:
Automated unit test generation has been widely studied, but generating effective tests for dynamically typed programs remains a significant challenge. Existing approaches, including search-based software testing (SBST) and recent LLM-based methods, often suffer from type errors, leading to invalid inputs and assertion failures, ultimately reducing testing effectiveness. To address this, we propose…
▽ More
Automated unit test generation has been widely studied, but generating effective tests for dynamically typed programs remains a significant challenge. Existing approaches, including search-based software testing (SBST) and recent LLM-based methods, often suffer from type errors, leading to invalid inputs and assertion failures, ultimately reducing testing effectiveness. To address this, we propose TypeTest, a novel framework that enhances type correctness in test generation through a vector-based Retrieval-Augmented Generation (RAG) system. TypeTest employs call instance retrieval and feature-based retrieval to infer parameter types accurately and construct valid test inputs. Furthermore, it utilizes the call graph to extract richer contextual information, enabling more accurate assertion generation. In addition, TypeTest incorporates a repair mechanism and iterative test generation, progressively refining test cases to improve coverage. In an evaluation on 125 real-world Python modules, TypeTest achieved an average statement coverage of 86.6% and branch coverage of 76.8%, outperforming state-of-theart tools by 5.4% and 9.3%, respectively.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
The development of vibration modes propagation method to perform wave-optics simulation of beamline vibration
Authors:
Han Xu,
Xiao Li,
Ming Li,
Zhe Ren,
Yi Zhang,
Peng Liu,
Yuhui Dong,
Liang Zhou
Abstract:
The evolution from 3rd to 4th generation synchrotron radiation (SR) sources provide promising potential improvements in X-ray techniques, particularly in spatial resolution for imaging, temporal resolution for dynamic studies, and beam size control for nanoprobes. Achieving these enhancements demands effective vibration suppression in beamline systems. This challenge drives the need for optical de…
▽ More
The evolution from 3rd to 4th generation synchrotron radiation (SR) sources provide promising potential improvements in X-ray techniques, particularly in spatial resolution for imaging, temporal resolution for dynamic studies, and beam size control for nanoprobes. Achieving these enhancements demands effective vibration suppression in beamline systems. This challenge drives the need for optical designs that ensure efficient photon transport while maintaining vibration within acceptable thresholds. To address the advanced coherence requirements of fourth-generation SR sources, wave-optics simulations must be incorporated into optical design processes. We therefore propose a vibration mode propagation method using wave-optics techniques for beamline vibration simulation. Our approach achieves an almost 40-fold computational acceleration in actual beamline models compared to conventional methods, enabling direct analysis of propagating wavefront vibrations. This framework allows systematic evaluation of intensity distribution variations, coherence changes, and beam positioning errors caused by mechanical vibrations.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems
Authors:
Tinglue Wang,
Yiming Li,
Wei Tang,
Jiapeng Guan,
Zhenghui Guo,
Renshuang Jiang,
Ran Wei,
Jing Li,
Zhe Jiang
Abstract:
Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource…
▽ More
Reliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores,strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource over-provisioning, and diminished real-time responsiveness, particularly in modern systems where tasks with varying reliability requirements are consolidated on shared processors to improve efficiency, reduce costs, and save power. To address these challenges, this work presents FlexStep, a systematic solution that integrates hardware and software across the SoC, ISA, and OS scheduling layers. FlexStep features a novel microarchitecture that supports dynamic core configuration and asynchronous, preemptive error detection. The FlexStep architecture naturally allows for flexible task scheduling and error detection, enabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Feasibility study for reconstruction of knee MRI from one corresponding X-ray via CNN
Authors:
Zhe Wang,
Aladine Chetouani,
Rachid Jennane
Abstract:
Generally, X-ray, as an inexpensive and popular medical imaging technique, is widely chosen by medical practitioners. With the development of medical technology, Magnetic Resonance Imaging (MRI), an advanced medical imaging technique, has already become a supplementary diagnostic option for the diagnosis of KOA. We propose in this paper a deep-learning-based approach for generating MRI from one co…
▽ More
Generally, X-ray, as an inexpensive and popular medical imaging technique, is widely chosen by medical practitioners. With the development of medical technology, Magnetic Resonance Imaging (MRI), an advanced medical imaging technique, has already become a supplementary diagnostic option for the diagnosis of KOA. We propose in this paper a deep-learning-based approach for generating MRI from one corresponding X-ray. Our method uses the hidden variables of a Convolutional Auto-Encoder (CAE) model, trained for reconstructing X-ray image, as inputs of a generator model to provide 3D MRI.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation
Authors:
Xingguo Lv,
Xingbo Dong,
Liwen Wang,
Jiewen Yang,
Lei Zhao,
Bin Pu,
Zhe Jin,
Xuejun Li
Abstract:
Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmenta…
▽ More
Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. The source code is available at https://github.com/Yore0/TTDG-MGM.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
Authors:
Tsu-Jui Fu,
Yusu Qian,
Chen Chen,
Wenze Hu,
Zhe Gan,
Yinfei Yang
Abstract:
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a general…
▽ More
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.
△ Less
Submitted 22 April, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
Fuzzy Clustering for Low-Complexity Time Domain Chromatic Dispersion Compensation Scheme in Coherent Optical Fiber Communication Systems
Authors:
Wenkai Wan,
Aiying Yang,
Peng Guo,
Zhe Zhao,
Tianjia Xu,
Jinxuan Wu,
Zhiheng Liu
Abstract:
Chromatic dispersion compensation (CDC), implemented in either the time-domain or frequency-domain, is crucial for enhancing power efficiency in the digital signal processing of modern optical fiber communication systems. Developing low-complexity CDC schemes is essential for hardware implemention, particularly for high-speed and long-haul optical fiber communication systems. In this work, we prop…
▽ More
Chromatic dispersion compensation (CDC), implemented in either the time-domain or frequency-domain, is crucial for enhancing power efficiency in the digital signal processing of modern optical fiber communication systems. Developing low-complexity CDC schemes is essential for hardware implemention, particularly for high-speed and long-haul optical fiber communication systems. In this work, we propose a novel two-stage fuzzy clustered time-domain chromatic dispersion compensation scheme. Unlike hard decisions of CDC filter coefficients after determining the cluster centroids, our approach applies a soft fuzzy decision, allowing the coefficients to belong to multiple clusters. Experiments on a single-channel, single-polarization 20Gbaud 16-QAM 1800 km standard single-mode fiber communication system demonstrate that our approach has a complexity reduction of 53.8% and 40% compared with clustered TD-CDC and FD-CDC at a target Q-factor of 20% HD-FEC, respectively. Furthermore, the proposed method achieves the same optimal Q-factor as FD-CDC with a 27% complexity reduction.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Augmented Adversarial Trigger Learning
Authors:
Zhe Wang,
Yanjun Qi
Abstract:
Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies i…
▽ More
Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Power Swing Trajectory Influenced by Virtual Impedance-Based Current-Limiting Strategy
Authors:
Yanshu Niu,
Zhe Yang,
Bikash C. Pal
Abstract:
Grid-forming (GFM) inverter-based resources (IBRs) can emulate the external characteristics of synchronous generators (SGs) through appropriate control loop design. However, in systems with GFM IBRs, the apparent impedance trajectory under current limitation differs significantly from that of SG-based systems due to the limited overcurrent capability of power electronic devices. This difference ch…
▽ More
Grid-forming (GFM) inverter-based resources (IBRs) can emulate the external characteristics of synchronous generators (SGs) through appropriate control loop design. However, in systems with GFM IBRs, the apparent impedance trajectory under current limitation differs significantly from that of SG-based systems due to the limited overcurrent capability of power electronic devices. This difference challenges the power swing detection functions of distance relays designed for SG-based systems. This paper presents a theoretical analysis of the apparent impedance trajectory over a full power swing cycle under two typical current-limiting strategies: variable virtual impedance (VI) and adaptive VI. The analysis reveals that the trajectory under VI current-limiting strategies differs significantly from that of a conventional SG. The results also indicate that the control parameters affect the characteristics of the trajectory. In addition, the new trajectories challenge conventional power swing detection functions, increasing the risk of malfunction. Furthermore, the implementation of VI leads to a deterioration in system stability. The theoretical analysis is further validated through simulations on the MATLAB/Simulink platform.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Adaptive Label Correction for Robust Medical Image Segmentation with Noisy Labels
Authors:
Chengxuan Qian,
Kai Han,
Siqi Ma,
Chongwen Lyu,
Zhenlong Yuan,
Jun Chen,
Zhe Liu
Abstract:
Deep learning has shown remarkable success in medical image analysis, but its reliance on large volumes of high-quality labeled data limits its applicability. While noisy labeled data are easier to obtain, directly incorporating them into training can degrade model performance. To address this challenge, we propose a Mean Teacher-based Adaptive Label Correction (ALC) self-ensemble framework for ro…
▽ More
Deep learning has shown remarkable success in medical image analysis, but its reliance on large volumes of high-quality labeled data limits its applicability. While noisy labeled data are easier to obtain, directly incorporating them into training can degrade model performance. To address this challenge, we propose a Mean Teacher-based Adaptive Label Correction (ALC) self-ensemble framework for robust medical image segmentation with noisy labels. The framework leverages the Mean Teacher architecture to ensure consistent learning under noise perturbations. It includes an adaptive label refinement mechanism that dynamically captures and weights differences across multiple disturbance versions to enhance the quality of noisy labels. Additionally, a sample-level uncertainty-based label selection algorithm is introduced to prioritize high-confidence samples for network updates, mitigating the impact of noisy annotations. Consistency learning is integrated to align the predictions of the student and teacher networks, further enhancing model robustness. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed framework, showing significant improvements in segmentation performance. By fully exploiting the strengths of the Mean Teacher structure, the ALC framework effectively processes noisy labels, adapts to challenging scenarios, and achieves competitive results compared to state-of-the-art methods.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art
Authors:
Zhe Jin,
Tat-Seng Chua
Abstract:
Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit pr…
▽ More
Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit premise that visual aesthetics is universal, which is limiting. Aesthetics in the T2I context should be about personalization and we propose the novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output. Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ, known as the Principles of Art (PoA). To facilitate this study, we introduce CompArt, a large-scale compositional art dataset building on top of WikiArt with PoA analysis annotated by a capable Multimodal LLM. Leveraging the expressive power of LLMs and training a lightweight and transferrable adapter, we demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions. Additionally, we design an appropriate evaluation framework to assess the efficacy of our approach.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object
Authors:
Zhe Shan,
Yang Liu,
Lei Zhou,
Cheng Yan,
Heng Wang,
Xia Xie
Abstract:
The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose ROS-SAM, a method designed to achieve high-quality interactive segmentation while preserving…
▽ More
The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose ROS-SAM, a method designed to achieve high-quality interactive segmentation while preserving generalization across diverse remote sensing data. The ROS-SAM is built upon three key innovations: 1) LoRA-based fine-tuning, which enables efficient domain adaptation while maintaining SAM's generalization ability, 2) Enhancement of deep network layers to improve the discriminability of extracted features, thereby reducing misclassifications, and 3) Integration of global context with local boundary details in the mask decoder to generate high-quality segmentation masks. Additionally, we design the data pipeline to ensure the model learns to better handle objects at varying scales during training while focusing on high-quality predictions during inference. Experiments on remote sensing video datasets show that the redesigned data pipeline boosts the IoU by 6%, while ROS-SAM increases the IoU by 13%. Finally, when evaluated on existing remote sensing object tracking datasets, ROS-SAM demonstrates impressive zero-shot capabilities, generating masks that closely resemble manual annotations. These results confirm ROS-SAM as a powerful tool for fine-grained segmentation in remote sensing applications. Code is available at https://github.com/ShanZard/ROS-SAM.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Spatio-temporal Fourier Transformer (StFT) for Long-term Dynamics Prediction
Authors:
Da Long,
Shandian Zhe,
Samuel Williams,
Leonid Oliker,
Zhe Bai
Abstract:
Simulating the long-term dynamics of multi-scale and multi-physics systems poses a significant challenge in understanding complex phenomena across science and engineering. The complexity arises from the intricate interactions between scales and the interplay of diverse physical processes. Neural operators have emerged as promising models for predicting such dynamics due to their flexibility and co…
▽ More
Simulating the long-term dynamics of multi-scale and multi-physics systems poses a significant challenge in understanding complex phenomena across science and engineering. The complexity arises from the intricate interactions between scales and the interplay of diverse physical processes. Neural operators have emerged as promising models for predicting such dynamics due to their flexibility and computational efficiency. However, they often fail to effectively capture multi-scale interactions or quantify the uncertainties inherent in the predictions. These limitations lead to rapid error accumulation, particularly in long-term forecasting of systems characterized by complex and coupled dynamics. To address these challenges, we propose a spatio-temporal Fourier transformer (StFT), in which each transformer block is designed to learn dynamics at a specific scale. By leveraging a structured hierarchy of StFT blocks, the model explicitly captures dynamics across both macro- and micro- spatial scales. Furthermore, a generative residual correction mechanism is integrated to estimate and mitigate predictive uncertainties, enhancing both the accuracy and reliability of long-term forecasts. Evaluations conducted on three benchmark datasets (plasma, fluid, and atmospheric dynamics) demonstrate the advantages of our approach over state-of-the-art ML methods.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification
Authors:
Yingjie Zhang,
Tong Liu,
Zhe Zhao,
Guozhu Meng,
Kai Chen
Abstract:
Large Language Models (LLMs) are vulnerable to jailbreak attacks, which use crafted prompts to elicit toxic responses. These attacks exploit LLMs' difficulty in dynamically detecting harmful intents during the generation process. Traditional safety alignment methods, often relying on the initial few generation steps, are ineffective due to limited computational budget. This paper proposes DEEPALIG…
▽ More
Large Language Models (LLMs) are vulnerable to jailbreak attacks, which use crafted prompts to elicit toxic responses. These attacks exploit LLMs' difficulty in dynamically detecting harmful intents during the generation process. Traditional safety alignment methods, often relying on the initial few generation steps, are ineffective due to limited computational budget. This paper proposes DEEPALIGN, a robust defense framework that fine-tunes LLMs to progressively detoxify generated content, significantly improving both the computational budget and effectiveness of mitigating harmful generation. Our approach uses a hybrid loss function operating on hidden states to directly improve LLMs' inherent awareness of toxity during generation. Furthermore, we redefine safe responses by generating semantically relevant answers to harmful queries, thereby increasing robustness against representation-mutation attacks. Evaluations across multiple LLMs demonstrate state-of-the-art defense performance against six different attack types, reducing Attack Success Rates by up to two orders of magnitude compared to previous state-of-the-art defense while preserving utility. This work advances LLM safety by addressing limitations of conventional alignment through dynamic, context-aware mitigation.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Palette of Language Models: A Solver for Controlled Text Generation
Authors:
Zhe Yang,
Yi Huang,
Yaqin Chen,
Xiaoting Wu,
Junlan Feng,
Chao Deng
Abstract:
Recent advancements in large language models have revolutionized text generation with their remarkable capabilities. These models can produce controlled texts that closely adhere to specific requirements when prompted appropriately. However, designing an optimal prompt to control multiple attributes simultaneously can be challenging. A common approach is to linearly combine single-attribute models…
▽ More
Recent advancements in large language models have revolutionized text generation with their remarkable capabilities. These models can produce controlled texts that closely adhere to specific requirements when prompted appropriately. However, designing an optimal prompt to control multiple attributes simultaneously can be challenging. A common approach is to linearly combine single-attribute models, but this strategy often overlooks attribute overlaps and can lead to conflicts. Therefore, we propose a novel combination strategy inspired by the Law of Total Probability and Conditional Mutual Information Minimization on generative language models. This method has been adapted for single-attribute control scenario and is termed the Palette of Language Models due to its theoretical linkage between attribute strength and generation style, akin to blending colors on an artist's palette. Moreover, positive correlation and attribute enhancement are advanced as theoretical properties to guide a rational combination strategy design. We conduct experiments on both single control and multiple control settings, and achieve surpassing results.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
From Understanding to Excelling: Template-Free Algorithm Design through Structural-Functional Co-Evolution
Authors:
Zhe Zhao,
Haibin Wen,
Pengkun Wang,
Ye Wei,
Zaixi Zhang,
Xi Lin,
Fei Liu,
Bo An,
Hui Xiong,
Yang Wang,
Qingfu Zhang
Abstract:
Large language models (LLMs) have greatly accelerated the automation of algorithm generation and optimization. However, current methods such as EoH and FunSearch mainly rely on predefined templates and expert-specified functions that focus solely on the local evolution of key functionalities. Consequently, they fail to fully leverage the synergistic benefits of the overall architecture and the pot…
▽ More
Large language models (LLMs) have greatly accelerated the automation of algorithm generation and optimization. However, current methods such as EoH and FunSearch mainly rely on predefined templates and expert-specified functions that focus solely on the local evolution of key functionalities. Consequently, they fail to fully leverage the synergistic benefits of the overall architecture and the potential of global optimization. In this paper, we introduce an end-to-end algorithm generation and optimization framework based on LLMs. Our approach utilizes the deep semantic understanding of LLMs to convert natural language requirements or human-authored papers into code solutions, and employs a two-dimensional co-evolution strategy to optimize both functional and structural aspects. This closed-loop process spans problem analysis, code generation, and global optimization, automatically identifying key algorithm modules for multi-level joint optimization and continually enhancing performance and design innovation. Extensive experiments demonstrate that our method outperforms traditional local optimization approaches in both performance and innovation, while also exhibiting strong adaptability to unknown environments and breakthrough potential in structural design. By building on human research, our framework generates and optimizes novel algorithms that surpass those designed by human experts, broadening the applicability of LLMs for algorithm design and providing a novel solution pathway for automated algorithm development.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
$D^{(*)}\bar{B}^{(*)}$ Dynamics in Chiral Effective Field Theory
Authors:
Zhe Liu,
Hao Xu,
Xiang Liu
Abstract:
In this work, we systematically study the interactions of the $S$-wave $D^{(*)}\bar{B}^{(*)}$ systems within the framework of chiral effective field theory in heavy hadron formalism. We calculate the $D^{(*)}\bar{B}^{(*)}$ effective potentials up to next-to-leading order, explore the bound state formations, and investigate the $D^{(*)}\bar{B}^{(*)}$ scattering properties such as scattering rate, s…
▽ More
In this work, we systematically study the interactions of the $S$-wave $D^{(*)}\bar{B}^{(*)}$ systems within the framework of chiral effective field theory in heavy hadron formalism. We calculate the $D^{(*)}\bar{B}^{(*)}$ effective potentials up to next-to-leading order, explore the bound state formations, and investigate the $D^{(*)}\bar{B}^{(*)}$ scattering properties such as scattering rate, scattering length, and effective range. Our results show that all $I=1$ $D^{(*)}\bar{B}^{(*)}$ potentials are repulsive, preventing the formation of bound states, while the $I=0$ potentials are generally attractive. Specifically, we get two important observations: first, the shallow bound state is more likely to exist in the $D\bar{B}[I(J^{P})=0(0^{+})]$ system than in the $D\bar{B}^{*}[I(J^{P})=0(1^{+})]$ system; second, $D^{*}\bar{B}^{*}[I(J^{P})=0(0^{+})]$ and $D^{*}\bar{B}^{*}[I(J^{P})=0(1^{+})]$ systems possess relatively large binding energies and positive scattering lengths, which suggests strong bound state formations in these channels. So the attractions in the $D^{*}\bar{B}^{*}[I=0]$ systems are deeper than those in the $D\bar{B}^{(*)}[I=0]$ systems, thus we strongly recommend the future experiment to search for the $D^{*}\bar{B}^{*}[I=0]$ tetraquark systems. In addition, we also investigate the dependencies of the $D\bar{B}^{(*)}$ binding energies on the contact low-energy coupling constants (LECs).
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Authors:
Weiyun Wang,
Zhangwei Gao,
Lianjie Chen,
Zhe Chen,
Jinguo Zhu,
Xiangyu Zhao,
Yangzhou Liu,
Yue Cao,
Shenglong Ye,
Xizhou Zhu,
Lewei Lu,
Haodong Duan,
Yu Qiao,
Jifeng Dai,
Wenhai Wang
Abstract:
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when a…
▽ More
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
I Can Tell Your Secrets: Inferring Privacy Attributes from Mini-app Interaction History in Super-apps
Authors:
Yifeng Cai,
Ziqi Zhang,
Mengyu Yao,
Junlin Liu,
Xiaoke Zhao,
Xinyi Fu,
Ruoyu Li,
Zhe Li,
Xiangqun Chen,
Yao Guo,
Ding Li
Abstract:
Super-apps have emerged as comprehensive platforms integrating various mini-apps to provide diverse services. While super-apps offer convenience and enriched functionality, they can introduce new privacy risks. This paper reveals a new privacy leakage source in super-apps: mini-app interaction history, including mini-app usage history (Mini-H) and operation history (Op-H). Mini-H refers to the his…
▽ More
Super-apps have emerged as comprehensive platforms integrating various mini-apps to provide diverse services. While super-apps offer convenience and enriched functionality, they can introduce new privacy risks. This paper reveals a new privacy leakage source in super-apps: mini-app interaction history, including mini-app usage history (Mini-H) and operation history (Op-H). Mini-H refers to the history of mini-apps accessed by users, such as their frequency and categories. Op-H captures user interactions within mini-apps, including button clicks, bar drags, and image views. Super-apps can naturally collect these data without instrumentation due to the web-based feature of mini-apps. We identify these data types as novel and unexplored privacy risks through a literature review of 30 papers and an empirical analysis of 31 super-apps. We design a mini-app interaction history-oriented inference attack (THEFT), to exploit this new vulnerability. Using THEFT, the insider threats within the low-privilege business department of the super-app vendor acting as the adversary can achieve more than 95.5% accuracy in inferring privacy attributes of over 16.1% of users. THEFT only requires a small training dataset of 200 users from public breached databases on the Internet. We also engage with super-app vendors and a standards association to increase industry awareness and commitment to protect this data. Our contributions are significant in identifying overlooked privacy risks, demonstrating the effectiveness of a new attack, and influencing industry practices toward better privacy protection in the super-app ecosystem.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?
Authors:
Zhe Xu,
Daoyuan Chen,
Zhenqing Ling,
Yaliang Li,
Ying Shen
Abstract:
Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framewor…
▽ More
Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands. Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.
△ Less
Submitted 22 May, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models
Authors:
Xiaozhen Qiao,
Peng Huang,
Jiakang Yuan,
Xianda Guo,
Bowen Ye,
Zhe Sun,
Xuelong Li
Abstract:
Test-time adaptation (TTA) is crucial in maintaining Vision-Language Models (VLMs) performance when facing real-world distribution shifts, particularly when the source data or target labels are inaccessible. Existing TTA methods rely on CLIP's output probability distribution for feature evaluation, which can introduce biases under domain shifts. This misalignment may cause features to be misclassi…
▽ More
Test-time adaptation (TTA) is crucial in maintaining Vision-Language Models (VLMs) performance when facing real-world distribution shifts, particularly when the source data or target labels are inaccessible. Existing TTA methods rely on CLIP's output probability distribution for feature evaluation, which can introduce biases under domain shifts. This misalignment may cause features to be misclassified due to text priors or incorrect textual associations. To address these limitations, we propose Bidirectional Prototype-Reward co-Evolution (BPRE), a novel TTA framework for VLMs that integrates feature quality assessment with prototype evolution through a synergistic feedback loop. BPRE first employs a Multi-Dimensional Quality-Aware Reward Module to evaluate feature quality and guide prototype refinement precisely. The continuous refinement of prototype quality through Prototype-Reward Interactive Evolution will subsequently enhance the computation of more robust Multi-Dimensional Quality-Aware Reward Scores. Through the bidirectional interaction, the precision of rewards and the evolution of prototypes mutually reinforce each other, forming a self-evolving cycle. Extensive experiments are conducted across 15 diverse recognition datasets encompassing natural distribution shifts and cross-dataset generalization scenarios. Results demonstrate that BPRE consistently achieves superior average performance compared to state-of-the-art methods across different model architectures, such as ResNet-50 and ViT-B/16. By emphasizing comprehensive feature evaluation and bidirectional knowledge refinement, BPRE advances VLM generalization capabilities, offering a new perspective on TTA.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Large model enhanced computational ghost imaging
Authors:
Yifan Chen,
Hongjun An,
Zhe Sun,
Tong Tian,
Mingliang Chen,
Christian Spielmann,
Xuelong Li
Abstract:
Ghost imaging (GI) achieves 2D image reconstruction through high-order correlation of 1D bucket signals and 2D light field information, particularly demonstrating enhanced detection sensitivity and high-quality image reconstruction via efficient photon collection in scattering media. Recent investigations have established that deep learning (DL) can substantially enhance the ghost imaging reconstr…
▽ More
Ghost imaging (GI) achieves 2D image reconstruction through high-order correlation of 1D bucket signals and 2D light field information, particularly demonstrating enhanced detection sensitivity and high-quality image reconstruction via efficient photon collection in scattering media. Recent investigations have established that deep learning (DL) can substantially enhance the ghost imaging reconstruction quality. Furthermore, with the emergence of large models like SDXL, GPT-4, etc., the constraints of conventional DL in parameters and architecture have been transcended, enabling models to comprehensively explore relationships among all distinct positions within feature sequences. This paradigm shift has significantly advanced the capability of DL in restoring severely degraded and low-resolution imagery, making it particularly advantageous for noise-robust image reconstruction in GI applications. In this paper, we propose the first large imaging model with 1.4 billion parameters that incorporates the physical principles of GI (GILM). The proposed GILM implements a skip connection mechanism to mitigate gradient explosion challenges inherent in deep architectures, ensuring sufficient parametric capacity to capture intricate correlations among object single-pixel measurements. Moreover, GILM leverages multi-head attention mechanism to learn spatial dependencies across pixel points during image reconstruction, facilitating the extraction of comprehensive object information for subsequent reconstruction. We validated the effectiveness of GILM through a series of experiments, including simulated object imaging, imaging objects in free space, and imaging object located 52 meters away in underwater environment. The experimental results show that GILM effectively analyzes the fluctuation trends of the collected signals, thereby optimizing the recovery of the object's image from the acquired data.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens
Authors:
Qingsong Xie,
Zhao Zhang,
Zhe Huang,
Yanhao Zhang,
Haonan Lu,
Zhenyu Yang
Abstract:
Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Cons…
▽ More
Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. Project homepage: https://github.com/OPPO-Mente-Lab/Layton
△ Less
Submitted 13 March, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis
Authors:
Kai Qiu,
Xiang Li,
Jason Kuen,
Hao Chen,
Xiaohao Xu,
Jiuxiang Gu,
Yinyi Luo,
Bhiksha Raj,
Zhe Lin,
Marios Savvides
Abstract:
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we c…
▽ More
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a $\sim$400M generator. Code: https://github.com/lxa9867/ImageFolder.
△ Less
Submitted 17 March, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Trinity: A Modular Humanoid Robot AI System
Authors:
Jingkai Sun,
Qiang Zhang,
Gang Han,
Wen Zhao,
Zhe Yong,
Yan He,
Jiaxu Wang,
Jiahang Cao,
Yijie Guo,
Renjing Xu
Abstract:
In recent years, research on humanoid robots has garnered increasing attention. With breakthroughs in various types of artificial intelligence algorithms, embodied intelligence, exemplified by humanoid robots, has been highly anticipated. The advancements in reinforcement learning (RL) algorithms have significantly improved the motion control and generalization capabilities of humanoid robots. Sim…
▽ More
In recent years, research on humanoid robots has garnered increasing attention. With breakthroughs in various types of artificial intelligence algorithms, embodied intelligence, exemplified by humanoid robots, has been highly anticipated. The advancements in reinforcement learning (RL) algorithms have significantly improved the motion control and generalization capabilities of humanoid robots. Simultaneously, the groundbreaking progress in large language models (LLM) and visual language models (VLM) has brought more possibilities and imagination to humanoid robots. LLM enables humanoid robots to understand complex tasks from language instructions and perform long-term task planning, while VLM greatly enhances the robots' understanding and interaction with their environment. This paper introduces \textcolor{magenta}{Trinity}, a novel AI system for humanoid robots that integrates RL, LLM, and VLM. By combining these technologies, Trinity enables efficient control of humanoid robots in complex environments. This innovative approach not only enhances the capabilities but also opens new avenues for future research and applications of humanoid robotics.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
CL-MVSNet: Unsupervised Multi-view Stereo with Dual-level Contrastive Learning
Authors:
Kaiqiang Xiong,
Rui Peng,
Zhe Zhang,
Tianxing Feng,
Jianbo Jiao,
Feng Gao,
Ronggang Wang
Abstract:
Unsupervised Multi-View Stereo (MVS) methods have achieved promising progress recently. However, previous methods primarily depend on the photometric consistency assumption, which may suffer from two limitations: indistinguishable regions and view-dependent effects, e.g., low-textured areas and reflections. To address these issues, in this paper, we propose a new dual-level contrastive learning ap…
▽ More
Unsupervised Multi-View Stereo (MVS) methods have achieved promising progress recently. However, previous methods primarily depend on the photometric consistency assumption, which may suffer from two limitations: indistinguishable regions and view-dependent effects, e.g., low-textured areas and reflections. To address these issues, in this paper, we propose a new dual-level contrastive learning approach, named CL-MVSNet. Specifically, our model integrates two contrastive branches into an unsupervised MVS framework to construct additional supervisory signals. On the one hand, we present an image-level contrastive branch to guide the model to acquire more context awareness, thus leading to more complete depth estimation in indistinguishable regions. On the other hand, we exploit a scene-level contrastive branch to boost the representation ability, improving robustness to view-dependent effects. Moreover, to recover more accurate 3D geometry, we introduce an L0.5 photometric consistency loss, which encourages the model to focus more on accurate points while mitigating the gradient penalty of undesirable ones. Extensive experiments on DTU and Tanks&Temples benchmarks demonstrate that our approach achieves state-of-the-art performance among all end-to-end unsupervised MVS frameworks and outperforms its supervised counterpart by a considerable margin without fine-tuning.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
ObjectMover: Generative Object Movement with Video Prior
Authors:
Xin Yu,
Tianyu Wang,
Soo Ye Kim,
Paul Guerrero,
Xi Chen,
Qing Liu,
Zhe Lin,
Xiaojuan Qi
Abstract:
Simple as it seems, moving an object to another location within an image is, in fact, a challenging image-editing task that requires re-harmonizing the lighting, adjusting the pose based on perspective, accurately filling occluded regions, and ensuring coherent synchronization of shadows and reflections while maintaining the object identity. In this paper, we present ObjectMover, a generative mode…
▽ More
Simple as it seems, moving an object to another location within an image is, in fact, a challenging image-editing task that requires re-harmonizing the lighting, adjusting the pose based on perspective, accurately filling occluded regions, and ensuring coherent synchronization of shadows and reflections while maintaining the object identity. In this paper, we present ObjectMover, a generative model that can perform object movement in highly challenging scenes. Our key insight is that we model this task as a sequence-to-sequence problem and fine-tune a video generation model to leverage its knowledge of consistent object generation across video frames. We show that with this approach, our model is able to adjust to complex real-world scenarios, handling extreme lighting harmonization and object effect movement. As large-scale data for object movement are unavailable, we construct a data generation pipeline using a modern game engine to synthesize high-quality data pairs. We further propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization. Through extensive experiments, we demonstrate that ObjectMover achieves outstanding results and adapts well to real-world scenarios.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models
Authors:
Bozhi Luan,
Wengang Zhou,
Hao Feng,
Zhe Wang,
Xiaosong Li,
Houqiang Li
Abstract:
As the computational needs of Large Vision-Language Models (LVLMs) increase, visual token pruning has proven effective in improving inference speed and memory efficiency. Traditional pruning methods in LVLMs predominantly focus on attention scores to determine token relevance, overlooking critical aspects such as spatial position and token similarity. To this end, we introduce AdaptPrune, a novel…
▽ More
As the computational needs of Large Vision-Language Models (LVLMs) increase, visual token pruning has proven effective in improving inference speed and memory efficiency. Traditional pruning methods in LVLMs predominantly focus on attention scores to determine token relevance, overlooking critical aspects such as spatial position and token similarity. To this end, we introduce AdaptPrune, a novel plug-and-play training-free pruning method that builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach. Our method is based on several observed phenomena in large models: the positional bias in the model's image attention and the redundancy of token information ignored by previous approaches. By integrating attention, spatial, and similarity information, our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions. Our method has been extensively tested across various LVLMs and benchmarks, confirming its robustness and adaptability. The results demonstrate that AdaptPrune consistently outperforms existing methods across various pruning ratios. Code is available at https://github.com/bzluan/AdaptPrune.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Gemini Embedding: Generalizable Embeddings from Gemini
Authors:
Jinhyuk Lee,
Feiyang Chen,
Sahil Dua,
Daniel Cer,
Madhuri Shanbhogue,
Iftekhar Naim,
Gustavo Hernández Ábrego,
Zhe Li,
Kaifeng Chen,
Henrique Schechter Vera,
Xiaoqi Ren,
Shanfeng Zhang,
Daniel Salz,
Michael Boratko,
Jay Han,
Blair Chen,
Shuo Huang,
Vikram Rao,
Paul Suganthan,
Feng Han,
Andreas Doumanoglou,
Nithi Gupta,
Fedor Moiseev,
Cathy Yip,
Aashi Jain
, et al. (22 additional authors not shown)
Abstract:
In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini…
▽ More
In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini Embedding can be precomputed and applied to a variety of downstream tasks including classification, similarity, clustering, ranking, and retrieval. Evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), which includes over one hundred tasks across 250+ languages, Gemini Embedding substantially outperforms prior state-of-the-art models, demonstrating considerable improvements in embedding quality. Achieving state-of-the-art performance across MMTEB's multilingual, English, and code benchmarks, our unified model demonstrates strong capabilities across a broad selection of tasks and surpasses specialized domain-specific models.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy
Authors:
Wei Junhao,
Yu Zhe,
Sakuma Jun
Abstract:
Model merging is a technique that combines multiple finetuned models into a single model without additional training, allowing a free-rider to cheaply inherit specialized capabilities. This study investigates methodologies to suppress unwanted model merging by free-riders. Existing methods such as model watermarking or fingerprinting can only detect merging in hindsight. In contrast, we propose a…
▽ More
Model merging is a technique that combines multiple finetuned models into a single model without additional training, allowing a free-rider to cheaply inherit specialized capabilities. This study investigates methodologies to suppress unwanted model merging by free-riders. Existing methods such as model watermarking or fingerprinting can only detect merging in hindsight. In contrast, we propose a first proactive defense against model merging. Specifically, our defense method modifies the model parameters so that the model is disrupted if the model is merged with any other model, while its functionality is kept unchanged if not merged with others. Our approach consists of two modules, rearranging MLP parameters and scaling attention heads, which push the model out of the shared basin in parameter space, causing the merging performance with other models to degrade significantly. We conduct extensive experiments on image classification, image generation, and text classification to demonstrate that our defense severely disrupts merging while retaining the functionality of the post-protect model. Moreover, we analyze potential adaptive attacks and further propose a dropout-based pruning to improve our proposal's robustness.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
Authors:
Jiacheng Ruan,
Wenzhen Yuan,
Xian Gao,
Ye Guo,
Daoxin Zhang,
Zhe Xu,
Yao Hu,
Ting Liu,
Yuzhuo Fu
Abstract:
Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs pe…
▽ More
Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0% accuracy. Additionally, we perform comprehensive analytical studies, offering valuable insights for the future development of VLRMs. We anticipate that VLRMBench will serve as a pivotal benchmark in advancing VLRMs. Code and datasets will be available at https://github.com/JCruan519/VLRMBench.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Quantum spin dynamics of the honeycomb magnet K$_2$Co$_2$TeO$_6$ in high magnetic fields
Authors:
Patrick Pilch,
Laur Peedu,
Urmas Nagel,
Toomas Rõõm,
Changqing Zhu,
Yurii Skourski,
Xianghan Xu,
Robert J. Cava,
Zhe Wang
Abstract:
We present terahertz spectroscopic measurements of quantum spin dynamics in the honeycomb magnet K$_2$Co$_2$TeO$_6$ as a function of temperature, polarization and in an external magnetic field applied in the honeycomb plane. Magnetic excitations are resolved below the magnetic ordering temperature of $T_\text{N}$ = 12 K. In the applied magnetic field, we reveal characteristic field dependence not…
▽ More
We present terahertz spectroscopic measurements of quantum spin dynamics in the honeycomb magnet K$_2$Co$_2$TeO$_6$ as a function of temperature, polarization and in an external magnetic field applied in the honeycomb plane. Magnetic excitations are resolved below the magnetic ordering temperature of $T_\text{N}$ = 12 K. In the applied magnetic field, we reveal characteristic field dependence not only for the magnetic excitations observed at zero field, but also a rich set of modes emerging in finite fields. The observed magnetic excitations exhibit clear dependence on the terahertz polarization, and characteristic features at field-induced phase transitions consistent with our high-field magnetization data. We cannot evidently resolve a continuumlike feature, even when the long-range magnetic order is presumably suppressed in the strong magnetic field, indicating that a Kitaev-type interaction, if existing, is subleading in this compound.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.