-
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
Authors:
Ke Wang,
Junting Pan,
Linda Wei,
Aojun Zhou,
Weikang Shi,
Zimu Lu,
Han Xiao,
Yunqiao Yang,
Houxing Ren,
Mingjie Zhan,
Hongsheng Li
Abstract:
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inhe…
▽ More
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
REMEDI: Relative Feature Enhanced Meta-Learning with Distillation for Imbalanced Prediction
Authors:
Fei Liu,
Huanhuan Ren,
Yu Guan,
Xiuxu Wang,
Wang Lv,
Zhiqiang Hu,
Yaxi Chen
Abstract:
Predicting future vehicle purchases among existing owners presents a critical challenge due to extreme class imbalance (<0.5% positive rate) and complex behavioral patterns. We propose REMEDI (Relative feature Enhanced Meta-learning with Distillation for Imbalanced prediction), a novel multi-stage framework addressing these challenges. REMEDI first trains diverse base models to capture complementa…
▽ More
Predicting future vehicle purchases among existing owners presents a critical challenge due to extreme class imbalance (<0.5% positive rate) and complex behavioral patterns. We propose REMEDI (Relative feature Enhanced Meta-learning with Distillation for Imbalanced prediction), a novel multi-stage framework addressing these challenges. REMEDI first trains diverse base models to capture complementary aspects of user behavior. Second, inspired by comparative op-timization techniques, we introduce relative performance meta-features (deviation from ensemble mean, rank among peers) for effective model fusion through a hybrid-expert architecture. Third, we distill the ensemble's knowledge into a single efficient model via supervised fine-tuning with MSE loss, enabling practical deployment. Evaluated on approximately 800,000 vehicle owners, REMEDI significantly outperforms baseline approaches, achieving the business target of identifying ~50% of actual buyers within the top 60,000 recommendations at ~10% precision. The distilled model preserves the ensemble's predictive power while maintaining deployment efficiency, demonstrating REMEDI's effectiveness for imbalanced prediction in industry settings.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
AugMixCloak: A Defense against Membership Inference Attacks via Image Transformation
Authors:
Heqing Ren,
Chao Feng,
Alberto Huertas,
Burkhard Stiller
Abstract:
Traditional machine learning (ML) raises serious privacy concerns, while federated learning (FL) mitigates the risk of data leakage by keeping data on local devices. However, the training process of FL can still leak sensitive information, which adversaries may exploit to infer private data. One of the most prominent threats is the membership inference attack (MIA), where the adversary aims to det…
▽ More
Traditional machine learning (ML) raises serious privacy concerns, while federated learning (FL) mitigates the risk of data leakage by keeping data on local devices. However, the training process of FL can still leak sensitive information, which adversaries may exploit to infer private data. One of the most prominent threats is the membership inference attack (MIA), where the adversary aims to determine whether a particular data record was part of the training set.
This paper addresses this problem through a two-stage defense called AugMixCloak. The core idea is to apply data augmentation and principal component analysis (PCA)-based information fusion to query images, which are detected by perceptual hashing (pHash) as either identical to or highly similar to images in the training set. Experimental results show that AugMixCloak successfully defends against both binary classifier-based MIA and metric-based MIA across five datasets and various decentralized FL (DFL) topologies. Compared with regularization-based defenses, AugMixCloak demonstrates stronger protection. Compared with confidence score masking, AugMixCloak exhibits better generalization.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
ElectricSight: 3D Hazard Monitoring for Power Lines Using Low-Cost Sensors
Authors:
Xingchen Li,
LiDian Wang,
Yu Sheng,
ZhiPeng Tang,
Haojie Ren,
Guoliang You,
YiFan Duan,
Jianmin Ji,
Yanyong Zhang
Abstract:
Protecting power transmission lines from potential hazards involves critical tasks, one of which is the accurate measurement of distances between power lines and potential threats, such as large cranes. The challenge with this task is that the current sensor-based methods face challenges in balancing accuracy and cost in distance measurement. A common practice is to install cameras on transmission…
▽ More
Protecting power transmission lines from potential hazards involves critical tasks, one of which is the accurate measurement of distances between power lines and potential threats, such as large cranes. The challenge with this task is that the current sensor-based methods face challenges in balancing accuracy and cost in distance measurement. A common practice is to install cameras on transmission towers, which, however, struggle to measure true 3D distances due to the lack of depth information. Although 3D lasers can provide accurate depth data, their high cost makes large-scale deployment impractical.
To address this challenge, we present ElectricSight, a system designed for 3D distance measurement and monitoring of potential hazards to power transmission lines. This work's key innovations lie in both the overall system framework and a monocular depth estimation method. Specifically, the system framework combines real-time images with environmental point cloud priors, enabling cost-effective and precise 3D distance measurements. As a core component of the system, the monocular depth estimation method enhances the performance by integrating 3D point cloud data into image-based estimates, improving both the accuracy and reliability of the system.
To assess ElectricSight's performance, we conducted tests with data from a real-world power transmission scenario. The experimental results demonstrate that ElectricSight achieves an average accuracy of 1.08 m for distance measurements and an early warning accuracy of 92%.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach
Authors:
Qian Peng,
Yajie Bao,
Haojie Ren,
Zhaojun Wang,
Changliang Zou
Abstract:
Conformal prediction is a powerful tool for constructing prediction intervals for black-box models, providing a finite sample coverage guarantee for exchangeable data. However, this exchangeability is compromised when some entries of the test feature are contaminated, such as in the case of cellwise outliers. To address this issue, this paper introduces a novel framework called detect-then-impute…
▽ More
Conformal prediction is a powerful tool for constructing prediction intervals for black-box models, providing a finite sample coverage guarantee for exchangeable data. However, this exchangeability is compromised when some entries of the test feature are contaminated, such as in the case of cellwise outliers. To address this issue, this paper introduces a novel framework called detect-then-impute conformal prediction. This framework first employs an outlier detection procedure on the test feature and then utilizes an imputation method to fill in those cells identified as outliers. To quantify the uncertainty in the processed test feature, we adaptively apply the detection and imputation procedures to the calibration set, thereby constructing exchangeable features for the conformal prediction interval of the test label. We develop two practical algorithms, PDI-CP and JDI-CP, and provide a distribution-free coverage analysis under some commonly used detection and imputation procedures. Notably, JDI-CP achieves a finite sample $1-2α$ coverage guarantee. Numerical experiments on both synthetic and real datasets demonstrate that our proposed algorithms exhibit robust coverage properties and comparable efficiency to the oracle baseline.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Authors:
Zimu Lu,
Yunqiao Yang,
Houxing Ren,
Haotian Hou,
Han Xiao,
Ke Wang,
Weikang Shi,
Aojun Zhou,
Mingjie Zhan,
Hongsheng Li
Abstract:
LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. The…
▽ More
LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Multimodal Graph Representation Learning for Robust Surgical Workflow Recognition with Adversarial Feature Disentanglement
Authors:
Long Bai,
Boyi Ma,
Ruohan Wang,
Guankun Wang,
Beilei Cui,
Zhongliang Jiang,
Mobarakol Islam,
Zhe Min,
Jiewen Lai,
Nassir Navab,
Hongliang Ren
Abstract:
Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. In this case, we explore a robust…
▽ More
Surgical workflow recognition is vital for automating tasks, supporting decision-making, and training novice surgeons, ultimately improving patient safety and standardizing procedures. However, data corruption can lead to performance degradation due to issues like occlusion from bleeding or smoke in surgical scenes and problems with data storage and transmission. In this case, we explore a robust graph-based multimodal approach to integrating vision and kinematic data to enhance accuracy and reliability. Vision data captures dynamic surgical scenes, while kinematic data provides precise movement information, overcoming limitations of visual recognition under adverse conditions. We propose a multimodal Graph Representation network with Adversarial feature Disentanglement (GRAD) for robust surgical workflow recognition in challenging scenarios with domain shifts or corrupted data. Specifically, we introduce a Multimodal Disentanglement Graph Network that captures fine-grained visual information while explicitly modeling the complex relationships between vision and kinematic embeddings through graph-based message modeling. To align feature spaces across modalities, we propose a Vision-Kinematic Adversarial framework that leverages adversarial training to reduce modality gaps and improve feature consistency. Furthermore, we design a Contextual Calibrated Decoder, incorporating temporal and contextual priors to enhance robustness against domain shifts and corrupted data. Extensive comparative and ablation experiments demonstrate the effectiveness of our model and proposed modules. Moreover, our robustness experiments show that our method effectively handles data corruption during storage and transmission, exhibiting excellent stability and robustness. Our approach aims to advance automated surgical workflow recognition, addressing the complexities and dynamism inherent in surgical procedures.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Deep Physics Prior for First Order Inverse Optimization
Authors:
Haoyu Yang,
Kamyar Azizzadenesheli,
Haoxing Ren
Abstract:
Inverse design optimization aims to infer system parameters from observed solutions, posing critical challenges across domains such as semiconductor manufacturing, structural engineering, materials science, and fluid dynamics. The lack of explicit mathematical representations in many systems complicates this process and makes the first order optimization impossible. Mainstream approaches, includin…
▽ More
Inverse design optimization aims to infer system parameters from observed solutions, posing critical challenges across domains such as semiconductor manufacturing, structural engineering, materials science, and fluid dynamics. The lack of explicit mathematical representations in many systems complicates this process and makes the first order optimization impossible. Mainstream approaches, including generative AI and Bayesian optimization, address these challenges but have limitations. Generative AI is computationally expensive, while Bayesian optimization, relying on surrogate models, suffers from scalability, sensitivity to priors, and noise issues, often leading to suboptimal solutions. This paper introduces Deep Physics Prior (DPP), a novel method enabling first-order gradient-based inverse optimization with surrogate machine learning models. By leveraging pretrained auxiliary Neural Operators, DPP enforces prior distribution constraints to ensure robust and meaningful solutions. This approach is particularly effective when prior data and observation distributions are unknown.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Automated Routing-Informed Placement for Large-Scale Photonic Integrated Circuits
Authors:
Hongjian Zhou,
Haoyu Yang,
Gangi Nicholas,
Haoxing Ren,
Huang Rena,
Jiaqi Gu
Abstract:
As technology advances, photonic integrated circuits (PICs) are rapidly scaling in size and complexity, with modern designs integrating thousands of components. However, the analog custom layout nature of photonics, the curvy waveguide structures, and single-layer routing resources impose stringent physical constraints, such as minimum bend radii and waveguide crossing penalties, which make manual…
▽ More
As technology advances, photonic integrated circuits (PICs) are rapidly scaling in size and complexity, with modern designs integrating thousands of components. However, the analog custom layout nature of photonics, the curvy waveguide structures, and single-layer routing resources impose stringent physical constraints, such as minimum bend radii and waveguide crossing penalties, which make manual layout the de facto standard. This manual process takes weeks to complete and is error-prone, which is fundamentally unscalable for large-scale PIC systems. Existing automation solutions have adopted force-directed placement on small benchmarks with tens of components, with limited routability and scalability. To fill this fundamental gap in the electronic-photonic design automation (EPDA) toolchain, we present the first GPU-accelerated, routing-informed placement framework. It features an asymmetric bending-aware wirelength function with explicit modeling of waveguide routing congestion and crossings for routability maximization. Meanwhile, conditional projection is employed to gradually enforce a variety of user-defined layout constraints, including alignment, spacing, etc. This constrained optimization is accelerated and stabilized by a custom blockwise adaptive Nesterov-accelerated optimizer, ensuring stable and high-quality convergence. Compared to existing methods, our method can generate high-quality layouts for large-scale PICs with an average routing success rate of 94.79% across all benchmarks within minutes. By tightly coupling placement with physical-aware routing, our method establishes a new paradigm for automated PIC design, bringing intelligent, scalable layout synthesis to the forefront of next-generation EPDA. We will open-source our code.
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
Event-Based Eye Tracking. 2025 Event-based Vision Workshop
Authors:
Qinyu Chen,
Chang Gao,
Min Liu,
Daniele Perrone,
Yan Ru Pei,
Zuowen Wang,
Zhuo Zou,
Shihang Tan,
Tao Han,
Guorui Lu,
Zhen Xu,
Junyuan Ding,
Ziteng Wang,
Zongwei Wu,
Han Han,
Yuliang Wu,
Jinze Chen,
Wei Zhai,
Yang Cao,
Zheng-jun Zha,
Nuwan Bandara,
Thivya Kandappu,
Archan Misra,
Xiaopeng Lin,
Hongxiang Huang
, et al. (7 additional authors not shown)
Abstract:
This survey serves as a review for the 2025 Event-Based Eye Tracking Challenge organized as part of the 2025 CVPR event-based vision workshop. This challenge focuses on the task of predicting the pupil center by processing event camera recorded eye movement. We review and summarize the innovative methods from teams rank the top in the challenge to advance future event-based eye tracking research.…
▽ More
This survey serves as a review for the 2025 Event-Based Eye Tracking Challenge organized as part of the 2025 CVPR event-based vision workshop. This challenge focuses on the task of predicting the pupil center by processing event camera recorded eye movement. We review and summarize the innovative methods from teams rank the top in the challenge to advance future event-based eye tracking research. In each method, accuracy, model size, and number of operations are reported. In this survey, we also discuss event-based eye tracking from the perspective of hardware design.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
RoboOcc: Enhancing the Geometric and Semantic Scene Understanding for Robots
Authors:
Zhang Zhang,
Qiang Zhang,
Wei Cui,
Shuai Shi,
Yijie Guo,
Gang Han,
Wen Zhao,
Hengle Ren,
Renjing Xu,
Jian Tang
Abstract:
3D occupancy prediction enables the robots to obtain spatial fine-grained geometry and semantics of the surrounding scene, and has become an essential task for embodied perception. Existing methods based on 3D Gaussians instead of dense voxels do not effectively exploit the geometry and opacity properties of Gaussians, which limits the network's estimation of complex environments and also limits t…
▽ More
3D occupancy prediction enables the robots to obtain spatial fine-grained geometry and semantics of the surrounding scene, and has become an essential task for embodied perception. Existing methods based on 3D Gaussians instead of dense voxels do not effectively exploit the geometry and opacity properties of Gaussians, which limits the network's estimation of complex environments and also limits the description of the scene by 3D Gaussians. In this paper, we propose a 3D occupancy prediction method which enhances the geometric and semantic scene understanding for robots, dubbed RoboOcc. It utilizes the Opacity-guided Self-Encoder (OSE) to alleviate the semantic ambiguity of overlapping Gaussians and the Geometry-aware Cross-Encoder (GCE) to accomplish the fine-grained geometric modeling of the surrounding scene. We conduct extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet datasets, and our RoboOcc achieves state-of the-art performance in both local and global camera settings. Further, in ablation studies of Gaussian parameters, the proposed RoboOcc outperforms the state-of-the-art methods by a large margin of (8.47, 6.27) in IoU and mIoU metric, respectively. The codes will be released soon.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
VGNC: Reducing the Overfitting of Sparse-view 3DGS via Validation-guided Gaussian Number Control
Authors:
Lifeng Lin,
Rongfeng Lu,
Quan Chen,
Haofan Ren,
Ming Lu,
Yaoqi Sun,
Chenggang Yan,
Anke Xue
Abstract:
Sparse-view 3D reconstruction is a fundamental yet challenging task in practical 3D reconstruction applications. Recently, many methods based on the 3D Gaussian Splatting (3DGS) framework have been proposed to address sparse-view 3D reconstruction. Although these methods have made considerable advancements, they still show significant issues with overfitting. To reduce the overfitting, we introduc…
▽ More
Sparse-view 3D reconstruction is a fundamental yet challenging task in practical 3D reconstruction applications. Recently, many methods based on the 3D Gaussian Splatting (3DGS) framework have been proposed to address sparse-view 3D reconstruction. Although these methods have made considerable advancements, they still show significant issues with overfitting. To reduce the overfitting, we introduce VGNC, a novel Validation-guided Gaussian Number Control (VGNC) approach based on generative novel view synthesis (NVS) models. To the best of our knowledge, this is the first attempt to alleviate the overfitting issue of sparse-view 3DGS with generative validation images. Specifically, we first introduce a validation image generation method based on a generative NVS model. We then propose a Gaussian number control strategy that utilizes generated validation images to determine the optimal Gaussian numbers, thereby reducing the issue of overfitting. We conducted detailed experiments on various sparse-view 3DGS baselines and datasets to evaluate the effectiveness of VGNC. Extensive experiments show that our approach not only reduces overfitting but also improves rendering quality on the test set while decreasing the number of Gaussian points. This reduction lowers storage demands and accelerates both training and rendering. The code will be released.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
Authors:
ByteDance Seed,
:,
Jiaze Chen,
Tiantian Fan,
Xin Liu,
Lingjun Liu,
Zhiqi Lin,
Mingxuan Wang,
Chengyi Wang,
Xiangpeng Wei,
Wenyuan Xu,
Yufeng Yuan,
Yu Yue,
Lin Yan,
Qiying Yu,
Xiaochen Zuo,
Chi Zhang,
Ruofei Zhu,
Zhecheng An,
Zhihao Bai,
Yu Bao,
Xingyan Bin,
Jiangjie Chen,
Feng Chen,
Hongmin Chen
, et al. (249 additional authors not shown)
Abstract:
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in…
▽ More
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
△ Less
Submitted 29 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
DG-MVP: 3D Domain Generalization via Multiple Views of Point Clouds for Classification
Authors:
Huantao Ren,
Minmin Yang,
Senem Velipasalar
Abstract:
Deep neural networks have achieved significant success in 3D point cloud classification while relying on large-scale, annotated point cloud datasets, which are labor-intensive to build. Compared to capturing data with LiDAR sensors and then performing annotation, it is relatively easier to sample point clouds from CAD models. Yet, data sampled from CAD models is regular, and does not suffer from o…
▽ More
Deep neural networks have achieved significant success in 3D point cloud classification while relying on large-scale, annotated point cloud datasets, which are labor-intensive to build. Compared to capturing data with LiDAR sensors and then performing annotation, it is relatively easier to sample point clouds from CAD models. Yet, data sampled from CAD models is regular, and does not suffer from occlusion and missing points, which are very common for LiDAR data, creating a large domain shift. Therefore, it is critical to develop methods that can generalize well across different point cloud domains. %In this paper, we focus on the 3D point cloud domain generalization problem. Existing 3D domain generalization methods employ point-based backbones to extract point cloud features. Yet, by analyzing point utilization of point-based methods and observing the geometry of point clouds from different domains, we have found that a large number of point features are discarded by point-based methods through the max-pooling operation. This is a significant waste especially considering the fact that domain generalization is more challenging than supervised learning, and point clouds are already affected by missing points and occlusion to begin with. To address these issues, we propose a novel method for 3D point cloud domain generalization, which can generalize to unseen domains of point clouds. Our proposed method employs multiple 2D projections of a 3D point cloud to alleviate the issue of missing points and involves a simple yet effective convolution-based model to extract features. The experiments, performed on the PointDA-10 and Sim-to-Real benchmarks, demonstrate the effectiveness of our proposed method, which outperforms different baselines, and can transfer well from synthetic domain to real-world domain.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap
Authors:
Minmin Yang,
Huantao Ren,
Senem Velipasalar
Abstract:
Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a…
▽ More
Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \href{https://github.com/LexieYang/3D-PointZshotS}{Github}.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning
Authors:
Hairui Ren,
Fan Tang,
He Zhao,
Zixuan Wang,
Dandan Guo,
Yi Chang
Abstract:
Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this p…
▽ More
Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification. Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment. Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Timing Analysis Agent: Autonomous Multi-Corner Multi-Mode (MCMM) Timing Debugging with Timing Debug Relation Graph
Authors:
Jatin Nainani,
Chia-Tung Ho,
Anirudh Dhurka,
Haoxing Ren
Abstract:
Timing analysis is an essential and demanding verification method for Very Large Scale Integrated (VLSI) circuit design and optimization. In addition, it also serves as the cornerstone of the final sign-off, determining whether the chip is ready to be sent to the semiconductor foundry for fabrication. Recently, as the technology advance relentlessly, smaller metal pitches and the increasing number…
▽ More
Timing analysis is an essential and demanding verification method for Very Large Scale Integrated (VLSI) circuit design and optimization. In addition, it also serves as the cornerstone of the final sign-off, determining whether the chip is ready to be sent to the semiconductor foundry for fabrication. Recently, as the technology advance relentlessly, smaller metal pitches and the increasing number of devices have led to greater challenges and longer turn-around-time for experienced human designers to debug timing issues from the Multi-Corner Multi-Mode (MCMM) timing reports. As a result, an efficient and intelligent methodology is highly necessary and essential for debugging timing issues and reduce the turnaround times. Recently, Large Language Models (LLMs) have shown great promise across various tasks in language understanding and interactive decision-making, incorporating reasoning and actions. In this work, we propose a timing analysis agent, that is empowered by multi-LLMs task solving, and incorporates a novel hierarchical planning and solving flow to automate the analysis of timing reports from commercial tool. In addition, we build a Timing Debug Relation Graph (TDRG) that connects the reports with the relationships of debug traces from experienced timing engineers. The timing analysis agent employs the novel Agentic Retrieval Augmented Generation (RAG) approach, that includes agent and coding to retrieve data accurately, on the developed TDRG. In our studies, the proposed timing analysis agent achieves an average 98% pass-rate on a single-report benchmark and a 90% pass-rate for multi-report benchmark from industrial designs, demonstrating its effectiveness and adaptability.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models
Authors:
Hao Ren,
Yiming Zeng,
Zetong Bi,
Zhaoliang Wan,
Junlong Huang,
Hui Cheng
Abstract:
Recent advancements in diffusion-based imitation learning, which show impressive performance in modeling multimodal distributions and training stability, have led to substantial progress in various robot learning tasks. In visual navigation, previous diffusion-based policies typically generate action sequences by initiating from denoising Gaussian noise. However, the target action distribution oft…
▽ More
Recent advancements in diffusion-based imitation learning, which show impressive performance in modeling multimodal distributions and training stability, have led to substantial progress in various robot learning tasks. In visual navigation, previous diffusion-based policies typically generate action sequences by initiating from denoising Gaussian noise. However, the target action distribution often diverges significantly from Gaussian noise, leading to redundant denoising steps and increased learning complexity. Additionally, the sparsity of effective action distributions makes it challenging for the policy to generate accurate actions without guidance. To address these issues, we propose a novel, unified visual navigation framework leveraging the denoising diffusion bridge models named NaviBridger. This approach enables action generation by initiating from any informative prior actions, enhancing guidance and efficiency in the denoising process. We explore how diffusion bridges can enhance imitation learning in visual navigation tasks and further examine three source policies for generating prior actions. Extensive experiments in both simulated and real-world indoor and outdoor scenarios demonstrate that NaviBridger accelerates policy inference and outperforms the baselines in generating target action sequences. Code is available at https://github.com/hren20/NaiviBridger.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
NaviDiffusor: Cost-Guided Diffusion Model for Visual Navigation
Authors:
Yiming Zeng,
Hao Ren,
Shuhang Wang,
Junlong Huang,
Hui Cheng
Abstract:
Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success r…
▽ More
Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at https://github.com/SYSU-RoboticsLab/NaviD.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Marco: Configurable Graph-Based Task Solving and Multi-AI Agents Framework for Hardware Design
Authors:
Chia-Tung Ho,
Jing Gong,
Yunsheng Bai,
Chenhui Deng,
Haoxing Ren,
Brucek Khailany
Abstract:
Hardware design presents numerous challenges stemming from its complexity and advancing technologies. These challenges result in longer turn-around-time (TAT) for optimizing performance, power, area, and cost (PPAC) during synthesis, verification, physical design, and reliability loops. Large Language Models (LLMs) have shown remarkable capacity to comprehend and generate natural language at a mas…
▽ More
Hardware design presents numerous challenges stemming from its complexity and advancing technologies. These challenges result in longer turn-around-time (TAT) for optimizing performance, power, area, and cost (PPAC) during synthesis, verification, physical design, and reliability loops. Large Language Models (LLMs) have shown remarkable capacity to comprehend and generate natural language at a massive scale, leading to many potential applications and benefits across various domains. Successful LLM-based agents for hardware design can drastically reduce TAT, leading to faster product cycles, lower costs, improved design reliability and reduced risk of costly errors. In this work, we propose a unified framework, Marco, that integrates configurable graph-based task solving with multi-modality and multi-AI agents for chip design by leveraging the natural language and reasoning abilities with collaborative toolkits. Lastly, we demonstrate promising performance, productivity, and efficiency of LLM agents by leveraging the Marco framework on layout optimization, Verilog/design rule checker (DRC) coding, and timing analysis tasks.
△ Less
Submitted 25 February, 2025;
originally announced April 2025.
-
MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs
Authors:
Juncheng Wu,
Wenlong Deng,
Xingxuan Li,
Sheng Liu,
Taomian Mi,
Yifan Peng,
Ziyang Xu,
Yi Liu,
Hyunjin Cho,
Chang-In Choi,
Yihan Cao,
Hui Ren,
Xiang Li,
Xiaoxiao Li,
Yuyin Zhou
Abstract:
Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical rea…
▽ More
Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code is available at https://github.com/UCSC-VLAA/MedReason.
△ Less
Submitted 4 April, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge
Authors:
Adam Schmidt,
Mert Asim Karaoglu,
Soham Sinha,
Mingang Jang,
Ho-Gun Ha,
Kyungmin Jung,
Kyeongmo Gu,
Ihsan Ullah,
Hyunki Lee,
Jonáš Šerých,
Michal Neoral,
Jiří Matas,
Rulin Zhou,
Wenlong He,
An Wang,
Hongliang Ren,
Bruno Silva,
Sandro Queirós,
Estêvão Lima,
João L. Vilaça,
Shunsuke Kikuchi,
Atsushi Kouno,
Hiroki Matsuzaki,
Tongtong Li,
Yulu Chen
, et al. (15 additional authors not shown)
Abstract:
Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to a…
▽ More
Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification. The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024. The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency. The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. The efficiency component tests the latency of algorithm inference. The challenge was conducted as a part of MICCAI EndoVis 2024. In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day. This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery. In this paper we summarize the design, submissions, and results from the challenge. The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Exploring Temporal Dynamics in Event-based Eye Tracker
Authors:
Hongwei Ren,
Xiaopeng Lin,
Hongxiang Huang,
Yue Zhou,
Bojun Cheng
Abstract:
Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vis…
▽ More
Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vision systems, are capable of perceiving eye movements with extremely low power consumption and ultra-high temporal resolution. This makes them a promising solution for achieving high-speed, high-precision tracking with rich temporal dynamics. In this paper, we propose TDTracker, an effective eye-tracking framework that captures rapid eye movements by thoroughly modeling temporal dynamics from both implicit and explicit perspectives. TDTracker utilizes 3D convolutional neural networks to capture implicit short-term temporal dynamics and employs a cascaded structure consisting of a Frequency-aware Module, GRU, and Mamba to extract explicit long-term temporal dynamics. Ultimately, a prediction heatmap is used for eye coordinate regression. Experimental results demonstrate that TDTracker achieves state-of-the-art (SOTA) performance on the synthetic SEET dataset and secured Third place in the CVPR event-based eye-tracking challenge 2025. Our code is available at https://github.com/rhwxmx/TDTracker.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
Authors:
Boyi Ma,
Yanguang Zhao,
Jie Wang,
Guankun Wang,
Kun Yuan,
Tong Chen,
Long Bai,
Hongliang Ren
Abstract:
The DeepSeek models have shown exceptional performance in general scene understanding, question-answering (QA), and text generation tasks, owing to their efficient training paradigm and strong reasoning capabilities. In this study, we investigate the dialogue capabilities of the DeepSeek model in robotic surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and Detailed Descrip…
▽ More
The DeepSeek models have shown exceptional performance in general scene understanding, question-answering (QA), and text generation tasks, owing to their efficient training paradigm and strong reasoning capabilities. In this study, we investigate the dialogue capabilities of the DeepSeek model in robotic surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and Detailed Description. The Single Phrase QA tasks further include sub-tasks such as surgical instrument recognition, action understanding, and spatial position analysis. We conduct extensive evaluations using publicly available datasets, including EndoVis18 and CholecT50, along with their corresponding dialogue data. Our empirical study shows that, compared to existing general-purpose multimodal large language models, DeepSeek-VL2 performs better on complex understanding tasks in surgical scenes. Additionally, although DeepSeek-V3 is purely a language model, we find that when image tokens are directly inputted, the model demonstrates better performance on single-sentence QA tasks. However, overall, the DeepSeek models still fall short of meeting the clinical requirements for understanding surgical scenes. Under general prompts, DeepSeek models lack the ability to effectively analyze global surgical concepts and fail to provide detailed insights into surgical scenarios. Based on our observations, we argue that the DeepSeek models are not ready for vision-language tasks in surgical contexts without fine-tuning on surgery-specific datasets.
△ Less
Submitted 3 April, 2025; v1 submitted 29 March, 2025;
originally announced March 2025.
-
Learning Library Cell Representations in Vector Space
Authors:
Rongjian Liang,
Yi-Chen Lu,
Wen-Hao Liu,
Haoxing Ren
Abstract:
We propose Lib2Vec, a novel self-supervised framework to efficiently learn meaningful vector representations of library cells, enabling ML models to capture essential cell semantics. The framework comprises three key components: (1) an automated method for generating regularity tests to quantitatively evaluate how well cell representations reflect inter-cell relationships; (2) a self-supervised le…
▽ More
We propose Lib2Vec, a novel self-supervised framework to efficiently learn meaningful vector representations of library cells, enabling ML models to capture essential cell semantics. The framework comprises three key components: (1) an automated method for generating regularity tests to quantitatively evaluate how well cell representations reflect inter-cell relationships; (2) a self-supervised learning scheme that systematically extracts training data from Liberty files, removing the need for costly labeling; and (3) an attention-based model architecture that accommodates various pin counts and enables the creation of property-specific cell and arc embeddings. Experimental results demonstrate that Lib2Vec effectively captures functional and electrical similarities. Moreover, linear algebraic operations on cell vectors reveal meaningful relationships, such as vector(BUF) - vector(INV) + vector(NAND) ~ vector(AND), showcasing the framework's nuanced representation capabilities. Lib2Vec also enhances downstream circuit learning applications, especially when labeled data is scarce.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision
Authors:
Rulin Zhou,
Wenlong He,
An Wang,
Qiqi Yao,
Haijun Hu,
Jiankun Wang,
Xi Zhang an Hongliang Ren
Abstract:
Accurate tissue point tracking in endoscopic videos is critical for robotic-assisted surgical navigation and scene understanding, but remains challenging due to complex deformations, instrument occlusion, and the scarcity of dense trajectory annotations. Existing methods struggle with long-term tracking under these conditions due to limited feature utilization and annotation dependence. We present…
▽ More
Accurate tissue point tracking in endoscopic videos is critical for robotic-assisted surgical navigation and scene understanding, but remains challenging due to complex deformations, instrument occlusion, and the scarcity of dense trajectory annotations. Existing methods struggle with long-term tracking under these conditions due to limited feature utilization and annotation dependence. We present Endo-TTAP, a novel framework addressing these challenges through: (1) A Multi-Facet Guided Attention (MFGA) module that synergizes multi-scale flow dynamics, DINOv2 semantic embeddings, and explicit motion patterns to jointly predict point positions with uncertainty and occlusion awareness; (2) A two-stage curriculum learning strategy employing an Auxiliary Curriculum Adapter (ACA) for progressive initialization and hybrid supervision. Stage I utilizes synthetic data with optical flow ground truth for uncertainty-occlusion regularization, while Stage II combines unsupervised flow consistency and semi-supervised learning with refined pseudo-labels from off-the-shelf trackers. Extensive validation on two MICCAI Challenge datasets and our collected dataset demonstrates that Endo-TTAP achieves state-of-the-art performance in tissue point tracking, particularly in scenarios characterized by complex endoscopic conditions. The source code and dataset will be available at https://anonymous.4open.science/r/Endo-TTAP-36E5.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
Authors:
Shijie Zhou,
Hui Ren,
Yijia Weng,
Shuwang Zhang,
Zhen Wang,
Dejia Xu,
Zhiwen Fan,
Suya You,
Zhangyang Wang,
Leonidas Guibas,
Achuta Kadambi
Abstract:
Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets,…
▽ More
Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g., SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.
△ Less
Submitted 28 March, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
AssertionForge: Enhancing Formal Verification Assertion Generation with Structured Representation of Specifications and RTL
Authors:
Yunsheng Bai,
Ghaith Bany Hamad,
Syed Suhaib,
Haoxing Ren
Abstract:
Generating SystemVerilog Assertions (SVAs) from natural language specifications remains a major challenge in formal verification (FV) due to the inherent ambiguity and incompleteness of specifications. Existing LLM-based approaches, such as AssertLLM, focus on extracting information solely from specification documents, often failing to capture essential internal signal interactions and design deta…
▽ More
Generating SystemVerilog Assertions (SVAs) from natural language specifications remains a major challenge in formal verification (FV) due to the inherent ambiguity and incompleteness of specifications. Existing LLM-based approaches, such as AssertLLM, focus on extracting information solely from specification documents, often failing to capture essential internal signal interactions and design details present in the RTL code, leading to incomplete or incorrect assertions. We propose a novel approach that constructs a Knowledge Graph (KG) from both specifications and RTL, using a hardware-specific schema with domain-specific entity and relation types. We create an initial KG from the specification and then systematically fuse it with information extracted from the RTL code, resulting in a unified, comprehensive KG. This combined representation enables a more thorough understanding of the design and allows for a multi-resolution context synthesis process which is designed to extract diverse verification contexts from the KG. Experiments on four designs demonstrate that our method significantly enhances SVA quality over prior methods. This structured representation not only improves FV but also paves the way for future research in tasks like code generation and design understanding.
△ Less
Submitted 14 May, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
Structure-Aware Correspondence Learning for Relative Pose Estimation
Authors:
Yihan Chen,
Wenfei Yang,
Huan Ren,
Shifeng Zhang,
Tianzhu Zhang,
Feng Wu
Abstract:
Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans' ability to assemble two object parts that have small or no overlapping r…
▽ More
Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans' ability to assemble two object parts that have small or no overlapping regions by considering object structure, we propose a novel Structure-Aware Correspondence Learning method for Relative Pose Estimation, which consists of two key modules. First, a structure-aware keypoint extraction module is designed to locate a set of kepoints that can represent the structure of objects with different shapes and appearance, under the guidance of a keypoint based image reconstruction loss. Second, a structure-aware correspondence estimation module is designed to model the intra-image and inter-image relationships between keypoints to extract structure-aware features for correspondence estimation. By jointly leveraging these two modules, the proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching for precise relative pose estimation. Experimental results on the CO3D, Objaverse and LineMOD datasets demonstrate that the proposed method significantly outperforms prior methods, i.e., with 5.7°reduction in mean angular error on the CO3D dataset.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras
Authors:
Beilei Cui,
Long Bai,
Mobarakol Islam,
An Wang,
Zhiqi Ma,
Yiming Huang,
Feng Li,
Zhen Chen,
Zhongliang Jiang,
Nassir Navab,
Hongliang Ren
Abstract:
Accurate 3D scene reconstruction is essential for numerous medical tasks. Given the challenges in obtaining ground truth data, there has been an increasing focus on self-supervised learning (SSL) for endoscopic depth estimation as a basis for scene reconstruction. While foundation models have shown remarkable progress in visual tasks, their direct application to the medical domain often leads to s…
▽ More
Accurate 3D scene reconstruction is essential for numerous medical tasks. Given the challenges in obtaining ground truth data, there has been an increasing focus on self-supervised learning (SSL) for endoscopic depth estimation as a basis for scene reconstruction. While foundation models have shown remarkable progress in visual tasks, their direct application to the medical domain often leads to suboptimal results. However, the visual features from these models can still enhance endoscopic tasks, emphasizing the need for efficient adaptation strategies, which still lack exploration currently. In this paper, we introduce Endo3DAC, a unified framework for endoscopic scene reconstruction that efficiently adapts foundation models. We design an integrated network capable of simultaneously estimating depth maps, relative poses, and camera intrinsic parameters. By freezing the backbone foundation model and training only the specially designed Gated Dynamic Vector-Based Low-Rank Adaptation (GDV-LoRA) with separate decoder heads, Endo3DAC achieves superior depth and pose estimation while maintaining training efficiency. Additionally, we propose a 3D scene reconstruction pipeline that optimizes depth maps' scales, shifts, and a few parameters based on our integrated network. Extensive experiments across four endoscopic datasets demonstrate that Endo3DAC significantly outperforms other state-of-the-art methods while requiring fewer trainable parameters. To our knowledge, we are the first to utilize a single network that only requires surgical videos to perform both SSL depth estimation and scene reconstruction tasks. The code will be released upon acceptance.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Learning Shape-Independent Transformation via Spherical Representations for Category-Level Object Pose Estimation
Authors:
Huan Ren,
Wenfei Yang,
Xiang Liu,
Shifeng Zhang,
Tianzhu Zhang
Abstract:
Category-level object pose estimation aims to determine the pose and size of novel objects in specific categories. Existing correspondence-based approaches typically adopt point-based representations to establish the correspondences between primitive observed points and normalized object coordinates. However, due to the inherent shape-dependence of canonical coordinates, these methods suffer from…
▽ More
Category-level object pose estimation aims to determine the pose and size of novel objects in specific categories. Existing correspondence-based approaches typically adopt point-based representations to establish the correspondences between primitive observed points and normalized object coordinates. However, due to the inherent shape-dependence of canonical coordinates, these methods suffer from semantic incoherence across diverse object shapes. To resolve this issue, we innovatively leverage the sphere as a shared proxy shape of objects to learn shape-independent transformation via spherical representations. Based on this insight, we introduce a novel architecture called SpherePose, which yields precise correspondence prediction through three core designs. Firstly, We endow the point-wise feature extraction with SO(3)-invariance, which facilitates robust mapping between camera coordinate space and object coordinate space regardless of rotation transformation. Secondly, the spherical attention mechanism is designed to propagate and integrate features among spherical anchors from a comprehensive perspective, thus mitigating the interference of noise and incomplete point cloud. Lastly, a hyperbolic correspondence loss function is designed to distinguish subtle distinctions, which can promote the precision of correspondence prediction. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate the superior performance of our method, verifying the effectiveness of spherical representations and architectural innovations.
△ Less
Submitted 19 March, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Rubikon: Intelligent Tutoring for Rubik's Cube Learning Through AR-enabled Physical Task Reconfiguration
Authors:
Haocheng Ren,
Muzhe Wu,
Gregory Croisdale,
Anhong Guo,
Xu Wang
Abstract:
Learning to solve a Rubik's Cube requires the learners to repeatedly practice a skill component, e.g., identifying a misplaced square and putting it back. However, for 3D physical tasks such as this, generating sufficient repeated practice opportunities for learners can be challenging, in part because it is difficult for novices to reconfigure the physical object to specific states. We propose Rub…
▽ More
Learning to solve a Rubik's Cube requires the learners to repeatedly practice a skill component, e.g., identifying a misplaced square and putting it back. However, for 3D physical tasks such as this, generating sufficient repeated practice opportunities for learners can be challenging, in part because it is difficult for novices to reconfigure the physical object to specific states. We propose Rubikon, an intelligent tutoring system for learning to solve the Rubik's Cube. Rubikon reduces the necessity for repeated manual configurations of the Rubik's Cube without compromising the tactile experience of handling a physical cube. The foundational design of Rubikon is an AR setup, where learners manipulate a physical cube while seeing an AR-rendered cube on a display. Rubikon automatically generates configurations of the Rubik's Cube to target learners' weaknesses and help them exercise diverse knowledge components. In a between-subjects experiment, we showed that Rubikon learners scored 25% higher on a post-test compared to baselines.
△ Less
Submitted 14 May, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
A Framework for Uplink ISAC Receiver Designs: Performance Analysis and Algorithm Development
Authors:
Zhiyuan Yu,
Hong Ren,
Cunhua Pan,
Gui Zhou,
Dongming Wang,
Chau Yuen,
Jiangzhou Wang
Abstract:
Uplink integrated sensing and communication (ISAC) systems have recently emerged as a promising research direction, enabling simultaneous uplink signal detection and target sensing. In this paper, we propose the flexible projection (FP)-type receiver that unify the projection-type receiver and the successive interference cancellation (SIC)-type receiver by using a flexible tradeoff factor to adapt…
▽ More
Uplink integrated sensing and communication (ISAC) systems have recently emerged as a promising research direction, enabling simultaneous uplink signal detection and target sensing. In this paper, we propose the flexible projection (FP)-type receiver that unify the projection-type receiver and the successive interference cancellation (SIC)-type receiver by using a flexible tradeoff factor to adapt to dynamically changing uplink ISAC scenarios. The FP-type receiver addresses the joint signal detection and target response estimation problem through two coordinated phases: 1) Communication signal detection using a reconstructed signal whose composition is controlled by the tradeoff factor, followed by 2) Target response estimation performed through subtraction of the detected communication signal from the received signal. With adjustable tradeoff factors, the FP-type receiver can balance the enhancement of the signal-to-interference-plus-noise ratio (SINR) with the reduction of correlation in the reconstructed signal for communication signal detection. The pairwise error probabilities (PEPs) are analyzed for both the maximum likelihood (ML) and the zero-forcing (ZF) detectors, revealing that the optimal tradeoff factor should be determined based on the adopted detection algorithm and the relative power of the sensing and communication (S\&C) signal. A homotopy optimization framework is first applied for the FP-type receiver with a fixed trade-off factor. This framework is then extended to develop the dynamic FP (DFP)-type receiver, which iteratively adjust the trade-off factor for improved algorithm performance and environmental adaptability. Subsequently, two extensions are explored to further enhance the receiver's performance: parallel DFP (PDFP)-type receiver and a block-structured receiver design. Finally, the effectiveness of the proposed receiver designs is verified via simulations.
△ Less
Submitted 3 April, 2025; v1 submitted 4 March, 2025;
originally announced March 2025.
-
Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
Authors:
Lian Liu,
Shixin Zhao,
Bing Li,
Haimeng Ren,
Zhaohui Xu,
Mengdi Wang,
Xiaowei Li,
Yinhe Han,
Ying Wang
Abstract:
The billion-scale Large Language Models (LLMs) need deployment on expensive server-grade GPUs with large-storage HBMs and abundant computation capability. As LLM-assisted services become popular, achieving cost-effective LLM inference on budget-friendly hardware becomes the trend. Extensive researches relocate LLM parameters from expensive GPUs to host memory. However, the restricted bandwidth bet…
▽ More
The billion-scale Large Language Models (LLMs) need deployment on expensive server-grade GPUs with large-storage HBMs and abundant computation capability. As LLM-assisted services become popular, achieving cost-effective LLM inference on budget-friendly hardware becomes the trend. Extensive researches relocate LLM parameters from expensive GPUs to host memory. However, the restricted bandwidth between the host and GPU memory limits the inference performance.
This work introduces Hermes, a budget-friendly system that leverages the near-data processing (NDP) within commodity DRAM DIMMs to enhance the performance of a single consumer-grade GPU, achieving efficient LLM inference. The inherent activation sparsity in LLMs naturally divides weight parameters into two categories, termed ``hot" and ``cold" neurons, respectively. Hot neurons, which consist of only approximately 20\% of all weight parameters, account for 80\% of the total computational load, while cold neurons make up the other 80\% of parameters but are responsible for just 20\% of the computational load. Therefore, we propose a heterogeneous computing strategy: mapping hot neurons to a single computation-efficient GPU, while offloading cold neurons to NDP-DIMMs, which offer large memory size but limited computation capabilities. Meanwhile, the dynamic nature of activation sparsity needs a real-time partition of hot/cold neurons and adaptive remapping of cold neurons across multiple NDP-DIMM modules. Therefore, we introduce a lightweight predictor optimizing real-time neuron partition and adjustment between GPU and NDP-DIMMs. We also utilize a window-based online scheduling mechanism to maintain load balance among NDP-DIMM modules. Hermes facilitates the deployment of LLaMA2-70B on consumer-grade hardware at 13.75 tokens/s and realizes an average 75.24$\times$ speedup over the state-of-the-art offloading-based inference system.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
InsBank: Evolving Instruction Subset for Ongoing Alignment
Authors:
Jiayi Shi,
Yiwei Li,
Shaoxiong Feng,
Peiwen Yuan,
Xinglin Wang,
Yueqi Zhang,
Chuyi Tan,
Boyuan Pan,
Huan Ren,
Yao Hu,
Kan Li
Abstract:
Large language models (LLMs) typically undergo instruction tuning to enhance alignment. Recent studies emphasize that quality and diversity of instruction data are more crucial than quantity, highlighting the need to select diverse, high-quality subsets to reduce training costs. However, how to evolve these selected subsets alongside the development of new instruction data remains insufficiently e…
▽ More
Large language models (LLMs) typically undergo instruction tuning to enhance alignment. Recent studies emphasize that quality and diversity of instruction data are more crucial than quantity, highlighting the need to select diverse, high-quality subsets to reduce training costs. However, how to evolve these selected subsets alongside the development of new instruction data remains insufficiently explored. To achieve LLMs' ongoing alignment, we introduce Instruction Bank (InsBank), a continuously updated repository that integrates the latest valuable instruction data. We further propose Progressive Instruction Bank Evolution (PIBE), a novel framework designed to evolve InsBank effectively and efficiently over time. PIBE employs a gradual data selection strategy to maintain long-term efficiency, leveraging a representation-based diversity score to capture relationships between data points and retain historical information for comprehensive diversity evaluation. This also allows for flexible combination of diversity and quality scores during data selection and ranking. Extensive experiments demonstrate that PIBE significantly outperforms baselines in InsBank evolution and is able to extract budget-specific subsets, demonstrating its effectiveness and adaptability.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Advancing Dense Endoscopic Reconstruction with Gaussian Splatting-driven Surface Normal-aware Tracking and Mapping
Authors:
Yiming Huang,
Beilei Cui,
Long Bai,
Zhen Chen,
Jinlin Wu,
Zhen Li,
Hongbin Liu,
Hongliang Ren
Abstract:
Simultaneous Localization and Mapping (SLAM) is essential for precise surgical interventions and robotic tasks in minimally invasive procedures. While recent advancements in 3D Gaussian Splatting (3DGS) have improved SLAM with high-quality novel view synthesis and fast rendering, these systems struggle with accurate depth and surface reconstruction due to multi-view inconsistencies. Simply incorpo…
▽ More
Simultaneous Localization and Mapping (SLAM) is essential for precise surgical interventions and robotic tasks in minimally invasive procedures. While recent advancements in 3D Gaussian Splatting (3DGS) have improved SLAM with high-quality novel view synthesis and fast rendering, these systems struggle with accurate depth and surface reconstruction due to multi-view inconsistencies. Simply incorporating SLAM and 3DGS leads to mismatches between the reconstructed frames. In this work, we present Endo-2DTAM, a real-time endoscopic SLAM system with 2D Gaussian Splatting (2DGS) to address these challenges. Endo-2DTAM incorporates a surface normal-aware pipeline, which consists of tracking, mapping, and bundle adjustment modules for geometrically accurate reconstruction. Our robust tracking module combines point-to-point and point-to-plane distance metrics, while the mapping module utilizes normal consistency and depth distortion to enhance surface reconstruction quality. We also introduce a pose-consistent strategy for efficient and geometrically coherent keyframe sampling. Extensive experiments on public endoscopic datasets demonstrate that Endo-2DTAM achieves an RMSE of $1.87\pm 0.63$ mm for depth reconstruction of surgical scenes while maintaining computationally efficient tracking, high-quality visual appearance, and real-time rendering. Our code will be released at github.com/lastbasket/Endo-2DTAM.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
ClearSight: Human Vision-Inspired Solutions for Event-Based Motion Deblurring
Authors:
Xiaopeng Lin,
Yulong Huang,
Hongwei Ren,
Zunchang Liu,
Yue Zhou,
Haotian Fu,
Bojun Cheng
Abstract:
Motion deblurring addresses the challenge of image blur caused by camera or scene movement. Event cameras provide motion information that is encoded in the asynchronous event streams. To efficiently leverage the temporal information of event streams, we employ Spiking Neural Networks (SNNs) for motion feature extraction and Artificial Neural Networks (ANNs) for color information processing. Due to…
▽ More
Motion deblurring addresses the challenge of image blur caused by camera or scene movement. Event cameras provide motion information that is encoded in the asynchronous event streams. To efficiently leverage the temporal information of event streams, we employ Spiking Neural Networks (SNNs) for motion feature extraction and Artificial Neural Networks (ANNs) for color information processing. Due to the non-uniform distribution and inherent redundancy of event data, existing cross-modal feature fusion methods exhibit certain limitations. Inspired by the visual attention mechanism in the human visual system, this study introduces a bioinspired dual-drive hybrid network (BDHNet). Specifically, the Neuron Configurator Module (NCM) is designed to dynamically adjusts neuron configurations based on cross-modal features, thereby focusing the spikes in blurry regions and adapting to varying blurry scenarios dynamically. Additionally, the Region of Blurry Attention Module (RBAM) is introduced to generate a blurry mask in an unsupervised manner, effectively extracting motion clues from the event features and guiding more accurate cross-modal feature fusion. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art methods on both synthetic and real-world datasets.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Rethinking Membership Inference Attacks Against Transfer Learning
Authors:
Cong Wu,
Jing Chen,
Qianru Fang,
Kun He,
Ziming Zhao,
Hao Ren,
Guowen Xu,
Yang Liu,
Yang Xiang
Abstract:
Transfer learning, successful in knowledge translation across related tasks, faces a substantial privacy threat from membership inference attacks (MIAs). These attacks, despite posing significant risk to ML model's training data, remain limited-explored in transfer learning. The interaction between teacher and student models in transfer learning has not been thoroughly explored in MIAs, potentiall…
▽ More
Transfer learning, successful in knowledge translation across related tasks, faces a substantial privacy threat from membership inference attacks (MIAs). These attacks, despite posing significant risk to ML model's training data, remain limited-explored in transfer learning. The interaction between teacher and student models in transfer learning has not been thoroughly explored in MIAs, potentially resulting in an under-examined aspect of privacy vulnerabilities within transfer learning. In this paper, we propose a new MIA vector against transfer learning, to determine whether a specific data point was used to train the teacher model while only accessing the student model in a white-box setting. Our method delves into the intricate relationship between teacher and student models, analyzing the discrepancies in hidden layer representations between the student model and its shadow counterpart. These identified differences are then adeptly utilized to refine the shadow model's training process and to inform membership inference decisions effectively. Our method, evaluated across four datasets in diverse transfer learning tasks, reveals that even when an attacker only has access to the student model, the teacher model's training data remains susceptible to MIAs. We believe our work unveils the unexplored risk of membership inference in transfer learning.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Federated Learning with Sample-level Client Drift Mitigation
Authors:
Haoran Xu,
Jiaze Li,
Wanyi Wu,
Hao Ren
Abstract:
Federated Learning (FL) suffers from severe performance degradation due to the data heterogeneity among clients. Existing works reveal that the fundamental reason is that data heterogeneity can cause client drift where the local model update deviates from the global one, and thus they usually tackle this problem from the perspective of calibrating the obtained local update. Despite effectiveness,…
▽ More
Federated Learning (FL) suffers from severe performance degradation due to the data heterogeneity among clients. Existing works reveal that the fundamental reason is that data heterogeneity can cause client drift where the local model update deviates from the global one, and thus they usually tackle this problem from the perspective of calibrating the obtained local update. Despite effectiveness, existing methods substantially lack a deep understanding of how heterogeneous data samples contribute to the formation of client drift. In this paper, we bridge this gap by identifying that the drift can be viewed as a cumulative manifestation of biases present in all local samples and the bias between samples is different. Besides, the bias dynamically changes as the FL training progresses. Motivated by this, we propose FedBSS that first mitigates the heterogeneity issue in a sample-level manner, orthogonal to existing methods. Specifically, the core idea of our method is to adopt a bias-aware sample selection scheme that dynamically selects the samples from small biases to large epoch by epoch to train progressively the local model in each round. In order to ensure the stability of training, we set the diversified knowledge acquisition stage as the warm-up stage to avoid the local optimality caused by knowledge deviation in the early stage of the model. Evaluation results show that FedBSS outperforms state-of-the-art baselines. In addition, we also achieved effective results on feature distribution skew and noise label dataset setting, which proves that FedBSS can not only reduce heterogeneity, but also has scalability and robustness.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery
Authors:
Guankun Wang,
Long Bai,
Junyi Wang,
Kun Yuan,
Zhen Li,
Tianxu Jiang,
Xiting He,
Jinlin Wu,
Zhen Chen,
Zhen Lei,
Hongbin Liu,
Jiazheng Wang,
Fan Zhang,
Nicolas Padoy,
Nassir Navab,
Hongliang Ren
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoC…
▽ More
Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding that surgeons encounter. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on collected large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model's representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, most of whom provide positive feedback on collaborating with EndoChat. Overall, these results demonstrate that our EndoChat has great potential to significantly advance training and automation in robotic-assisted surgery.
△ Less
Submitted 14 March, 2025; v1 submitted 20 January, 2025;
originally announced January 2025.
-
ByteFS: System Support for (CXL-based) Memory-Semantic Solid-State Drives
Authors:
Shaobo Li,
Yirui Eric Zhou,
Hao Ren,
Jian Huang
Abstract:
Unlike non-volatile memory that resides on the processor memory bus, memory-semantic solid-state drives (SSDs) support both byte and block access granularity via PCIe or CXL interconnects. They provide scalable memory capacity using NAND flash at a much lower cost. In addition, they have different performance characteristics for their dual byte/block interface respectively, while offering essentia…
▽ More
Unlike non-volatile memory that resides on the processor memory bus, memory-semantic solid-state drives (SSDs) support both byte and block access granularity via PCIe or CXL interconnects. They provide scalable memory capacity using NAND flash at a much lower cost. In addition, they have different performance characteristics for their dual byte/block interface respectively, while offering essential memory semantics for upper-level software. Such a byte-accessible storage device provides new implications on the software system design.
In this paper, we develop a new file system, named ByteFS, by rethinking the design primitives of file systems and SSD firmware to exploit the advantages of both byte and block-granular data accesses. ByteFS supports byte-granular data persistence to retain the persistence nature of SSDs. It extends the core data structure of file systems by enabling dual byte/block-granular data accesses. To facilitate the support for byte-granular writes, \pname{} manages the internal DRAM of SSD firmware in a log-structured manner and enables data coalescing to reduce the unnecessary I/O traffic to flash chips. ByteFS also enables coordinated data caching between the host page cache and SSD cache for best utilizing the precious memory resource. We implement ByteFS on both a real programmable SSD and an emulated memory-semantic SSD for sensitivity study. Compared to state-of-the-art file systems for non-volatile memory and conventional SSDs, ByteFS outperforms them by up to 2.7$\times$, while preserving the essential properties of a file system. ByteFS also reduces the write traffic to SSDs by up to 5.1$\times$ by alleviating unnecessary writes caused by both metadata and data updates in file systems.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
RIS-Aided Integrated Sensing and Communication Systems under Dual-polarized Channels
Authors:
Dongnan Xia,
Cunhua Pan,
Hong Ren,
Zhiyuan Yu,
Yasheng Jin,
Jiangzhou Wang
Abstract:
This paper considers reconfigurable intelligent surface (RIS)-aided integrated sensing and communication (ISAC) systems under dual-polarized (DP) channels.
Unlike the existing ISAC systems, which ignored polarization of electromagnetic waves, this study adopts DP base station (BS) and DP RIS to serve users with a pair of DP antennas.
The achievable sum rate is maximized through jointly optimiz…
▽ More
This paper considers reconfigurable intelligent surface (RIS)-aided integrated sensing and communication (ISAC) systems under dual-polarized (DP) channels.
Unlike the existing ISAC systems, which ignored polarization of electromagnetic waves, this study adopts DP base station (BS) and DP RIS to serve users with a pair of DP antennas.
The achievable sum rate is maximized through jointly optimizing the beamforming matrix at the DP BS, and the reflecting coefficients at the DP RIS.
To address this problem, we first utilize the weighted minimum mean-square error (WMMSE) method to transform the objective function into a more tractable form, and then an alternating optimization (AO) method is employed to decouple the original problem into two subproblems.
Due to the constant modulus constraint, the DP RIS reflection matrix optimization problem is addressed by the majorization-minimization (MM) method.
For the DP beamforming matrix, we propose a penalty-based algorithm that can obtain a low-complexity closed-form solution.
Simulation results validate the advantage of deploying DP transmit array and DP RIS in the considered ISAC systems.
△ Less
Submitted 1 January, 2025;
originally announced January 2025.
-
Frequency-aware Event Cloud Network
Authors:
Hongwei Ren,
Fei Ma,
Xiaopeng Lin,
Yuetong Fang,
Hongxiang Huang,
Yulong Huang,
Yue Zhou,
Haotian Fu,
Ziyi Yang,
Fei Richard Yu,
Bojun Cheng
Abstract:
Event cameras are biologically inspired sensors that emit events asynchronously with remarkable temporal resolution, garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing time-consuming transformation, bulky models, and sacrificing fine-grained temporal information. Alterna…
▽ More
Event cameras are biologically inspired sensors that emit events asynchronously with remarkable temporal resolution, garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing time-consuming transformation, bulky models, and sacrificing fine-grained temporal information. Alternatively, Point Cloud representation demonstrates promise in addressing the mentioned weaknesses, but it ignores the polarity information, and its models have limited proficiency in abstracting long-term events' features. In this paper, we propose a frequency-aware network named FECNet that leverages Event Cloud representations. FECNet fully utilizes 2S-1T-1P Event Cloud by innovating the event-based Group and Sampling module. To accommodate the long sequence events from Event Cloud, FECNet embraces feature extraction in the frequency domain via the Fourier transform. This approach substantially extinguishes the explosion of Multiply Accumulate Operations (MACs) while effectively abstracting spatial-temporal features. We conducted extensive experiments on event-based object classification, action recognition, and human pose estimation tasks, and the results substantiate the effectiveness and efficiency of FECNet.
△ Less
Submitted 30 December, 2024;
originally announced December 2024.
-
ChipAlign: Instruction Alignment in Large Language Models for Chip Design via Geodesic Interpolation
Authors:
Chenhui Deng,
Yunsheng Bai,
Haoxing Ren
Abstract:
Recent advancements in large language models (LLMs) have expanded their application across various domains, including chip design, where domain-adapted chip models like ChipNeMo have emerged. However, these models often struggle with instruction alignment, a crucial capability for LLMs that involves following explicit human directives. This limitation impedes the practical application of chip LLMs…
▽ More
Recent advancements in large language models (LLMs) have expanded their application across various domains, including chip design, where domain-adapted chip models like ChipNeMo have emerged. However, these models often struggle with instruction alignment, a crucial capability for LLMs that involves following explicit human directives. This limitation impedes the practical application of chip LLMs, including serving as assistant chatbots for hardware design engineers. In this work, we introduce ChipAlign, a novel approach that utilizes a training-free model merging strategy, combining the strengths of a general instruction-aligned LLM with a chip-specific LLM. By considering the underlying manifold in the weight space, ChipAlign employs geodesic interpolation to effectively fuse the weights of input LLMs, producing a merged model that inherits strong instruction alignment and chip expertise from the respective instruction and chip LLMs. Our results demonstrate that ChipAlign significantly enhances instruction-following capabilities of existing chip LLMs, achieving up to a 26.6% improvement on the IFEval benchmark, while maintaining comparable expertise in the chip domain. This improvement in instruction alignment also translates to notable gains in instruction-involved QA tasks, delivering performance enhancements of 3.9% on the OpenROAD QA benchmark and 8.25% on production-level chip QA benchmarks, surpassing state-of-the-art baselines.
△ Less
Submitted 14 December, 2024;
originally announced December 2024.
-
Improving Generalization for AI-Synthesized Voice Detection
Authors:
Hainan Ren,
Li Lin,
Chun-Hao Liu,
Xin Wang,
Shu Hu
Abstract:
AI-synthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused for malicious purposes. While existing AI-synthesized voice detection models excel in intra-domain evaluation, they face challenges in generalizing across different domains, potentially becoming obsolete as new voice generators emerge. Current solutions use div…
▽ More
AI-synthesized voice technology has the potential to create realistic human voices for beneficial applications, but it can also be misused for malicious purposes. While existing AI-synthesized voice detection models excel in intra-domain evaluation, they face challenges in generalizing across different domains, potentially becoming obsolete as new voice generators emerge. Current solutions use diverse data and advanced machine learning techniques (e.g., domain-invariant representation, self-supervised learning), but are limited by predefined vocoders and sensitivity to factors like background noise and speaker identity. In this work, we introduce an innovative disentanglement framework aimed at extracting domain-agnostic artifact features related to vocoders. Utilizing these features, we enhance model learning in a flat loss landscape, enabling escape from suboptimal solutions and improving generalization. Extensive experiments on benchmarks show our approach outperforms state-of-the-art methods, achieving up to 5.12% improvement in the equal error rate metric in intra-domain and 7.59% in cross-domain evaluations.
△ Less
Submitted 30 December, 2024; v1 submitted 26 December, 2024;
originally announced December 2024.
-
V$^2$-SfMLearner: Learning Monocular Depth and Ego-motion for Multimodal Wireless Capsule Endoscopy
Authors:
Long Bai,
Beilei Cui,
Liangyu Wang,
Yanheng Li,
Shilong Yao,
Sishen Yuan,
Yanan Wu,
Yang Zhang,
Max Q. -H. Meng,
Zhen Li,
Weiping Ding,
Hongliang Ren
Abstract:
Deep learning can predict depth maps and capsule ego-motion from capsule endoscopy videos, aiding in 3D scene reconstruction and lesion localization. However, the collisions of the capsule endoscopies within the gastrointestinal tract cause vibration perturbations in the training data. Existing solutions focus solely on vision-based processing, neglecting other auxiliary signals like vibrations th…
▽ More
Deep learning can predict depth maps and capsule ego-motion from capsule endoscopy videos, aiding in 3D scene reconstruction and lesion localization. However, the collisions of the capsule endoscopies within the gastrointestinal tract cause vibration perturbations in the training data. Existing solutions focus solely on vision-based processing, neglecting other auxiliary signals like vibrations that could reduce noise and improve performance. Therefore, we propose V$^2$-SfMLearner, a multimodal approach integrating vibration signals into vision-based depth and capsule motion estimation for monocular capsule endoscopy. We construct a multimodal capsule endoscopy dataset containing vibration and visual signals, and our artificial intelligence solution develops an unsupervised method using vision-vibration signals, effectively eliminating vibration perturbations through multimodal learning. Specifically, we carefully design a vibration network branch and a Fourier fusion module, to detect and mitigate vibration noises. The fusion framework is compatible with popular vision-only algorithms. Extensive validation on the multimodal dataset demonstrates superior performance and robustness against vision-only algorithms. Without the need for large external equipment, our V$^2$-SfMLearner has the potential for integration into clinical capsule robots, providing real-time and dependable digestive examination tools. The findings show promise for practical implementation in clinical settings, enhancing the diagnostic capabilities of doctors.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
OpenAI o1 System Card
Authors:
OpenAI,
:,
Aaron Jaech,
Adam Kalai,
Adam Lerer,
Adam Richardson,
Ahmed El-Kishky,
Aiden Low,
Alec Helyar,
Aleksander Madry,
Alex Beutel,
Alex Carney,
Alex Iftimie,
Alex Karpenko,
Alex Tachard Passos,
Alexander Neitz,
Alexander Prokofiev,
Alexander Wei,
Allison Tam,
Ally Bennett,
Ananya Kumar,
Andre Saraiva,
Andrea Vallone,
Andrew Duberstein,
Andrew Kondrich
, et al. (238 additional authors not shown)
Abstract:
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-ar…
▽ More
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
Deliberative Alignment: Reasoning Enables Safer Language Models
Authors:
Melody Y. Guan,
Manas Joglekar,
Eric Wallace,
Saachi Jain,
Boaz Barak,
Alec Helyar,
Rachel Dias,
Andrea Vallone,
Hongyu Ren,
Jason Wei,
Hyung Won Chung,
Sam Toyer,
Johannes Heidecke,
Alex Beutel,
Amelia Glaese
Abstract:
As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to…
▽ More
As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.
△ Less
Submitted 8 January, 2025; v1 submitted 20 December, 2024;
originally announced December 2024.
-
Adaptive Calibration: A Unified Conversion Framework of Spiking Neural Network
Authors:
Ziqing Wang,
Yuetong Fang,
Jiahang Cao,
Hongwei Ren,
Renjing Xu
Abstract:
Spiking Neural Networks (SNNs) are seen as an energy-efficient alternative to traditional Artificial Neural Networks (ANNs), but the performance gap remains a challenge. While this gap is narrowing through ANN-to-SNN conversion, substantial computational resources are still needed, and the energy efficiency of converted SNNs cannot be ensured. To address this, we present a unified training-free co…
▽ More
Spiking Neural Networks (SNNs) are seen as an energy-efficient alternative to traditional Artificial Neural Networks (ANNs), but the performance gap remains a challenge. While this gap is narrowing through ANN-to-SNN conversion, substantial computational resources are still needed, and the energy efficiency of converted SNNs cannot be ensured. To address this, we present a unified training-free conversion framework that significantly enhances both the performance and efficiency of converted SNNs. Inspired by the biological nervous system, we propose a novel Adaptive-Firing Neuron Model (AdaFire), which dynamically adjusts firing patterns across different layers to substantially reduce the Unevenness Error - the primary source of error of converted SNNs within limited inference timesteps. We further introduce two efficiency-enhancing techniques: the Sensitivity Spike Compression (SSC) technique for reducing spike operations, and the Input-aware Adaptive Timesteps (IAT) technique for decreasing latency. These methods collectively enable our approach to achieve state-of-the-art performance while delivering significant energy savings of up to 70.1%, 60.3%, and 43.1% on CIFAR-10, CIFAR-100, and ImageNet datasets, respectively. Extensive experiments across 2D, 3D, event-driven classification tasks, object detection, and segmentation tasks, demonstrate the effectiveness of our method in various domains. The code is available at: https://github.com/bic-L/burst-ann2snn.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
SurgSora: Decoupled RGBD-Flow Diffusion Model for Controllable Surgical Video Generation
Authors:
Tong Chen,
Shuya Yang,
Junyi Wang,
Long Bai,
Hongliang Ren,
Luping Zhou
Abstract:
Medical video generation has transformative potential for enhancing surgical understanding and pathology insights through precise and controllable visual representations. However, current models face limitations in controllability and authenticity. To bridge this gap, we propose SurgSora, a motion-controllable surgical video generation framework that uses a single input frame and user-controllable…
▽ More
Medical video generation has transformative potential for enhancing surgical understanding and pathology insights through precise and controllable visual representations. However, current models face limitations in controllability and authenticity. To bridge this gap, we propose SurgSora, a motion-controllable surgical video generation framework that uses a single input frame and user-controllable motion cues. SurgSora consists of three key modules: the Dual Semantic Injector (DSI), which extracts object-relevant RGB and depth features from the input frame and integrates them with segmentation cues to capture detailed spatial features of complex anatomical structures; the Decoupled Flow Mapper (DFM), which fuses optical flow with semantic-RGB-D features at multiple scales to enhance temporal understanding and object spatial dynamics; and the Trajectory Controller (TC), which allows users to specify motion directions and estimates sparse optical flow, guiding the video generation process. The fused features are used as conditions for a frozen Stable Diffusion model to produce realistic, temporally coherent surgical videos. Extensive evaluations demonstrate that SurgSora outperforms state-of-the-art methods in controllability and authenticity, showing its potential to advance surgical video generation for medical education, training, and research.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.