-
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
Authors:
Zhiyuan Hu,
Yibo Wang,
Hanze Dong,
Yuhui Xu,
Amrita Saha,
Caiming Xiong,
Bryan Hooi,
Junnan Li
Abstract:
Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors r…
▽ More
Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations
Authors:
Yile Wang,
Zhanyu Shen,
Hui Huang
Abstract:
Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable…
▽ More
Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using large language models, which forms "0/1" embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at https://github.com/szu-tera/LDIR.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Private Transformer Inference in MLaaS: A Survey
Authors:
Yang Li,
Xinyu Zhou,
Yitong Wang,
Liangxin Qian,
Jun Zhao
Abstract:
Transformer models have revolutionized AI, powering applications like content generation and sentiment analysis. However, their deployment in Machine Learning as a Service (MLaaS) raises significant privacy concerns, primarily due to the centralized processing of sensitive user data. Private Transformer Inference (PTI) offers a solution by utilizing cryptographic techniques such as secure multi-pa…
▽ More
Transformer models have revolutionized AI, powering applications like content generation and sentiment analysis. However, their deployment in Machine Learning as a Service (MLaaS) raises significant privacy concerns, primarily due to the centralized processing of sensitive user data. Private Transformer Inference (PTI) offers a solution by utilizing cryptographic techniques such as secure multi-party computation and homomorphic encryption, enabling inference while preserving both user data and model privacy. This paper reviews recent PTI advancements, highlighting state-of-the-art solutions and challenges. We also introduce a structured taxonomy and evaluation framework for PTI, focusing on balancing resource efficiency with privacy and bridging the gap between high-performance inference and data privacy.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning
Authors:
Yue Wang,
Shuai Xu,
Xuelin Zhu,
Yicong Li
Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-mo…
▽ More
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model's perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot
Authors:
Hao Lu,
Jiaqi Tang,
Jiyao Wang,
Yunfan LU,
Xu Cao,
Qingyong Hu,
Yin Wang,
Yuting Zhang,
Tianxin Xie,
Yunpeng Zhang,
Yong Chen,
Jiayu. Gao,
Bin Huang,
Dengbo He,
Shuiguang Deng,
Hao Chen,
Ying-Cong Chen
Abstract:
The intelligent driving cockpit, an important part of intelligent driving, needs to match different users' comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people's preferences and biases. (2) Generalist: It can u…
▽ More
The intelligent driving cockpit, an important part of intelligent driving, needs to match different users' comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people's preferences and biases. (2) Generalist: It can understand the multi-view and multi-mode inputs to reason the user's physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Self-Eliciting: It can elicit implicit thought chains in the language space to further increase generalist and super-aligned abilities. Besides, we collected multiple data sets and built a large-scale benchmark. This benchmark measures the deer's perceptual decision-making ability and the super alignment's accuracy.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Solar-CSK: Decoding Color Coded Visible Light Communications using Solar Cells
Authors:
Yanxiang Wang,
Yihe Yan,
Jiawei Hu,
Cheng Jiang,
Brano Kusy,
Ashraf Uddin,
Mahbub Hassan,
Wen Hu
Abstract:
Visible Light Communication (VLC) provides an energy-efficient wireless solution by using existing LED-based illumination for high-speed data transmissions. Although solar cells offer the advantage of simultaneous energy harvesting and data reception, their broadband nature hinders accurate decoding of color-coded signals like Color Shift Keying (CSK). In this paper, we propose a novel approach ex…
▽ More
Visible Light Communication (VLC) provides an energy-efficient wireless solution by using existing LED-based illumination for high-speed data transmissions. Although solar cells offer the advantage of simultaneous energy harvesting and data reception, their broadband nature hinders accurate decoding of color-coded signals like Color Shift Keying (CSK). In this paper, we propose a novel approach exploiting the concept of tandem solar cells, multi-layer devices with partial wavelength selectivity, to capture coarse color information without resorting to energy-limiting color filters. To address the residual spectral overlap, we develop a bidirectional LSTM-based machine learning framework that infers channel characteristics by comparing solar cells' photovoltaic signals with pilot-based anchor data. Our commercial off-the-shelf (COTS) solar prototype achieves robust performance across varying distances and ambient lighting levels, significantly reducing bit error rates compared to conventional channel estimation methods. These findings mark a step toward sustainable, high-performance VLC systems powered by the multi-layer solar technologies.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation
Authors:
Jun Guo,
Xiaojian Ma,
Yikai Wang,
Min Yang,
Huaping Liu,
Qing Li
Abstract:
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rende…
▽ More
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts
Authors:
Jing-Cheng Pang,
Kaiyuan Li,
Yidi Wang,
Si-Hang Yang,
Shengyi Jiang,
Yang Yu
Abstract:
A central challenge in reinforcement learning (RL) is its dependence on extensive real-world interaction data to learn task-specific policies. While recent work demonstrates that large language models (LLMs) can mitigate this limitation by generating synthetic experience (noted as imaginary rollouts) for mastering novel tasks, progress in this emerging field is hindered due to the lack of a standa…
▽ More
A central challenge in reinforcement learning (RL) is its dependence on extensive real-world interaction data to learn task-specific policies. While recent work demonstrates that large language models (LLMs) can mitigate this limitation by generating synthetic experience (noted as imaginary rollouts) for mastering novel tasks, progress in this emerging field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ImagineBench, the first comprehensive benchmark for evaluating offline RL algorithms that leverage both real rollouts and LLM-imaginary rollouts. The key features of ImagineBench include: (1) datasets comprising environment-collected and LLM-imaginary rollouts; (2) diverse domains of environments covering locomotion, robotic manipulation, and navigation tasks; and (3) natural language task instructions with varying complexity levels to facilitate language-conditioned policy learning. Through systematic evaluation of state-of-the-art offline RL algorithms, we observe that simply applying existing offline RL algorithms leads to suboptimal performance on unseen tasks, achieving 35.44% success rate in hard tasks in contrast to 64.37% of method training on real rollouts for hard tasks. This result highlights the need for algorithm advancements to better leverage LLM-imaginary rollouts. Additionally, we identify key opportunities for future research: including better utilization of imaginary rollouts, fast online adaptation and continual learning, and extension to multi-modal tasks. Our code is publicly available at https://github.com/LAMDA-RL/ImagineBench.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
PointArena: Probing Multimodal Grounding Through Language-Guided Pointing
Authors:
Long Cheng,
Jiafei Duan,
Yi Ru Wang,
Haoquan Fang,
Boyang Li,
Yushan Huang,
Elvis Wang,
Ainaz Eftekhar,
Jason Lee,
Wentao Yuan,
Rose Hendrix,
Noah A. Smith,
Fei Xia,
Dieter Fox,
Ranjay Krishna
Abstract:
Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platf…
▽ More
Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model pointing capabilities in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary multimodal models. Results indicate that Molmo-72B consistently outperforms other models, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training specifically targeting pointing tasks significantly enhances model performance. Across our multi-stage evaluation pipeline, we also observe strong correlations, underscoring the critical role of precise pointing capabilities in enabling multimodal models to effectively bridge abstract reasoning with concrete, real-world actions. Project page: https://pointarena.github.io/
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation
Authors:
Yimin Zhou,
Yichong Xia,
Sicheng Pan,
Bin Chen,
Baoyi An,
Haoqian Wang,
Zhi Wang,
Yaowei Wang,
Zikun Zhou
Abstract:
With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terres…
▽ More
With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terrestrial images, resulting in suboptimal performance. To address this limitation, we introduce HQUIC, designed to exploit underwater-image-specific features for enhanced compression efficiency. HQUIC employs an ALTC module to adaptively predict the attenuation coefficients and global light information of the images, which effectively mitigates the issues caused by the differences in lighting and tone existing in underwater images. Subsequently, HQUIC employs a codebook as an auxiliary branch to extract the common objects within underwater images and enhances the performance of the main branch. Furthermore, HQUIC dynamically weights multi-scale frequency components, prioritizing information critical for distortion quality while discarding redundant details. Extensive evaluations on diverse underwater datasets demonstrate that HQUIC outperforms state-of-the-art compression methods.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models
Authors:
Yidan Wang,
Yubing Ren,
Yanan Cao,
Binxing Fang
Abstract:
The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based sc…
▽ More
The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based schemes, harnessing their respective strengths to achieve synergy. In this paper, we propose a versatile symbiotic watermarking framework with three strategies: serial, parallel, and hybrid. The hybrid framework adaptively embeds watermarks using token entropy and semantic entropy, optimizing the balance between detectability, robustness, text quality, and security. Furthermore, we validate our approach through comprehensive experiments on various datasets and models. Experimental results indicate that our method outperforms existing baselines and achieves state-of-the-art (SOTA) performance. We believe this framework provides novel insights into diverse watermarking paradigms. Our code is available at \href{https://github.com/redwyd/SymMark}{https://github.com/redwyd/SymMark}.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Authors:
Yidan Wang,
Yanan Cao,
Yubing Ren,
Fang Fang,
Zheng Lin,
Binxing Fang
Abstract:
Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains undere…
▽ More
Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains underexplored. In this paper, we examine the effectiveness of jailbreak attacks in extracting sensitive information, bridging privacy leakage and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework targeting Personally Identifiable Information (PII) and addressing the limitations of current jailbreak methods. Specifically, PIG identifies PII entities and their types in privacy queries, uses in-context learning to build a privacy context, and iteratively updates it with three gradient-based strategies to elicit target PII. We evaluate PIG and existing jailbreak methods using two privacy-related datasets. Experiments on four white-box and two black-box LLMs show that PIG outperforms baseline methods and achieves state-of-the-art (SoTA) results. The results underscore significant privacy risks in LLMs, emphasizing the need for stronger safeguards. Our code is availble at \href{https://github.com/redwyd/PrivacyJailbreak}{https://github.com/redwyd/PrivacyJailbreak}.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation
Authors:
Enyu Zhao,
Vedant Raval,
Hejia Zhang,
Jiageng Mao,
Zeyu Shangguan,
Stefanos Nikolaidis,
Yue Wang,
Daniel Seita
Abstract:
Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common…
▽ More
Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. Consequently, we propose a novel benchmark, ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation. We extensively test 33 representative VLMs across 10 model families on our benchmark, including variants to test different model sizes. Our evaluation shows that the performance of VLMs significantly varies across tasks, and there is a strong correlation between this performance and trends in our real-world manipulation tasks. It also shows that there remains a significant gap between these models and human-level understanding. See our website at: https://manipbench.github.io.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models
Authors:
Xiwen Chen,
Wenhui Zhu,
Peijie Qiu,
Xuanzhao Dong,
Hao Wang,
Haiyu Wu,
Huayu Li,
Aristeidis Sotiras,
Yalin Wang,
Abolfazl Razi
Abstract:
Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinc…
▽ More
Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
SpecWav-Attack: Leveraging Spectrogram Resizing and Wav2Vec 2.0 for Attacking Anonymized Speech
Authors:
Yuqi Li,
Yuanzhong Zheng,
Zhongtian Guo,
Yaoxuan Wang,
Jianjun Yin,
Haojun Fei
Abstract:
This paper presents SpecWav-Attack, an adversarial model for detecting speakers in anonymized speech. It leverages Wav2Vec2 for feature extraction and incorporates spectrogram resizing and incremental training for improved performance. Evaluated on librispeech-dev and librispeech-test, SpecWav-Attack outperforms conventional attacks, revealing vulnerabilities in anonymized speech systems and empha…
▽ More
This paper presents SpecWav-Attack, an adversarial model for detecting speakers in anonymized speech. It leverages Wav2Vec2 for feature extraction and incorporates spectrogram resizing and incremental training for improved performance. Evaluated on librispeech-dev and librispeech-test, SpecWav-Attack outperforms conventional attacks, revealing vulnerabilities in anonymized speech systems and emphasizing the need for stronger defenses, benchmarked against the ICASSP 2025 Attacker Challenge.
△ Less
Submitted 10 January, 2025;
originally announced May 2025.
-
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
Authors:
Yung-Hsuan Lai,
Janek Ebbers,
Yu-Chiang Frank Wang,
François Germain,
Michael Jeffrey Jones,
Moitreya Chatterjee
Abstract:
Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end…
▽ More
Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, where only modality-agnostic, video-level labels are available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide model training. However, the absence of inter-segment dependencies when generating these pseudo-labels and the general bias towards predicting labels that are absent in a segment limit their performance. This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV). Additionally, our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training. Empirical results show that UWAV outperforms state-of-the-art methods for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
SEGA-DCIM: Design Space Exploration-Guided Automatic Digital CIM Compiler with Multiple Precision Support
Authors:
Haikang Diao,
Haoyi Zhang,
Jiahao Song,
Haoyang Luo,
Yibo Lin,
Runsheng Wang,
Yuan Wang,
Xiyuan Tang
Abstract:
Digital computing-in-memory (DCIM) has been a popular solution for addressing the memory wall problem in recent years. However, the DCIM design still heavily relies on manual efforts, and the optimization of DCIM is often based on human experience. These disadvantages limit the time to market while increasing the design difficulty of DCIMs. This work proposes a design space exploration-guided auto…
▽ More
Digital computing-in-memory (DCIM) has been a popular solution for addressing the memory wall problem in recent years. However, the DCIM design still heavily relies on manual efforts, and the optimization of DCIM is often based on human experience. These disadvantages limit the time to market while increasing the design difficulty of DCIMs. This work proposes a design space exploration-guided automatic DCIM compiler (SEGA-DCIM) with multiple precision support, including integer and floating-point data precision operations. SEGA-DCIM can automatically generate netlists and layouts of DCIM designs by leveraging a template-based method. With a multi-objective genetic algorithm (MOGA)-based design space explorer, SEGA-DCIM can easily select appropriate DCIM designs for a specific application considering the trade-offs among area, power, and delay. As demonstrated by the experimental results, SEGA-DCIM offers solutions with wide design space, including integer and floating-point precision designs, while maintaining competitive performance compared to state-of-the-art (SOTA) DCIMs.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Exploring Pose-Guided Imitation Learning for Robotic Precise Insertion
Authors:
Han Sun,
Yizhao Wang,
Zhenning Zhou,
Shuai Wang,
Haibo Yang,
Jingyuan Sun,
Qixin Cao
Abstract:
Recent studies have proved that imitation learning shows strong potential in the field of robotic manipulation. However, existing methods still struggle with precision manipulation task and rely on inefficient image/point cloud observations. In this paper, we explore to introduce SE(3) object pose into imitation learning and propose the pose-guided efficient imitation learning methods for robotic…
▽ More
Recent studies have proved that imitation learning shows strong potential in the field of robotic manipulation. However, existing methods still struggle with precision manipulation task and rely on inefficient image/point cloud observations. In this paper, we explore to introduce SE(3) object pose into imitation learning and propose the pose-guided efficient imitation learning methods for robotic precise insertion task. First, we propose a precise insertion diffusion policy which utilizes the relative SE(3) pose as the observation-action pair. The policy models the source object SE(3) pose trajectory relative to the target object. Second, we explore to introduce the RGBD data to the pose-guided diffusion policy. Specifically, we design a goal-conditioned RGBD encoder to capture the discrepancy between the current state and the goal state. In addition, a pose-guided residual gated fusion method is proposed, which takes pose features as the backbone, and the RGBD features selectively compensate for pose feature deficiencies through an adaptive gating mechanism. Our methods are evaluated on 6 robotic precise insertion tasks, demonstrating competitive performance with only 7-10 demonstrations. Experiments demonstrate that the proposed methods can successfully complete precision insertion tasks with a clearance of about 0.01 mm. Experimental results highlight its superior efficiency and generalization capability compared to existing baselines. Code will be available at https://github.com/sunhan1997/PoseInsert.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
MoRAL: Motion-aware Multi-Frame 4D Radar and LiDAR Fusion for Robust 3D Object Detection
Authors:
Xiangyuan Peng,
Yu Wang,
Miao Tang,
Bierzynski Kay,
Lorenzo Servadei,
Robert Wille
Abstract:
Reliable autonomous driving systems require accurate detection of traffic participants. To this end, multi-modal fusion has emerged as an effective strategy. In particular, 4D radar and LiDAR fusion methods based on multi-frame radar point clouds have demonstrated the effectiveness in bridging the point density gap. However, they often neglect radar point clouds' inter-frame misalignment caused by…
▽ More
Reliable autonomous driving systems require accurate detection of traffic participants. To this end, multi-modal fusion has emerged as an effective strategy. In particular, 4D radar and LiDAR fusion methods based on multi-frame radar point clouds have demonstrated the effectiveness in bridging the point density gap. However, they often neglect radar point clouds' inter-frame misalignment caused by object movement during accumulation and do not fully exploit the object dynamic information from 4D radar. In this paper, we propose MoRAL, a motion-aware multi-frame 4D radar and LiDAR fusion framework for robust 3D object detection. First, a Motion-aware Radar Encoder (MRE) is designed to compensate for inter-frame radar misalignment from moving objects. Later, a Motion Attention Gated Fusion (MAGF) module integrate radar motion features to guide LiDAR features to focus on dynamic foreground objects. Extensive evaluations on the View-of-Delft (VoD) dataset demonstrate that MoRAL outperforms existing methods, achieving the highest mAP of 73.30% in the entire area and 88.68% in the driving corridor. Notably, our method also achieves the best AP of 69.67% for pedestrians in the entire area and 96.25% for cyclists in the driving corridor.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
Authors:
Chenggang Zhao,
Chengqi Deng,
Chong Ruan,
Damai Dai,
Huazuo Gao,
Jiashi Li,
Liyue Zhang,
Panpan Huang,
Shangyan Zhou,
Shirong Ma,
Wenfeng Liang,
Ying He,
Yuqing Wang,
Yuxuan Liu,
Y. X. Wei
Abstract:
The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inferen…
▽ More
The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Zero-Shot Multi-modal Large Language Model v.s. Supervised Deep Learning: A Comparative Study on CT-Based Intracranial Hemorrhage Subtyping
Authors:
Yinuo Wang,
Yue Zeng,
Kai Chen,
Cai Meng,
Chao Pan,
Zhouping Tang
Abstract:
Introduction: Timely identification of intracranial hemorrhage (ICH) subtypes on non-contrast computed tomography is critical for prognosis prediction and therapeutic decision-making, yet remains challenging due to low contrast and blurring boundaries. This study evaluates the performance of zero-shot multi-modal large language models (MLLMs) compared to traditional deep learning methods in ICH bi…
▽ More
Introduction: Timely identification of intracranial hemorrhage (ICH) subtypes on non-contrast computed tomography is critical for prognosis prediction and therapeutic decision-making, yet remains challenging due to low contrast and blurring boundaries. This study evaluates the performance of zero-shot multi-modal large language models (MLLMs) compared to traditional deep learning methods in ICH binary classification and subtyping. Methods: We utilized a dataset provided by RSNA, comprising 192 NCCT volumes. The study compares various MLLMs, including GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet V2, with conventional deep learning models, including ResNet50 and Vision Transformer. Carefully crafted prompts were used to guide MLLMs in tasks such as ICH presence, subtype classification, localization, and volume estimation. Results: The results indicate that in the ICH binary classification task, traditional deep learning models outperform MLLMs comprehensively. For subtype classification, MLLMs also exhibit inferior performance compared to traditional deep learning models, with Gemini 2.0 Flash achieving an macro-averaged precision of 0.41 and a macro-averaged F1 score of 0.31. Conclusion: While MLLMs excel in interactive capabilities, their overall accuracy in ICH subtyping is inferior to deep networks. However, MLLMs enhance interpretability through language interactions, indicating potential in medical imaging analysis. Future efforts will focus on model refinement and developing more precise MLLMs to improve performance in three-dimensional medical image processing.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Bridging Theory and Experiment in Materials Discovery: Machine-Learning-Assisted Prediction of Synthesizable Structures
Authors:
Yu Xin,
Peng Liu,
Zhuohang Xie,
Wenhui Mi,
Pengyue Gao,
Hong Jian Zhao,
Jian Lv,
Yanchao Wang,
Yanming Ma
Abstract:
Even though thermodynamic energy-based crystal structure prediction (CSP) has revolutionized materials discovery, the energy-driven CSP approaches often struggle to identify experimentally realizable metastable materials synthesized through kinetically controlled pathways, creating a critical gap between theoretical predictions and experimental synthesis. Here, we propose a synthesizability-driven…
▽ More
Even though thermodynamic energy-based crystal structure prediction (CSP) has revolutionized materials discovery, the energy-driven CSP approaches often struggle to identify experimentally realizable metastable materials synthesized through kinetically controlled pathways, creating a critical gap between theoretical predictions and experimental synthesis. Here, we propose a synthesizability-driven CSP framework that integrates symmetry-guided structure derivation with a Wyckoff encode-based machine-learning model, allowing for the efficient localization of subspaces likely to yield highly synthesizable structures. Within the identified promising subspaces, a structure-based synthesizability evaluation model, fine-tuned using recently synthesized structures to enhance predictive accuracy, is employed in conjunction with ab initio calculations to systematically identify synthesizable candidates. The framework successfully reproduces 13 experimentally known XSe (X = Sc, Ti, Mn, Fe, Ni, Cu, Zn) structures, demonstrating its effectiveness in predicting synthesizable structures. Notably, 92,310 structures are filtered from the 554,054 candidates predicted by GNoME, exhibiting great potential for promising synthesizability. Additionally, eight thermodynamically favorable Hf-X-O (X = Ti, V, and Mn) structures have been identified, among which three HfV$_2$O$_7$ candidates exhibit high synthesizability, presenting viable candidates for experimental realization and potentially associated with experimentally observed temperature-induced phase transitions. This work establishes a data-driven paradigm for machine-learning-assisted inorganic materials synthesis, highlighting its potential to bridge the gap between computational predictions and experimental realization while unlocking new opportunities for the targeted discovery of novel functional materials.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Sequential Treatment Effect Estimation with Unmeasured Confounders
Authors:
Yingrong Wang,
Anpeng Wu,
Baohong Li,
Ziyang Xiao,
Ruoxuan Xiong,
Qing Han,
Kun Kuang
Abstract:
This paper studies the cumulative causal effects of sequential treatments in the presence of unmeasured confounders. It is a critical issue in sequential decision-making scenarios where treatment decisions and outcomes dynamically evolve over time. Advanced causal methods apply transformer as a backbone to model such time sequences, which shows superiority in capturing long time dependence and per…
▽ More
This paper studies the cumulative causal effects of sequential treatments in the presence of unmeasured confounders. It is a critical issue in sequential decision-making scenarios where treatment decisions and outcomes dynamically evolve over time. Advanced causal methods apply transformer as a backbone to model such time sequences, which shows superiority in capturing long time dependence and periodic patterns via attention mechanism. However, even they control the observed confounding, these estimators still suffer from unmeasured confounders, which influence both treatment assignments and outcomes. How to adjust the latent confounding bias in sequential treatment effect estimation remains an open challenge. Therefore, we propose a novel Decomposing Sequential Instrumental Variable framework for CounterFactual Regression (DSIV-CFR), relying on a common negative control assumption. Specifically, an instrumental variable (IV) is a special negative control exposure, while the previous outcome serves as a negative control outcome. This allows us to recover the IVs latent in observation variables and estimate sequential treatment effects via a generalized moment condition. We conducted experiments on 4 datasets and achieved significant performance in one- and multi-step prediction, supported by which we can identify optimal treatments for dynamic systems.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions
Authors:
Yuhang Wang,
Abdulaziz Alhuraish,
Shengming Yuan,
Hao Zhou
Abstract:
Lane Keeping Assist (LKA) is widely adopted in modern vehicles, yet its real-world performance remains underexplored due to proprietary systems and limited data access. This paper presents OpenLKA, the first open, large-scale dataset for LKA evaluation and improvement. It includes 400 hours of driving data from 50+ production vehicle models, collected through extensive road testing in Tampa, Flori…
▽ More
Lane Keeping Assist (LKA) is widely adopted in modern vehicles, yet its real-world performance remains underexplored due to proprietary systems and limited data access. This paper presents OpenLKA, the first open, large-scale dataset for LKA evaluation and improvement. It includes 400 hours of driving data from 50+ production vehicle models, collected through extensive road testing in Tampa, Florida and global contributions from the Comma.ai driving community. The dataset spans a wide range of challenging scenarios, including complex road geometries, degraded lane markings, adverse weather, lighting conditions and surrounding traffic. The dataset is multimodal, comprising: i) full CAN bus streams, decoded using custom reverse-engineered DBC files to extract key LKA events (e.g., system disengagements, lane detection failures); ii) synchronized high-resolution dash-cam video; iii) real-time outputs from Openpilot, providing accurate estimates of road curvature and lane positioning; iv) enhanced scene annotations generated by Vision Language Models, describing lane visibility, pavement quality, weather, lighting, and traffic conditions. By integrating vehicle-internal signals with high-fidelity perception and rich semantic context, OpenLKA provides a comprehensive platform for benchmarking the real-world performance of production LKA systems, identifying safety-critical operational scenarios, and assessing the readiness of current road infrastructure for autonomous driving. The dataset is publicly available at: https://github.com/OpenLKA/OpenLKA.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Human-like Cognitive Generalization for Large Models via Brain-in-the-loop Supervision
Authors:
Jiaxuan Chen,
Yu Qi,
Yueming Wang,
Gang Pan
Abstract:
Recent advancements in deep neural networks (DNNs), particularly large-scale language models, have demonstrated remarkable capabilities in image and natural language understanding. Although scaling up model parameters with increasing volume of training data has progressively improved DNN capabilities, achieving complex cognitive abilities - such as understanding abstract concepts, reasoning, and a…
▽ More
Recent advancements in deep neural networks (DNNs), particularly large-scale language models, have demonstrated remarkable capabilities in image and natural language understanding. Although scaling up model parameters with increasing volume of training data has progressively improved DNN capabilities, achieving complex cognitive abilities - such as understanding abstract concepts, reasoning, and adapting to novel scenarios, which are intrinsic to human cognition - remains a major challenge. In this study, we show that brain-in-the-loop supervised learning, utilizing a small set of brain signals, can effectively transfer human conceptual structures to DNNs, significantly enhancing their comprehension of abstract and even unseen concepts. Experimental results further indicate that the enhanced cognitive capabilities lead to substantial performance gains in challenging tasks, including few-shot/zero-shot learning and out-of-distribution recognition, while also yielding highly interpretable concept representations. These findings highlight that human-in-the-loop supervision can effectively augment the complex cognitive abilities of large models, offering a promising pathway toward developing more human-like cognitive abilities in artificial systems.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Differentiable Channel Selection in Self-Attention For Person Re-Identification
Authors:
Yancheng Wang,
Nebojsa Jojic,
Yingzhen Yang
Abstract:
In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless in…
▽ More
In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at https://github.com/Statistical-Deep-Learning/DCS-Attention.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Generative AI for Autonomous Driving: Frontiers and Opportunities
Authors:
Yuping Wang,
Shuo Xing,
Cui Can,
Renjie Li,
Hongyuan Hua,
Kexin Tian,
Zhaobin Mo,
Xiangbo Gao,
Keshu Wu,
Sulong Zhou,
Hengxu You,
Juntong Peng,
Junge Zhang,
Zehao Wang,
Rui Song,
Mingxuan Yan,
Walter Zimmer,
Xingcheng Zhou,
Peiran Li,
Zhaohan Lu,
Chia-Ju Chen,
Yue Huang,
Ryan A. Rossi,
Lichao Sun,
Hongkai Yu
, et al. (22 additional authors not shown)
Abstract:
Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, partic…
▽ More
Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, particularly the pursuit of Level 5 autonomy. This survey delivers a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack. We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models (LLMs). We then map their frontier applications in image, LiDAR, trajectory, occupancy, video generation as well as LLM-guided reasoning and decision making. We categorize practical applications, such as synthetic data workflows, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI. We identify key obstacles and possibilities such as comprehensive generalization across rare cases, evaluation and safety checks, budget-limited implementation, regulatory compliance, ethical concerns, and environmental effects, while proposing research plans across theoretical assurances, trust metrics, transport integration, and socio-technical influence. By unifying these threads, the survey provides a forward-looking reference for researchers, engineers, and policymakers navigating the convergence of generative AI and advanced autonomous mobility. An actively maintained repository of cited works is available at https://github.com/taco-group/GenAI4AD.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
The Geography of Transportation Cybersecurity: Visitor Flows, Industry Clusters, and Spatial Dynamics
Authors:
Yuhao Wang,
Kailai Wang,
Songhua Hu,
Yunpeng,
Zhang,
Gino Lim,
Pengyu Zhu
Abstract:
The rapid evolution of the transportation cybersecurity ecosystem, encompassing cybersecurity, automotive, and transportation and logistics sectors, will lead to the formation of distinct spatial clusters and visitor flow patterns across the US. This study examines the spatiotemporal dynamics of visitor flows, analyzing how socioeconomic factors shape industry clustering and workforce distribution…
▽ More
The rapid evolution of the transportation cybersecurity ecosystem, encompassing cybersecurity, automotive, and transportation and logistics sectors, will lead to the formation of distinct spatial clusters and visitor flow patterns across the US. This study examines the spatiotemporal dynamics of visitor flows, analyzing how socioeconomic factors shape industry clustering and workforce distribution within these evolving sectors. To model and predict visitor flow patterns, we develop a BiTransGCN framework, integrating an attention-based Transformer architecture with a Graph Convolutional Network backbone. By integrating AI-enabled forecasting techniques with spatial analysis, this study improves our ability to track, interpret, and anticipate changes in industry clustering and mobility trends, thereby supporting strategic planning for a secure and resilient transportation network. It offers a data-driven foundation for economic planning, workforce development, and targeted investments in the transportation cybersecurity ecosystem.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Towards Understanding Deep Learning Model in Image Recognition via Coverage Test
Authors:
Wenkai Li,
Xiaoqi Li,
Yingjie Mao,
Yishun Wang
Abstract:
Deep neural networks (DNNs) play a crucial role in the field of artificial intelligence, and their security-related testing has been a prominent research focus. By inputting test cases, the behavior of models is examined for anomalies, and coverage metrics are utilized to determine the extent of neurons covered by these test cases. With the widespread application and advancement of DNNs, different…
▽ More
Deep neural networks (DNNs) play a crucial role in the field of artificial intelligence, and their security-related testing has been a prominent research focus. By inputting test cases, the behavior of models is examined for anomalies, and coverage metrics are utilized to determine the extent of neurons covered by these test cases. With the widespread application and advancement of DNNs, different types of neural behaviors have garnered attention, leading to the emergence of various coverage metrics for neural networks. However, there is currently a lack of empirical research on these coverage metrics, specifically in analyzing the relationships and patterns between model depth, configuration information, and neural network coverage. This paper aims to investigate the relationships and patterns of four coverage metrics: primary functionality, boundary, hierarchy, and structural coverage. A series of empirical experiments were conducted, selecting LeNet, VGG, and ResNet as different DNN architectures, along with 10 models of varying depths ranging from 5 to 54 layers, to compare and study the relationships between different depths, configuration information, and various neural network coverage metrics. Additionally, an investigation was carried out on the relationships between modified decision/condition coverage and dataset size. Finally, three potential future directions are proposed to further contribute to the security testing of DNN Models.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
SparseMeXT Unlocking the Potential of Sparse Representations for HD Map Construction
Authors:
Anqing Jiang,
Jinhao Chai,
Yu Gao,
Yiru Wang,
Yuwen Heng,
Zhigang Sun,
Hao Sun,
Zezhong Zhao,
Li Sun,
Jian Zhou,
Lijuan Zhu,
Shugong Xu,
Hao Zhao
Abstract:
Recent advancements in high-definition \emph{HD} map construction have demonstrated the effectiveness of dense representations, which heavily rely on computationally intensive bird's-eye view \emph{BEV} features. While sparse representations offer a more efficient alternative by avoiding dense BEV processing, existing methods often lag behind due to the lack of tailored designs. These limitations…
▽ More
Recent advancements in high-definition \emph{HD} map construction have demonstrated the effectiveness of dense representations, which heavily rely on computationally intensive bird's-eye view \emph{BEV} features. While sparse representations offer a more efficient alternative by avoiding dense BEV processing, existing methods often lag behind due to the lack of tailored designs. These limitations have hindered the competitiveness of sparse representations in online HD map construction. In this work, we systematically revisit and enhance sparse representation techniques, identifying key architectural and algorithmic improvements that bridge the gap with--and ultimately surpass--dense approaches. We introduce a dedicated network architecture optimized for sparse map feature extraction, a sparse-dense segmentation auxiliary task to better leverage geometric and semantic cues, and a denoising module guided by physical priors to refine predictions. Through these enhancements, our method achieves state-of-the-art performance on the nuScenes dataset, significantly advancing HD map construction and centerline detection. Specifically, SparseMeXt-Tiny reaches a mean average precision \emph{mAP} of 55.5% at 32 frames per second \emph{fps}, while SparseMeXt-Base attains 65.2% mAP. Scaling the backbone and decoder further, SparseMeXt-Large achieves an mAP of 68.9% at over 20 fps, establishing a new benchmark for sparse representations in HD map construction. These results underscore the untapped potential of sparse methods, challenging the conventional reliance on dense representations and redefining efficiency-performance trade-offs in the field.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Security of Internet of Agents: Attacks and Countermeasures
Authors:
Yuntao Wang,
Yanghe Pan,
Shaolong Guo,
Zhou Su
Abstract:
With the rise of large language and vision-language models, AI agents have evolved into autonomous, interactive systems capable of perception, reasoning, and decision-making. As they proliferate across virtual and physical domains, the Internet of Agents (IoA) has emerged as a key infrastructure for enabling scalable and secure coordination among heterogeneous agents. This survey offers a comprehe…
▽ More
With the rise of large language and vision-language models, AI agents have evolved into autonomous, interactive systems capable of perception, reasoning, and decision-making. As they proliferate across virtual and physical domains, the Internet of Agents (IoA) has emerged as a key infrastructure for enabling scalable and secure coordination among heterogeneous agents. This survey offers a comprehensive examination of the security and privacy landscape in IoA systems. We begin by outlining the IoA architecture and its distinct vulnerabilities compared to traditional networks, focusing on four critical aspects: identity authentication threats, cross-agent trust issues, embodied security, and privacy risks. We then review existing and emerging defense mechanisms and highlight persistent challenges. Finally, we identify open research directions to advance the development of resilient and privacy-preserving IoA ecosystems.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Controllable Image Colorization with Instance-aware Texts and Masks
Authors:
Yanru An,
Ling Gui,
Qiang Hu,
Chunlei Cai,
Tianxiao Ye,
Xiaoyun Zhang,
Yanfeng Wang
Abstract:
Recently, the application of deep learning in image colorization has received widespread attention. The maturation of diffusion models has further advanced the development of image colorization models. However, current mainstream image colorization models still face issues such as color bleeding and color binding errors, and cannot colorize images at the instance level. In this paper, we propose a…
▽ More
Recently, the application of deep learning in image colorization has received widespread attention. The maturation of diffusion models has further advanced the development of image colorization models. However, current mainstream image colorization models still face issues such as color bleeding and color binding errors, and cannot colorize images at the instance level. In this paper, we propose a diffusion-based colorization method MT-Color to achieve precise instance-aware colorization with use-provided guidance. To tackle color bleeding issue, we design a pixel-level mask attention mechanism that integrates latent features and conditional gray image features through cross-attention. We use segmentation masks to construct cross-attention masks, preventing pixel information from exchanging between different instances. We also introduce an instance mask and text guidance module that extracts instance masks and text representations of each instance, which are then fused with latent features through self-attention, utilizing instance masks to form self-attention masks to prevent instance texts from guiding the colorization of other areas, thus mitigating color binding errors. Furthermore, we apply a multi-instance sampling strategy, which involves sampling each instance region separately and then fusing the results. Additionally, we have created a specialized dataset for instance-level colorization tasks, GPT-color, by leveraging large visual language models on existing image datasets. Qualitative and quantitative experiments show that our model and dataset outperform previous methods and datasets.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
SPAST: Arbitrary Style Transfer with Style Priors via Pre-trained Large-scale Model
Authors:
Zhanjie Zhang,
Quanwei Zhang,
Junsheng Luan,
Mengyuan Yang,
Yun Wang,
Lei Zhao
Abstract:
Given an arbitrary content and style image, arbitrary style transfer aims to render a new stylized
image which preserves the content image's structure and possesses the style image's style. Existing
arbitrary style transfer methods are based on either small models or pre-trained large-scale models.
The small model-based methods fail to generate high-quality stylized images, bringing artifact…
▽ More
Given an arbitrary content and style image, arbitrary style transfer aims to render a new stylized
image which preserves the content image's structure and possesses the style image's style. Existing
arbitrary style transfer methods are based on either small models or pre-trained large-scale models.
The small model-based methods fail to generate high-quality stylized images, bringing artifacts and
disharmonious patterns. The pre-trained large-scale model-based methods can generate high-quality
stylized images but struggle to preserve the content structure and cost long inference time. To this
end, we propose a new framework, called SPAST, to generate high-quality stylized images with
less inference time. Specifically, we design a novel Local-global Window Size Stylization Module
(LGWSSM)tofuse style features into content features. Besides, we introduce a novel style prior loss,
which can dig out the style priors from a pre-trained large-scale model into the SPAST and motivate
the SPAST to generate high-quality stylized images with short inference time.We conduct abundant
experiments to verify that our proposed method can generate high-quality stylized images and less
inference time compared with the SOTA arbitrary style transfer methods.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
AC-PKAN: Attention-Enhanced and Chebyshev Polynomial-Based Physics-Informed Kolmogorov-Arnold Networks
Authors:
Hangwei Zhang,
Zhimu Huang,
Yan Wang
Abstract:
Kolmogorov-Arnold Networks (KANs) have recently shown promise for solving partial differential equations (PDEs). Yet their original formulation is computationally and memory intensive, motivating the introduction of Chebyshev Type-I-based KANs (Chebyshev1KANs). Although Chebyshev1KANs have outperformed the vanilla KANs architecture, our rigorous theoretical analysis reveals that they still suffer…
▽ More
Kolmogorov-Arnold Networks (KANs) have recently shown promise for solving partial differential equations (PDEs). Yet their original formulation is computationally and memory intensive, motivating the introduction of Chebyshev Type-I-based KANs (Chebyshev1KANs). Although Chebyshev1KANs have outperformed the vanilla KANs architecture, our rigorous theoretical analysis reveals that they still suffer from rank collapse, ultimately limiting their expressive capacity. To overcome these limitations, we enhance Chebyshev1KANs by integrating wavelet-activated MLPs with learnable parameters and an internal attention mechanism. We prove that this design preserves a full-rank Jacobian and is capable of approximating solutions to PDEs of arbitrary order. Furthermore, to alleviate the loss instability and imbalance introduced by the Chebyshev polynomial basis, we externally incorporate a Residual Gradient Attention (RGA) mechanism that dynamically re-weights individual loss terms according to their gradient norms and residual magnitudes. By jointly leveraging internal and external attention, we present AC-PKAN, a novel architecture that constitutes an enhancement to weakly supervised Physics-Informed Neural Networks (PINNs) and extends the expressive power of KANs. Experimental results from nine benchmark tasks across three domains show that AC-PKAN consistently outperforms or matches state-of-the-art models such as PINNsFormer, establishing it as a highly effective tool for solving complex real-world engineering problems in zero-data or data-sparse regimes. The code will be made publicly available upon acceptance.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Integrating Natural Language Processing and Exercise Monitoring for Early Diagnosis of Metabolic Syndrome: A Deep Learning Approach
Authors:
Yichen Zhao,
Yuhua Wang,
Xi Cheng,
Junhao Fang,
Yang Yang
Abstract:
Metabolic syndrome (MetS) is a medication condition characterized by abdominal obesity, insulin resistance, hypertension and hyperlipidemia. It increases the risk of majority of chronic diseases, including type 2 diabetes mellitus, and affects about one quarter of the global population. Therefore, early detection and timely intervention for MetS are crucial. Standard diagnosis for MetS components…
▽ More
Metabolic syndrome (MetS) is a medication condition characterized by abdominal obesity, insulin resistance, hypertension and hyperlipidemia. It increases the risk of majority of chronic diseases, including type 2 diabetes mellitus, and affects about one quarter of the global population. Therefore, early detection and timely intervention for MetS are crucial. Standard diagnosis for MetS components requires blood tests conducted within medical institutions. However, it is frequently underestimated, leading to unmet need for care for MetS population. This study aims to use the least physiological data and free texts about exercises related activities, which are obtained easily in daily life, to diagnosis MetS. We collected the data from 40 volunteers in a nursing home and used data augmentation to reduce the imbalance. We propose a deep learning framework for classifying MetS that integrates natural language processing (NLP) and exercise monitoring. The results showed that the best model reported a high positive result (AUROC=0.806 and REC=76.3%) through 3-fold cross-validation. Feature importance analysis revealed that text and minimum heart rate on a daily basis contribute the most in the classification of MetS. This study demonstrates the potential application of data that are easily measurable in daily life for the early diagnosis of MetS, which could contribute to reducing the cost of screening and management for MetS population.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Boosting Zero-shot Stereo Matching using Large-scale Mixed Images Sources in the Real World
Authors:
Yuran Wang,
Yingping Liang,
Ying Fu
Abstract:
Stereo matching methods rely on dense pixel-wise ground truth labels, which are laborious to obtain, especially for real-world datasets. The scarcity of labeled data and domain gaps between synthetic and real-world images also pose notable challenges. In this paper, we propose a novel framework, \textbf{BooSTer}, that leverages both vision foundation models and large-scale mixed image sources, inc…
▽ More
Stereo matching methods rely on dense pixel-wise ground truth labels, which are laborious to obtain, especially for real-world datasets. The scarcity of labeled data and domain gaps between synthetic and real-world images also pose notable challenges. In this paper, we propose a novel framework, \textbf{BooSTer}, that leverages both vision foundation models and large-scale mixed image sources, including synthetic, real, and single-view images. First, to fully unleash the potential of large-scale single-view images, we design a data generation strategy combining monocular depth estimation and diffusion models to generate dense stereo matching data from single-view images. Second, to tackle sparse labels in real-world datasets, we transfer knowledge from monocular depth estimation models, using pseudo-mono depth labels and a dynamic scale- and shift-invariant loss for additional supervision. Furthermore, we incorporate vision foundation model as an encoder to extract robust and transferable features, boosting accuracy and generalization. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, achieving significant improvements in accuracy over existing methods, particularly in scenarios with limited labeled data and domain shifts.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Guiding LLM-based Smart Contract Generation with Finite State Machine
Authors:
Hao Luo,
Yuhao Lin,
Xiao Yan,
Xintong Hu,
Yuxiang Wang,
Qiming Zeng,
Hao Wang,
Jiawei Jiang
Abstract:
Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w.r.t. eff…
▽ More
Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w.r.t. effectiveness and security. To solve these problems, we propose FSM-SCG, a smart contract generation framework based on finite state machine (FSM) and LLMs, which significantly improves the quality of the generated code by abstracting user requirements to generate FSM, guiding LLMs to generate smart contracts, and iteratively optimizing the code with the feedback of compilation and security checks. The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Strategy-Augmented Planning for Large Language Models via Opponent Exploitation
Authors:
Shuai Xu,
Sijia Cui,
Yanna Wang,
Bo Xu,
Qi Wang
Abstract:
Efficiently modeling and exploiting opponents is a long-standing challenge in adversarial domains. Large Language Models (LLMs) trained on extensive textual data have recently demonstrated outstanding performance in general tasks, introducing new research directions for opponent modeling. Some studies primarily focus on directly using LLMs to generate decisions based on the elaborate prompt contex…
▽ More
Efficiently modeling and exploiting opponents is a long-standing challenge in adversarial domains. Large Language Models (LLMs) trained on extensive textual data have recently demonstrated outstanding performance in general tasks, introducing new research directions for opponent modeling. Some studies primarily focus on directly using LLMs to generate decisions based on the elaborate prompt context that incorporates opponent descriptions, while these approaches are limited to scenarios where LLMs possess adequate domain expertise. To address that, we introduce a two-stage Strategy-Augmented Planning (SAP) framework that significantly enhances the opponent exploitation capabilities of LLM-based agents by utilizing a critical component, the Strategy Evaluation Network (SEN). Specifically, in the offline stage, we construct an explicit strategy space and subsequently collect strategy-outcome pair data for training the SEN network. During the online phase, SAP dynamically recognizes the opponent's strategies and greedily exploits them by searching best response strategy on the well-trained SEN, finally translating strategy to a course of actions by carefully designed prompts. Experimental results show that SAP exhibits robust generalization capabilities, allowing it to perform effectively not only against previously encountered opponent strategies but also against novel, unseen strategies. In the MicroRTS environment, SAP achieves a 85.35\% performance improvement over baseline methods and matches the competitiveness of reinforcement learning approaches against state-of-the-art (SOTA) rule-based AI.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care
Authors:
Zhi Da Soh,
Yang Bai,
Kai Yu,
Yang Zhou,
Xiaofeng Lei,
Sahil Thakur,
Zann Lee,
Lee Ching Linette Phang,
Qingsheng Peng,
Can Can Xue,
Rachel Shujuan Chong,
Quan V. Hoang,
Lavanya Raghavan,
Yih Chung Tham,
Charumathi Sabanayagam,
Wei-Chi Wu,
Ming-Chih Ho,
Jiangnan He,
Preeti Gupta,
Ecosse Lamoureux,
Seang Mei Saw,
Vinay Nangia,
Songhomitra Panda-Jonas,
Jie Xu,
Ya Xing Wang
, et al. (6 additional authors not shown)
Abstract:
Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptati…
▽ More
Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptation, we fine-tuned our VFMs to detect ocular and systemic diseases, differentiate ocular disease severity, and identify common ocular signs. The model achieved 100% accuracy in routing fundus images to appropriate VFMs, which achieved $\ge$ 82.2% accuracy in disease detection, $\ge$ 89% in severity differentiation, $\ge$ 76% in sign identification. Meta-EyeFM was 11% to 43% more accurate than Gemini-1.5-flash and ChatGPT-4o LMMs in detecting various eye diseases and comparable to an ophthalmologist. This system offers enhanced usability and diagnostic performance, making it a valuable decision support tool for primary eye care or an online LLM for fundus evaluation.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
TUMS: Enhancing Tool-use Abilities of LLMs with Multi-structure Handlers
Authors:
Aiyao He,
Sijia Cui,
Shuai Xu,
Yanna Wang,
Bo Xu
Abstract:
Recently, large language models(LLMs) have played an increasingly important role in solving a wide range of NLP tasks, leveraging their capabilities of natural language understanding and generating. Integration with external tools further enhances LLMs' effectiveness, providing more precise, timely, and specialized responses. However, LLMs still encounter difficulties with non-executable actions a…
▽ More
Recently, large language models(LLMs) have played an increasingly important role in solving a wide range of NLP tasks, leveraging their capabilities of natural language understanding and generating. Integration with external tools further enhances LLMs' effectiveness, providing more precise, timely, and specialized responses. However, LLMs still encounter difficulties with non-executable actions and improper actions, which are primarily attributed to incorrect parameters. The process of generating parameters by LLMs is confined to the tool level, employing the coarse-grained strategy without considering the different difficulties of various tools. To address this issue, we propose TUMS, a novel framework designed to enhance the tool-use capabilities of LLMs by transforming tool-level processing into parameter-level processing. Specifically, our framework consists of four key components: (1) an intent recognizer that identifies the user's intent to help LLMs better understand the task; (2) a task decomposer that breaks down complex tasks into simpler subtasks, each involving a tool call; (3) a subtask processor equipped with multi-structure handlers to generate accurate parameters; and (4) an executor. Our empirical studies have evidenced the effectiveness and efficiency of the TUMS framework with an average of 19.6\% and 50.6\% improvement separately on easy and hard benchmarks of ToolQA, meanwhile, we demonstrated the key contribution of each part with ablation experiments, offering more insights and stimulating future research on Tool-augmented LLMs.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Improving Unsupervised Task-driven Models of Ventral Visual Stream via Relative Position Predictivity
Authors:
Dazhong Rong,
Hao Dong,
Xing Gao,
Jiyu Wei,
Di Hong,
Yaoyao Hao,
Qinming He,
Yueming Wang
Abstract:
Based on the concept that ventral visual stream (VVS) mainly functions for object recognition, current unsupervised task-driven methods model VVS by contrastive learning, and have achieved good brain similarity. However, we believe functions of VVS extend beyond just object recognition. In this paper, we introduce an additional function involving VVS, named relative position (RP) prediction. We fi…
▽ More
Based on the concept that ventral visual stream (VVS) mainly functions for object recognition, current unsupervised task-driven methods model VVS by contrastive learning, and have achieved good brain similarity. However, we believe functions of VVS extend beyond just object recognition. In this paper, we introduce an additional function involving VVS, named relative position (RP) prediction. We first theoretically explain contrastive learning may be unable to yield the model capability of RP prediction. Motivated by this, we subsequently integrate RP learning with contrastive learning, and propose a new unsupervised task-driven method to model VVS, which is more inline with biological reality. We conduct extensive experiments, demonstrating that: (i) our method significantly improves downstream performance of object recognition while enhancing RP predictivity; (ii) RP predictivity generally improves the model brain similarity. Our results provide strong evidence for the involvement of VVS in location perception (especially RP prediction) from a computational perspective.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
ACT-R: Adaptive Camera Trajectories for 3D Reconstruction from Single Image
Authors:
Yizhi Wang,
Mingrui Zhao,
Ali Mahdavi-Amiri,
Hao Zhang
Abstract:
We introduce adaptive view planning to multi-view synthesis, aiming to improve both occlusion revelation and 3D consistency for single-view 3D reconstruction. Instead of generating an unordered set of views independently or simultaneously, we generate a sequence of views, leveraging temporal consistency to enhance 3D coherence. Most importantly, our view sequence is not determined by a pre-determi…
▽ More
We introduce adaptive view planning to multi-view synthesis, aiming to improve both occlusion revelation and 3D consistency for single-view 3D reconstruction. Instead of generating an unordered set of views independently or simultaneously, we generate a sequence of views, leveraging temporal consistency to enhance 3D coherence. Most importantly, our view sequence is not determined by a pre-determined camera setup. Instead, we compute an adaptive camera trajectory (ACT), specifically, an orbit of camera views, which maximizes the visibility of occluded regions of the 3D object to be reconstructed. Once the best orbit is found, we feed it to a video diffusion model to generate novel views around the orbit, which in turn, are passed to a multi-view 3D reconstruction model to obtain the final reconstruction. Our multi-view synthesis pipeline is quite efficient since it involves no run-time training/optimization, only forward inferences by applying the pre-trained models for occlusion analysis and multi-view synthesis. Our method predicts camera trajectories that reveal occlusions effectively and produce consistent novel views, significantly improving 3D reconstruction over SOTA on the unseen GSO dataset, both quantitatively and qualitatively.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Deep Probabilistic Modeling of User Behavior for Anomaly Detection via Mixture Density Networks
Authors:
Lu Dai,
Wenxuan Zhu,
Xuehui Quan,
Renzi Meng,
Sheng Cai,
Yichen Wang
Abstract:
To improve the identification of potential anomaly patterns in complex user behavior, this paper proposes an anomaly detection method based on a deep mixture density network. The method constructs a Gaussian mixture model parameterized by a neural network, enabling conditional probability modeling of user behavior. It effectively captures the multimodal distribution characteristics commonly presen…
▽ More
To improve the identification of potential anomaly patterns in complex user behavior, this paper proposes an anomaly detection method based on a deep mixture density network. The method constructs a Gaussian mixture model parameterized by a neural network, enabling conditional probability modeling of user behavior. It effectively captures the multimodal distribution characteristics commonly present in behavioral data. Unlike traditional classifiers that rely on fixed thresholds or a single decision boundary, this approach defines an anomaly scoring function based on probability density using negative log-likelihood. This significantly enhances the model's ability to detect rare and unstructured behaviors. Experiments are conducted on the real-world network user dataset UNSW-NB15. A series of performance comparisons and stability validation experiments are designed. These cover multiple evaluation aspects, including Accuracy, F1- score, AUC, and loss fluctuation. The results show that the proposed method outperforms several advanced neural network architectures in both performance and training stability. This study provides a more expressive and discriminative solution for user behavior modeling and anomaly detection. It strongly promotes the application of deep probabilistic modeling techniques in the fields of network security and intelligent risk control.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Exploiting Text Semantics for Few and Zero Shot Node Classification on Text-attributed Graph
Authors:
Yuxiang Wang,
Xiao Yan,
Shiyu Jin,
Quanqing Xu,
Chuang Hu,
Yuanyuan Zhu,
Bo Du,
Jia Wu,
Jiawei Jiang
Abstract:
Text-attributed graph (TAG) provides a text description for each graph node, and few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. Existing work utilizes various graph-based augmentation techniques to train the node and text embeddings, while text-based augmentations are largely unexplored. In this paper, we propose Text Semantics…
▽ More
Text-attributed graph (TAG) provides a text description for each graph node, and few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. Existing work utilizes various graph-based augmentation techniques to train the node and text embeddings, while text-based augmentations are largely unexplored. In this paper, we propose Text Semantics Augmentation (TSA) to improve accuracy by introducing more text semantic supervision signals. Specifically, we design two augmentation techniques, i.e., positive semantics matching and negative semantics contrast, to provide more reference texts for each graph node or text description. Positive semantic matching retrieves texts with similar embeddings to match with a graph node. Negative semantic contrast adds a negative prompt to construct a text description with the opposite semantics, which is contrasted with the original node and text. We evaluate TSA on 5 datasets and compare with 13 state-of-the-art baselines. The results show that TSA consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
GDNTT: an Area-Efficient Parallel NTT Accelerator Using Glitch-Driven Near-Memory Computing and Reconfigurable 10T SRAM
Authors:
Hengyu Ding,
Houran Ji,
Jia Li,
Jinhang Chen,
Chin-Wing Sham,
Yao Wang
Abstract:
With the rapid advancement of quantum computing technology, post-quantum cryptography (PQC) has emerged as a pivotal direction for next-generation encryption standards. Among these, lattice-based cryptographic schemes rely heavily on the fast Number Theoretic Transform (NTT) over polynomial rings, whose performance directly determines encryption/decryption throughput and energy efficiency. However…
▽ More
With the rapid advancement of quantum computing technology, post-quantum cryptography (PQC) has emerged as a pivotal direction for next-generation encryption standards. Among these, lattice-based cryptographic schemes rely heavily on the fast Number Theoretic Transform (NTT) over polynomial rings, whose performance directly determines encryption/decryption throughput and energy efficiency. However, existing software-based NTT implementations struggle to meet the real-time performance and low-power requirements of IoT and edge devices. To address this challenge, this paper proposes an area-efficient highly parallel NTT accelerator with glitch-driven near-memory computing (GDNTT). The design integrates a 10T SRAM for data storage, enabling flexible row/column data access and streamlining circuit mapping strategies. Furthermore, a glitch generator is incorporated into the near-memory computing unit, significantly reducing the latency of butterfly operations. Evaluation results show that the proposed NTT accelerator achieves a 1.5~28* improvement in throughput-per-area compared to the state-of-the-art.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Will Your Next Pair Programming Partner Be Human? An Empirical Evaluation of Generative AI as a Collaborative Teammate in a Semester-Long Classroom Setting
Authors:
Wenhan Lyu,
Yimeng Wang,
Yifan Sun,
Yixuan Zhang
Abstract:
Generative AI (GenAI), especially Large Language Models (LLMs), is rapidly reshaping both programming workflows and computer science education. Many programmers now incorporate GenAI tools into their workflows, including for collaborative coding tasks such as pair programming. While prior research has demonstrated the benefits of traditional pair programming and begun to explore GenAI-assisted cod…
▽ More
Generative AI (GenAI), especially Large Language Models (LLMs), is rapidly reshaping both programming workflows and computer science education. Many programmers now incorporate GenAI tools into their workflows, including for collaborative coding tasks such as pair programming. While prior research has demonstrated the benefits of traditional pair programming and begun to explore GenAI-assisted coding, the role of LLM-based tools as collaborators in pair programming remains underexamined. In this work, we conducted a mixed-methods study with 39 undergraduate students to examine how GenAI influences collaboration, learning, and performance in pair programming. Specifically, students completed six in-class assignments under three conditions: Traditional Pair Programming (PP), Pair Programming with GenAI (PAI), and Solo Programming with GenAI (SAI). They used both LLM-based inline completion tools (e.g., GitHub Copilot) and LLM-based conversational tools (e.g., ChatGPT). Our results show that students in PAI achieved the highest assignment scores, whereas those in SAI attained the lowest. Additionally, students' attitudes toward LLMs' programming capabilities improved significantly after collaborating with LLM-based tools, and preferences were largely shaped by the perceived usefulness for completing assignments and learning programming skills, as well as the quality of collaboration. Our qualitative findings further reveal that while students appreciated LLM-based tools as valuable pair programming partners, they also identified limitations and had different expectations compared to human teammates. Our study provides one of the first empirical evaluations of GenAI as a pair programming collaborator through a comparison of three conditions (PP, PAI, and SAI). We also discuss the design implications and pedagogical considerations for future GenAI-assisted pair programming approaches.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Representation Learning with Mutual Influence of Modalities for Node Classification in Multi-Modal Heterogeneous Networks
Authors:
Jiafan Li,
Jiaqi Zhu,
Liang Chang,
Yilin Li,
Miaomiao Li,
Yang Wang,
Hongan Wang
Abstract:
Nowadays, numerous online platforms can be described as multi-modal heterogeneous networks (MMHNs), such as Douban's movie networks and Amazon's product review networks. Accurately categorizing nodes within these networks is crucial for analyzing the corresponding entities, which requires effective representation learning on nodes. However, existing multi-modal fusion methods often adopt either ea…
▽ More
Nowadays, numerous online platforms can be described as multi-modal heterogeneous networks (MMHNs), such as Douban's movie networks and Amazon's product review networks. Accurately categorizing nodes within these networks is crucial for analyzing the corresponding entities, which requires effective representation learning on nodes. However, existing multi-modal fusion methods often adopt either early fusion strategies which may lose the unique characteristics of individual modalities, or late fusion approaches overlooking the cross-modal guidance in GNN-based information propagation. In this paper, we propose a novel model for node classification in MMHNs, named Heterogeneous Graph Neural Network with Inter-Modal Attention (HGNN-IMA). It learns node representations by capturing the mutual influence of multiple modalities during the information propagation process, within the framework of heterogeneous graph transformer. Specifically, a nested inter-modal attention mechanism is integrated into the inter-node attention to achieve adaptive multi-modal fusion, and modality alignment is also taken into account to encourage the propagation among nodes with consistent similarities across all modalities. Moreover, an attention loss is augmented to mitigate the impact of missing modalities. Extensive experiments validate the superiority of the model in the node classification task, providing an innovative view to handle multi-modal data, especially when accompanied with network structures.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
The Sound of Populism: Distinct Linguistic Features Across Populist Variants
Authors:
Yu Wang,
Runxi Yu,
Zhongyuan Wang,
Jing He
Abstract:
This study explores the sound of populism by integrating the classic Linguistic Inquiry and Word Count (LIWC) features, which capture the emotional and stylistic tones of language, with a fine-tuned RoBERTa model, a state-of-the-art context-aware language model trained to detect nuanced expressions of populism. This approach allows us to uncover the auditory dimensions of political rhetoric in U.S…
▽ More
This study explores the sound of populism by integrating the classic Linguistic Inquiry and Word Count (LIWC) features, which capture the emotional and stylistic tones of language, with a fine-tuned RoBERTa model, a state-of-the-art context-aware language model trained to detect nuanced expressions of populism. This approach allows us to uncover the auditory dimensions of political rhetoric in U.S. presidential inaugural and State of the Union addresses. We examine how four key populist dimensions (i.e., left-wing, right-wing, anti-elitism, and people-centrism) manifest in the linguistic markers of speech, drawing attention to both commonalities and distinct tonal shifts across these variants. Our findings reveal that populist rhetoric consistently features a direct, assertive ``sound" that forges a connection with ``the people'' and constructs a charismatic leadership persona. However, this sound is not simply informal but strategically calibrated. Notably, right-wing populism and people-centrism exhibit a more emotionally charged discourse, resonating with themes of identity, grievance, and crisis, in contrast to the relatively restrained emotional tones of left-wing and anti-elitist expressions.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Skeletonization of neuronal processes using Discrete Morse techniques from computational topology
Authors:
Samik Banerjee,
Caleb Stam,
Daniel J. Tward,
Steven Savoia,
Yusu Wang,
Partha P. Mitra
Abstract:
To understand biological intelligence we need to map neuronal networks in vertebrate brains. Mapping mesoscale neural circuitry is done using injections of tracers that label groups of neurons whose axons project to different brain regions. Since many neurons are labeled, it is difficult to follow individual axons. Previous approaches have instead quantified the regional projections using the tota…
▽ More
To understand biological intelligence we need to map neuronal networks in vertebrate brains. Mapping mesoscale neural circuitry is done using injections of tracers that label groups of neurons whose axons project to different brain regions. Since many neurons are labeled, it is difficult to follow individual axons. Previous approaches have instead quantified the regional projections using the total label intensity within a region. However, such a quantification is not biologically meaningful. We propose a new approach better connected to the underlying neurons by skeletonizing labeled axon fragments and then estimating a volumetric length density. Our approach uses a combination of deep nets and the Discrete Morse (DM) technique from computational topology. This technique takes into account nonlocal connectivity information and therefore provides noise-robustness. We demonstrate the utility and scalability of the approach on whole-brain tracer injected data. We also define and illustrate an information theoretic measure that quantifies the additional information obtained, compared to the skeletonized tracer injection fragments, when individual axon morphologies are available. Our approach is the first application of the DM technique to computational neuroanatomy. It can help bridge between single-axon skeletons and tracer injections, two important data types in mapping neural networks in vertebrates.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Authors:
Xiaomi LLM-Core Team,
:,
Bingquan Xia,
Bowen Shen,
Cici,
Dawei Zhu,
Di Zhang,
Gang Wang,
Hailin Zhang,
Huaqiu Liu,
Jiebao Xiao,
Jinhao Dong,
Liang Zhao,
Peidian Li,
Peng Wang,
Shihua Yu,
Shimao Chen,
Weikun Wang,
Wenhan Ma,
Xiangwei Deng,
Yi Huang,
Yifan Song,
Zihan Jiang,
Bowen Ye,
Can Cai
, et al. (40 additional authors not shown)
Abstract:
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective…
▽ More
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.