Search | arXiv e-print repository

Generalizing to New Dynamical Systems via Frequency Domain Adaptation

Authors: Tiexin Qin, Hong Yan, Haoliang Li

Abstract: Learning the underlying dynamics from data with deep neural networks has shown remarkable potential in modeling various complex physical dynamics. However, current approaches are constrained in their ability to make reliable predictions in a specific domain and struggle with generalizing to unseen systems that are governed by the same general dynamics but differ in environmental characteristics. I… ▽ More Learning the underlying dynamics from data with deep neural networks has shown remarkable potential in modeling various complex physical dynamics. However, current approaches are constrained in their ability to make reliable predictions in a specific domain and struggle with generalizing to unseen systems that are governed by the same general dynamics but differ in environmental characteristics. In this work, we formulate a parameter-efficient method, Fourier Neural Simulator for Dynamical Adaptation (FNSDA), that can readily generalize to new dynamics via adaptation in the Fourier space. Specifically, FNSDA identifies the shareable dynamics based on the known environments using an automatic partition in Fourier modes and learns to adjust the modes specific for each new environment by conditioning on low-dimensional latent systematic parameters for efficient generalization. We evaluate our approach on four representative families of dynamic systems, and the results show that FNSDA can achieve superior or competitive generalization performance compared to existing methods with a significantly reduced parameter cost. Our code is available at https://github.com/WonderSeven/FNSDA. △ Less

Submitted 17 June, 2025; originally announced July 2025.

Comments: Accepted by TPAMI 2025

arXiv:2506.20324 [pdf, ps, other]

Permutation Equivariant Neural Controlled Differential Equations for Dynamic Graph Representation Learning

Authors: Torben Berndt, Benjamin Walker, Tiexin Qin, Jan Stühmer, Andrey Kormilitzin

Abstract: Dynamic graphs exhibit complex temporal dynamics due to the interplay between evolving node features and changing network structures. Recently, Graph Neural Controlled Differential Equations (Graph Neural CDEs) successfully adapted Neural CDEs from paths on Euclidean domains to paths on graph domains. Building on this foundation, we introduce Permutation Equivariant Neural Graph CDEs, which projec… ▽ More Dynamic graphs exhibit complex temporal dynamics due to the interplay between evolving node features and changing network structures. Recently, Graph Neural Controlled Differential Equations (Graph Neural CDEs) successfully adapted Neural CDEs from paths on Euclidean domains to paths on graph domains. Building on this foundation, we introduce Permutation Equivariant Neural Graph CDEs, which project Graph Neural CDEs onto permutation equivariant function spaces. This significantly reduces the model's parameter count without compromising representational power, resulting in more efficient training and improved generalisation. We empirically demonstrate the advantages of our approach through experiments on simulated dynamical systems and real-world tasks, showing improved performance in both interpolation and extrapolation scenarios. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.15741 [pdf, ps, other]

OAgents: An Empirical Study of Building Effective Agents

Authors: He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, Wangchunshu Zhou

Abstract: Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we… ▽ More Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI. △ Less

Submitted 23 June, 2025; v1 submitted 17 June, 2025; originally announced June 2025.

Comments: 28 pages

arXiv:2506.15133 [pdf]

A Specht Filtration of Permutation Modules Over KLR Algebras

Authors: Tao Qin

Abstract: In type A, Kleshchev-Ram-Mathas realize Specht modules as quotient of Permutation modules, in this paper, we construct a Specht filtration of Permutation modules indexed by hook partition in affine type A; and construct a generalized Specht filtration of Permutation modules indexed by any partition in linear quiver case. In type A, Kleshchev-Ram-Mathas realize Specht modules as quotient of Permutation modules, in this paper, we construct a Specht filtration of Permutation modules indexed by hook partition in affine type A; and construct a generalized Specht filtration of Permutation modules indexed by any partition in linear quiver case. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 24 pages, comments welcome

arXiv:2506.10055 [pdf, ps, other]

TaskCraft: Automated Generation of Agentic Tasks

Authors: Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, Jian Yang, Ge Zhang, Jiaheng Liu, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating diffic… ▽ More Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation. △ Less

Submitted 17 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.07392 [pdf, ps, other]

From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks

Authors: Yuyang Zhou, Guang Cheng, Kang Du, Zihan Chen, Tian Qin, Yuyu Zhao

Abstract: The proliferation of unmanned aerial vehicle (UAV) swarms has enabled a wide range of mission-critical applications, but also exposes UAV networks to severe Denial-of-Service (DoS) threats due to their open wireless environment, dynamic topology, and resource constraints. Traditional static or centralized defense mechanisms are often inadequate for such dynamic and distributed scenarios. To addres… ▽ More The proliferation of unmanned aerial vehicle (UAV) swarms has enabled a wide range of mission-critical applications, but also exposes UAV networks to severe Denial-of-Service (DoS) threats due to their open wireless environment, dynamic topology, and resource constraints. Traditional static or centralized defense mechanisms are often inadequate for such dynamic and distributed scenarios. To address these challenges, we propose a novel federated multi-agent deep reinforcement learning (FMADRL)-driven moving target defense (MTD) framework for proactive and adaptive DoS mitigation in UAV swarm networks. Specifically, we design three lightweight and coordinated MTD mechanisms, including leader switching, route mutation, and frequency hopping, that leverage the inherent flexibility of UAV swarms to disrupt attacker efforts and enhance network resilience. The defense problem is formulated as a multi-agent partially observable Markov decision process (POMDP), capturing the distributed, resource-constrained, and uncertain nature of UAV swarms under attack. Each UAV is equipped with a local policy agent that autonomously selects MTD actions based on partial observations and local experiences. By employing a policy gradient-based FMADRL algorithm, UAVs collaboratively optimize their defense policies via reward-weighted aggregation, enabling distributed learning without sharing raw data and thus reducing communication overhead. Extensive simulations demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving up to a 34.6% improvement in attack mitigation rate, a reduction in average recovery time of up to 94.6%, and decreases in energy consumption and defense cost by as much as 29.3% and 98.3%, respectively, while maintaining robust mission continuity under various DoS attack strategies. △ Less

Submitted 8 June, 2025; originally announced June 2025.

Comments: 13pages; In submission

MSC Class: 68 ACM Class: F.2.2

arXiv:2506.03747 [pdf, other]

Fast Non-Line-of-Sight Transient Data Simulation and an Open Benchmark Dataset

Authors: Yingjie Shi, Jinye Miao, Taotao Qin, Fuyao Cai, Yi Wei, Lingfeng Liu, Tongyao Li, Chenyang Wu, Huan Liang, Yuyang Yin, Lianfa Bai, Enlai Guo, Jing Han

Abstract: Non-Line-of-Sight (NLOS) imaging reconstructs the shape and depth of hidden objects from picosecond-resolved transient signals, offering potential applications in autonomous driving, security, and medical diagnostics. However, current NLOS experiments rely on expensive hardware and complex system alignment, limiting their scalability. This manuscript presents a simplified simulation method that ge… ▽ More Non-Line-of-Sight (NLOS) imaging reconstructs the shape and depth of hidden objects from picosecond-resolved transient signals, offering potential applications in autonomous driving, security, and medical diagnostics. However, current NLOS experiments rely on expensive hardware and complex system alignment, limiting their scalability. This manuscript presents a simplified simulation method that generates NLOS transient data by modeling light-intensity transport rather than performing conventional path tracing, significantly enhancing computational efficiency. All scene elements, including the relay surface, hidden target, stand-off distance, detector time resolution, and acquisition window are fully parameterized, allowing for rapid configuration of test scenarios. Reconstructions based on the simulated data accurately recover hidden geometries, validating the effectiveness of the approach. The proposed tool reduces the entry barrier for NLOS research and supports the optimization of system design. △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2505.22756 [pdf, ps, other]

Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

Authors: Tian Qin, Core Francisco Park, Mujin Kwun, Aaron Walsman, Eran Malach, Nikhil Anand, Hidenori Tanaka, David Alvarez-Melis

Abstract: Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand thes… ▽ More Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.15216 [pdf, ps, other]

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Authors: Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y. Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu , et al. (9 additional authors not shown)

Abstract: AI agents have the potential to significantly alter the cybersecurity landscape. To help us understand this change, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task… ▽ More AI agents have the potential to significantly alter the cybersecurity landscape. To help us understand this change, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \$10 to \$30,485, and cover 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a specific vulnerability. We evaluate 5 agents: Claude Code, OpenAI Codex CLI, and custom agents with GPT-4.1, Gemini 2.5 Pro Preview, and Claude 3.7 Sonnet Thinking. Given up to three attempts, the top-performing agents are Claude Code (5% on Detect, mapping to \$1,350), Custom Agent with Claude 3.7 Sonnet Thinking (5% on Detect, mapping to \$1,025; 67.5% on Exploit), and OpenAI Codex CLI (5% on Detect, mapping to \$2,400; 90% on Patch, mapping to \$14,422). OpenAI Codex CLI and Claude Code are more capable at defense, achieving higher Patch scores of 90% and 87.5%, compared to Exploit scores of 32.5% and 57.5% respectively; in contrast, the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 40-67.5% and Patch scores of 45-60%. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: 78 pages

arXiv:2505.14254 [pdf, other]

Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization

Authors: Yuanyuan Chang, Yinghua Yao, Tao Qin, Mengmeng Wang, Ivor Tsang, Guang Dai

Abstract: Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing seman… ▽ More Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data. △ Less

Submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.11820 [pdf, other]

Chain-of-Model Learning for Language Model

Authors: Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu

Abstract: In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a com… ▽ More In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM. △ Less

Submitted 23 May, 2025; v1 submitted 17 May, 2025; originally announced May 2025.

arXiv:2504.11783 [pdf, other]

The Digital Cybersecurity Expert: How Far Have We Come?

Authors: Dawei Wang, Geng Zhou, Xianglong Li, Yu Bai, Li Chen, Ting Qin, Jian Sun, Dan Li

Abstract: The increasing deployment of large language models (LLMs) in the cybersecurity domain underscores the need for effective model selection and evaluation. However, traditional evaluation methods often overlook specific cybersecurity knowledge gaps that contribute to performance limitations. To address this, we develop CSEBenchmark, a fine-grained cybersecurity evaluation framework based on 345 knowl… ▽ More The increasing deployment of large language models (LLMs) in the cybersecurity domain underscores the need for effective model selection and evaluation. However, traditional evaluation methods often overlook specific cybersecurity knowledge gaps that contribute to performance limitations. To address this, we develop CSEBenchmark, a fine-grained cybersecurity evaluation framework based on 345 knowledge points expected of cybersecurity experts. Drawing from cognitive science, these points are categorized into factual, conceptual, and procedural types, enabling the design of 11,050 tailored multiple-choice questions. We evaluate 12 popular LLMs on CSEBenchmark and find that even the best-performing model achieves only 85.42% overall accuracy, with particular knowledge gaps in the use of specialized tools and uncommon commands. Different LLMs have unique knowledge gaps. Even large models from the same family may perform poorly on knowledge points where smaller models excel. By identifying and addressing specific knowledge gaps in each LLM, we achieve up to an 84% improvement in correcting previously incorrect predictions across three existing benchmarks for two cybersecurity tasks. Furthermore, our assessment of each LLM's knowledge alignment with specific cybersecurity roles reveals that different models align better with different roles, such as GPT-4o for the Google Senior Intelligence Analyst and Deepseek-V3 for the Amazon Privacy Engineer. These findings underscore the importance of aligning LLM selection with the specific knowledge requirements of different cybersecurity roles for optimal performance. △ Less

Submitted 16 April, 2025; originally announced April 2025.

Comments: To appear in the IEEE Symposium on Security and Privacy (IEEE S&P) 2025, San Francisco, CA, USA

arXiv:2504.07052 [pdf, other]

To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning

Authors: Tian Qin, David Alvarez-Melis, Samy Jelassi, Eran Malach

Abstract: Recent advancements in large language models have significantly improved their reasoning abilities, particularly through techniques involving search and backtracking. Backtracking naturally scales test-time compute by enabling sequential, linearized exploration via long chain-of-thought (CoT) generation. However, this is not the only strategy for scaling test-time compute: parallel sampling with b… ▽ More Recent advancements in large language models have significantly improved their reasoning abilities, particularly through techniques involving search and backtracking. Backtracking naturally scales test-time compute by enabling sequential, linearized exploration via long chain-of-thought (CoT) generation. However, this is not the only strategy for scaling test-time compute: parallel sampling with best-of-n selection provides an alternative that generates diverse solutions simultaneously. Despite the growing adoption of sequential search, its advantages over parallel sampling--especially under a fixed compute budget remain poorly understood. In this paper, we systematically compare these two approaches on two challenging reasoning tasks: CountDown and Sudoku. Surprisingly, we find that sequential search underperforms parallel sampling on CountDown but outperforms it on Sudoku, suggesting that backtracking is not universally beneficial. We identify two factors that can cause backtracking to degrade performance: (1) training on fixed search traces can lock models into suboptimal strategies, and (2) explicit CoT supervision can discourage "implicit" (non-verbalized) reasoning. Extending our analysis to reinforcement learning (RL), we show that models with backtracking capabilities benefit significantly from RL fine-tuning, while models without backtracking see limited, mixed gains. Together, these findings challenge the assumption that backtracking universally enhances LLM reasoning, instead revealing a complex interaction between task structure, training data, model scale, and learning paradigm. △ Less

Submitted 9 April, 2025; originally announced April 2025.

arXiv:2504.02008 [pdf, other]

Test-time Adaptation for Foundation Medical Segmentation Model without Parametric Updates

Authors: Kecheng Chen, Xinyu Luo, Tiexin Qin, Jie Liu, Hui Liu, Victor Ho Fun Lee, Hong Yan, Haoliang Li

Abstract: Foundation medical segmentation models, with MedSAM being the most popular, have achieved promising performance across organs and lesions. However, MedSAM still suffers from compromised performance on specific lesions with intricate structures and appearance, as well as bounding box prompt-induced perturbations. Although current test-time adaptation (TTA) methods for medical image segmentation may… ▽ More Foundation medical segmentation models, with MedSAM being the most popular, have achieved promising performance across organs and lesions. However, MedSAM still suffers from compromised performance on specific lesions with intricate structures and appearance, as well as bounding box prompt-induced perturbations. Although current test-time adaptation (TTA) methods for medical image segmentation may tackle this issue, partial (e.g., batch normalization) or whole parametric updates restrict their effectiveness due to limited update signals or catastrophic forgetting in large models. Meanwhile, these approaches ignore the computational complexity during adaptation, which is particularly significant for modern foundation models. To this end, our theoretical analyses reveal that directly refining image embeddings is feasible to approach the same goal as parametric updates under the MedSAM architecture, which enables us to realize high computational efficiency and segmentation performance without the risk of catastrophic forgetting. Under this framework, we propose to encourage maximizing factorized conditional probabilities of the posterior prediction probability using a proposed distribution-approximated latent conditional random field loss combined with an entropy minimization loss. Experiments show that we achieve about 3\% Dice score improvements across three datasets while reducing computational complexity by over 7 times. △ Less

Submitted 1 April, 2025; originally announced April 2025.

Comments: Under review

arXiv:2503.20825 [pdf, other]

Debiasing Kernel-Based Generative Models

Authors: Tian Qin, Wei-Min Huang

Abstract: We propose a novel two-stage framework of generative models named Debiasing Kernel-Based Generative Models (DKGM) with the insights from kernel density estimation (KDE) and stochastic approximation. In the first stage of DKGM, we employ KDE to bypass the obstacles in estimating the density of data without losing too much image quality. One characteristic of KDE is oversmoothing, which makes the ge… ▽ More We propose a novel two-stage framework of generative models named Debiasing Kernel-Based Generative Models (DKGM) with the insights from kernel density estimation (KDE) and stochastic approximation. In the first stage of DKGM, we employ KDE to bypass the obstacles in estimating the density of data without losing too much image quality. One characteristic of KDE is oversmoothing, which makes the generated image blurry. Therefore, in the second stage, we formulate the process of reducing the blurriness of images as a statistical debiasing problem and develop a novel iterative algorithm to improve image quality, which is inspired by the stochastic approximation. Extensive experiments illustrate that the image quality of DKGM on CIFAR10 is comparable to state-of-the-art models such as diffusion models and GAN models. The performance of DKGM on CelebA 128x128 and LSUN (Church) 128x128 is also competitive. We conduct extra experiments to exploit how the bandwidth in KDE affects the sample diversity and debiasing effect of DKGM. The connections between DKGM and score-based models are also discussed. △ Less

Submitted 25 March, 2025; originally announced March 2025.

arXiv:2503.15886 [pdf, other]

Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance

Authors: Hui Liu, Wenya Wang, Kecheng Chen, Jie Liu, Yibing Liu, Tiexin Qin, Peisong He, Xinghao Jiang, Haoliang Li

Abstract: In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories by composing known simpler concepts. However, existing vision-language models (VLMs), despite achieving significant progress through large-scale natural language supervision, often underperform in real-world applications because of sub-optimal prompt engineering and the inability to ada… ▽ More In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories by composing known simpler concepts. However, existing vision-language models (VLMs), despite achieving significant progress through large-scale natural language supervision, often underperform in real-world applications because of sub-optimal prompt engineering and the inability to adapt effectively to target classes. To address these issues, we propose a Concept-guided Human-like Bayesian Reasoning (CHBR) framework. Grounded in Bayes' theorem, CHBR models the concept used in human image recognition as latent variables and formulates this task by summing across potential concepts, weighted by a prior distribution and a likelihood function. To tackle the intractable computation over an infinite concept space, we introduce an importance sampling algorithm that iteratively prompts large language models (LLMs) to generate discriminative concepts, emphasizing inter-class differences. We further propose three heuristic approaches involving Average Likelihood, Confidence Likelihood, and Test Time Augmentation (TTA) Likelihood, which dynamically refine the combination of concepts based on the test image. Extensive evaluations across fifteen datasets demonstrate that CHBR consistently outperforms existing state-of-the-art zero-shot generalization methods. △ Less

Submitted 20 March, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

Comments: 21 pages, 7 figures 7 tables

arXiv:2503.14345 [pdf, other]

MoonCast: High-Quality Zero-Shot Podcast Generation

Authors: Zeqian Ju, Dongchao Yang, Jianwei Yu, Kai Shen, Yichong Leng, Zhengtao Wang, Xu Tan, Xinyu Zhou, Tao Qin, Xiangyang Li

Abstract: Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts… ▽ More Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence. △ Less

Submitted 19 March, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

arXiv:2503.13808 [pdf, other]

SNAKE: A Sustainable and Multi-functional Traffic Analysis System utilizing Specialized Large-Scale Models with a Mixture of Experts Architecture

Authors: Tian Qin, Guang Cheng, Yuyang Zhou, Zihan Chen, Xing Luan

Abstract: The rapid advancement of internet technology has led to a surge in data transmission, making network traffic classification crucial for security and management. However, there are significant deficiencies in its efficiency for handling multiattribute analysis and its ability to expand model knowledge, making it difficult to adapt to the ever-changing network environment and complex identification… ▽ More The rapid advancement of internet technology has led to a surge in data transmission, making network traffic classification crucial for security and management. However, there are significant deficiencies in its efficiency for handling multiattribute analysis and its ability to expand model knowledge, making it difficult to adapt to the ever-changing network environment and complex identification requirements. To address this issue, we proposed the SNAKE (Sustainable Network Analysis with Knowledge Exploration) system, which adopts a multi-gated mixture of experts architecture to construct a multi-functional traffic classification model. The system analyzes traffic attributes at different levels through multiple expert sub-models, providing predictions for these attributes via gating and a final Tower network. Additionally, through an intelligent gating configuration, the system enables extremely fast model integration and evolution across various knowledge expansion scenarios. Its excellent compatibility allows it to continuously evolve into a multi-functional largescale model in the field of traffic analysis. Our experimental results demonstrate that the SNAKE system exhibits remarkable scalability when faced with incremental challenges in diverse traffic classification tasks. Currently, we have integrated multiple models into the system, enabling it to classify a wide range of attributes, such as encapsulation usage, application types and numerous malicious behaviors. We believe that SNAKE can pioneeringly create a sustainable and multifunctional large-scale model in the field of network traffic analysis after continuous expansion. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.11568 [pdf]

Probing the Limit of Heat Transfer in Inorganic Crystals with Deep Learning

Authors: Jielan Li, Zekun Chen, Qian Wang, Han Yang, Ziheng Lu, Guanzhi Li, Shuizhou Chen, Yu Zhu, Xixian Liu, Junfu Tan, Mingfa Tang, Yichi Zhou, Claudio Zeni, Andrew Fowler, Daniel Zügner, Robert Pinsler, Matthew Horton, Tian Xie, Tie-Yan Liu, Haiguang Liu, Tao Qin, Bing Lv, Davide Donadio, Hongxia Hao

Abstract: Heat transfer is a fundamental property of matter. Research spanning decades has attempted to discover materials with exceptional thermal conductivity, yet the upper limit remains unknown. Using deep learning accelerated crystal structure prediction and first-principles calculation, we systematically explore the thermal conductivity landscape of inorganic crystals. We brute-force over half a milli… ▽ More Heat transfer is a fundamental property of matter. Research spanning decades has attempted to discover materials with exceptional thermal conductivity, yet the upper limit remains unknown. Using deep learning accelerated crystal structure prediction and first-principles calculation, we systematically explore the thermal conductivity landscape of inorganic crystals. We brute-force over half a million ordered crystalline structures, encompassing an extensive coverage of local energy minima in binary compounds with up to four atoms per primitive cell. We confirm diamond sets the upper bound of thermal conductivity within our search space, very likely also among all stable crystalline solids at ambient conditions. We also identify over 20 novel crystals surpassing silicon in thermal conductivity, validated by density functional theory. These include a semiconductor TaN with ultrahigh thermal conductivity (~900 $\mathrm{W\cdot m^{-1}\cdot K^{-1}}$), and metallic compounds such as MnV that exhibit high lattice and electronic thermal conductivity simultaneously, a distinctive feature not observed before. These results as well as the deep learning-driven screening method, redefine the landscape of thermal transport and establish a large open-access database for future materials discovery. △ Less

Submitted 31 May, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

arXiv:2503.06687 [pdf, other]

UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion

Authors: Gongbo Zhang, Yanting Li, Renqian Luo, Pipi Hu, Zeru Zhao, Lingbo Li, Guoqing Liu, Zun Wang, Ran Bi, Kaiyuan Gao, Liya Guo, Yu Xie, Chang Liu, Jia Zhang, Tian Xie, Robert Pinsler, Claudio Zeni, Ziheng Lu, Yingce Xia, Marwin Segler, Maik Riechert, Li Yuan, Lei Chen, Haiguang Liu, Tao Qin

Abstract: Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language ge… ▽ More Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language generation and have been extended to multimodal tasks (e.g., image, video, and audio) using advanced encoders like VQ-VAE to represent complex modalities as discrete sequences. However, their direct application to scientific domains is challenging due to the high precision requirements and the diverse nature of scientific data. On the other hand, diffusion models excel at generating high-dimensional scientific data, such as protein, molecule, and material structures, with remarkable accuracy. Yet, their inability to effectively model sequences limits their potential as general-purpose multimodal foundation models. To address these challenges, we propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models. This integration leverages the strengths of autoregressive models to ease the training of conditional diffusion models, while diffusion-based generative heads enhance the precision of autoregressive predictions. We validate the effectiveness of UniGenX on material and small molecule generation tasks, achieving a significant leap in state-of-the-art performance for material crystal structure prediction and establishing new state-of-the-art results for small molecule structure prediction, de novo design, and conditional generation. Notably, UniGenX demonstrates significant improvements, especially in handling long sequences for complex structures, showcasing its efficacy as a versatile tool for scientific data generation. △ Less

Submitted 9 March, 2025; originally announced March 2025.

arXiv:2503.04131 [pdf, other]

Q-PART: Quasi-Periodic Adaptive Regression with Test-time Training for Pediatric Left Ventricular Ejection Fraction Regression

Authors: Jie Liu, Tiexin Qin, Hui Liu, Yilei Shi, Lichao Mou, Xiao Xiang Zhu, Shiqi Wang, Haoliang Li

Abstract: In this work, we address the challenge of adaptive pediatric Left Ventricular Ejection Fraction (LVEF) assessment. While Test-time Training (TTT) approaches show promise for this task, they suffer from two significant limitations. Existing TTT works are primarily designed for classification tasks rather than continuous value regression, and they lack mechanisms to handle the quasi-periodic nature… ▽ More In this work, we address the challenge of adaptive pediatric Left Ventricular Ejection Fraction (LVEF) assessment. While Test-time Training (TTT) approaches show promise for this task, they suffer from two significant limitations. Existing TTT works are primarily designed for classification tasks rather than continuous value regression, and they lack mechanisms to handle the quasi-periodic nature of cardiac signals. To tackle these issues, we propose a novel \textbf{Q}uasi-\textbf{P}eriodic \textbf{A}daptive \textbf{R}egression with \textbf{T}est-time Training (Q-PART) framework. In the training stage, the proposed Quasi-Period Network decomposes the echocardiogram into periodic and aperiodic components within latent space by combining parameterized helix trajectories with Neural Controlled Differential Equations. During inference, our framework further employs a variance minimization strategy across image augmentations that simulate common quality issues in echocardiogram acquisition, along with differential adaptation rates for periodic and aperiodic components. Theoretical analysis is provided to demonstrate that our variance minimization objective effectively bounds the regression error under mild conditions. Furthermore, extensive experiments across three pediatric age groups demonstrate that Q-PART not only significantly outperforms existing approaches in pediatric LVEF prediction, but also exhibits strong clinical screening capability with high mAUROC scores (up to 0.9747) and maintains gender-fair performance across all metrics, validating its robustness and practical utility in pediatric echocardiography analysis. △ Less

Submitted 6 March, 2025; originally announced March 2025.

Comments: Accepted to CVPR 2025

arXiv:2503.03208 [pdf, other]

Embodied Escaping: End-to-End Reinforcement Learning for Robot Navigation in Narrow Environment

Authors: Han Zheng, Jiale Zhang, Mingyang Jiang, Peiyuan Liu, Danni Liu, Tong Qin, Ming Yang

Abstract: Autonomous navigation is a fundamental task for robot vacuum cleaners in indoor environments. Since their core function is to clean entire areas, robots inevitably encounter dead zones in cluttered and narrow scenarios. Existing planning methods often fail to escape due to complex environmental constraints, high-dimensional search spaces, and high difficulty maneuvers. To address these challenges,… ▽ More Autonomous navigation is a fundamental task for robot vacuum cleaners in indoor environments. Since their core function is to clean entire areas, robots inevitably encounter dead zones in cluttered and narrow scenarios. Existing planning methods often fail to escape due to complex environmental constraints, high-dimensional search spaces, and high difficulty maneuvers. To address these challenges, this paper proposes an embodied escaping model that leverages reinforcement learning-based policy with an efficient action mask for dead zone escaping. To alleviate the issue of the sparse reward in training, we introduce a hybrid training policy that improves learning efficiency. In handling redundant and ineffective action options, we design a novel action representation to reshape the discrete action space with a uniform turning radius. Furthermore, we develop an action mask strategy to select valid action quickly, balancing precision and efficiency. In real-world experiments, our robot is equipped with a Lidar, IMU, and two-wheel encoders. Extensive quantitative and qualitative experiments across varying difficulty levels demonstrate that our robot can consistently escape from challenging dead zones. Moreover, our approach significantly outperforms compared path planning and reinforcement learning methods in terms of success rate and collision avoidance. △ Less

Submitted 5 March, 2025; originally announced March 2025.

arXiv:2502.18846 [pdf, other]

RL-OGM-Parking: Lidar OGM-Based Hybrid Reinforcement Learning Planner for Autonomous Parking

Authors: Zhitao Wang, Zhe Chen, Mingyang Jiang, Tong Qin, Ming Yang

Abstract: Autonomous parking has become a critical application in automatic driving research and development. Parking operations often suffer from limited space and complex environments, requiring accurate perception and precise maneuvering. Traditional rule-based parking algorithms struggle to adapt to diverse and unpredictable conditions, while learning-based algorithms lack consistent and stable performa… ▽ More Autonomous parking has become a critical application in automatic driving research and development. Parking operations often suffer from limited space and complex environments, requiring accurate perception and precise maneuvering. Traditional rule-based parking algorithms struggle to adapt to diverse and unpredictable conditions, while learning-based algorithms lack consistent and stable performance in various scenarios. Therefore, a hybrid approach is necessary that combines the stability of rule-based methods and the generalizability of learning-based methods. Recently, reinforcement learning (RL) based policy has shown robust capability in planning tasks. However, the simulation-to-reality (sim-to-real) transfer gap seriously blocks the real-world deployment. To address these problems, we employ a hybrid policy, consisting of a rule-based Reeds-Shepp (RS) planner and a learning-based reinforcement learning (RL) planner. A real-time LiDAR-based Occupancy Grid Map (OGM) representation is adopted to bridge the sim-to-real gap, leading the hybrid policy can be applied to real-world systems seamlessly. We conducted extensive experiments both in the simulation environment and real-world scenarios, and the result demonstrates that the proposed method outperforms pure rule-based and learning-based methods. The real-world experiment further validates the feasibility and efficiency of the proposed method. △ Less

Submitted 26 February, 2025; originally announced February 2025.

arXiv:2502.17356 [pdf, other]

Distributional Scaling for Emergent Capabilities

Authors: Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra

Abstract: This paper explores the nature of sudden breakthroughs in language model performance at scale, which stand in contrast to smooth improvements governed by scaling laws. While advocates of "emergence" view breakthroughs as unlocked capabilities, others attribute them to thresholding effects on noncontinuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probab… ▽ More This paper explores the nature of sudden breakthroughs in language model performance at scale, which stand in contrast to smooth improvements governed by scaling laws. While advocates of "emergence" view breakthroughs as unlocked capabilities, others attribute them to thresholding effects on noncontinuous metrics. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes when performance is bimodally distributed across random seeds. In synthetic length generalization tasks, we show that different random seeds can produce either highly linear or emergent scaling trends. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. Furthermore, we provide a case study of inverse scaling. We validate our distributional scaling framework on realistic settings by measuring MMLU performance in LM populations. These insights emphasize the role of random variation in the effect of scale on LM capabilities. △ Less

Submitted 27 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

Comments: 18 pages

ACM Class: I.2.7

arXiv:2502.14934 [pdf, other]

Fast and Accurate Blind Flexible Docking

Authors: Zizhuo Zhang, Lijun Wu, Kaiyuan Gao, Jiangchao Yao, Tao Qin, Bo Han

Abstract: Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To addre… ▽ More Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex's architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand and pocket docking modules, enabling continuous structural refinements. This approach effectively integrates the three subtasks of blind flexible docking-pocket identification, ligand conformation prediction, and protein flexibility modeling-into a unified, coherent framework. Extensive experiments on public benchmark datasets demonstrate that FABFlex not only achieves superior effectiveness in predicting accurate binding modes but also exhibits a significant speed advantage (208 $\times$) compared to existing state-of-the-art methods. Our code is released at https://github.com/tmlr-group/FABFlex. △ Less

Submitted 20 February, 2025; originally announced February 2025.

Comments: 25 pages, Accepted by ICLR 2025

arXiv:2502.10807 [pdf, other]

HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model

Authors: Mingqian Ma, Guoqing Liu, Chuan Cao, Pan Deng, Tri Dao, Albert Gu, Peiran Jin, Zhao Yang, Yingce Xia, Renqian Luo, Pipi Hu, Zun Wang, Yuan-Jyue Chen, Haiguang Liu, Tao Qin

Abstract: Advances in natural language processing and large language models have sparked growing interest in modeling DNA, often referred to as the "language of life". However, DNA modeling poses unique challenges. First, it requires the ability to process ultra-long DNA sequences while preserving single-nucleotide resolution, as individual nucleotides play a critical role in DNA function. Second, success i… ▽ More Advances in natural language processing and large language models have sparked growing interest in modeling DNA, often referred to as the "language of life". However, DNA modeling poses unique challenges. First, it requires the ability to process ultra-long DNA sequences while preserving single-nucleotide resolution, as individual nucleotides play a critical role in DNA function. Second, success in this domain requires excelling at both generative and understanding tasks: generative tasks hold potential for therapeutic and industrial applications, while understanding tasks provide crucial insights into biological mechanisms and diseases. To address these challenges, we propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture, seamlessly integrating the strengths of attention mechanisms with selective state-space models. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks, and demonstrates exceptional capability in generating synthetic cis-regulatory elements (CREs) with desired properties. Furthermore, we show that HybriDNA adheres to expected scaling laws, with performance improving consistently as the model scales from 300M to 3B and 7B parameters. These findings underscore HybriDNA's versatility and its potential to advance DNA research and applications, paving the way for innovations in understanding and engineering the "language of life". △ Less

Submitted 17 February, 2025; v1 submitted 15 February, 2025; originally announced February 2025.

Comments: Project page: https://hybridna-project.github.io/HybriDNA-Project/

arXiv:2502.07527 [pdf, ps, other]

Nature Language Model: Deciphering the Language of Nature for Scientific Discovery

Authors: Yingce Xia, Peiran Jin, Shufang Xie, Liang He, Chuan Cao, Renqian Luo, Guoqing Liu, Yue Wang, Zequn Liu, Yuan-Jyue Chen, Zekun Guo, Yeqi Bai, Pan Deng, Yaosen Min, Ziheng Lu, Hongxia Hao, Han Yang, Jielan Li, Chang Liu, Jia Zhang, Jianwei Zhu, Ran Bi, Kehan Wu, Wei Zhang, Kaiyuan Gao , et al. (21 additional authors not shown)

Abstract: Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, RNA and even cells. However, these models… ▽ More Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, RNA and even cells. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) top performance across different domains, matching or surpassing state-of-the-art specialist models. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases. △ Less

Submitted 20 June, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: 95 pages

arXiv:2501.19216 [pdf, other]

E2Former: A Linear-time Efficient and Equivariant Transformer for Scalable Molecular Modeling

Authors: Yunyang Li, Lin Huang, Zhihao Ding, Chu Wang, Xinran Wei, Han Yang, Zun Wang, Chang Liu, Yu Shi, Peiran Jin, Jia Zhang, Mark Gerstein, Tao Qin

Abstract: Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduc… ▽ More Equivariant Graph Neural Networks (EGNNs) have demonstrated significant success in modeling microscale systems, including those in chemistry, biology and materials science. However, EGNNs face substantial computational challenges due to the high cost of constructing edge features via spherical tensor products, making them impractical for large-scale systems. To address this limitation, we introduce E2Former, an equivariant and efficient transformer architecture that incorporates the Wigner $6j$ convolution (Wigner $6j$ Conv). By shifting the computational burden from edges to nodes, the Wigner $6j$ Conv reduces the complexity from $O(|\mathcal{E}|)$ to $ O(| \mathcal{V}|)$ while preserving both the model's expressive power and rotational equivariance. We show that this approach achieves a 7x-30x speedup compared to conventional $\mathrm{SO}(3)$ convolutions. Furthermore, our empirical results demonstrate that the derived E2Former mitigates the computational challenges of existing approaches without compromising the ability to capture detailed geometric information. This development could suggest a promising direction for scalable and efficient molecular modeling. △ Less

Submitted 3 February, 2025; v1 submitted 31 January, 2025; originally announced January 2025.

arXiv:2501.08672 [pdf, other]

GS-LIVO: Real-Time LiDAR, Inertial, and Visual Multi-sensor Fused Odometry with Gaussian Mapping

Authors: Sheng Hong, Chunran Zheng, Yishu Shen, Changze Li, Fu Zhang, Tong Qin, Shaojie Shen

Abstract: In recent years, 3D Gaussian splatting (3D-GS) has emerged as a novel scene representation approach. However, existing vision-only 3D-GS methods often rely on hand-crafted heuristics for point-cloud densification and face challenges in handling occlusions and high GPU memory and computation consumption. LiDAR-Inertial-Visual (LIV) sensor configuration has demonstrated superior performance in local… ▽ More In recent years, 3D Gaussian splatting (3D-GS) has emerged as a novel scene representation approach. However, existing vision-only 3D-GS methods often rely on hand-crafted heuristics for point-cloud densification and face challenges in handling occlusions and high GPU memory and computation consumption. LiDAR-Inertial-Visual (LIV) sensor configuration has demonstrated superior performance in localization and dense mapping by leveraging complementary sensing characteristics: rich texture information from cameras, precise geometric measurements from LiDAR, and high-frequency motion data from IMU. Inspired by this, we propose a novel real-time Gaussian-based simultaneous localization and mapping (SLAM) system. Our map system comprises a global Gaussian map and a sliding window of Gaussians, along with an IESKF-based odometry. The global Gaussian map consists of hash-indexed voxels organized in a recursive octree, effectively covering sparse spatial volumes while adapting to different levels of detail and scales. The Gaussian map is initialized through multi-sensor fusion and optimized with photometric gradients. Our system incrementally maintains a sliding window of Gaussians, significantly reducing GPU computation and memory consumption by only optimizing the map within the sliding window. Moreover, we implement a tightly coupled multi-sensor fusion odometry with an iterative error state Kalman filter (IESKF), leveraging real-time updating and rendering of the Gaussian map. Our system represents the first real-time Gaussian-based SLAM framework deployable on resource-constrained embedded systems, demonstrated on the NVIDIA Jetson Orin NX platform. The framework achieves real-time performance while maintaining robust multi-sensor fusion capabilities. All implementation algorithms, hardware designs, and CAD models will be publicly available. △ Less

Submitted 15 January, 2025; originally announced January 2025.

arXiv:2501.04246 [pdf, other]

Drift-oriented Self-evolving Encrypted Traffic Application Classification for Actual Network Environment

Authors: Zihan Chen, Guang Cheng, Jinhui Li, Tian Qin, Yuyang Zhou, Xing Luan

Abstract: Encrypted traffic classification technology is a crucial decision-making information source for network management and security protection. It has the advantages of excellent response timeliness, large-scale data bearing, and cross-time-and-space analysis. The existing research on encrypted traffic classification has gradually transitioned from the closed world to the open world, and many classifi… ▽ More Encrypted traffic classification technology is a crucial decision-making information source for network management and security protection. It has the advantages of excellent response timeliness, large-scale data bearing, and cross-time-and-space analysis. The existing research on encrypted traffic classification has gradually transitioned from the closed world to the open world, and many classifier optimization and feature engineering schemes have been proposed. However, encrypted traffic classification has yet to be effectively applied to the actual network environment. The main reason is that applications on the Internet are constantly updated, including function adjustment and version change, which brings severe feature concept drift, resulting in rapid failure of the classifier. Hence, the entire model must be retrained only past very fast time, with unacceptable labeled sample constructing and model training cost. To solve this problem, we deeply study the characteristics of Internet application updates, associate them with feature concept drift, and then propose self-evolving encrypted traffic classification. We propose a feature concept drift determination method and a drift-oriented self-evolving fine-tuning method based on the Laida criterion to adapt to all applications that are likely to be updated. In the case of no exact label samples, the classifier evolves through fully fine-tuning continuously, and the time interval between two necessary retraining is greatly extended to be applied to the actual network environment. Experiments show that our approach significantly improves the classification performance of the original classifier on the following stage dataset of the following months (9\% improvement on F1-score) without any hard-to-acquire labeled sample. Under the current experimental environment, the life of the classifier is extended to more than eight months. △ Less

Submitted 7 January, 2025; originally announced January 2025.

arXiv:2501.02533 [pdf, other]

Universal classes of disorder scatterings in in-plane anomalous Hall effect

Authors: Guoao Yang, Tao Qin, Jianhui Zhou

Abstract: The in-plane anomalous Hall effect (IPAHE) with planar Hall current and magnetization/magnetic fields in various quantum materials has received increasing attentions. Most of current efforts are devoted to the intrinsic part due to the Berry curvature of electronic bands, however, how the disorder scatterings affect the extrinsic part (skew scattering and side jump) still remains largely elusive.… ▽ More The in-plane anomalous Hall effect (IPAHE) with planar Hall current and magnetization/magnetic fields in various quantum materials has received increasing attentions. Most of current efforts are devoted to the intrinsic part due to the Berry curvature of electronic bands, however, how the disorder scatterings affect the extrinsic part (skew scattering and side jump) still remains largely elusive. Here we theoretically investigated the three universal classes of disorder scatterings (scalar, spin-conserving and spin-flipping) on the IPAHE based on the prototypical two-dimensional massive Dirac fermion model with warping term under generic Zeeman fields. We find the different disorder scatterings result in distinct dependence of the anomalous Hall conductivity on disorder strength and recover previous known results in some limits. Remarkably, the spin-flipping scattering could give rise to nontrivial contributions featuring sinusoidal oscillation with periods of $π$ and $2π$ to the extrinsic part, in contrast to the standard two-dimensional massive Dirac fermions. Our work unveils the rich features of anomalous transport in planar Hall geometry in the presence of disorder scatterings and provides some useful insights into the magnetotransport phenomena. △ Less

Submitted 5 January, 2025; originally announced January 2025.

Comments: 7 pages, 4 figures

arXiv:2501.02308 [pdf]

doi 10.1039/d4nr03682d

Ultrafast Chirality-dependent Dynamics from Helicity-resolved Transient Absorption Spectroscopy

Authors: Xiu Zhang, Lu Zhang, Junzhi Zhu, Tingxiao Qin, Haiyun Huang, Baixu Xiang, Haiyun Liu, Qihua Xiong

Abstract: Chirality, a pervasive phenomenon in nature, is widely studied across diverse fields including the origins of life, chemical catalysis, drug discovery, and physical optoelectronics. The investigations of natural chiral materials have been constrained by their intrinsically weak chiral effects. Recently, significant progress has been made in the fabrication and assembly of low-dimensional micro and… ▽ More Chirality, a pervasive phenomenon in nature, is widely studied across diverse fields including the origins of life, chemical catalysis, drug discovery, and physical optoelectronics. The investigations of natural chiral materials have been constrained by their intrinsically weak chiral effects. Recently, significant progress has been made in the fabrication and assembly of low-dimensional micro and nanoscale chiral materials and their architectures, leading to the discovery of novel optoelectronic phenomena such as circularly polarized light emission, spin and charge flip, advocating great potential for applications in quantum information, quantum computing, and biosensing. Despite these advancements, the fundamental mechanisms underlying the generation, propagation, and amplification of chirality in low-dimensional chiral materials and architectures remain largely unexplored. To tackle these challenges, we focus on employing ultrafast spectroscopy to investigate the dynamics of chirality evolution, with the aim of attaining a more profound understanding of the microscopic mechanisms governing chirality generation and amplification. This review thus provides a comprehensive overview of the chiral micro-/nano-materials, including two-dimensional transition metal dichalcogenides (TMDs), chiral halide perovskites, and chiral metasurfaces, with a particular emphasis on the physical mechanism. This review further explores the advancements made by ultrafast chiral spectroscopy research, thereby paving the way for innovative devices in chiral photonics and optoelectronics. △ Less

Submitted 26 February, 2025; v1 submitted 4 January, 2025; originally announced January 2025.

Comments: 11 Figures

Journal ref: Nanoscale, 2025, 17(8): 4175-94

arXiv:2412.07778 [pdf, other]

MIN: Multi-channel Interaction Network for Drug-Target Interaction with Protein Distillation

Authors: Shuqi Li, Shufang Xie, Hongda Sun, Yuhan Chen, Tao Qin, Tianjun Ke, Rui Yan

Abstract: Traditional drug discovery processes are both time-consuming and require extensive professional expertise. With the accumulation of drug-target interaction (DTI) data from experimental studies, leveraging modern machine-learning techniques to discern patterns between drugs and target proteins has become increasingly feasible. In this paper, we introduce the Multi-channel Interaction Network (MIN),… ▽ More Traditional drug discovery processes are both time-consuming and require extensive professional expertise. With the accumulation of drug-target interaction (DTI) data from experimental studies, leveraging modern machine-learning techniques to discern patterns between drugs and target proteins has become increasingly feasible. In this paper, we introduce the Multi-channel Interaction Network (MIN), a novel framework designed to predict DTIs through two primary components: a representation learning module and a multi-channel interaction module. The representation learning module features a C-Score Predictor-assisted screening mechanism, which selects critical residues to enhance prediction accuracy and reduce noise. The multi-channel interaction module incorporates a structure-agnostic channel, a structure-aware channel, and an extended-mixture channel, facilitating the identification of interaction patterns at various levels for optimal complementarity. Additionally, contrastive learning is utilized to harmonize the representations of diverse data types. Our experimental evaluations on public datasets demonstrate that MIN surpasses other strong DTI prediction methods. Furthermore, the case study reveals a high overlap between the residues selected by the C-Score Predictor and those in actual binding pockets, underscoring MIN's explainability capability. These findings affirm that MIN is not only a potent tool for DTI prediction but also offers fresh insights into the prediction of protein binding sites. △ Less

Submitted 23 November, 2024; originally announced December 2024.

arXiv:2412.06550 [pdf]

DNA Fragments in Crude Oil Reveals Earth's Hidden History

Authors: Wan-Qian Zhao, Zhan-Yong Guo, Yu-Qi Guo, Mei-Jun Li, Gang-Qiang Cao, Zeng-Yuan Tian, Ran Chai, Li-You Qiu, Jin-Hua Zeng, Xin-Ge Zhang, Tian-Cang Qin, Jin-Yu Yang, Ming-Jie Chen, Mei-Rong Song, Fei Liang, Jun-Hui Geng, Chun-Yan Zhou, Shu-Jie Zhang, Li-Juan Zhao

Abstract: This groundbreaking research extracted DNA from petroleum using nanoparticle affinity bead technology, yielding 3,159,020 petroleum DNA (pDNA) sequences, primarily environmental DNA. While most original in situ DNA (oriDNA) was lost, ancient DNA (aDNA) from petroleum offers an important source of ecological and evolutionary information, surpassing traditional fossils. This study reveals that oil,… ▽ More This groundbreaking research extracted DNA from petroleum using nanoparticle affinity bead technology, yielding 3,159,020 petroleum DNA (pDNA) sequences, primarily environmental DNA. While most original in situ DNA (oriDNA) was lost, ancient DNA (aDNA) from petroleum offers an important source of ecological and evolutionary information, surpassing traditional fossils. This study reveals that oil, mainly sourced from algae and lower aquatic plants, now serves as a new type of fossil, providing detailed insights into Earth's hidden history, including unclassified species and ancient events, revolutionizing petroleum geology and paleontology. △ Less

Submitted 9 December, 2024; originally announced December 2024.

Comments: 85 pages, 7 Figures, 13 Table

arXiv:2412.06521 [pdf]

Ancient DNA from 120-Million-Year-Old Lycoptera Fossils Reveals Evolutionary Insights

Authors: Wan-Qian Zhao, Zhan-Yong Guo, Zeng-Yuan Tian, Tong-Fu Su, Gang-Qiang Cao, Zi-Xin Qi, Tian-Cang Qin, Wei Zhou, Jin-Yu Yang, Ming-Jie Chen, Xin-Ge Zhang, Chun-Yan Zhou, Chuan-Jia Zhu, Meng-Fei Tang, Di Wu, Mei-Rong Song, Yu-Qi Guo, Li-You Qiu, Fei Liang, Mei-Jun Li, Jun-Hui Geng, Li-Juan Zhao, Shu-Jie Zhang

Abstract: High quality ancient DNA (aDNA) is essential for molecular paleontology. Due to DNA degradation and contamination by environmental DNA (eDNA), current research is limited to fossils less than 1 million years old. The study successfully extracted DNA from Lycoptera davidi fossils from the Early Cretaceous period, dating 120 million years ago. Using high-throughput sequencing, 1,258,901 DNA sequence… ▽ More High quality ancient DNA (aDNA) is essential for molecular paleontology. Due to DNA degradation and contamination by environmental DNA (eDNA), current research is limited to fossils less than 1 million years old. The study successfully extracted DNA from Lycoptera davidi fossils from the Early Cretaceous period, dating 120 million years ago. Using high-throughput sequencing, 1,258,901 DNA sequences were obtained. We established a rigorous protocol known as the mega screen method. Using this method, we identified 243 original in situ DNA (oriDNA) sequences, likely from the Lycoptera genome. These sequences have an average length of over 100 base pairs and show no signs of deamination. Additionally, 10 transposase coding sequences were discovered, shedding light on a unique self-renewal mechanism in the genome. This study provides valuable DNA data for understanding ancient fish evolution and advances paleontological research. △ Less

Submitted 9 December, 2024; originally announced December 2024.

Comments: 14 pages,3 Figures

arXiv:2412.04619 [pdf, other]

Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization

Authors: Tian Qin, Naomi Saphra, David Alvarez-Melis

Abstract: Language models (LMs), like other neural networks, often favor shortcut heuristics based on surface-level patterns. Although LMs behave like n-gram models early in training, they must eventually learn hierarchical syntactic representations to correctly apply grammatical rules out-of-distribution (OOD). In this work, we use case studies of English grammar to explore how complex, diverse training da… ▽ More Language models (LMs), like other neural networks, often favor shortcut heuristics based on surface-level patterns. Although LMs behave like n-gram models early in training, they must eventually learn hierarchical syntactic representations to correctly apply grammatical rules out-of-distribution (OOD). In this work, we use case studies of English grammar to explore how complex, diverse training data drives models to generalize OOD. We construct a framework that unifies our understanding of random variation with training dynamics, rule selection with memorization, and data diversity with complexity. We show that these factors are nuanced, and that intermediate levels of diversity and complexity lead to inconsistent behavior across random seeds and to unstable training dynamics. Our findings emphasize the critical role of training data in shaping generalization patterns and illuminate how competing model strategies lead to inconsistent generalization outcomes across random seeds. Code is available at https://github.com/sunnytqin/concept_comp.git. △ Less

Submitted 19 December, 2024; v1 submitted 5 December, 2024; originally announced December 2024.

arXiv:2412.01299 [pdf, other]

Cross-Modal Visual Relocalization in Prior LiDAR Maps Utilizing Intensity Textures

Authors: Qiyuan Shen, Hengwang Zhao, Weihao Yan, Chunxiang Wang, Tong Qin, Ming Yang

Abstract: Cross-modal localization has drawn increasing attention in recent years, while the visual relocalization in prior LiDAR maps is less studied. Related methods usually suffer from inconsistency between the 2D texture and 3D geometry, neglecting the intensity features in the LiDAR point cloud. In this paper, we propose a cross-modal visual relocalization system in prior LiDAR maps utilizing intensity… ▽ More Cross-modal localization has drawn increasing attention in recent years, while the visual relocalization in prior LiDAR maps is less studied. Related methods usually suffer from inconsistency between the 2D texture and 3D geometry, neglecting the intensity features in the LiDAR point cloud. In this paper, we propose a cross-modal visual relocalization system in prior LiDAR maps utilizing intensity textures, which consists of three main modules: map projection, coarse retrieval, and fine relocalization. In the map projection module, we construct the database of intensity channel map images leveraging the dense characteristic of panoramic projection. The coarse retrieval module retrieves the top-K most similar map images to the query image from the database, and retains the top-K' results by covisibility clustering. The fine relocalization module applies a two-stage 2D-3D association and a covisibility inlier selection method to obtain robust correspondences for 6DoF pose estimation. The experimental results on our self-collected datasets demonstrate the effectiveness in both place recognition and pose estimation tasks. △ Less

Submitted 2 December, 2024; originally announced December 2024.

arXiv:2412.00713 [pdf, other]

Many-body multipole indices revealed by the real-space dynamical mean-field theory

Authors: Guoao Yang, Jianhui Zhou, Tao Qin

Abstract: The multipole moments are fundamental properties of insulators, and have attracted lots of attention with emerging of the higher-order topological insulators. A couple of ways, including generalization of the formula for the polarization and the Wilson loop, have been proposed to calculate it in real materials. However, a practical method to explore it in correlated insulators is still lacking. He… ▽ More The multipole moments are fundamental properties of insulators, and have attracted lots of attention with emerging of the higher-order topological insulators. A couple of ways, including generalization of the formula for the polarization and the Wilson loop, have been proposed to calculate it in real materials. However, a practical method to explore it in correlated insulators is still lacking. Here, we proposed a systematic way, which combines the general Green's function formula for multiopoles with the real-space dynamical mean-field theory, to calculate the multipole moments in correlated materials. Our demonstrating calculations are consistent with symmetry analysis, and the calculations of the spectral functions further confirm our results. This method opens the new avenue to study the topological phase transitions in correlated multipole insulators and other crucial physical quantities closely related to multipole moments. △ Less

Submitted 1 December, 2024; originally announced December 2024.

Comments: 7 pages, 4 figures

arXiv:2411.05278 [pdf, other]

Integrated Location Sensing and Communication for Ultra-Massive MIMO With Hybrid-Field Beam-Squint Effect

Authors: Zhen Gao, Xingyu Zhou, Boyu Ning, Yu Su, Tong Qin, Dusit Niyato

Abstract: The advent of ultra-massive multiple-input-multiple output systems holds great promise for next-generation communications, yet their channels exhibit hybrid far- and near- field beam-squint (HFBS) effect. In this paper, we not only overcome but also harness the HFBS effect to propose an integrated location sensing and communication (ILSC) framework. During the uplink training stage, user terminals… ▽ More The advent of ultra-massive multiple-input-multiple output systems holds great promise for next-generation communications, yet their channels exhibit hybrid far- and near- field beam-squint (HFBS) effect. In this paper, we not only overcome but also harness the HFBS effect to propose an integrated location sensing and communication (ILSC) framework. During the uplink training stage, user terminals (UTs) transmit reference signals for simultaneous channel estimation and location sensing. This stage leverages an elaborately designed hybrid-field projection matrix to overcome the HFBS effect and estimate the channel in compressive manner. Subsequently, the scatterers' locations can be sensed from the spherical wavefront based on the channel estimation results. By treating the sensed scatterers as virtual anchors, we employ a weighted least-squares approach to derive UT' s location. Moreover, we propose an iterative refinement mechanism, which utilizes the accurately estimated time difference of arrival of multipath components to enhance location sensing precision. In the following downlink data transmission stage, we leverage the acquired location information to further optimize the hybrid beamformer, which combines the beam broadening and focusing to mitigate the spectral efficiency degradation resulted from the HFBS effect. Extensive simulation experiments demonstrate that the proposed ILSC scheme has superior location sensing and communication performance than conventional methods. △ Less

Submitted 7 November, 2024; originally announced November 2024.

Comments: This paper has been accepted by IEEE JSAC

arXiv:2410.24022 [pdf, other]

SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation

Authors: Liang He, Peiran Jin, Yaosen Min, Shufang Xie, Lijun Wu, Tao Qin, Xiaozhuan Liang, Kaiyuan Gao, Yuliang Jiang, Tie-Yan Liu

Abstract: Proteins, essential to biological systems, perform functions intricately linked to their three-dimensional structures. Understanding the relationship between protein structures and their amino acid sequences remains a core challenge in protein modeling. While traditional protein foundation models benefit from pre-training on vast unlabeled datasets, they often struggle to capture critical co-evolu… ▽ More Proteins, essential to biological systems, perform functions intricately linked to their three-dimensional structures. Understanding the relationship between protein structures and their amino acid sequences remains a core challenge in protein modeling. While traditional protein foundation models benefit from pre-training on vast unlabeled datasets, they often struggle to capture critical co-evolutionary information, which evolutionary-based methods excel at. In this study, we introduce a novel pre-training strategy for protein foundation models that emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features from sequence data. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability, outperforming established baselines of similar size, including the ESM model, across diverse downstream tasks. Experimental results confirm the model's effectiveness in integrating co-evolutionary information, marking a significant step forward in protein sequence-based modeling. △ Less

Submitted 31 October, 2024; originally announced October 2024.

arXiv:2410.12191 [pdf, other]

Test-time adaptation for image compression with distribution regularization

Authors: Kecheng Chen, Pingping Zhang, Tiexin Qin, Shiqi Wang, Hong Yan, Haoliang Li

Abstract: Current test- or compression-time adaptation image compression (TTA-IC) approaches, which leverage both latent and decoder refinements as a two-step adaptation scheme, have potentially enhanced the rate-distortion (R-D) performance of learned image compression models on cross-domain compression tasks, \textit{e.g.,} from natural to screen content images. However, compared with the emergence of var… ▽ More Current test- or compression-time adaptation image compression (TTA-IC) approaches, which leverage both latent and decoder refinements as a two-step adaptation scheme, have potentially enhanced the rate-distortion (R-D) performance of learned image compression models on cross-domain compression tasks, \textit{e.g.,} from natural to screen content images. However, compared with the emergence of various decoder refinement variants, the latent refinement, as an inseparable ingredient, is barely tailored to cross-domain scenarios. To this end, we aim to develop an advanced latent refinement method by extending the effective hybrid latent refinement (HLR) method, which is designed for \textit{in-domain} inference improvement but shows noticeable degradation of the rate cost in \textit{cross-domain} tasks. Specifically, we first provide theoretical analyses, in a cue of marginalization approximation from in- to cross-domain scenarios, to uncover that the vanilla HLR suffers from an underlying mismatch between refined Gaussian conditional and hyperprior distributions, leading to deteriorated joint probability approximation of marginal distribution with increased rate consumption. To remedy this issue, we introduce a simple Bayesian approximation-endowed \textit{distribution regularization} to encourage learning a better joint probability approximation in a plug-and-play manner. Extensive experiments on six in- and cross-domain datasets demonstrate that our proposed method not only improves the R-D performance compared with other latent refinement counterparts, but also can be flexibly integrated into existing TTA-IC methods with incremental benefits. △ Less

Submitted 15 October, 2024; originally announced October 2024.

arXiv:2410.10118 [pdf, other]

Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning

Authors: Yuxuan Ren, Dihan Zheng, Chang Liu, Peiran Jin, Yu Shi, Lin Huang, Jiyan He, Shengjie Luo, Tao Qin, Tie-Yan Liu

Abstract: In recent years, machine learning has demonstrated impressive capability in handling molecular science tasks. To support various molecular properties at scale, machine learning models are trained in the multi-task learning paradigm. Nevertheless, data of different molecular properties are often not aligned: some quantities, e.g. equilibrium structure, demand more cost to compute than others, e.g.… ▽ More In recent years, machine learning has demonstrated impressive capability in handling molecular science tasks. To support various molecular properties at scale, machine learning models are trained in the multi-task learning paradigm. Nevertheless, data of different molecular properties are often not aligned: some quantities, e.g. equilibrium structure, demand more cost to compute than others, e.g. energy, so their data are often generated by cheaper computational methods at the cost of lower accuracy, which cannot be directly overcome through multi-task learning. Moreover, it is not straightforward to leverage abundant data of other tasks to benefit a particular task. To handle such data heterogeneity challenges, we exploit the specialty of molecular tasks that there are physical laws connecting them, and design consistency training approaches that allow different tasks to exchange information directly so as to improve one another. Particularly, we demonstrate that the more accurate energy data can improve the accuracy of structure prediction. We also find that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction, demonstrating a broad capability for integrating heterogeneous data. △ Less

Submitted 13 October, 2024; originally announced October 2024.

Comments: Published as a conference paper at NeurIPS 2024

arXiv:2410.02847 [pdf, other]

Deep Signature: Characterization of Large-Scale Molecular Dynamics

Authors: Tiexin Qin, Mengxu Zhu, Chunyang Li, Terry Lyons, Hong Yan, Haoliang Li

Abstract: Understanding protein dynamics are essential for deciphering protein functional mechanisms and developing molecular therapies. However, the complex high-dimensional dynamics and interatomic interactions of biological processes pose significant challenge for existing computational techniques. In this paper, we approach this problem for the first time by introducing Deep Signature, a novel computati… ▽ More Understanding protein dynamics are essential for deciphering protein functional mechanisms and developing molecular therapies. However, the complex high-dimensional dynamics and interatomic interactions of biological processes pose significant challenge for existing computational techniques. In this paper, we approach this problem for the first time by introducing Deep Signature, a novel computationally tractable framework that characterizes complex dynamics and interatomic interactions based on their evolving trajectories. Specifically, our approach incorporates soft spectral clustering that locally aggregates cooperative dynamics to reduce the size of the system, as well as signature transform that collects iterated integrals to provide a global characterization of the non-smooth interactive dynamics. Theoretical analysis demonstrates that Deep Signature exhibits several desirable properties, including invariance to translation, near invariance to rotation, equivariance to permutation of atomic coordinates, and invariance under time reparameterization. Furthermore, experimental results on three benchmarks of biological processes verify that our approach can achieve superior performance compared to baseline methods. △ Less

Submitted 14 May, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

Comments: ICLR 2025

arXiv:2409.15604 [pdf, other]

Persona-L has Entered the Chat: Leveraging LLM and Ability-based Framework for Personas of People with Complex Needs

Authors: Lipeipei Sun, Tianzi Qin, Anran Hu, Jiale Zhang, Shuojia Lin, Jianyan Chen, Mona Ali, Mirjana Prpa

Abstract: We present Persona-L, a novel approach for creating personas using Large Language Models (LLMs) and an ability-based framework, specifically designed to improve the representation of users with complex needs. Traditional methods of persona creation often fall short of accurately depicting the dynamic and diverse nature of complex needs, resulting in oversimplified or stereotypical profiles. Person… ▽ More We present Persona-L, a novel approach for creating personas using Large Language Models (LLMs) and an ability-based framework, specifically designed to improve the representation of users with complex needs. Traditional methods of persona creation often fall short of accurately depicting the dynamic and diverse nature of complex needs, resulting in oversimplified or stereotypical profiles. Persona-L enables users to create and interact with personas through a chat interface. Persona-L was evaluated through interviews with UX designers (N=6), where we examined its effectiveness in reflecting the complexities of lived experiences of people with complex needs. We report our findings that indicate the potential of Persona-L to increase empathy and understanding of complex needs while also revealing the need for transparency of data used in persona creation, the role of the language and tone, and the need to provide a more balanced presentation of abilities with constraints. △ Less

Submitted 23 September, 2024; originally announced September 2024.

arXiv:2409.05297 [pdf, other]

Adaptive Offloading and Enhancement for Low-Light Video Analytics on Mobile Devices

Authors: Yuanyi He, Peng Yang, Tian Qin, Jiawei Hou, Ning Zhang

Abstract: In this paper, we explore adaptive offloading and enhancement strategies for video analytics tasks on computing-constrained mobile devices in low-light conditions. We observe that the accuracy of low-light video analytics varies from different enhancement algorithms. The root cause could be the disparities in the effectiveness of enhancement algorithms for feature extraction in analytic models. Sp… ▽ More In this paper, we explore adaptive offloading and enhancement strategies for video analytics tasks on computing-constrained mobile devices in low-light conditions. We observe that the accuracy of low-light video analytics varies from different enhancement algorithms. The root cause could be the disparities in the effectiveness of enhancement algorithms for feature extraction in analytic models. Specifically, the difference in class activation maps (CAMs) between enhanced and low-light frames demonstrates a positive correlation with video analytics accuracy. Motivated by such observations, a novel enhancement quality assessment method is proposed on CAMs to evaluate the effectiveness of different enhancement algorithms for low-light videos. Then, we design a multi-edge system, which adaptively offloads and enhances low-light video analytics tasks from mobile devices. To achieve the trade-off between the enhancement quality and the latency for all system-served mobile devices, we propose a genetic-based scheduling algorithm, which can find a near-optimal solution in a reasonable time to meet the latency requirement. Thereby, the offloading strategies and the enhancement algorithms are properly selected under the condition of limited end-edge bandwidth and edge computation resources. Simulation experiments demonstrate the superiority of the proposed system, improving accuracy up to 20.83\% compared to existing benchmarks. △ Less

Submitted 8 September, 2024; originally announced September 2024.

arXiv:2408.02061 [pdf, other]

ParkingE2E: Camera-based End-to-end Parking Network, from Images to Planning

Authors: Changze Li, Ziheng Ji, Zhe Chen, Tong Qin, Ming Yang

Abstract: Autonomous parking is a crucial task in the intelligent driving field. Traditional parking algorithms are usually implemented using rule-based schemes. However, these methods are less effective in complex parking scenarios due to the intricate design of the algorithms. In contrast, neural-network-based methods tend to be more intuitive and versatile than the rule-based methods. By collecting a lar… ▽ More Autonomous parking is a crucial task in the intelligent driving field. Traditional parking algorithms are usually implemented using rule-based schemes. However, these methods are less effective in complex parking scenarios due to the intricate design of the algorithms. In contrast, neural-network-based methods tend to be more intuitive and versatile than the rule-based methods. By collecting a large number of expert parking trajectory data and emulating human strategy via learning-based methods, the parking task can be effectively addressed. In this paper, we employ imitation learning to perform end-to-end planning from RGB images to path planning by imitating human driving trajectories. The proposed end-to-end approach utilizes a target query encoder to fuse images and target features, and a transformer-based decoder to autoregressively predict future waypoints. We conducted extensive experiments in real-world scenarios, and the results demonstrate that the proposed method achieved an average parking success rate of 87.8% across four different real-world garages. Real-vehicle experiments further validate the feasibility and effectiveness of the method proposed in this paper. △ Less

Submitted 4 August, 2024; originally announced August 2024.

arXiv:2407.08561 [pdf, other]

MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

Authors: Hang Wu, Zhenghao Zhang, Siyuan Lin, Xiangru Mu, Qiang Zhao, Ming Yang, Tong Qin

Abstract: Robust localization is the cornerstone of autonomous driving, especially in challenging urban environments where GPS signals suffer from multipath errors. Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. However, building HD map is expensive and challenging to scale up. Given these limitations, leveraging navigation maps has eme… ▽ More Robust localization is the cornerstone of autonomous driving, especially in challenging urban environments where GPS signals suffer from multipath errors. Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. However, building HD map is expensive and challenging to scale up. Given these limitations, leveraging navigation maps has emerged as a promising low-cost alternative for localization. Current approaches based on navigation maps can achieve highly accurate localization, but their complex matching strategies lead to unacceptable inference latency that fails to meet the real-time demands. To address these limitations, we propose a novel transformer-based neural re-localization method. Inspired by image registration, our approach performs a coarse-to-fine neural feature registration between navigation map and visual bird's-eye view features. Our method significantly outperforms the current state-of-the-art OrienterNet on both the nuScenes and Argoverse datasets, which is nearly 10%/20% localization accuracy and 30/16 FPS improvement on single-view and surround-view input settings, separately. We highlight that our research presents an HD-map-free localization method for autonomous driving, offering cost-effective, reliable, and scalable performance in challenging driving environments. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: IROS 2024 (Oral)

arXiv:2407.08526 [pdf, other]

BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of Sight

Authors: Hang Wu, Zhenghao Zhang, Siyuan Lin, Tong Qin, Jin Pan, Qiang Zhao, Chunjing Xu, Ming Yang

Abstract: Bird's-eye-view (BEV) representation is crucial for the perception function in autonomous driving tasks. It is difficult to balance the accuracy, efficiency and range of BEV representation. The existing works are restricted to a limited perception range within 50 meters. Extending the BEV representation range can greatly benefit downstream tasks such as topology reasoning, scene understanding, and… ▽ More Bird's-eye-view (BEV) representation is crucial for the perception function in autonomous driving tasks. It is difficult to balance the accuracy, efficiency and range of BEV representation. The existing works are restricted to a limited perception range within 50 meters. Extending the BEV representation range can greatly benefit downstream tasks such as topology reasoning, scene understanding, and planning by offering more comprehensive information and reaction time. The Standard-Definition (SD) navigation maps can provide a lightweight representation of road structure topology, characterized by ease of acquisition and low maintenance costs. An intuitive idea is to combine the close-range visual information from onboard cameras with the beyond line-of-sight (BLOS) environmental priors from SD maps to realize expanded perceptual capabilities. In this paper, we propose BLOS-BEV, a novel BEV segmentation model that incorporates SD maps for accurate beyond line-of-sight perception, up to 200m. Our approach is applicable to common BEV architectures and can achieve excellent results by incorporating information derived from SD maps. We explore various feature fusion schemes to effectively integrate the visual BEV representations and semantic features from the SD map, aiming to leverage the complementary information from both sources optimally. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in BEV segmentation on nuScenes and Argoverse benchmark. Through multi-modal inputs, BEV segmentation is significantly enhanced at close ranges below 50m, while also demonstrating superior performance in long-range scenarios, surpassing other methods by over 20% mIoU at distances ranging from 50-200m. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: IEEE IV 2024

arXiv:2406.16289 [pdf, other]

doi 10.1109/TITS.2024.3415394

Crowd-Sourced NeRF: Collecting Data from Production Vehicles for 3D Street View Reconstruction

Authors: Tong Qin, Changze Li, Haoyang Ye, Shaowei Wan, Minzhen Li, Hongwei Liu, Ming Yang

Abstract: Recently, Neural Radiance Fields (NeRF) achieved impressive results in novel view synthesis. Block-NeRF showed the capability of leveraging NeRF to build large city-scale models. For large-scale modeling, a mass of image data is necessary. Collecting images from specially designed data-collection vehicles can not support large-scale applications. How to acquire massive high-quality data remains an… ▽ More Recently, Neural Radiance Fields (NeRF) achieved impressive results in novel view synthesis. Block-NeRF showed the capability of leveraging NeRF to build large city-scale models. For large-scale modeling, a mass of image data is necessary. Collecting images from specially designed data-collection vehicles can not support large-scale applications. How to acquire massive high-quality data remains an opening problem. Noting that the automotive industry has a huge amount of image data, crowd-sourcing is a convenient way for large-scale data collection. In this paper, we present a crowd-sourced framework, which utilizes substantial data captured by production vehicles to reconstruct the scene with the NeRF model. This approach solves the key problem of large-scale reconstruction, that is where the data comes from and how to use them. Firstly, the crowd-sourced massive data is filtered to remove redundancy and keep a balanced distribution in terms of time and space. Then a structure-from-motion module is performed to refine camera poses. Finally, images, as well as poses, are used to train the NeRF model in a certain block. We highlight that we present a comprehensive framework that integrates multiple modules, including data selection, sparse 3D reconstruction, sequence appearance embedding, depth supervision of ground surface, and occlusion completion. The complete system is capable of effectively processing and reconstructing high-quality 3D scenes from crowd-sourced data. Extensive quantitative and qualitative experiments were conducted to validate the performance of our system. Moreover, we proposed an application, named first-view navigation, which leveraged the NeRF model to generate 3D street view and guide the driver with a synthesized video. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.15777 [pdf, other]

ISS-Scenario: Scenario-based Testing in CARLA

Authors: Renjue Li, Tianhang Qin, Cas Widdershoven

Abstract: The rapidly evolving field of autonomous driving systems (ADSs) is full of promise. However, in order to fulfil these promises, ADSs need to be safe in all circumstances. This paper introduces ISS-Scenario, an autonomous driving testing framework in the paradigm of scenario-based testing. ISS-Scenario is designed for batch testing, exploration of test cases (e.g., potentially dangerous scenarios),… ▽ More The rapidly evolving field of autonomous driving systems (ADSs) is full of promise. However, in order to fulfil these promises, ADSs need to be safe in all circumstances. This paper introduces ISS-Scenario, an autonomous driving testing framework in the paradigm of scenario-based testing. ISS-Scenario is designed for batch testing, exploration of test cases (e.g., potentially dangerous scenarios), and performance evaluation of autonomous vehicles (AVs). ISS-Scenario includes a diverse simulation scenario library with parametrized design. Furthermore, ISS-Scenario integrates two testing methods within the framework: random sampling and optimized search by means of a genetic algorithm. Finally, ISS-Scenario provides an accident replay feature, saving a log file for each test case which allows developers to replay and dissect scenarios where the ADS showed problematic behavior. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: TASE 2024, 8 pages

Showing 1–50 of 318 results for author: Qin, T