Search | arXiv e-print repository

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Authors: Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, Chuang Gan

Abstract: Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal… ▽ More Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Project page: https://vlm-mirage.github.io/

arXiv:2506.15662 [pdf, ps, other]

CC-LEARN: Cohort-based Consistency Learning

Authors: Xiao Ye, Shaswat Shrivastava, Zhaonan Li, Jacob Dineen, Shijie Lu, Avneet Ahuja, Ming Shen, Zhikun Xu, Ben Zhou

Abstract: Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective com… ▽ More Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.14888 [pdf]

Time-domain decoding of unconventional charge order mechanisms in nonmagnetic and magnetic kagome metals

Authors: Seongyong Lee, Byungjune Lee, Hoyoung Jang, Xueliang Wu, Jimin Kim, Gyeongbo Kang, Choongjae Won, Hyeongi Choi, Sang-Youn Park, Kyle M. Shen, Federico Cilento, Aifeng Wang, Jae-Hoon Park, Mingu Kang

Abstract: In kagome lattice materials, quantum interplay between charge, spin, orbital, and lattice degrees of freedom gives rise to a remarkably rich set of emergent phenomena, ranging from unconventional charge order and superconductivity to topological magnetism. While the exact nature of these exotic orders is often challenging to comprehend in static experiments, time-resolved techniques can offer crit… ▽ More In kagome lattice materials, quantum interplay between charge, spin, orbital, and lattice degrees of freedom gives rise to a remarkably rich set of emergent phenomena, ranging from unconventional charge order and superconductivity to topological magnetism. While the exact nature of these exotic orders is often challenging to comprehend in static experiments, time-resolved techniques can offer critical insights by disentangling coupled degrees of freedom on the time-axis. In this work, we demonstrate that the nature of charge orders in two representative kagome metals - nonmagnetic ScV6Sn6 and magnetic FeGe - which has been highly controversial in static studies, can be directly deciphered in the time-domain through their fundamentally distinct order parameter dynamics measured via time-resolved X-ray scattering at an X-ray free electron laser. In nonmagnetic ScV6Sn6, the dynamics are characterized by ultrafast melting and coherent amplitudon oscillations, typical of a phonon-coupled charge order. In stark contrast, magnetic FeGe exhibits resilient metastable charge order dynamics, hitherto unobserved in any other charge-ordered system - this unique time-domain behavior directly signifies an unconventional magnetism-interlocked charge order state realized in this kagome magnet. Our results not only provide a model case where unconventional nature of electronic order, hidden in equilibrium, is directly unraveled in the time-domain, but also pave the way for future out-of-equilibrium engineering of novel quantum orders in kagome lattice platforms. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 4 figures

arXiv:2506.14808 [pdf, ps, other]

PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models

Authors: Jenny Schmalfuss, Nadine Chang, Vibashan VS, Maying Shen, Andres Bruhn, Jose M. Alvarez

Abstract: Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems. VLMs integrate visual modalities with large language models (LLMs), which are well known to be prompt-sensitive. Hence, it is crucial to determine whether VLMs inherit this instability to varying prompts. We therefore investigate which prompt variations VLMs are mo… ▽ More Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems. VLMs integrate visual modalities with large language models (LLMs), which are well known to be prompt-sensitive. Hence, it is crucial to determine whether VLMs inherit this instability to varying prompts. We therefore investigate which prompt variations VLMs are most sensitive to and which VLMs are most agnostic to prompt variations. To this end, we introduce PARC (Prompt Analysis via Reliability and Calibration), a VLM prompt sensitivity analysis framework built on three pillars: (1) plausible prompt variations in both the language and vision domain, (2) a novel model reliability score with built-in guarantees, and (3) a calibration step that enables dataset- and prompt-spanning prompt variation analysis. Regarding prompt variations, PARC's evaluation shows that VLMs mirror LLM language prompt sensitivity in the vision domain, and most destructive variations change the expected answer. Regarding models, outstandingly robust VLMs among 22 evaluated models come from the InternVL2 family. We further find indications that prompt sensitivity is linked to training data. The code will be at https://github.com/NVlabs/PARC. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: Accepted to CVPR 2025

arXiv:2506.13502 [pdf, ps, other]

BOW: Bottlenecked Next Word Exploration

Authors: Ming Shen, Zhikun Xu, Xiao Ye, Jacob Dineen, Ben Zhou

Abstract: Large language models (LLMs) are typically trained via next-word prediction (NWP), which provides strong surface-level fluency but often lacks support for robust reasoning. We propose BOttlenecked next Word exploration (BOW), a novel RL framework that rethinks NWP by introducing a reasoning bottleneck where a policy model first generates a reasoning path rather than predicting the next token direc… ▽ More Large language models (LLMs) are typically trained via next-word prediction (NWP), which provides strong surface-level fluency but often lacks support for robust reasoning. We propose BOttlenecked next Word exploration (BOW), a novel RL framework that rethinks NWP by introducing a reasoning bottleneck where a policy model first generates a reasoning path rather than predicting the next token directly, after which a frozen judge model predicts the next token distribution based solely on this reasoning path. We train the policy model using GRPO with rewards that quantify how effectively the reasoning path facilitates next-word recovery. Compared with other continual pretraining baselines, we show that BOW improves both the general and next-word reasoning capabilities of the base model, evaluated on various benchmarks. Our findings show that BOW can serve as an effective and scalable alternative to vanilla NWP. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.12198 [pdf, ps, other]

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Authors: Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, Sarah Adel Bargal

Abstract: Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past i… ▽ More Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions. △ Less

Submitted 13 June, 2025; originally announced June 2025.

arXiv:2506.08123 [pdf, ps, other]

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Authors: Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou

Abstract: Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce… ▽ More Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.06664 [pdf, ps, other]

Generalized Trajectory Scoring for End-to-end Multimodal Planning

Authors: Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, Jose M. Alvarez

Abstract: End-to-end multi-modal planning is a promising paradigm in autonomous driving, enabling decision-making with diverse trajectory candidates. A key component is a robust trajectory scorer capable of selecting the optimal trajectory from these candidates. While recent trajectory scorers focus on scoring either large sets of static trajectories or small sets of dynamically generated ones, both approac… ▽ More End-to-end multi-modal planning is a promising paradigm in autonomous driving, enabling decision-making with diverse trajectory candidates. A key component is a robust trajectory scorer capable of selecting the optimal trajectory from these candidates. While recent trajectory scorers focus on scoring either large sets of static trajectories or small sets of dynamically generated ones, both approaches face significant limitations in generalization. Static vocabularies provide effective coarse discretization but struggle to make fine-grained adaptation, while dynamic proposals offer detailed precision but fail to capture broader trajectory distributions. To overcome these challenges, we propose GTRS (Generalized Trajectory Scoring), a unified framework for end-to-end multi-modal planning that combines coarse and fine-grained trajectory evaluation. GTRS consists of three complementary innovations: (1) a diffusion-based trajectory generator that produces diverse fine-grained proposals; (2) a vocabulary generalization technique that trains a scorer on super-dense trajectory sets with dropout regularization, enabling its robust inference on smaller subsets; and (3) a sensor augmentation strategy that enhances out-of-domain generalization while incorporating refinement training for critical trajectory discrimination. As the winning solution of the Navsim v2 Challenge, GTRS demonstrates superior performance even with sub-optimal sensor inputs, approaching privileged methods that rely on ground-truth perception. Code will be available at https://github.com/NVlabs/GTRS. △ Less

Submitted 7 June, 2025; originally announced June 2025.

Comments: The 1st place solution of the End-to-end Driving Track at the CVPR 2025 Autonomous Grand Challenge

arXiv:2506.03065 [pdf, ps, other]

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

Authors: Pengtao Chen, Xianfang Zeng, Maosen Zhao, Peng Ye, Mingzhu Shen, Wei Cheng, Gang Yu, Tao Chen

Abstract: While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and ver… ▽ More While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09$\times$, 2.38$\times$, and 1.67$\times$ theoretical FLOP reduction, and actual inference speedups of 1.76$\times$, 1.85$\times$, and 1.58$\times$, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2505.23604 [pdf, ps, other]

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Authors: Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan

Abstract: Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on su… ▽ More Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.22477 [pdf]

Human-Centered Human-AI Collaboration (HCHAC)

Authors: Qi Gao, Wei Xu, Hanxi Pan, Mowei Shen, Zaifeng Gao

Abstract: In the intelligent era, the interaction between humans and intelligent systems fundamentally involves collaboration with autonomous intelligent agents. Human-AI Collaboration (HAC) represents a novel type of human-machine relationship facilitated by autonomous intelligent machines equipped with AI technologies. In this paradigm, AI agents serve not only as auxiliary tools but also as active teamma… ▽ More In the intelligent era, the interaction between humans and intelligent systems fundamentally involves collaboration with autonomous intelligent agents. Human-AI Collaboration (HAC) represents a novel type of human-machine relationship facilitated by autonomous intelligent machines equipped with AI technologies. In this paradigm, AI agents serve not only as auxiliary tools but also as active teammates, partnering with humans to accomplish tasks collaboratively. Human-centered AI (HCAI) emphasizes that humans play critical leadership roles in the collaboration. This human-led collaboration imparts new dimensions to the human-machine relationship, necessitating innovative research perspectives, paradigms, and agenda to address the unique challenges posed by HAC. This chapter delves into the essence of HAC from the human-centered perspective, outlining its core concepts and distinguishing features. It reviews the current research methodologies and research agenda within the HAC field from the HCAI perspective, highlighting advancements and ongoing studies. Furthermore, a framework for human-centered HAC (HCHAC) is proposed by integrating these reviews and analyses. A case study of HAC in the context of autonomous vehicles is provided, illustrating practical applications and the synergistic interactions between humans and AI agents. Finally, it identifies potential future research directions aimed at enhancing the effectiveness, reliability, and ethical integration of human-centered HAC systems in diverse domains. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: This article is a chapter from the upcoming book Handbook of Human-Centered Artificial Intelligence

arXiv:2505.19482 [pdf, ps, other]

Language of Network: A Generative Pre-trained Model for Encrypted Traffic Comprehension

Authors: Di Zhao, Bo Jiang, Song Liu, Susu Cui, Meng Shen, Dongqi Han, Xingmao Guan, Zhigang Lu

Abstract: The increasing demand for privacy protection and security considerations leads to a significant rise in the proportion of encrypted network traffic. Since traffic content becomes unrecognizable after encryption, accurate analysis is challenging, making it difficult to classify applications and detect attacks. Deep learning is currently the predominant approach for encrypted traffic classification… ▽ More The increasing demand for privacy protection and security considerations leads to a significant rise in the proportion of encrypted network traffic. Since traffic content becomes unrecognizable after encryption, accurate analysis is challenging, making it difficult to classify applications and detect attacks. Deep learning is currently the predominant approach for encrypted traffic classification through feature analysis. However, these methods face limitations due to their high dependence on labeled data and difficulties in detecting attack variants. First, their performance is highly sensitive to data quality, where the highcost manual labeling process and dataset imbalance significantly degrade results. Second, the rapid evolution of attack patterns makes it challenging for models to identify new types of attacks. To tackle these challenges, we present GBC, a generative model based on pre-training for encrypted traffic comprehension. Since traditional tokenization methods are primarily designed for natural language, we propose a protocol-aware tokenization approach for encrypted traffic that improves model comprehension of fields specific to network traffic. In addition, GBC employs pretraining to learn general representations from extensive unlabeled traffic data. Through prompt learning, it effectively adapts to various downstream tasks, enabling both high-quality traffic generation and effective detection. Evaluations across multiple datasets demonstrate that GBC achieves superior results in both traffic classification and generation tasks, resulting in a 5% improvement in F1 score compared to state-of-the-art methods for classification tasks. △ Less

Submitted 26 May, 2025; originally announced May 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2505.16086 [pdf, other]

Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

Authors: Ming Shen, Raphael Shu, Anurag Pratik, James Gung, Yubin Ge, Monica Sunkara, Yi Zhang

Abstract: We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challe… ▽ More We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development. △ Less

Submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.15989 [pdf, other]

AI-Assisted NLOS Sensing for RIS-Based Indoor Localization in Smart Factories

Authors: Taofeek A. O. Yusuf, Sigurd S. Petersen, Puchu Li, Jian Ren, Placido Mursia, Vincenzo Sciancalepore, Xavier Costa Pérez, Gilberto Berardinelli, Ming Shen

Abstract: In the era of Industry 4.0, precise indoor localization is vital for automation and efficiency in smart factories. Reconfigurable Intelligent Surfaces (RIS) are emerging as key enablers in 6G networks for joint sensing and communication. However, RIS faces significant challenges in Non-Line-of-Sight (NLOS) and multipath propagation, particularly in localization scenarios, where detecting NLOS cond… ▽ More In the era of Industry 4.0, precise indoor localization is vital for automation and efficiency in smart factories. Reconfigurable Intelligent Surfaces (RIS) are emerging as key enablers in 6G networks for joint sensing and communication. However, RIS faces significant challenges in Non-Line-of-Sight (NLOS) and multipath propagation, particularly in localization scenarios, where detecting NLOS conditions is crucial for ensuring not only reliable results and increased connectivity but also the safety of smart factory personnel. This study introduces an AI-assisted framework employing a Convolutional Neural Network (CNN) customized for accurate Line-of-Sight (LOS) and Non-Line-of-Sight (NLOS) classification to enhance RIS-based localization using measured, synthetic, mixed-measured, and mixed-synthetic experimental data, that is, original, augmented, slightly noisy, and highly noisy data, respectively. Validated through such data from three different environments, the proposed customized-CNN (cCNN) model achieves {95.0\%-99.0\%} accuracy, outperforming standard pre-trained models like Visual Geometry Group 16 (VGG-16) with an accuracy of {85.5\%-88.0\%}. By addressing RIS limitations in NLOS scenarios, this framework offers scalable and high-precision localization solutions for 6G-enabled smart factories. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: Accepted 7 pages Paper for VTCSpring2025 Conference

arXiv:2505.15403 [pdf, ps, other]

RIS Beam Calibration for ISAC Systems: Modeling and Performance Analysis

Authors: Mengting Li, Hui Chen, Sigurd Sandor Petersen, Huiping Huang, Alireza Pourafzal, Yu Ge, Ming Shen, Henk Wymeersch

Abstract: High-accuracy localization is a key enabler for integrated sensing and communication (ISAC), playing an essential role in various applications such as autonomous driving. Antenna arrays and reconfigurable intelligent surface (RIS) are incorporated into these systems to achieve high angular resolution, assisting in the localization process. However, array and RIS beam patterns in practice often dev… ▽ More High-accuracy localization is a key enabler for integrated sensing and communication (ISAC), playing an essential role in various applications such as autonomous driving. Antenna arrays and reconfigurable intelligent surface (RIS) are incorporated into these systems to achieve high angular resolution, assisting in the localization process. However, array and RIS beam patterns in practice often deviate from the idealized models used for algorithm design, leading to significant degradation in positioning accuracy. This mismatch highlights the need for beam calibration to bridge the gap between theoretical models and real-world hardware behavior. In this paper, we present and analyze three beam models considering several key non-idealities such as mutual coupling, non-ideal codebook, and measurement uncertainties. Based on the models, we then develop calibration algorithms to estimate the model parameters that can be used for future localization tasks. This work evaluates the effectiveness of the beam models and the calibration algorithms using both theoretical bounds and real-world beam pattern data from an RIS prototype. The simulation results show that the model incorporating combined impacts can accurately reconstruct measured beam patterns. This highlights the necessity of realistic beam modeling and calibration to achieve high-accuracy localization. △ Less

Submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.15034 [pdf, ps, other]

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

Authors: Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S. Boning, Dina Katabi

Abstract: Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-t… ▽ More Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: Tech report. The first two authors contributed equally

arXiv:2505.09892 [pdf, other]

Correlating Account on Ethereum Mixing Service via Domain-Invariant feature learning

Authors: Zheng Che, Taoyu Li, Meng Shen, Hanbiao Du, Liehuang Zhu

Abstract: The untraceability of transactions facilitated by Ethereum mixing services like Tornado Cash poses significant challenges to blockchain security and financial regulation. Existing methods for correlating mixing accounts suffer from limited labeled data and vulnerability to noisy annotations, which restrict their practical applicability. In this paper, we propose StealthLink, a novel framework that… ▽ More The untraceability of transactions facilitated by Ethereum mixing services like Tornado Cash poses significant challenges to blockchain security and financial regulation. Existing methods for correlating mixing accounts suffer from limited labeled data and vulnerability to noisy annotations, which restrict their practical applicability. In this paper, we propose StealthLink, a novel framework that addresses these limitations through cross-task domain-invariant feature learning. Our key innovation lies in transferring knowledge from the well-studied domain of blockchain anomaly detection to the data-scarce task of mixing transaction tracing. Specifically, we design a MixFusion module that constructs and encodes mixing subgraphs to capture local transactional patterns, while introducing a knowledge transfer mechanism that aligns discriminative features across domains through adversarial discrepancy minimization. This dual approach enables robust feature learning under label scarcity and distribution shifts. Extensive experiments on real-world mixing transaction datasets demonstrate that StealthLink achieves state-of-the-art performance, with 96.98\% F1-score in 10-shot learning scenarios. Notably, our framework shows superior generalization capability in imbalanced data conditions than conventional supervised methods. This work establishes the first systematic approach for cross-domain knowledge transfer in blockchain forensics, providing a practical solution for combating privacy-enhanced financial crimes in decentralized ecosystems. △ Less

Submitted 14 May, 2025; originally announced May 2025.

Comments: Cryptocurrency, Ethereum, mixing services, GNN

arXiv:2505.07569 [pdf, ps, other]

Melting of Charge Density Waves in Low Dimensions

Authors: Jeremy M. Shen, Alex Stangel, Suk Hyun Sung, Ismail El Baggari, Kai Sun, Robert Hovden

Abstract: Charge density waves (CDWs) are collective electronic states that can reshape and melt, even while confined within a rigid atomic crystal. In two dimensions, melting is predicted to be distinct, proceeding through partially ordered nematic and hexatic states that are neither liquid nor crystal. Here we measure and explain how continuous, hexatic melting of incommensurate CDWs occurs in low-dimensi… ▽ More Charge density waves (CDWs) are collective electronic states that can reshape and melt, even while confined within a rigid atomic crystal. In two dimensions, melting is predicted to be distinct, proceeding through partially ordered nematic and hexatic states that are neither liquid nor crystal. Here we measure and explain how continuous, hexatic melting of incommensurate CDWs occurs in low-dimensional materials. As a CDW is thermally excited, disorder emerges progressively$\unicode{x2013}$initially through smooth elastic deformations that modulate the local wavelength, and subsequently via the nucleation of topological defects. Experimentally, we track three hallmark signatures of CDW melting$\unicode{x2013}$azimuthal superlattice peak broadening, wavevector contraction, and integrated intensity decay. △ Less

Submitted 12 May, 2025; originally announced May 2025.

Comments: 18 pages, 7 figures (includes supplemental)

arXiv:2505.03531 [pdf, ps, other]

Faster MoE LLM Inference for Extremely Large Models

Authors: Haoqi Yang, Luohe Shi, Qiwei Li, Zuchao Li, Ping Wang, Bo Du, Mengjia Shen, Hai Zhao

Abstract: Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE architectures. With the emergence of DeepSeek Models, fine-grained MoE models are gaining popularity, yet research on them remains limited. Therefore, we want to discuss th… ▽ More Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE architectures. With the emergence of DeepSeek Models, fine-grained MoE models are gaining popularity, yet research on them remains limited. Therefore, we want to discuss the efficiency dynamic under different service loads. Additionally, fine-grained models allow deployers to reduce the number of routed experts, both activated counts and total counts, raising the question of how this reduction affects the trade-off between MoE efficiency and performance. Our findings indicate that while deploying MoE models presents greater challenges, it also offers significant optimization opportunities. Reducing the number of activated experts can lead to substantial efficiency improvements in certain scenarios, with only minor performance degradation. Reducing the total number of experts provides limited efficiency gains but results in severe performance degradation. Our method can increase throughput by at least 10\% without any performance degradation. Overall, we conclude that MoE inference optimization remains an area with substantial potential for exploration and improvement. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.02927 [pdf, ps, other]

The Physics of Local Optimization in Complex Disordered Systems

Authors: Mutian Shen, Gerardo Ortiz, Zhiqiao Dong, Martin Weigel, Zohar Nussinov

Abstract: Limited resources motivate decomposing large-scale problems into smaller, "local" subsystems and stitching together the so-found solutions. We explore the physics underlying this approach and discuss the concept of "local hardness", i.e., complexity from the local solver perspective, in determining the ground states of both P- and NP-hard spin-glasses and related systems. Depending on the model co… ▽ More Limited resources motivate decomposing large-scale problems into smaller, "local" subsystems and stitching together the so-found solutions. We explore the physics underlying this approach and discuss the concept of "local hardness", i.e., complexity from the local solver perspective, in determining the ground states of both P- and NP-hard spin-glasses and related systems. Depending on the model considered, we observe varying scaling behaviors in how errors associated with local predictions decay as a function of the size of the solved subsystem. These errors stem from global critical threshold instabilities, characterized by gapless, avalanche-like excitations that follow scale-invariant size distributions. Away from criticality, local solvers quickly achieve high accuracy, aligning closely with the results of the more computationally intensive global minimization. These findings shed light on how Nature may operate solely through local actions at her disposal. △ Less

Submitted 2 June, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

Comments: 8+14 pages, 8+16 figures. Add two figures (S4, S5)

arXiv:2505.02024 [pdf, other]

From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent

Authors: Minjie Shen, Qikai Yang

Abstract: Manus AI is a general-purpose AI agent introduced in early 2025, marking a significant advancement in autonomous artificial intelligence. Developed by the Chinese startup Monica.im, Manus is designed to bridge the gap between "mind" and "hand" - combining the reasoning and planning capabilities of large language models with the ability to execute complex, end-to-end tasks that produce tangible out… ▽ More Manus AI is a general-purpose AI agent introduced in early 2025, marking a significant advancement in autonomous artificial intelligence. Developed by the Chinese startup Monica.im, Manus is designed to bridge the gap between "mind" and "hand" - combining the reasoning and planning capabilities of large language models with the ability to execute complex, end-to-end tasks that produce tangible outcomes. This paper presents a comprehensive overview of Manus AI, exploring its core technical architecture, diverse applications across sectors such as healthcare, finance, manufacturing, robotics, and gaming, as well as its key strengths, current limitations, and future potential. Positioned as a preview of what lies ahead, Manus AI represents a shift toward intelligent agents that can translate high-level intentions into real-world actions, heralding a new era of human-AI collaboration. △ Less

Submitted 4 May, 2025; originally announced May 2025.

arXiv:2504.17457 [pdf, other]

Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks

Authors: Zhiying Li, Yeying Jin, Fan Shen, Zhi Liu, Weibin Chen, Pengju Zhang, Xiaomei Zhang, Boyu Chen, Michael Shen, Kejian Wu, Zhaoxin Fan, Jin Dong

Abstract: Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbf{Tangible Attack… ▽ More Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbf{Tangible Attack (TBA)}, a novel framework designed to generate adversarial examples capable of effectively compromising any digital human generation model. Our approach introduces a \textbf{Dual Heterogeneous Noise Generator (DHNG)}, which leverages Variational Autoencoders (VAE) and ControlNet to produce diverse, targeted noise tailored to the original image features. Additionally, we design a custom \textbf{adversarial loss function} to optimize the noise, ensuring both high controllability and potent disruption. By iteratively refining the adversarial sample through multi-gradient signals from both the noise and the state-of-the-art EHPS model, TBA substantially improves the effectiveness of adversarial attacks. Extensive experiments demonstrate TBA's superiority, achieving a remarkable 41.0\% increase in estimation error, with an average improvement of approximately 17.0\%. These findings expose significant security vulnerabilities in current EHPS models and highlight the need for stronger defenses in digital human generation systems. △ Less

Submitted 24 April, 2025; originally announced April 2025.

Comments: 14 pages, 7 figures

arXiv:2504.04471 [pdf, other]

VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

Authors: Zhuo Zhi, Qiangqiang Wu, Minghe shen, Wenbo Li, Yinchuan Li, Kun Shao, Kaiwen Zhou

Abstract: Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of larg… ▽ More Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of large language models (LLMs) without dedicated mechanisms to enhance reasoning in long video scenarios; and (2) they remain vulnerable to errors or noise from external tools. To address these issues, we propose a specialized chain-of-thought (CoT) process tailored for long video analysis. Our proposed CoT with plan-adjust mode enables the LLM to incrementally plan and adapt its information-gathering strategy. We further incorporate heuristic uncertainty estimation of both the LLM and external tools to guide the CoT process. This allows the LLM to assess the reliability of newly collected information, refine its collection strategy, and make more robust decisions when synthesizing final answers. Empirical experiments show that our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs. We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design. Evaluation on three dedicated long video benchmarks (and their subsets) demonstrates that VideoAgent2 outperforms the previous state-of-the-art agent-based method, VideoAgent, by an average of 13.1% and achieves leading performance among all zero-shot approaches △ Less

Submitted 6 April, 2025; originally announced April 2025.

arXiv:2504.04066 [pdf, other]

Performance Analysis of Deep Learning Models for Femur Segmentation in MRI Scan

Authors: Mengyuan Liu, Yixiao Chen, Anning Tian, Xinmeng Wu, Mozhi Shen, Tianchou Gong, Jeongkyu Lee

Abstract: Convolutional neural networks like U-Net excel in medical image segmentation, while attention mechanisms and KAN enhance feature extraction. Meta's SAM 2 uses Vision Transformers for prompt-based segmentation without fine-tuning. However, biases in these models impact generalization with limited data. In this study, we systematically evaluate and compare the performance of three CNN-based models,… ▽ More Convolutional neural networks like U-Net excel in medical image segmentation, while attention mechanisms and KAN enhance feature extraction. Meta's SAM 2 uses Vision Transformers for prompt-based segmentation without fine-tuning. However, biases in these models impact generalization with limited data. In this study, we systematically evaluate and compare the performance of three CNN-based models, i.e., U-Net, Attention U-Net, and U-KAN, and one transformer-based model, i.e., SAM 2 for segmenting femur bone structures in MRI scan. The dataset comprises 11,164 MRI scans with detailed annotations of femoral regions. Performance is assessed using the Dice Similarity Coefficient, which ranges from 0.932 to 0.954. Attention U-Net achieves the highest overall scores, while U-KAN demonstrated superior performance in anatomical regions with a smaller region of interest, leveraging its enhanced learning capacity to improve segmentation accuracy. △ Less

Submitted 5 April, 2025; originally announced April 2025.

arXiv:2504.03559 [pdf, other]

Constraints on dark matter boosted by supernova shock within the effective field theory framework from the CDEX-10 experiment

Authors: J. Z. Wang, L. T. Yang, Q. Yue, K. J. Kang, Y. J. Li, H. P. An, Greeshma C., J. P. Chang, H. Chen, Y. H. Chen, J. P. Cheng, W. H. Dai, Z. Deng, C. H. Fang, X. P. Geng, H. Gong, Q. J. Guo, T. Guo, X. Y. Guo, L. He, J. R. He, H. X. Huang, T. C. Huang, S. Karmakar, H. B. Li , et al. (62 additional authors not shown)

Abstract: Supernova shocks can boost dark matter (DM) particles to high, yet nonrelativistic, velocities, providing a suitable mechanism for analysis within the framework of the nonrelativistic effective field theory (NREFT). These accelerated DM sources extend the experimental ability to scan the parameter space of light DM into the sub-GeV region. In this study, we specifically analyze DM accelerated by t… ▽ More Supernova shocks can boost dark matter (DM) particles to high, yet nonrelativistic, velocities, providing a suitable mechanism for analysis within the framework of the nonrelativistic effective field theory (NREFT). These accelerated DM sources extend the experimental ability to scan the parameter space of light DM into the sub-GeV region. In this study, we specifically analyze DM accelerated by the Monogem Ring supernova remnant, whose age ($\sim 68000$ yr) and distance to Earth ($\sim 300$ parsecs) are strategically matched to enable detection with current terrestrial detectors. Utilizing the 205.4 kg$\cdot$day data obtained from the CDEX-10 experiment at the China Jinping Underground Laboratory (CJPL), we derive new constraints on boosted DM within the NREFT framework. The NREFT coupling constant exclusion regions now penetrate the sub-GeV mass range, with optimal sensitivity achieved for operators $\mathcal{O}_{3}$, $\mathcal{O}_{6}$, $\mathcal{O}_{15}$ in the 0.4--0.6 GeV mass range. △ Less

Submitted 4 April, 2025; originally announced April 2025.

Comments: 9 pages, 5 figures

arXiv:2504.02168 [pdf, other]

MDP: Multidimensional Vision Model Pruning with Latency Constraint

Authors: Xinglong Sun, Barath Lakshmanan, Maying Shen, Shiyi Lan, Jingde Chen, Jose M. Alvarez

Abstract: Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where mult… ▽ More Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning (MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities-including channels, query, key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37% acceleration with a +0.7 Top-1 accuracy improvement. △ Less

Submitted 2 April, 2025; originally announced April 2025.

Comments: Accepted at CVPR 2025

arXiv:2503.23778 [pdf, other]

doi 10.1021/acsami.5c00619

Efficient defect healing of single-walled cabron nanotubes through $ \mathrm{C}_{2}\mathrm{H}_{2} $-assisted multiple-cycle treatment with air exposure

Authors: Man Shen, Taiki Inoue, Mengyue Wang, Yuanjia Liu, Yoshihiro Kobayashi

Abstract: Defects in single-walled carbon nanotubes (SWCNTs) degrade their mechanical,electrical, and thermal properties, limiting their potential applications. To realize the diverse applications of SWCNTs, it is essential to enhance their crystallinity through effective defect healing. However, traditional thermal treatments typically require temperatures above 1800°C, which can alter the nanotube structu… ▽ More Defects in single-walled carbon nanotubes (SWCNTs) degrade their mechanical,electrical, and thermal properties, limiting their potential applications. To realize the diverse applications of SWCNTs, it is essential to enhance their crystallinity through effective defect healing. However, traditional thermal treatments typically require temperatures above 1800°C, which can alter the nanotube structure. Previously, defect healing of SWCNTs was achieved at a relatively low temperature of 1100°C, using C$_{2}$H$_{2}$ assistance, but the efficiency was limited. In this study, we developed a C$_{2}$H$_{2}$-assisted multiple-cycle process at an even lower temperature of 1000°C combined with air exposure, achieving highly efficient defect healing while preserving the nanotube structure. The combination of multiple-cycle treatment and air exposure between cycles was found to promote defect activation, suppress the formation of amorphous carbon, and enhance the effectiveness of defect healing. Additionally, we successfully healed commercially available bulk-scale SWCNTs (super-growth SWCNTs), noting that their healing behavior differed from lab-grown SWCNTs with smaller diameters synthesized from nanodiamond. The efficient and structure-preserved healing process developed in this study broadens the potential applications of high-quality SWCNTs, including flexible electronics, high-performance composites, and energy storage devices. △ Less

Submitted 31 March, 2025; originally announced March 2025.

Comments: submitted version

Journal ref: ACS Appl. Mater. Interfaces 2025

arXiv:2503.17793 [pdf, other]

Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

Authors: Codefuse, Ling Team, :, Wenting Cai, Yuchen Cao, Chaoyu Chen, Chen Chen, Siba Chen, Qing Cui, Peng Di, Junpeng Fang, Zi Gong, Ting Guo, Zhengyu He, Yang Huang, Cong Li, Jianguo Li, Zheng Li, Shijie Lian, BingChang Liu, Songshan Luo, Shuo Mao, Min Shen, Jian Wu, Jiaolong Yang , et al. (8 additional authors not shown)

Abstract: Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the Deep… ▽ More Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{https://huggingface.co/inclusionAI/Ling-Coder-lite}. △ Less

Submitted 22 March, 2025; originally announced March 2025.

Comments: 20 pages, 6 figures

ACM Class: I.2.7

arXiv:2503.17174 [pdf, other]

How to Promote Autonomous Driving with Evolving Technology: Business Strategy and Pricing Decision

Authors: Mingliang Li, Yanrong Li, Lai Wei, Wei Jiang, Zuo-Jun Max Shen

Abstract: Recently, autonomous driving system (ADS) has been widely adopted due to its potential to enhance travel convenience and alleviate traffic congestion, thereby improving the driving experience for consumers and creating lucrative opportunities for manufacturers. With the advancement of data sensing and control technologies, the reliability of ADS and the purchase intentions of consumers are continu… ▽ More Recently, autonomous driving system (ADS) has been widely adopted due to its potential to enhance travel convenience and alleviate traffic congestion, thereby improving the driving experience for consumers and creating lucrative opportunities for manufacturers. With the advancement of data sensing and control technologies, the reliability of ADS and the purchase intentions of consumers are continually evolving, presenting challenges for manufacturers in promotion and pricing decisions. To address this issue, we develop a two-stage game-theoretical model to characterize the decision-making processes of manufacturers and consumers before and after a technology upgrade. Considering the unique structural characteristics of ADS, which consists of driving software and its supporting hardware (SSH), we propose different business strategies for SSH (bundle or unbundle with the vehicle) and driving software (perpetual licensing or subscription) from the manufacturer's perspective. We find that, first, SSH strategies influence the optimal software strategies by changing the consumers' entry barriers to the ADS market. Specifically, for manufacturers with mature ADS technology, the bundle strategy provides consumers with a lower entry barrier by integrating SSH, making the flexible subscription model a dominant strategy; while perpetual licensing outperforms under the unbundle strategy. Second, the software strategies influence the optimal SSH strategy by altering consumers' exit barriers. Perpetual licensing imposes higher exit barriers; when combined with a bundle strategy that lowers entry barriers, it becomes a more advantageous choice for manufacturers with mature ADS technology. In contrast, the subscription strategy allows consumers to easily exit the market, making the bundle strategy advantageous only when a substantial proportion of consumers are compatible with ADS. △ Less

Submitted 21 March, 2025; originally announced March 2025.

arXiv:2503.15918 [pdf, other]

Denoising-based Contractive Imitation Learning

Authors: Macheng Shen, Jishen Peng, Zefang Huang

Abstract: A fundamental challenge in imitation learning is the \emph{covariate shift} problem. Existing methods to mitigate covariate shift often require additional expert interactions, access to environment dynamics, or complex adversarial training, which may not be practical in real-world applications. In this paper, we propose a simple yet effective method (DeCIL) to mitigate covariate shift by incorpora… ▽ More A fundamental challenge in imitation learning is the \emph{covariate shift} problem. Existing methods to mitigate covariate shift often require additional expert interactions, access to environment dynamics, or complex adversarial training, which may not be practical in real-world applications. In this paper, we propose a simple yet effective method (DeCIL) to mitigate covariate shift by incorporating a denoising mechanism that enhances the contraction properties of the state transition mapping. Our approach involves training two neural networks: a dynamics model ( f ) that predicts the next state from the current state, and a joint state-action denoising policy network ( d ) that refines this state prediction via denoising and outputs the corresponding action. We provide theoretical analysis showing that the denoising network acts as a local contraction mapping, reducing the error propagation of the state transition and improving stability. Our method is straightforward to implement and can be easily integrated with existing imitation learning frameworks without requiring additional expert data or complex modifications to the training procedure. Empirical results demonstrate that our approach effectively improves success rate of various imitation learning tasks under noise perturbation. △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2503.06730 [pdf, other]

Adaptive Test-Time Intervention for Concept Bottleneck Models

Authors: Matthew Shen, Aliyah Hsu, Abhineet Agarwal, Bin Yu

Abstract: Concept bottleneck models (CBM) aim to improve model interpretability by predicting human level "concepts" in a bottleneck within a deep learning model architecture. However, how the predicted concepts are used in predicting the target still either remains black-box or is simplified to maintain interpretability at the cost of prediction performance. We propose to use Fast Interpretable Greedy Sum-… ▽ More Concept bottleneck models (CBM) aim to improve model interpretability by predicting human level "concepts" in a bottleneck within a deep learning model architecture. However, how the predicted concepts are used in predicting the target still either remains black-box or is simplified to maintain interpretability at the cost of prediction performance. We propose to use Fast Interpretable Greedy Sum-Trees (FIGS) to obtain Binary Distillation (BD). This new method, called FIGS-BD, distills a binary-augmented concept-to-target portion of the CBM into an interpretable tree-based model, while maintaining the competitive prediction performance of the CBM teacher. FIGS-BD can be used in downstream tasks to explain and decompose CBM predictions into interpretable binary-concept-interaction attributions and guide adaptive test-time intervention. Across 4 datasets, we demonstrate that our adaptive test-time intervention identifies key concepts that significantly improve performance for realistic human-in-the-loop settings that only allow for limited concept interventions. △ Less

Submitted 14 April, 2025; v1 submitted 9 March, 2025; originally announced March 2025.

arXiv:2502.17806 [pdf, other]

doi 10.3847/1538-4357/adadf7

Radial dependence of ion fluences in the 2023 July 17 SEP event from Parker Solar Probe to STEREO and ACE

Authors: G. D. Muro, C. M. S Cohen, Z. Xu, R. A. Leske, E. R. Christian, A. C. Cummings, G. De Nolfo, M. I. Desai, F. Fraschetti, J. Giacalone, A. Labrador, D. J. McComas, J. G. Mitchell, D. G. Mitchell, J. Rankin, N. A. Schwadron, M. Shen, M. E. Wiedenbeck, S. D. Bale, O. Romeo, A. Vourlidas

Abstract: In the latter moments of 17 July 2023, the solar active region 13363, near the southwestern face of the Sun, was undergoing considerable evolution, which resulted in a significant solar energetic particle (SEP) event measured by Parker Solar Probe's Integrated Science Investigation of the Sun (ISOIS) and near-Earth spacecraft. Remote observations from GOES and CHASE captured two M5.0+ solar flares… ▽ More In the latter moments of 17 July 2023, the solar active region 13363, near the southwestern face of the Sun, was undergoing considerable evolution, which resulted in a significant solar energetic particle (SEP) event measured by Parker Solar Probe's Integrated Science Investigation of the Sun (ISOIS) and near-Earth spacecraft. Remote observations from GOES and CHASE captured two M5.0+ solar flares that peaked at 23:34 and 00:06 UT from the source region. In tandem, STEREO COR2 first recorded a small, narrow coronal mass ejection (CME) emerging at 22:54 UT and then saw a major halo CME emerge at 23:43 UT with a bright, rapidly expanding core and CME-driven magnetic shock with an estimated speed of $\sim$1400 $kms^{-1}$. Parker Solar Probe was positioned at 0.65 au, near-perfectly on the nominal Parker spiral magnetic field line which connected Earth and the active region for a 537 $kms^{-1}$ ambient solar wind speed at L1. This fortuitous alignment provided the opportunity to examine how the SEP velocity dispersion, energy spectra, elemental composition, and fluence varied from 0.65 to 1 au along a shared magnetic connection to the Sun. We find a strong radial gradient, which is best characterized for H and He as $r^{-4.0}$ and most surprisingly is stronger for O and Fe which is better described by $r^{-5.7}$. △ Less

Submitted 24 February, 2025; originally announced February 2025.

Comments: The Astrophysical Journal: 10 pages, 13 figures

arXiv:2502.16084 [pdf, other]

Single Inclusive $π^\pm$ and $K^\pm$ Production in $e^+e^-$ Annihilation at center-of-mass Energies from 2.000 to 3.671GeV

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (707 additional authors not shown)

Abstract: Using data samples with a total integrated luminosity of 253 $\rm pb^{-1}$ collected by the BESIII detector operating at the BEPCII collider, the differential cross-sections of inclusive $π^\pm$ and $K^\pm$ production, as a function of momentum and normalized by the total hadronic cross-section, are measured at center-of-mass energies from 2.000 to 3.671 GeV. The measured $π^{\pm}$ cross sections… ▽ More Using data samples with a total integrated luminosity of 253 $\rm pb^{-1}$ collected by the BESIII detector operating at the BEPCII collider, the differential cross-sections of inclusive $π^\pm$ and $K^\pm$ production, as a function of momentum and normalized by the total hadronic cross-section, are measured at center-of-mass energies from 2.000 to 3.671 GeV. The measured $π^{\pm}$ cross sections are consistent with the previously reported $π^{0}$ cross-sections by BESIII, while the $K^{\pm}$ cross sections are systematically higher than the $K^0_S$ cross sections by a factor of approximately 1.4. These new results are in agreement with state-of-the-art QCD analyses at next-to-next-to-leading order accuracy, particularly in the large hadron momentum region at energy scales down to 3 GeV. These findings support the validity of isospin symmetry in parton fragmentation processes. △ Less

Submitted 22 February, 2025; originally announced February 2025.

arXiv:2502.15664 [pdf, ps, other]

The Eggbox Ising Model

Authors: Mutian Shen, Yichen Xu, Zohar Nussinov

Abstract: We introduce a simple and versatile model that enables controlled design of rugged energy landscapes that realize different types of Parisi overlap distributions. This model captures quintessential aspects of Replica Symmetry Breaking (RSB) theory and may afford additional insights into complex systems and numerical methods for their analysis. We introduce a simple and versatile model that enables controlled design of rugged energy landscapes that realize different types of Parisi overlap distributions. This model captures quintessential aspects of Replica Symmetry Breaking (RSB) theory and may afford additional insights into complex systems and numerical methods for their analysis. △ Less

Submitted 21 February, 2025; originally announced February 2025.

arXiv:2502.09030 [pdf, ps, other]

$L^p\to L^q$ estimates for Stein's spherical maximal operators

Authors: Naijia Liu, Minxing Shen, Liang Song, Lixin Yan

Abstract: In this article we consider a modification of the Stein's spherical maximal operator of complex order $α$ on ${\mathbb R^n}$: $$ {\mathfrak M}^α_{[1,2]} f(x) =\sup\limits_{t\in [1,2]} \big| {1\over Γ(α) } \int_{|y|\leq 1} \left(1-|y|^2 \right)^{α-1} f(x-ty) dy\big|. $$ We show that when $n\geq 2$, suppose $\|{\mathfrak M}^α_{[1,2]} f \|_{L^q({\mathbb R^n})} \leq C\|f \|_{L^p({\mathbb R^n})}$ holds… ▽ More In this article we consider a modification of the Stein's spherical maximal operator of complex order $α$ on ${\mathbb R^n}$: $$ {\mathfrak M}^α_{[1,2]} f(x) =\sup\limits_{t\in [1,2]} \big| {1\over Γ(α) } \int_{|y|\leq 1} \left(1-|y|^2 \right)^{α-1} f(x-ty) dy\big|. $$ We show that when $n\geq 2$, suppose $\|{\mathfrak M}^α_{[1,2]} f \|_{L^q({\mathbb R^n})} \leq C\|f \|_{L^p({\mathbb R^n})}$ holds for some $α\in \mathbb{C}$, $p,q\geq1$, then we must have that $q\geq p$ and $${\rm Re}\,α\geq σ_n(p,q):=\max\left\{\frac{1}{p}-\frac{n}{q},\ \frac{n+1}{2p}-\frac{n-1}{2}\left(\frac{1}{q}+1\right),\frac{n}{p}-n+1\right\}.$$ Conversely, we show that ${\mathfrak M}^α_{[1,2]}$ is bounded from $L^p({\mathbb R^n})$ to $L^q({\mathbb R^n})$ provided that $q\geq p$ and ${\rm Re}\,α>σ_2(p,q)$ for $n=2$; and ${\rm Re}\,α>\max\left\{σ_n(p,q), 1/(2p)- (n-2)/(2q) -(n-1)/4\right\}$ for $n>2$. The range of $α,p$ and $q$ is almost optimal in the case either $n=2$, or $α=0$, or $(p,q)$ lies in some regions for $n>2$. △ Less

Submitted 13 February, 2025; originally announced February 2025.

Comments: 14 pages, 1 figures

arXiv:2502.07317 [pdf, other]

doi 10.1016/j.nima.2025.170548

Position reconstruction and surface background model for the PandaX-4T detector

Authors: Zhicheng Qian, Linhui Gu, Chen Cheng, Zihao Bo, Wei Chen, Xun Chen, Yunhua Chen, Zhaokan Cheng, Xiangyi Cui, Yingjie Fan, Deqing Fang, Zhixing Gao, Lisheng Geng, Karl Giboni, Xunan Guo, Xuyuan Guo, Zichao Guo, Chencheng Han, Ke Han, Changda He, Jinrong He, Di Huang, Houqi Huang, Junting Huang, Ruquan Hou , et al. (78 additional authors not shown)

Abstract: We report the position reconstruction methods and surface background model for the PandaX-4T dark matter direct search experiment. This work develops two position reconstruction algorithms: template matching (TM) method and photon acceptance function (PAF) method. Both methods determine the horizontal position of events based on the light pattern of secondary scintillation collected by the light s… ▽ More We report the position reconstruction methods and surface background model for the PandaX-4T dark matter direct search experiment. This work develops two position reconstruction algorithms: template matching (TM) method and photon acceptance function (PAF) method. Both methods determine the horizontal position of events based on the light pattern of secondary scintillation collected by the light sensors. After a comprehensive evaluation of resolution, uniformity, and robustness, the PAF method was selected for position reconstruction, while the TM method was employed for verification. The PAF method achieves a bulk event resolution of 1.0 mm and a surface event resolution of 4.4 mm for a typical $S2$ signal with a bottom charge of 1500 PE (about 14 keV). The uniformity is around 20\%. Robustness studies reveal average deviations of 5.1 mm and 8.8 mm for the commissioning run (Run0) and the first science run (Run1), respectively, due to the deactivation of certain PMTs. A data-driven surface background model is developed based on the PAF method. The surface background is estimated to be $0.09 \pm 0.06$ events for Run0 (0.54 tonne$\cdot$year) and $0.17 \pm 0.11$ events for Run1 (1.00 tonne$\cdot$year). △ Less

Submitted 11 February, 2025; originally announced February 2025.

Comments: 22 pages, 15 figures, 2 tables

arXiv:2502.07165 [pdf, other]

Don't Just Demo, Teach Me the Principles: A Principle-Based Multi-Agent Prompting Strategy for Text Classification

Authors: Peipei Wei, Dimitris Dimitriadis, Yan Xu, Mingwei Shen

Abstract: We present PRINCIPLE-BASED PROMPTING, a simple but effective multi-agent prompting strategy for text classification. It first asks multiple LLM agents to independently generate candidate principles based on analysis of demonstration samples with or without labels, consolidates them into final principles via a finalizer agent, and then sends them to a classifier agent to perform downstream classifi… ▽ More We present PRINCIPLE-BASED PROMPTING, a simple but effective multi-agent prompting strategy for text classification. It first asks multiple LLM agents to independently generate candidate principles based on analysis of demonstration samples with or without labels, consolidates them into final principles via a finalizer agent, and then sends them to a classifier agent to perform downstream classification tasks. Extensive experiments on binary and multi-class classification datasets with different sizes of LLMs show that our approach not only achieves substantial performance gains (1.55% - 19.37%) over zero-shot prompting on macro-F1 score but also outperforms other strong baselines (CoT and stepback prompting). Principles generated by our approach help LLMs perform better on classification tasks than human crafted principles on two private datasets. Our multi-agent PRINCIPLE-BASED PROMPTING approach also shows on-par or better performance compared to demonstration-based few-shot prompting approaches, yet with substantially lower inference costs. Ablation studies show that label information and the multi-agent cooperative LLM framework play an important role in generating high-quality principles to facilitate downstream classification tasks. △ Less

Submitted 10 February, 2025; originally announced February 2025.

Comments: To be published in AAAI 2025 Workshop on Advancing LLM-Based Multi-Agent Collaboration

arXiv:2502.04923 [pdf, other]

Cached Multi-Lora Composition for Multi-Concept Image Generation

Authors: Xiandong Zou, Mingzhu Shen, Christos-Savvas Bouganis, Yiren Zhao

Abstract: Low-Rank Adaptation (LoRA) has emerged as a widely adopted technique in text-to-image models, enabling precise rendering of multiple distinct elements, such as characters and styles, in multi-concept image generation. However, current approaches face significant challenges when composing these LoRAs for multi-concept image generation, resulting in diminished generated image quality. In this paper,… ▽ More Low-Rank Adaptation (LoRA) has emerged as a widely adopted technique in text-to-image models, enabling precise rendering of multiple distinct elements, such as characters and styles, in multi-concept image generation. However, current approaches face significant challenges when composing these LoRAs for multi-concept image generation, resulting in diminished generated image quality. In this paper, we initially investigate the role of LoRAs in the denoising process through the lens of the Fourier frequency domain. Based on the hypothesis that applying multiple LoRAs could lead to "semantic conflicts", we find that certain LoRAs amplify high-frequency features such as edges and textures, whereas others mainly focus on low-frequency elements, including the overall structure and smooth color gradients. Building on these insights, we devise a frequency domain based sequencing strategy to determine the optimal order in which LoRAs should be integrated during inference. This strategy offers a methodical and generalizable solution compared to the naive integration commonly found in existing LoRA fusion techniques. To fully leverage our proposed LoRA order sequence determination method in multi-LoRA composition tasks, we introduce a novel, training-free framework, Cached Multi-LoRA (CMLoRA), designed to efficiently integrate multiple LoRAs while maintaining cohesive image generation. With its flexible backbone for multi-LoRA fusion and a non-uniform caching strategy tailored to individual LoRAs, CMLoRA has the potential to reduce semantic conflicts in LoRA composition and improve computational efficiency. Our experimental evaluations demonstrate that CMLoRA outperforms state-of-the-art training-free LoRA fusion methods by a significant margin -- it achieves an average improvement of $2.19\%$ in CLIPScore, and $11.25\%$ in MLLM win rate compared to LoraHub, LoRA Composite, and LoRA Switch. △ Less

Submitted 7 February, 2025; originally announced February 2025.

Comments: The Thirteenth International Conference on Learning Representations (ICLR 2025)

arXiv:2502.03658 [pdf, other]

Advancing Weight and Channel Sparsification with Enhanced Saliency

Authors: Xinglong Sun, Maying Shen, Hongxu Yin, Lei Mao, Pavlo Molchanov, Jose M. Alvarez

Abstract: Pruning aims to accelerate and compress models by removing redundant parameters, identified by specifically designed importance scores which are usually imperfect. This removal is irreversible, often leading to subpar performance in pruned models. Dynamic sparse training, while attempting to adjust sparse structures during training for continual reassessment and refinement, has several limitations… ▽ More Pruning aims to accelerate and compress models by removing redundant parameters, identified by specifically designed importance scores which are usually imperfect. This removal is irreversible, often leading to subpar performance in pruned models. Dynamic sparse training, while attempting to adjust sparse structures during training for continual reassessment and refinement, has several limitations including criterion inconsistency between pruning and growth, unsuitability for structured sparsity, and short-sighted growth strategies. Our paper introduces an efficient, innovative paradigm to enhance a given importance criterion for either unstructured or structured sparsity. Our method separates the model into an active structure for exploitation and an exploration space for potential updates. During exploitation, we optimize the active structure, whereas in exploration, we reevaluate and reintegrate parameters from the exploration space through a pruning and growing step consistently guided by the same given importance criterion. To prepare for exploration, we briefly "reactivate" all parameters in the exploration space and train them for a few iterations while keeping the active part frozen, offering a preview of the potential performance gains from reintegrating these parameters. We show on various datasets and configurations that existing importance criterion even simple as magnitude can be enhanced with ours to achieve state-of-the-art performance and training cost reductions. Notably, on ImageNet with ResNet50, ours achieves an +1.3 increase in Top-1 accuracy over prior art at 90% ERK sparsity. Compared with the SOTA latency pruning method HALP, we reduced its training cost by over 70% while attaining a faster and more accurate pruned model. △ Less

Submitted 5 February, 2025; originally announced February 2025.

Comments: Accepted at WACV 2025

arXiv:2502.03017 [pdf, other]

Search for Double Beta Decay of $^{136}$Xe to the $0^+_1$ Excited State of $^{136}$Ba with PandaX-4T

Authors: PandaX Collaboration, Lingyin Luo, Zihao Bo, Wei Chen, Xun Chen, Yunhua Chen, Zhaokan Cheng, Xiangyi Cui, Yingji Fang, Deqing Fang, Zhixing Gao, Lisheng Geng, Karl Giboni, Xunan Guo, Xuyuan Guo, Zichao Guo, Chencheng Han, Ke Han, Changda He, Jinrong He, Di Huang, Houqi Huang, Junting Huang, Ruquan Hou, Yu Hou , et al. (76 additional authors not shown)

Abstract: We perform a search of double beta decay of $^{136}$Xe to the excited state, $0^+_1$, of $^{136}$Ba (2$νββ$-0$_1^+$), using the dual-phase xenon detector of PandaX-4T with the first 94.9-day commissioning data. The multi-site events are reconstructed up to the MeV energy scale, which helps to improve the background model significantly. The background contribution from the stainless steel platform… ▽ More We perform a search of double beta decay of $^{136}$Xe to the excited state, $0^+_1$, of $^{136}$Ba (2$νββ$-0$_1^+$), using the dual-phase xenon detector of PandaX-4T with the first 94.9-day commissioning data. The multi-site events are reconstructed up to the MeV energy scale, which helps to improve the background model significantly. The background contribution from the stainless steel platform outside PandaX-4T cryostat is evaluated for the first time. No significant evidence for 2$νββ$-$0_1^+$ is observed, resulting in a lower limit on half-life of $7.5 \times 10^{22}$ yr at the 90% confidence level. This is the first experimental limit on such a rare decay in a natural xenon-based detector. △ Less

Submitted 7 March, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

arXiv:2502.02508 [pdf, ps, other]

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Authors: Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan

Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system d… ▽ More Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models are fully open-sourced. △ Less

Submitted 15 June, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

arXiv:2501.19208 [pdf, other]

Learning While Repositioning in On-Demand Vehicle Sharing Networks

Authors: Hansheng Jiang, Chunlin Sun, Zuo-Jun Max Shen, Shunan Jiang

Abstract: We consider a network inventory problem motivated by one-way, on-demand vehicle sharing services. Due to uncertainties in both demand and returns, as well as a fixed number of rental units across an $n$-location network, the service provider must periodically reposition vehicles to match supply with demand spatially while minimizing costs. The optimal repositioning policy under a general $n$-locat… ▽ More We consider a network inventory problem motivated by one-way, on-demand vehicle sharing services. Due to uncertainties in both demand and returns, as well as a fixed number of rental units across an $n$-location network, the service provider must periodically reposition vehicles to match supply with demand spatially while minimizing costs. The optimal repositioning policy under a general $n$-location network is intractable without knowing the optimal value function. We introduce the best base-stock repositioning policy as a generalization of the classical inventory control policy to $n$ dimensions, and establish its asymptotic optimality in two distinct limiting regimes under general network structures. We present reformulations to efficiently compute this best base-stock policy in an offline setting with pre-collected data. In the online setting, we show that a natural Lipschitz-bandit approach achieves a regret guarantee of $\widetilde{O}(T^{\frac{n}{n+1}})$, which suffers from the exponential dependence on $n$. We illustrate the challenges of learning with censored data in networked systems through a regret lower bound analysis and by demonstrating the suboptimality of alternative algorithmic approaches. Motivated by these challenges, we propose an Online Gradient Repositioning algorithm that relies solely on censored demand. Under a mild cost-structure assumption, we prove that it attains an optimal regret of $O(n^{2.5} \sqrt{T})$, which matches the regret lower bound in $T$ and achieves only polynomial dependence on $n$. The key algorithmic innovation involves proposing surrogate costs to disentangle intertemporal dependencies and leveraging dual solutions to find the gradient of policy change. Numerical experiments demonstrate the effectiveness of our proposed methods. △ Less

Submitted 31 January, 2025; originally announced January 2025.

arXiv:2501.18871 [pdf, other]

Neural SDEs as a Unified Approach to Continuous-Domain Sequence Modeling

Authors: Macheng Shen, Chen Cheng

Abstract: Inspired by the ubiquitous use of differential equations to model continuous dynamics across diverse scientific and engineering domains, we propose a novel and intuitive approach to continuous sequence modeling. Our method interprets time-series data as \textit{discrete samples from an underlying continuous dynamical system}, and models its time evolution using Neural Stochastic Differential Equat… ▽ More Inspired by the ubiquitous use of differential equations to model continuous dynamics across diverse scientific and engineering domains, we propose a novel and intuitive approach to continuous sequence modeling. Our method interprets time-series data as \textit{discrete samples from an underlying continuous dynamical system}, and models its time evolution using Neural Stochastic Differential Equation (Neural SDE), where both the flow (drift) and diffusion terms are parameterized by neural networks. We derive a principled maximum likelihood objective and a \textit{simulation-free} scheme for efficient training of our Neural SDE model. We demonstrate the versatility of our approach through experiments on sequence modeling tasks across both embodied and generative AI. Notably, to the best of our knowledge, this is the first work to show that SDE-based continuous-time modeling also excels in such complex scenarios, and we hope that our work opens up new avenues for research of SDE models in high-dimensional and temporally intricate domains. △ Less

Submitted 30 January, 2025; originally announced January 2025.

arXiv:2501.15942 [pdf, other]

TimeHF: Billion-Scale Time Series Models Guided by Human Feedback

Authors: Yongzhi Qi, Hao Hu, Dazhou Lei, Jianshen Zhang, Zhengxin Shi, Yulin Huang, Zhengyu Chen, Xiaoming Lin, Zuo-Jun Max Shen

Abstract: Time series neural networks perform exceptionally well in real-world applications but encounter challenges such as limited scalability, poor generalization, and suboptimal zero-shot performance. Inspired by large language models, there is interest in developing large time series models (LTM) to address these issues. However, current methods struggle with training complexity, adapting human feedbac… ▽ More Time series neural networks perform exceptionally well in real-world applications but encounter challenges such as limited scalability, poor generalization, and suboptimal zero-shot performance. Inspired by large language models, there is interest in developing large time series models (LTM) to address these issues. However, current methods struggle with training complexity, adapting human feedback, and achieving high predictive accuracy. We introduce TimeHF, a novel pipeline for creating LTMs with 6 billion parameters, incorporating human feedback. We use patch convolutional embedding to capture long time series information and design a human feedback mechanism called time-series policy optimization. Deployed in JD.com's supply chain, TimeHF handles automated replenishment for over 20,000 products, improving prediction accuracy by 33.21% over existing methods. This work advances LTM technology and shows significant industrial benefits. △ Less

Submitted 27 January, 2025; originally announced January 2025.

arXiv:2501.15381 [pdf, other]

Two-optical-cycle pulses from nanophotonic two-color soliton compression

Authors: Robert M. Gray, Ryoto Sekine, Maximilian Shen, Thomas Zacharias, James Williams, Selina Zhou, Rahul Chawlani, Luis Ledezma, Nicolas Englebert, Alireza Marandi

Abstract: Few- and single-cycle optical pulses and their associated ultra-broadband spectra have been crucial in the progress of ultrafast science and technology. Moreover, multi-color waveforms composed of independently manipulable ultrashort pulses in distinct spectral bands offer unique advantages in pulse synthesis and attosecond science. However, the generation and control of ultrashort pulses has requ… ▽ More Few- and single-cycle optical pulses and their associated ultra-broadband spectra have been crucial in the progress of ultrafast science and technology. Moreover, multi-color waveforms composed of independently manipulable ultrashort pulses in distinct spectral bands offer unique advantages in pulse synthesis and attosecond science. However, the generation and control of ultrashort pulses has required bulky and expensive optical systems at the tabletop scale and has so far been beyond the reach of integrated photonics. Here, we break these limitations and demonstrate two-optical-cycle pulse compression using quadratic two-color soliton dynamics in lithium niobate nanophotonics. By leveraging dispersion engineering and operation near phase matching, we achieve extreme compression, energy-efficient operation, and strong conversion of pump to the second harmonic. We experimentally demonstrate generation of $\sim$13-fs pulses at 2 $μ$m using only $\sim$3 pJ of input energy. We further illustrate how the demonstrated scheme can be readily extended to on-chip single-cycle pulse synthesis with sub-cycle control. Our results provide a path towards realization of single-cycle ultrafast systems in nanophotonic circuits. △ Less

Submitted 18 February, 2025; v1 submitted 25 January, 2025; originally announced January 2025.

Comments: 24 pages, 5 figures

arXiv:2501.14923 [pdf, other]

Comparing Methods for Calculating Solar Energetic Particle Intensities: Re-binning versus Spectral Binning

Authors: M. E. Cuesta, L. Y. Khoo, G. Livadiotis, M. M. Shen, J. R. Szalay, D. J. McComas, J. S. Rankin, R. Bandyopadhyay, H. A. Farooki, J. T. Niehof, C. M. S. Cohen, R. A. Leske, Z. Xu, E. R. Christian, M. I. Desai, M. A. Dayeh

Abstract: Solar energetic particle (SEP) events have been observed for decades in the interplanetary medium by spacecraft measuring the intensity of energetic ions and electrons. These intensities provide valuable information about particle acceleration, the effects of bulk plasma dynamics on particle transport, and the anisotropy of particle distributions. Since measured intensities are typically reported… ▽ More Solar energetic particle (SEP) events have been observed for decades in the interplanetary medium by spacecraft measuring the intensity of energetic ions and electrons. These intensities provide valuable information about particle acceleration, the effects of bulk plasma dynamics on particle transport, and the anisotropy of particle distributions. Since measured intensities are typically reported in narrow energy bins, it is common to re-bin intensities over a wider energy range to improve counting statistics. We investigate two methods for calculating intensities across multiple energy bins: a) \textit{re-binned intensity} ($\overline{j}_{\rm linlin}$), which is calculated by integrating the intensity over energy space and corresponds to the intensity at an effective energy that depends on the time-varying spectral index, and b) \textit{spectral binned intensity} ($\overline{j}_{\rm loglog}$), calculated by integrating the log-intensity in log-energy space, yielding the intensity at the log-centered energy that is independent of the spectral index and remains constant over time. We compare these methods using Parker Solar Probe (PSP) IS$\odot$IS measurements of energetic protons, and we prescribe criteria for selecting the appropriate method for different scenarios. Our results show that the re-binned intensity is consistently larger (up to a factor of 5) than the spectral binned intensity for two SEP events observed by PSP, although the time series of the two methods are strongly correlated. Overall, both measures are important for SEP spectral analysis, and the selection of the appropriate measure depends on whether a physical (spectral binned intensity) or a statistical (re-binned intensity) representation is needed for a given analysis. △ Less

Submitted 24 January, 2025; originally announced January 2025.

Comments: 17 pages, 9 Figures, Accepted for Publication in ApJS

arXiv:2501.07329 [pdf, other]

Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Authors: Jiliang Hu, Zuchao Li, Mengjia Shen, Haojun Ai, Sheng Li, Jun Zhang

Abstract: Spoken language understanding (SLU) is a structure prediction task in the field of speech. Recently, many works on SLU that treat it as a sequence-to-sequence task have achieved great success. However, This method is not suitable for simultaneous speech recognition and understanding. In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU mo… ▽ More Spoken language understanding (SLU) is a structure prediction task in the field of speech. Recently, many works on SLU that treat it as a sequence-to-sequence task have achieved great success. However, This method is not suitable for simultaneous speech recognition and understanding. In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. We conduct experiments on name entity recognition and intent classification using the Chinese dataset AISHELL-NER and the English dataset SLURP. The results show that our proposed method not only outperforms the traditional sequence-to-sequence method in both transcription and extraction capabilities but also achieves state-of-the-art performance on the two datasets. △ Less

Submitted 17 January, 2025; v1 submitted 13 January, 2025; originally announced January 2025.

Comments: 5 pages, 2 figures, accepted by ICASSP 2025

arXiv:2501.03880 [pdf, other]

SELMA3D challenge: Self-supervised learning for 3D light-sheet microscopy image segmentation

Authors: Ying Chen, Rami Al-Maskari, Izabela Horvath, Mayar Ali, Luciano Hoher, Kaiyuan Yang, Zengming Lin, Zhiwei Zhai, Mengzhe Shen, Dejin Xun, Yi Wang, Tony Xu, Maged Goubran, Yunheng Wu, Kensaku Mori, Johannes C. Paetzold, Ali Erturk

Abstract: Recent innovations in light sheet microscopy, paired with developments in tissue clearing techniques, enable the 3D imaging of large mammalian tissues with cellular resolution. Combined with the progress in large-scale data analysis, driven by deep learning, these innovations empower researchers to rapidly investigate the morphological and functional properties of diverse biological samples. Segme… ▽ More Recent innovations in light sheet microscopy, paired with developments in tissue clearing techniques, enable the 3D imaging of large mammalian tissues with cellular resolution. Combined with the progress in large-scale data analysis, driven by deep learning, these innovations empower researchers to rapidly investigate the morphological and functional properties of diverse biological samples. Segmentation, a crucial preliminary step in the analysis process, can be automated using domain-specific deep learning models with expert-level performance. However, these models exhibit high sensitivity to domain shifts, leading to a significant drop in accuracy when applied to data outside their training distribution. To address this limitation, and inspired by the recent success of self-supervised learning in training generalizable models, we organized the SELMA3D Challenge during the MICCAI 2024 conference. SELMA3D provides a vast collection of light-sheet images from cleared mice and human brains, comprising 35 large 3D images-each with over 1000^3 voxels-and 315 annotated small patches for finetuning, preliminary testing and final testing. The dataset encompasses diverse biological structures, including vessel-like and spot-like structures. Five teams participated in all phases of the challenge, and their proposed methods are reviewed in this paper. Quantitative and qualitative results from most participating teams demonstrate that self-supervised learning on large datasets improves segmentation model performance and generalization. We will continue to support and extend SELMA3D as an inaugural MICCAI challenge focused on self-supervised learning for 3D microscopy image segmentation. △ Less

Submitted 12 January, 2025; v1 submitted 7 January, 2025; originally announced January 2025.

Comments: 2st version

arXiv:2412.19970 [pdf, other]

doi 10.1103/PhysRevLett.134.161003

Search for Solar Boosted Dark Matter Particles at the PandaX-4T Experiment

Authors: Guofang Shen, Zihao Bo, Wei Chen, Xun Chen, Yunhua Chen, Zhaokan Cheng, Xiangyi Cui, Yingjie Fan, Deqing Fang, Zhixing Gao, Lisheng Geng, Karl Giboni, Xunan Guo, Xuyuan Guo, Zichao Guo, Chencheng Han, Ke Han, Changda He, Jinrong He, Di Huang, Houqi Huang, Junting Huang, Ruquan Hou, Yu Hou, Xiangdong Ji , et al. (78 additional authors not shown)

Abstract: We present a novel constraint on light dark matter utilizing $1.54$ tonne$\cdot$year of data acquired from the PandaX-4T dual-phase xenon time projection chamber. This constraint is derived through detecting electronic recoil signals resulting from the interaction with solar-enhanced dark matter flux. Low-mass dark matter particles, lighter than a few MeV/$c^2$, can scatter with the thermal electr… ▽ More We present a novel constraint on light dark matter utilizing $1.54$ tonne$\cdot$year of data acquired from the PandaX-4T dual-phase xenon time projection chamber. This constraint is derived through detecting electronic recoil signals resulting from the interaction with solar-enhanced dark matter flux. Low-mass dark matter particles, lighter than a few MeV/$c^2$, can scatter with the thermal electrons in the Sun. Consequently, with higher kinetic energy, the boosted dark matter component becomes detectable via contact scattering with xenon electrons, resulting in a few keV energy deposition that exceeds the threshold of PandaX-4T. We calculate the expected recoil energy in PandaX-4T considering the Sun's acceleration and the detection capabilities of the xenon detector. The first experimental search results using the xenon detector yield the most stringent cross-section of $3.51 \times 10^{-39}~\mathrm{cm}^2$ at $0.08~\mathrm{MeV}$/$c^2$ for a solar boosted dark matter mass ranging from $0.02$ to $10~ \mathrm{MeV}$/$c^2$, achieving a 23 fold improvement compared with earlier experimental studies. △ Less

Submitted 12 May, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

arXiv:2412.18028 [pdf]

Diverse dust populations in the near-Sun environment characterized by PSP/IS$\odot$IS

Authors: M. M. Shen, J. R. Szalay, P. Pokorný, J. G. Mitchell, M. E. Hill, D. G. Mitchell, D. J. McComas, E. R. Christian, C. M. S. Cohen, N. A. Schwadron, S. D. Bale, D. M. Malaspina

Abstract: The Integrated Science Investigation of the Sun (IS$\odot$IS) energetic particle instrument suite on Parker Solar Probe is dedicated to measuring energetic ions and electrons in the near-Sun environment. It includes a half-sky-viewing time-of-flight mass spectrometer (EPI-Lo) and five high-energy silicon solid-state detector-telescopes (EPI-Hi). To August 2024, eight of EPI-Lo's eighty separate te… ▽ More The Integrated Science Investigation of the Sun (IS$\odot$IS) energetic particle instrument suite on Parker Solar Probe is dedicated to measuring energetic ions and electrons in the near-Sun environment. It includes a half-sky-viewing time-of-flight mass spectrometer (EPI-Lo) and five high-energy silicon solid-state detector-telescopes (EPI-Hi). To August 2024, eight of EPI-Lo's eighty separate telescope foils have experienced direct dust puncture events, most of which occurred inside 40 solar radii (0.19 au). These impacts represent the closest ever direct dust detections to the Sun. While there is limited information about the size/mass of each impact due to the lack of a dedicated dust instrument, we can determine the impact direction for six punctures, allowing us to partially constrain the inner zodiacal abundance. Remarkably, one of six unambiguous dust impacters was likely on a retrograde orbit, suggesting long-period cometary material may survive within 20 solar radii (0.09 au). We discuss observations in the context of improving our understanding of the inner zodiacal dust environment, highlighting multiple dust populations responsible for these events, and refining hazard assessment for near-Sun spacecraft. △ Less

Submitted 28 December, 2024; v1 submitted 23 December, 2024; originally announced December 2024.

Showing 1–50 of 468 results for author: shen, M