-
Bootstrapping form factor squared in ${\cal N}=4$ super-Yang-Mills
Authors:
Song He,
Xiang Li,
Jingwen Lin,
Jiahao Liu,
Kai Yan
Abstract:
We propose a bootstrap program for the {\it form factor squared} with operator ${\rm tr}(φ^2)$ in maximally supersymmetric Yang-Mills theory in the planar limit, which plays a central role for perturbative calculations of important physical observables such as energy correlators. The tree-level $N$-point form factor (FF) squared can be obtained by cutting $N$ propagators of a collection of two-poi…
▽ More
We propose a bootstrap program for the {\it form factor squared} with operator ${\rm tr}(φ^2)$ in maximally supersymmetric Yang-Mills theory in the planar limit, which plays a central role for perturbative calculations of important physical observables such as energy correlators. The tree-level $N$-point form factor (FF) squared can be obtained by cutting $N$ propagators of a collection of two-point ``master diagrams" at $(N{-}1)$ loops: for $N=3,4,5,6$ there are merely $1, 2, 4, 13$ topologies of such diagrams respectively, and their numerators are strongly constrained by power-counting (including ``no triangle" property) and other constraints such as the ``rung rule". Moreover, these two-point diagrams provide a ``unification" of FF squared at different numbers of loops and legs, which is similar to extracting (planar) amplitude squared from vacuum master diagrams (dual to $f$-graphs): by cutting $2\leq n<N$ propagators, one can also extract the planar integrand of $n$-point FF squared at $(N-n)$ loops, thus our results automatically include integrands of 2-point (Sudakov) FF up to four loops (where the squaring is trivial), 3-point FF squared up to three loops, and so on. Our ansatz is completely fixed using soft limits of (tree and loop) FF squared and the multi-collinear limit which reduces it to the splitting function, without any other inputs such as unitarity cuts. This method opens up the exciting possibility of a {\it graphical bootstrap} for FF squared for higher $N$ (which contains {\it e.g.} planar Sudakov FF to $N{-}2$ loops) similar to that for the amplitude squared via $f$-graphs. We also comment on applications to the computation of leading order energy correlators where new structures are expected after performing phase-space integrations.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
Authors:
Pengfei Zhao,
Rongbo Luan,
Wei Zhang,
Peng Wu,
Sifeng He
Abstract:
Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitiga…
▽ More
Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Identity Deepfake Threats to Biometric Authentication Systems: Public and Expert Perspectives
Authors:
Shijing He,
Yaxiong Lei,
Zihan Zhang,
Yuzhou Sun,
Shujun Li,
Chi Zhang,
Juan Ye
Abstract:
Generative AI (Gen-AI) deepfakes pose a rapidly evolving threat to biometric authentication, yet a significant gap exists between expert understanding of these risks and public perception. This disconnection creates critical vulnerabilities in systems trusted by millions. To bridge this gap, we conducted a comprehensive mixed-method study, surveying 408 professionals across key sectors and conduct…
▽ More
Generative AI (Gen-AI) deepfakes pose a rapidly evolving threat to biometric authentication, yet a significant gap exists between expert understanding of these risks and public perception. This disconnection creates critical vulnerabilities in systems trusted by millions. To bridge this gap, we conducted a comprehensive mixed-method study, surveying 408 professionals across key sectors and conducting in-depth interviews with 37 participants (25 experts, 12 general public [non-experts]). Our findings reveal a paradox: while the public increasingly relies on biometrics for convenience, experts express grave concerns about the spoofing of static modalities like face and voice recognition. We found significant demographic and sector-specific divides in awareness and trust, with finance professionals, for example, showing heightened skepticism. To systematically analyze these threats, we introduce a novel Deepfake Kill Chain model, adapted from Hutchins et al.'s cybersecurity frameworks to map the specific attack vectors used by malicious actors against biometric systems. Based on this model and our empirical findings, we propose a tri-layer mitigation framework that prioritizes dynamic biometric signals (e.g., eye movements), robust privacy-preserving data governance, and targeted educational initiatives. This work provides the first empirically grounded roadmap for defending against AI-generated identity threats by aligning technical safeguards with human-centered insights.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Privacy Perspectives and Practices of Chinese Smart Home Product Teams
Authors:
Shijing He,
Yaxiong Lei,
Xiao Zhan,
Chi Zhang,
Juan Ye,
Ruba Abu-Salma,
Jose Such
Abstract:
Previous research has explored the privacy needs and concerns of device owners, primary users, and different bystander groups with regard to smart home devices like security cameras, smart speakers, and hubs, but little is known about the privacy views and practices of smart home product teams, particularly those in non-Western contexts. This paper presents findings from 27 semi-structured intervi…
▽ More
Previous research has explored the privacy needs and concerns of device owners, primary users, and different bystander groups with regard to smart home devices like security cameras, smart speakers, and hubs, but little is known about the privacy views and practices of smart home product teams, particularly those in non-Western contexts. This paper presents findings from 27 semi-structured interviews with Chinese smart home product team members, including product/project managers, software/hardware engineers, user experience (UX) designers, legal/privacy experts, and marketers/operation specialists. We examine their privacy perspectives, practices, and risk mitigation strategies. Our results show that participants emphasized compliance with Chinese data privacy laws, which typically prioritized national security over individual privacy rights. China-specific cultural, social, and legal factors also influenced participants' ethical considerations and attitudes toward balancing user privacy and security with convenience. Drawing on our findings, we propose a set of recommendations for smart home product teams, along with socio-technical and legal interventions to address smart home privacy issues-especially those belonging to at-risk groups-in Chinese multi-user smart homes.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
The asymptotics of the $\mathrm{SL}_2(\mathbb{C})$-Hitchin metric on the singular locus: subintegrable systems
Authors:
Siqi He,
Johannes Horn,
Nianzi Li
Abstract:
We study the asymptotic hyperkähler geometry of the $\mathrm{SL}_2(\mathbb{C})$-Hitchin moduli space over the singular fibers of the Hitchin fibration. We extend the previously known exponential convergence results for solutions to the Hitchin equation to the class of locally fiducial Higgs bundles defined by a special local description at the singularities of the spectral curve. This condition is…
▽ More
We study the asymptotic hyperkähler geometry of the $\mathrm{SL}_2(\mathbb{C})$-Hitchin moduli space over the singular fibers of the Hitchin fibration. We extend the previously known exponential convergence results for solutions to the Hitchin equation to the class of locally fiducial Higgs bundles defined by a special local description at the singularities of the spectral curve. This condition is satisfied by the Higgs bundles contained in certain subintegrable systems introduced by Hitchin. We prove that the restriction of the hyperkähler metric to the subintegrable system converges exponentially fast to the corresponding semi-flat metric along a ray $(\mathcal{E},t\varphi)$. This answers a question posed by Hitchin in \cite{Hitchin2021subintegrable_special_Kaehler}. More generally, we prove that for each stratum of quadratic differentials there is a closed subset of the corresponding Hitchin fibers, such that the restricted hyperkähler metric converges to a generalized semi-flat metric.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Authors:
Zhaolu Kang,
Junhao Gong,
Jiaxu Yan,
Wanke Xia,
Yian Wang,
Ziwen Wang,
Huaxuan Ding,
Zhuo Cheng,
Wenhao Cao,
Zhiyuan Feng,
Siqi He,
Shannan Yan,
Junzhe Chen,
Xiaomin He,
Chaoya Jiang,
Wei Ye,
Kaidong Yu,
Xuelong Li
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require mo…
▽ More
Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
CogniPair: From LLM Chatbots to Conscious AI Agents -- GNWT-Based Multi-Agent Digital Twins for Social Pairing -- Dating & Hiring Applications
Authors:
Wanghao Ye,
Sihan Chen,
Yiting Wang,
Shwai He,
Bowei Tian,
Guoheng Sun,
Ziyi Wang,
Ziyao Wang,
Yexiao He,
Zheyu Shen,
Meng Liu,
Yuning Zhang,
Meng Feng,
Yang Wang,
Siyuan Peng,
Yilong Dai,
Zhenle Duan,
Hanzhang Qin,
Ang Li
Abstract:
Current large language model (LLM) agents lack authentic human psychological processes necessary for genuine digital twins and social AI applications. To address this limitation, we present a computational implementation of Global Workspace Theory (GNWT) that integrates human cognitive architecture principles into LLM agents, creating specialized sub-agents for emotion, memory, social norms, plann…
▽ More
Current large language model (LLM) agents lack authentic human psychological processes necessary for genuine digital twins and social AI applications. To address this limitation, we present a computational implementation of Global Workspace Theory (GNWT) that integrates human cognitive architecture principles into LLM agents, creating specialized sub-agents for emotion, memory, social norms, planning, and goal-tracking coordinated through a global workspace mechanism. However, authentic digital twins require accurate personality initialization. We therefore develop a novel adventure-based personality test that evaluates true personality through behavioral choices within interactive scenarios, bypassing self-presentation bias found in traditional assessments. Building on these innovations, our CogniPair platform enables digital twins to engage in realistic simulated dating interactions and job interviews before real encounters, providing bidirectional cultural fit assessment for both romantic compatibility and workplace matching. Validation using 551 GNWT-Agents and Columbia University Speed Dating dataset demonstrates 72% correlation with human attraction patterns, 77.8% match prediction accuracy, and 74% agreement in human validation studies. This work advances psychological authenticity in LLM agents and establishes a foundation for intelligent dating platforms and HR technology solutions.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
How Far Are We from Predicting Missing Modalities with Foundation Models?
Authors:
Guanzhou Ke,
Yi Xie,
Xiaoli Wang,
Guoqing Chao,
Bo Wang,
Shengfeng He
Abstract:
Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality prediction remains underexplored. To investigate this, we categorize existing approaches into three representative paradigms, encompassing a total of 42 model variants, and conduct a comprehensive evaluation in terms of prediction acc…
▽ More
Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality prediction remains underexplored. To investigate this, we categorize existing approaches into three representative paradigms, encompassing a total of 42 model variants, and conduct a comprehensive evaluation in terms of prediction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned predictions. To address these challenges, we propose an agentic framework tailored for missing modality prediction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a \textit{self-refinement mechanism}, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image prediction by at least 14% and MER for missing text prediction by at least 10% compared to baselines.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Self-Dual Electrodynamics via the Characteristic Method: Relativistic and Carrollian Perspectives
Authors:
Bin Chen,
Song He,
Jue Hou
Abstract:
Electric-magnetic duality plays a pivotal role in understanding the structure of nonlinear electrodynamics (NED). The Gaillard-Zumino (GZ) criterion provides a powerful constraint for identifying self-dual theories. In this work, we systematically explore solutions to the GZ self-duality condition by applying the method of characteristics, a robust tool for solving nonlinear partial differential e…
▽ More
Electric-magnetic duality plays a pivotal role in understanding the structure of nonlinear electrodynamics (NED). The Gaillard-Zumino (GZ) criterion provides a powerful constraint for identifying self-dual theories. In this work, we systematically explore solutions to the GZ self-duality condition by applying the method of characteristics, a robust tool for solving nonlinear partial differential equations. Our approach enables the construction of new classes of Lagrangians that respect duality symmetry, both in the relativistic and Carrollian frameworks. In the relativistic setting, we not only recover well-known examples such as Born-Infeld and ModMax theories, but also identify novel models. We then generalize the GZ formalism to the Carrollian case and construct several classes of Carrollian self-dual non-linear electrodynamic models. Remarkably, we demonstrate that the characteristic flow exhibits an attractor behavior, in the sense that different seed theories that may not be self-dual can generate the same descendant self-dual Lagrangian. These findings broaden the landscape of self-dual theories and open new directions for exploring duality in ultra-relativistic regimes.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
NTIRE 2025 XGC Quality Assessment Challenge: Methods and Results
Authors:
Xiaohong Liu,
Xiongkuo Min,
Qiang Hu,
Xiaoyun Zhang,
Jie Guo,
Guangtao Zhai,
Shushi Wang,
Yingjie Zhou,
Lu Liu,
Jingxin Li,
Liu Yang,
Farong Wen,
Li Xu,
Yanwei Jiang,
Xilei Zhu,
Chunyi Li,
Zicheng Zhang,
Huiyu Duan,
Xiele Wu,
Yixuan Gao,
Yuqin Cao,
Jun Jia,
Wei Sun,
Jiezhang Cao,
Radu Timofte
, et al. (70 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking he…
▽ More
This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking head. The user-generated video track uses the FineVD-GC, which contains 6,284 user generated videos. The user-generated video track has a total of 125 registered participants. A total of 242 submissions are received in the development phase, and 136 submissions are received in the test phase. Finally, 5 participating teams submitted their models and fact sheets. The AI generated video track uses the Q-Eval-Video, which contains 34,029 AI-Generated Videos (AIGVs) generated by 11 popular Text-to-Video (T2V) models. A total of 133 participants have registered in this track. A total of 396 submissions are received in the development phase, and 226 submissions are received in the test phase. Finally, 6 participating teams submitted their models and fact sheets. The talking head track uses the THQA-NTIRE, which contains 12,247 2D and 3D talking heads. A total of 89 participants have registered in this track. A total of 225 submissions are received in the development phase, and 118 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Each participating team in every track has proposed a method that outperforms the baseline, which has contributed to the development of fields in three tracks.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Large-scale Self-supervised Video Foundation Model for Intelligent Surgery
Authors:
Shu Yang,
Fengtao Zhou,
Leon Mayer,
Fuxiang Huang,
Yiliang Chen,
Yihui Wang,
Sunan He,
Yuxiang Nie,
Xi Wang,
Ömer Sümer,
Yueming Jin,
Huihui Sun,
Shuchang Xu,
Alex Qinyang Liu,
Zheng Li,
Jing Qin,
Jeremy YuenChun Teoh,
Lena Maier-Hein,
Hao Chen
Abstract:
Computer-Assisted Intervention (CAI) has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making, improving procedural efficacy, and ensuring intraoperative safety. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit tempora…
▽ More
Computer-Assisted Intervention (CAI) has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making, improving procedural efficacy, and ensuring intraoperative safety. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3,650 videos and approximately 3.55 million frames, spanning more than 20 surgical procedures and over 10 anatomical structures. Building upon this dataset, we propose SurgVISTA (Surgical Video-level Spatial-Temporal Architecture), a reconstruction-based pre-training method that captures intricate spatial structures and temporal dynamics through joint spatiotemporal modeling. Additionally, SurgVISTA incorporates image-level knowledge distillation guided by a surgery-specific expert to enhance the learning of fine-grained anatomical and semantic features. To validate its effectiveness, we established a comprehensive benchmark comprising 13 video-level datasets spanning six surgical procedures across four tasks. Extensive experiments demonstrate that SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models, demonstrating strong potential to advance intelligent surgical systems in clinically meaningful scenarios.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective
Authors:
Shenghua He,
Tian Xia,
Xuan Zhou,
Hui Wei
Abstract:
We study a common challenge in reinforcement learning for large language models (LLMs): the Zero-Reward Assumption, where non-terminal actions (i.e., intermediate token generations) receive zero task-specific immediate reward, while only the final token receives a reward for the entire response. This assumption arises frequently in practice, as precise token-level rewards are often difficult or in…
▽ More
We study a common challenge in reinforcement learning for large language models (LLMs): the Zero-Reward Assumption, where non-terminal actions (i.e., intermediate token generations) receive zero task-specific immediate reward, while only the final token receives a reward for the entire response. This assumption arises frequently in practice, as precise token-level rewards are often difficult or infeasible to obtain in LLM applications. In this work, we provide a unifying theoretical perspective. We introduce the Trajectory Policy Gradient Theorem, which shows that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model, regardless of whether the Zero-Reward Assumption holds or not, for algorithms in the REINFORCE and Actor-Critic families. This result reveals that widely used methods such as PPO, GRPO, ReMax, and RLOO inherently possess the capacity to model token-level reward signals, offering a theoretical justification for response-level reward approaches. Our findings pave the way for more practical, efficient LLM fine-tuning, allowing developers to treat training algorithms as black boxes and focus on improving the response-level reward model with auxiliary sub-models. We also offer a detailed analysis of popular RL and non-RL methods, comparing their theoretical foundations and practical advantages across common LLM tasks. Finally, we propose a new algorithm: Token-Reinforced Policy Optimization (TRePO), a theoretically grounded method that is simpler than PPO, matches GRPO in memory efficiency, and holds promise for broad applicability.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning
Authors:
Fangyu Lei,
Jinxiang Meng,
Yiming Huang,
Tinghong Chen,
Yun Zhang,
Shizhu He,
Jun Zhao,
Kang Liu
Abstract:
Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imi…
▽ More
Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imitative learning. We introduce Reasoning-Table, the first application of reinforcement learning (RL) to table reasoning, achieving state-of-the-art performance. Through rigorous data preprocessing, reward design, and tailored training strategies, our method leverages simple rule-based outcome rewards to outperform SFT across multiple benchmarks. Unified training across diverse tasks enables Reasoning-Table to emerge as a robust table reasoning large language model, surpassing larger proprietary models like Claude-3.7-Sonnet by 4.0% on table reasoning benchmarks. The approach also achieves excellent performance on text-to-SQL tasks, reaching 68.3% performance on the BIRD dev dataset with a 7B model. Further experiments demonstrate that Reasoning-Table enhances the model's generalization capabilities and robustness.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Local Ambiguity Shaping for Doppler-Resilient Sequences Under Spectral and PAPR Constraints
Authors:
Shi He,
Lingsheng Meng,
Yao Ge,
Yong Liang Guan,
David González G.,
Zilong Liu
Abstract:
This paper focuses on designing Doppler-resilient sequences with low local Ambiguity Function (AF) sidelobes, subject to certain spectral and Peak-to-Average Power Ratio (PAPR) constraints. To achieve this, we propose two distinctoptimization algorithms: (i) an Alternating Minimization (AM) algorithm for superior Weighted Peak Sidelobe Level (WPSL) minimization, and (ii) a low-complexity Augmented…
▽ More
This paper focuses on designing Doppler-resilient sequences with low local Ambiguity Function (AF) sidelobes, subject to certain spectral and Peak-to-Average Power Ratio (PAPR) constraints. To achieve this, we propose two distinctoptimization algorithms: (i) an Alternating Minimization (AM) algorithm for superior Weighted Peak Sidelobe Level (WPSL) minimization, and (ii) a low-complexity Augmented Lagrangian-assisted Majorization Minimization (ALaMM) algorithm with effective WPSL suppression. The proposed schemes hold great potential for sequence design in future 6G and integrated sensing and communication applications, supporting robust sensing under spectral coexistence constraints in high-mobility scenarios.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
ProstaTD: A Large-scale Multi-source Dataset for Structured Surgical Triplet Detection
Authors:
Yiliang Chen,
Zhixi Li,
Cheng Xu,
Alex Qinyang Liu,
Xuemiao Xu,
Jeremy Yuen-Chun Teoh,
Shengfeng He,
Jing Qin
Abstract:
Surgical triplet detection has emerged as a pivotal task in surgical video analysis, with significant implications for performance assessment and the training of novice surgeons. However, existing datasets such as CholecT50 exhibit critical limitations: they lack precise spatial bounding box annotations, provide inconsistent and clinically ungrounded temporal labels, and rely on a single data sour…
▽ More
Surgical triplet detection has emerged as a pivotal task in surgical video analysis, with significant implications for performance assessment and the training of novice surgeons. However, existing datasets such as CholecT50 exhibit critical limitations: they lack precise spatial bounding box annotations, provide inconsistent and clinically ungrounded temporal labels, and rely on a single data source, which limits model generalizability.To address these shortcomings, we introduce ProstaTD, a large-scale, multi-institutional dataset for surgical triplet detection, developed from the technically demanding domain of robot-assisted prostatectomy. ProstaTD offers clinically defined temporal boundaries and high-precision bounding box annotations for each structured triplet action. The dataset comprises 60,529 video frames and 165,567 annotated triplet instances, collected from 21 surgeries performed across multiple institutions, reflecting a broad range of surgical practices and intraoperative conditions. The annotation process was conducted under rigorous medical supervision and involved more than 50 contributors, including practicing surgeons and medically trained annotators, through multiple iterative phases of labeling and verification. ProstaTD is the largest and most diverse surgical triplet dataset to date, providing a robust foundation for fair benchmarking, the development of reliable surgical AI systems, and scalable tools for procedural training.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book
Authors:
Sau Lai Yip,
Sunan He,
Yuxiang Nie,
Shu Pui Chan,
Yilin Ye,
Sum Ying Lam,
Hao Chen
Abstract:
The accelerating development of general medical artificial intelligence (GMAI), powered by multimodal large language models (MLLMs), offers transformative potential for addressing persistent healthcare challenges, including workforce deficits and escalating costs. The parallel development of systematic evaluation benchmarks emerges as a critical imperative to enable performance assessment and prov…
▽ More
The accelerating development of general medical artificial intelligence (GMAI), powered by multimodal large language models (MLLMs), offers transformative potential for addressing persistent healthcare challenges, including workforce deficits and escalating costs. The parallel development of systematic evaluation benchmarks emerges as a critical imperative to enable performance assessment and provide technological guidance. Meanwhile, as an invaluable knowledge source, the potential of medical textbooks for benchmark development remains underexploited. Here, we present MedBookVQA, a systematic and comprehensive multimodal benchmark derived from open-access medical textbooks. To curate this benchmark, we propose a standardized pipeline for automated extraction of medical figures while contextually aligning them with corresponding medical narratives. Based on this curated data, we generate 5,000 clinically relevant questions spanning modality recognition, disease classification, anatomical identification, symptom diagnosis, and surgical procedures. A multi-tier annotation system categorizes queries through hierarchical taxonomies encompassing medical imaging modalities (42 categories), body anatomies (125 structures), and clinical specialties (31 departments), enabling nuanced analysis across medical subdomains. We evaluate a wide array of MLLMs, including proprietary, open-sourced, medical, and reasoning models, revealing significant performance disparities across task types and model categories. Our findings highlight critical capability gaps in current GMAI systems while establishing textbook-derived multimodal benchmarks as essential evaluation tools. MedBookVQA establishes textbook-derived benchmarking as a critical paradigm for advancing clinical AI, exposing limitations in GMAI systems while providing anatomically structured performance metrics across specialties.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
BAGNet: A Boundary-Aware Graph Attention Network for 3D Point Cloud Semantic Segmentation
Authors:
Wei Tao,
Xiaoyang Qu,
Kai Lu,
Jiguang Wan,
Shenglin He,
Jianzong Wang
Abstract:
Since the point cloud data is inherently irregular and unstructured, point cloud semantic segmentation has always been a challenging task. The graph-based method attempts to model the irregular point cloud by representing it as a graph; however, this approach incurs substantial computational cost due to the necessity of constructing a graph for every point within a large-scale point cloud. In this…
▽ More
Since the point cloud data is inherently irregular and unstructured, point cloud semantic segmentation has always been a challenging task. The graph-based method attempts to model the irregular point cloud by representing it as a graph; however, this approach incurs substantial computational cost due to the necessity of constructing a graph for every point within a large-scale point cloud. In this paper, we observe that boundary points possess more intricate spatial structural information and develop a novel graph attention network known as the Boundary-Aware Graph attention Network (BAGNet). On one hand, BAGNet contains a boundary-aware graph attention layer (BAGLayer), which employs edge vertex fusion and attention coefficients to capture features of boundary points, reducing the computation time. On the other hand, BAGNet employs a lightweight attention pooling layer to extract the global feature of the point cloud to maintain model accuracy. Extensive experiments on standard datasets demonstrate that BAGNet outperforms state-of-the-art methods in point cloud semantic segmentation with higher accuracy and less inference time.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Latent Wavelet Diffusion: Enabling 4K Image Synthesis for Free
Authors:
Luigi Sigillo,
Shengfeng He,
Danilo Comminiello
Abstract:
High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight framework that enables any latent diffusion model to scale to ultra-high-resolution image generation (2K to 4K) for free. LWD introduces three key components:…
▽ More
High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight framework that enables any latent diffusion model to scale to ultra-high-resolution image generation (2K to 4K) for free. LWD introduces three key components: (1) a scale-consistent variational autoencoder objective that enhances the spectral fidelity of latent representations; (2) wavelet energy maps that identify and localize detail-rich spatial regions within the latent space; and (3) a time-dependent masking strategy that focuses denoising supervision on high-frequency components during training. LWD requires no architectural modifications and incurs no additional computational overhead. Despite its simplicity, it consistently improves perceptual quality and reduces FID in ultra-high-resolution image synthesis, outperforming strong baseline models. These results highlight the effectiveness of frequency-aware, signal-driven supervision as a principled and efficient approach for high-resolution generative modeling.
△ Less
Submitted 3 June, 2025; v1 submitted 31 May, 2025;
originally announced June 2025.
-
Co-designed Quantum Discrete Adiabatic Linear System Solver Via Dynamic Circuits
Authors:
Boxuan Ai,
Shuo He,
Xiang Zhao,
Lin Yang,
Guozhen Liu,
Pengfei Gao,
Hongbao Liu,
Tao Tang,
Jiecheng Yang,
Jie Wu
Abstract:
Existing quantum discrete adiabatic approaches are hindered by circuit depth that increases linearly with the number of evolution steps, a significant challenge for current quantum hardware with limited coherence times. To address this, we propose a co-designed framework that synergistically integrates dynamic circuit capabilities with real-time classical processing. This framework reformulates th…
▽ More
Existing quantum discrete adiabatic approaches are hindered by circuit depth that increases linearly with the number of evolution steps, a significant challenge for current quantum hardware with limited coherence times. To address this, we propose a co-designed framework that synergistically integrates dynamic circuit capabilities with real-time classical processing. This framework reformulates the quantum adiabatic evolution into discrete, dynamically adjustable segments. The unitary operator for each segment is optimized on-the-fly using classical computation, and circuit multiplexing techniques are leveraged to reduce the overall circuit depth scaling from $O(\text{steps}\times\text{depth}(U))$ to $O(\text{depth}(U))$. We implement and benchmark a quantum discrete adiabatic linear solver based on this framework for linear systems of $W \in \{2,4,8,16\}$ dimensions with condition numbers $κ\in \{10,20,30,40,50\}$. Our solver successfully overcomes previous depth limitations, maintaining over 80% solution fidelity even under realistic noise models. Key algorithmic optimizations contributing to this performance include a first-order approximation of the discrete evolution operator, a tailored dynamic circuit design exploiting real-imaginary component separation, and noise-resilient post-processing techniques.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
A note on the Diversity Owen values
Authors:
Songtao He,
Erfang Shan,
Xinyu Sun
Abstract:
Béal et al. (Int J Game Theory 54, 2025) introduce the Diversity Owen value for TU-games with diversity constraints, and provide axiomatic characterizations using the axioms of fairness and balanced contributions. However, there exist logical flaws in the proofs of the uniqueness of these characterizations. In this note we provide the corrected proofs of the characterizations by introducing the nu…
▽ More
Béal et al. (Int J Game Theory 54, 2025) introduce the Diversity Owen value for TU-games with diversity constraints, and provide axiomatic characterizations using the axioms of fairness and balanced contributions. However, there exist logical flaws in the proofs of the uniqueness of these characterizations. In this note we provide the corrected proofs of the characterizations by introducing the null player for diverse games axiom. Also, we establish an alternative characterization of the Diversity Owen value by modifying the axioms of the above characterizations.
△ Less
Submitted 5 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
Authors:
Xiang Li,
Haiyang Yu,
Xinghua Zhang,
Ziyang Huang,
Shizhu He,
Kang Liu,
Jun Zhao,
Fei Huang,
Yongbin Li
Abstract:
Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs…
▽ More
Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
SWE-bench Goes Live!
Authors:
Linghao Zhang,
Shilin He,
Chaoyun Zhang,
Yu Kang,
Bowen Li,
Chengxing Xie,
Junhao Wang,
Maoquan Wang,
Yufan Huang,
Shengyu Fu,
Elsie Nallipogu,
Qingwei Lin,
Yingnong Dang,
Saravan Rajmohan,
Dongmei Zhang
Abstract:
The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily o…
▽ More
The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live, a live-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.
△ Less
Submitted 1 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
MAC-Gaze: Motion-Aware Continual Calibration for Mobile Gaze Tracking
Authors:
Yaxiong Lei,
Mingyue Zhao,
Yuheng Wang,
Shijing He,
Yusuke Sugano,
Mohamed Khamis,
Juan Ye
Abstract:
Mobile gaze tracking faces a fundamental challenge: maintaining accuracy as users naturally change their postures and device orientations. Traditional calibration approaches, like one-off, fail to adapt to these dynamic conditions, leading to degraded performance over time. We present MAC-Gaze, a Motion-Aware continual Calibration approach that leverages smartphone Inertial measurement unit (IMU)…
▽ More
Mobile gaze tracking faces a fundamental challenge: maintaining accuracy as users naturally change their postures and device orientations. Traditional calibration approaches, like one-off, fail to adapt to these dynamic conditions, leading to degraded performance over time. We present MAC-Gaze, a Motion-Aware continual Calibration approach that leverages smartphone Inertial measurement unit (IMU) sensors and continual learning techniques to automatically detect changes in user motion states and update the gaze tracking model accordingly. Our system integrates a pre-trained visual gaze estimator and an IMU-based activity recognition model with a clustering-based hybrid decision-making mechanism that triggers recalibration when motion patterns deviate significantly from previously encountered states. To enable accumulative learning of new motion conditions while mitigating catastrophic forgetting, we employ replay-based continual learning, allowing the model to maintain performance across previously encountered motion conditions. We evaluate our system through extensive experiments on the publicly available RGBDGaze dataset and our own 10-hour multimodal MotionGaze dataset (481K+ images, 800K+ IMU readings), encompassing a wide range of postures under various motion conditions including sitting, standing, lying, and walking. Results demonstrate that our method reduces gaze estimation error by 19.9% on RGBDGaze (from 1.73 cm to 1.41 cm) and by 31.7% on MotionGaze (from 2.81 cm to 1.92 cm) compared to traditional calibration approaches. Our framework provides a robust solution for maintaining gaze estimation accuracy in mobile scenarios.
△ Less
Submitted 5 June, 2025; v1 submitted 28 May, 2025;
originally announced May 2025.
-
UniFoil: A Universal Dataset of Airfoils in Transitional and Turbulent Regimes for Subsonic and Transonic Flows
Authors:
Rohit Sunil Kanchi,
Benjamin Melanson,
Nithin Somasekharan,
Shaowu Pan,
Sicheng He
Abstract:
We present UniFoil, a large publicly available universal airfoil dataset based on Reynolds-averaged Navier-Stokes (RANS) simulations. It contains over 500,000 samples spanning a wide range of Reynolds and Mach numbers, capturing both transitional and fully turbulent flows across incompressible to compressible regimes. UniFoil is designed to support machine learning research in fluid dynamics, part…
▽ More
We present UniFoil, a large publicly available universal airfoil dataset based on Reynolds-averaged Navier-Stokes (RANS) simulations. It contains over 500,000 samples spanning a wide range of Reynolds and Mach numbers, capturing both transitional and fully turbulent flows across incompressible to compressible regimes. UniFoil is designed to support machine learning research in fluid dynamics, particularly for modeling complex aerodynamic phenomena. Most existing datasets are limited to incompressible, fully turbulent flows with smooth field characteristics, overlooking the critical physics of laminar\-turbulent transition and shock\-wave interactions\-features that exhibit strong nonlinearity and sharp gradients. UniFoil addresses this limitation by offering a broad spectrum of realistic flow conditions. Turbulent simulations utilize the Spalart\-Allmaras (SA) model, while transitional flows are modeled using an e^N\-based transition prediction method coupled with the SA model. The dataset includes a comprehensive geometry set comprising over 4,800 natural laminar flow (NLF) airfoils and 30,000 fully turbulent (FT) airfoils, covering a diverse range of airfoil designs relevant to aerospace, wind energy, and marine applications. This dataset is also valuable for scientific machine learning, enabling the development of data-driven models that more accurately capture the transport processes associated with laminar-turbulent transition. UniFoil is freely available under a permissive CC\-BY\-SA license.
△ Less
Submitted 3 June, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
Automated CAD Modeling Sequence Generation from Text Descriptions via Transformer-Based Large Language Models
Authors:
Jianxing Liao,
Junyan Xu,
Yatao Sun,
Maowen Tang,
Sicheng He,
Jingxian Liao,
Shui Yu,
Yun Li,
Hongguan Xiao
Abstract:
Designing complex computer-aided design (CAD) models is often time-consuming due to challenges such as computational inefficiency and the difficulty of generating precise models. We propose a novel language-guided framework for industrial design automation to address these issues, integrating large language models (LLMs) with computer-automated design (CAutoD).Through this framework, CAD models ar…
▽ More
Designing complex computer-aided design (CAD) models is often time-consuming due to challenges such as computational inefficiency and the difficulty of generating precise models. We propose a novel language-guided framework for industrial design automation to address these issues, integrating large language models (LLMs) with computer-automated design (CAutoD).Through this framework, CAD models are automatically generated from parameters and appearance descriptions, supporting the automation of design tasks during the detailed CAD design phase. Our approach introduces three key innovations: (1) a semi-automated data annotation pipeline that leverages LLMs and vision-language large models (VLLMs) to generate high-quality parameters and appearance descriptions; (2) a Transformer-based CAD generator (TCADGen) that predicts modeling sequences via dual-channel feature aggregation; (3) an enhanced CAD modeling generation model, called CADLLM, that is designed to refine the generated sequences by incorporating the confidence scores from TCADGen. Experimental results demonstrate that the proposed approach outperforms traditional methods in both accuracy and efficiency, providing a powerful tool for automating industrial workflows and generating complex CAD models from textual prompts. The code is available at https://jianxliao.github.io/cadllm-page/
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Room Impulse Response as a Prompt for Acoustic Echo Cancellation
Authors:
Fei Zhao,
Shulin He,
Xueliang Zhang
Abstract:
Data-driven acoustic echo cancellation (AEC) methods, predominantly trained on synthetic or constrained real-world datasets, encounter performance declines in unseen echo scenarios, especially in real environments where echo paths are not directly observable. Our proposed method counters this limitation by integrating room impulse response (RIR) as a pivotal training prompt, aiming to improve the…
▽ More
Data-driven acoustic echo cancellation (AEC) methods, predominantly trained on synthetic or constrained real-world datasets, encounter performance declines in unseen echo scenarios, especially in real environments where echo paths are not directly observable. Our proposed method counters this limitation by integrating room impulse response (RIR) as a pivotal training prompt, aiming to improve the generalization of AEC models in such unforeseen conditions. We also explore four RIR prompt fusion methods. Comprehensive evaluations, including both simulated RIR under unknown conditions and recorded RIR in real, demonstrate that the proposed approach significantly improves performance compared to baseline models. These results substantiate the effectiveness of our RIR-guided approach in strengthening the model's generalization capabilities.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
Authors:
GuangHao Meng,
Sunan He,
Jinpeng Wang,
Tao Dai,
Letian Zhang,
Jieming Zhu,
Qing Li,
Gang Wang,
Rui Zhang,
Yong Jiang
Abstract:
Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced…
▽ More
Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Pessimism Principle Can Be Effective: Towards a Framework for Zero-Shot Transfer Reinforcement Learning
Authors:
Chi Zhang,
Ziying Jia,
George K. Atia,
Sihong He,
Yue Wang
Abstract:
Transfer reinforcement learning aims to derive a near-optimal policy for a target environment with limited data by leveraging abundant data from related source domains. However, it faces two key challenges: the lack of performance guarantees for the transferred policy, which can lead to undesired actions, and the risk of negative transfer when multiple source domains are involved. We propose a nov…
▽ More
Transfer reinforcement learning aims to derive a near-optimal policy for a target environment with limited data by leveraging abundant data from related source domains. However, it faces two key challenges: the lack of performance guarantees for the transferred policy, which can lead to undesired actions, and the risk of negative transfer when multiple source domains are involved. We propose a novel framework based on the pessimism principle, which constructs and optimizes a conservative estimation of the target domain's performance. Our framework effectively addresses the two challenges by providing an optimized lower bound on target performance, ensuring safe and reliable decisions, and by exhibiting monotonic improvement with respect to the quality of the source domains, thereby avoiding negative transfer. We construct two types of conservative estimations, rigorously characterize their effectiveness, and develop efficient distributed algorithms with convergence guarantees. Our framework provides a theoretically sound and practically robust solution for transfer learning in reinforcement learning.
△ Less
Submitted 29 May, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
Instruct2See: Learning to Remove Any Obstructions Across Distributions
Authors:
Junhang Li,
Yu Guo,
Chuhua Xian,
Shengfeng He
Abstract:
Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero…
▽ More
Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual instructions, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time adjustment of inaccurate masks. Extensive experiments on both in-distribution and out-of-distribution obstacles show that Instruct2See consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during the training phase. Code and dataset are available at https://jhscut.github.io/Instruct2See.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN
Authors:
Yao Xu,
Mingyu Xu,
Fangyu Lei,
Wangtao Sun,
Xiangrong Zeng,
Bingning Wang,
Guang Liu,
Shizhu He,
Jun Zhao,
Kang Liu
Abstract:
Recently, models such as OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable performance on complex reasoning tasks through Long Chain-of-Thought (Long-CoT) reasoning. Although distilling this capability into student models significantly enhances their performance, this paper finds that fine-tuning LLMs with full parameters or LoRA with a low rank on long CoT data often leads to Cyclical Reason…
▽ More
Recently, models such as OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable performance on complex reasoning tasks through Long Chain-of-Thought (Long-CoT) reasoning. Although distilling this capability into student models significantly enhances their performance, this paper finds that fine-tuning LLMs with full parameters or LoRA with a low rank on long CoT data often leads to Cyclical Reasoning, where models repeatedly reiterate previous inference steps until the maximum length limit. Further analysis reveals that smaller differences in representations between adjacent tokens correlates with a higher tendency toward Cyclical Reasoning. To mitigate this issue, this paper proposes Shift Feedforward Networks (Shift-FFN), a novel approach that edits the current token's representation with the previous one before inputting it to FFN. This architecture dynamically amplifies the representation differences between adjacent tokens. Extensive experiments on multiple mathematical reasoning tasks demonstrate that LoRA combined with Shift-FFN achieves higher accuracy and a lower rate of Cyclical Reasoning across various data sizes compared to full fine-tuning and standard LoRA. Our data and code are available at https://anonymous.4open.science/r/Shift-FFN
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment
Authors:
Shuhao Han,
Haotian Fan,
Fangyuan Kong,
Wenjie Liao,
Chunle Guo,
Chongyi Li,
Radu Timofte,
Liang Li,
Tao Li,
Junhui Cui,
Yunqiu Wang,
Yang Tai,
Jingwei Sun,
Jianhui Sun,
Xinli Yue,
Tianyi Wang,
Huan Hou,
Junda Lu,
Xinyang Huang,
Zitang Zhou,
Zijian Zhang,
Xuhui Zheng,
Xuecheng Wu,
Chong Peng,
Xuezhi Cao
, et al. (90 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspe…
▽ More
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Pure nematic transition inside the superconducting dome of iron chalcogenide superconductor FeSe$_{1-x}$Te$_x$
Authors:
K. Y. Liang,
R . Z. Zhang,
Z. F. Lin,
Z. J. Li,
B. R. Chen,
P. H. Zhang,
K. Z. Yao,
Q. S. He,
Q. Z. Zhou,
H. X. Yao,
K. Jin,
Y. H. Wang
Abstract:
Nematicity and magnetism are prevalent orders in high transition temperature (Tc) superconductors, coexisting in the parent compound of most material families. Quantum fluctuations of nematicity or spin orders are both plausible candidates for mediating unconventional Cooper pairing. Identifying the sole effect of a nematic quantum critical point (QCP) on the emergence of superconducting dome with…
▽ More
Nematicity and magnetism are prevalent orders in high transition temperature (Tc) superconductors, coexisting in the parent compound of most material families. Quantum fluctuations of nematicity or spin orders are both plausible candidates for mediating unconventional Cooper pairing. Identifying the sole effect of a nematic quantum critical point (QCP) on the emergence of superconducting dome without interference of spin fluctuations is therefore highly desirable. The iron chalcogenide superconductor FeSe exhibits pure nematicity without any magnetic ordering. A nematic quantum phase transition can be induced by Te substitution but experimental study of such transition is so far limited to its normal state. By performing local susceptometry on composition-spread FeSe$_{1-x}$Te$_x$ films ($0 < x < 1$) using scanning Superconducting Quantum Interference Device (sSQUID) microscopy, we investigate the superfluid density ($ρ_s$) across the pure nematic transition in extremely fine steps of $Δx$ = 0.0008. The temperature dependence of $ρ_s$ changes from the form of anisotropic pairing on the nematic side to an isotropic one across the critical doping $x_c$. The power-law dependence of gap anisotropy on $|x - x_c|$ provides evidence for nematic quantum criticality under the superconducting dome. The low-temperature $ρ_s$ scales linearly with Tc in the nematic phase $x < x_c$, whereas the gap amplitude, maximized at $x_c$, determines the Tc for $x>x_c$. Our results establish a pure nematic QCP in FeSe$_{1-x}$Te$_x$, separating two superconducting orders with distinct pairing boosted by nematic quantum fluctuations.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention
Authors:
Huanxuan Liao,
Wen Hu,
Yao Xu,
Shizhu He,
Jun Zhao,
Kang Liu
Abstract:
Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textua…
▽ More
Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textual content relevance and the diversity of demands for user instructions mean these approaches frequently lead to the loss of potentially valuable information. To address this, we propose $\textbf{Hy}$brid $\textbf{Co}$ntext $\textbf{Co}$mpression (HyCo$_2$) for LLMs, which integrates both global and local perspectives to guide context compression while retaining both the essential semantics and critical details for task completion. Specifically, we employ a hybrid adapter to refine global semantics with the global view, based on the observation that different adapters excel at different tasks. Then we incorporate a classification layer that assigns a retention probability to each context token based on the local view, determining whether it should be retained or discarded. To foster a balanced integration of global and local compression, we introduce auxiliary paraphrasing and completion pretraining before instruction tuning. This promotes a synergistic integration that emphasizes instruction-relevant information while preserving essential local details, ultimately balancing local and global information retention in context compression. Experiments show that our HyCo$_2$ method significantly enhances long-text reasoning while reducing token usage. It improves the performance of various LLM series by an average of 13.1\% across seven knowledge-intensive QA benchmarks. Moreover, HyCo$_2$ matches the performance of uncompressed methods while reducing token consumption by 88.8\%.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Expanding Zero-Shot Object Counting with Rich Prompts
Authors:
Huilin Zhu,
Senyao Li,
Jingling Yuan,
Zhengwei Yang,
Yu Guo,
Wenxuan Liu,
Xian Zhong,
Shengfeng He
Abstract:
Expanding pre-trained zero-shot counting models to handle unseen categories requires more than simply adding new prompts, as this approach does not achieve the necessary alignment between text and visual features for accurate counting. We introduce RichCount, the first framework to address these limitations, employing a two-stage training strategy that enhances text encoding and strengthens the mo…
▽ More
Expanding pre-trained zero-shot counting models to handle unseen categories requires more than simply adding new prompts, as this approach does not achieve the necessary alignment between text and visual features for accurate counting. We introduce RichCount, the first framework to address these limitations, employing a two-stage training strategy that enhances text encoding and strengthens the model's association with objects in images. RichCount improves zero-shot counting for unseen categories through two key objectives: (1) enriching text features with a feed-forward network and adapter trained on text-image similarity, thereby creating robust, aligned representations; and (2) applying this refined encoder to counting tasks, enabling effective generalization across diverse prompts and complex images. In this manner, RichCount goes beyond simple prompt expansion to establish meaningful feature alignment that supports accurate counting across novel categories. Extensive experiments on three benchmark datasets demonstrate the effectiveness of RichCount, achieving state-of-the-art performance in zero-shot counting and significantly enhancing generalization to unseen categories in open-world scenarios.
△ Less
Submitted 26 May, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models
Authors:
Yuqiao Tan,
Shizhu He,
Kang Liu,
Jun Zhao
Abstract:
Large Language Models (LLMs) offer a transparent brain with accessible parameters that encode extensive knowledge, which can be analyzed, located and transferred. Consequently, a key research challenge is to transcend traditional knowledge transfer paradigms rooted in symbolic language and achieve genuine Parametric Knowledge Transfer (PKT). Significantly, exploring effective methods for transferr…
▽ More
Large Language Models (LLMs) offer a transparent brain with accessible parameters that encode extensive knowledge, which can be analyzed, located and transferred. Consequently, a key research challenge is to transcend traditional knowledge transfer paradigms rooted in symbolic language and achieve genuine Parametric Knowledge Transfer (PKT). Significantly, exploring effective methods for transferring knowledge across LLMs of different scales through parameters presents an intriguing and valuable research direction. In this paper, we first demonstrate $\textbf{Alignment}$ in parametric space is the fundamental prerequisite to achieve successful cross-scale PKT. We redefine the previously explored knowledge transfer as Post-Align PKT (PostPKT), which utilizes extracted parameters for LoRA initialization and requires subsequent fine-tune for alignment. Hence, to reduce cost for further fine-tuning, we introduce a novel Pre-Align PKT (PrePKT) paradigm and propose a solution called $\textbf{LaTen}$ ($\textbf{L}$oc$\textbf{a}$te-$\textbf{T}$h$\textbf{e}$n-Alig$\textbf{n}$) that aligns the parametric spaces of LLMs across scales only using several training steps without following training. Comprehensive experiments on four benchmarks demonstrate that both PostPKT and PrePKT face challenges in achieving consistently stable transfer. Through in-depth analysis, we identify $\textbf{Neural Incompatibility}$ as the ethological and parametric structural differences between LLMs of varying scales, presenting fundamental challenges to achieving effective PKT. These findings provide fresh insights into the parametric architectures of LLMs and highlight promising directions for future research on efficient PKT. Our code is available at https://github.com/Trae1ounG/Neural_Incompatibility.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs
Authors:
Guoheng Sun,
Ziyao Wang,
Bowei Tian,
Meng Liu,
Zheyu Shen,
Shwai He,
Yexiao He,
Wanghao Ye,
Yiting Wang,
Ang Li
Abstract:
As post-training techniques evolve, large language models (LLMs) are increasingly augmented with structured multi-step reasoning abilities, often optimized through reinforcement learning. These reasoning-enhanced models outperform standard LLMs on complex tasks and now underpin many commercial LLM APIs. However, to protect proprietary behavior and reduce verbosity, providers typically conceal the…
▽ More
As post-training techniques evolve, large language models (LLMs) are increasingly augmented with structured multi-step reasoning abilities, often optimized through reinforcement learning. These reasoning-enhanced models outperform standard LLMs on complex tasks and now underpin many commercial LLM APIs. However, to protect proprietary behavior and reduce verbosity, providers typically conceal the reasoning traces while returning only the final answer. This opacity introduces a critical transparency gap: users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no means to verify their authenticity. This opens the door to token count inflation, where providers may overreport token usage or inject synthetic, low-effort tokens to inflate charges. To address this issue, we propose CoIn, a verification framework that audits both the quantity and semantic validity of hidden tokens. CoIn constructs a verifiable hash tree from token embedding fingerprints to check token counts, and uses embedding-based relevance matching to detect fabricated reasoning content. Experiments demonstrate that CoIn, when deployed as a trusted third-party auditor, can effectively detect token count inflation with a success rate reaching up to 94.7%, showing the strong ability to restore billing transparency in opaque LLM services. The dataset and code are available at https://github.com/CASE-Lab-UMD/LLM-Auditing-CoIn.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking
Authors:
Zihan Su,
Xuerui Qiu,
Hongbin Xu,
Tangyu Jiang,
Junhao Zhuang,
Chun Yuan,
Ming Li,
Shengfeng He,
Fei Richard Yu
Abstract:
The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. M…
▽ More
The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. We will release our code upon publication.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Convergence analysis of the Halpern iteration with adaptive anchoring parameters
Authors:
Songnian He,
Hong-Kun Xu,
Qiao-Li Dong,
Na Mei
Abstract:
We propose an adaptive way to choose the anchoring parameters for the Halpern iteration to find a fixed point of a nonexpansive mapping in a real Hilbert space. We prove strong convergence of this adaptive Halpern iteration and obtain the rate of asymptotic regularity at least O(1/k), where k is the number of iterations. Numerical experiments are also provided to show advantages and outperformance…
▽ More
We propose an adaptive way to choose the anchoring parameters for the Halpern iteration to find a fixed point of a nonexpansive mapping in a real Hilbert space. We prove strong convergence of this adaptive Halpern iteration and obtain the rate of asymptotic regularity at least O(1/k), where k is the number of iterations. Numerical experiments are also provided to show advantages and outperformance of our adaptive Halpern algorithm over the standard Halpern algorithm.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Contractive difference-of-convex algorithms
Authors:
Songnian He,
Qiao-Li Dong,
Michael Th. Rassias
Abstract:
The difference-of-convex algorithm (DCA) and its variants are the most popular methods to solve the difference-of-convex optimization problem. Each iteration of them is reduced to a convex optimization problem, which generally needs to be solved by iterative methods such as proximal gradient algorithm. However, these algorithms essentially belong to some iterative methods of fixed point problems o…
▽ More
The difference-of-convex algorithm (DCA) and its variants are the most popular methods to solve the difference-of-convex optimization problem. Each iteration of them is reduced to a convex optimization problem, which generally needs to be solved by iterative methods such as proximal gradient algorithm. However, these algorithms essentially belong to some iterative methods of fixed point problems of averaged mappings, and their convergence speed is generally slow. Furthermore, there is seldom research on the termination rule of these iterative algorithms solving the subproblem of DCA. To overcome these defects, we ffrstly show that the subproblem of the linearized proximal method (LPM) in each iteration is equal to the ffxed point problem of a contraction. Secondly, by using Picard iteration to approximately solve the subproblem of LPM in each iteration, we propose a contractive difference-ofconvex algorithm (cDCA) where an adaptive termination rule is presented. Both global subsequential convergence and global convergence of the whole sequence of cDCA are established. Finally, preliminary results from numerical experiments are promising.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Leading singularities and chambers of Correlahedron
Authors:
Song He,
Yu-tin Huang,
Chia-Kai Kuo
Abstract:
In this paper, we explore the Chamber dissection of the loop-geometry of Correlehedron, which encodes the loop integrand of four-point stress-energy correlators in planar $\mathcal{N}=4$ super Yang-Mills. We demonstrate that at four loops, continuing the pattern of lower loops, the integrand of four-point correlation function can be written as a sum over products of chamber-forms and local loop in…
▽ More
In this paper, we explore the Chamber dissection of the loop-geometry of Correlehedron, which encodes the loop integrand of four-point stress-energy correlators in planar $\mathcal{N}=4$ super Yang-Mills. We demonstrate that at four loops, continuing the pattern of lower loops, the integrand of four-point correlation function can be written as a sum over products of chamber-forms and local loop integrands. The chambers and their associated forms are identical to those of three-loops, indicating that the dissection may be complete to all loop orders. Furthermore, this suggests that the leading singularities to all loops are simply linear combinations of these chamber forms. This is especially intriguing at four loops since it contains elliptic functions. Interestingly, each elliptic function appears in a subset of chambers. Our geometric approach allows us to ``diagonalize" the representation, where the local integrals only possess a single leading singularity or elliptic cut. In such a representation, all integrals must evaluate to pure functions, including a single pure elliptic integral. Inspired by this picture, we also present a simplified form of the three-loop correlator in terms of two independent pure functions (weight-$6$ single-valued multiple polylogarithms), which are directly computed from local integrals with unit leading singularities, multiplied by the leading singularities from chamber forms.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Listen to Extract: Onset-Prompted Target Speaker Extraction
Authors:
Pengjie Shen,
Kangrui Chen,
Shulin He,
Pengru Chen,
Shuqi Yuan,
He Kong,
Xueliang Zhang,
Zhong-Qiu Wang
Abstract:
We propose $\textit{listen to extract}$ (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker's mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at…
▽ More
We propose $\textit{listen to extract}$ (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker's mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at the waveform level, and trains deep neural networks (DNN) to extract the target speech based on the concatenated mixture signal. The rationale is that, this way, an artificial speech onset is created for the target speaker and it could prompt the DNN (a) which speaker is the target to extract; and (b) spectral-temporal patterns of the target speaker that could help extraction. This simple approach produces strong TSE performance on multiple public TSE datasets including WSJ0-2mix, WHAM! and WHAMR!.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Randomized Routing to Remote Queues
Authors:
Shuangchi He,
Yunfang Yang,
Yao Yu
Abstract:
We study load balancing for a queueing system where parallel stations are distant from customers. In the presence of traveling delays, the join-the-shortest-queue (JSQ) policy induces queue length oscillations and prolongs the mean waiting time. A variant of the JSQ policy, dubbed the randomized join-the-shortest-queue (RJSQ) policy, is devised to mitigate the oscillation phenomenon. By the RJSQ p…
▽ More
We study load balancing for a queueing system where parallel stations are distant from customers. In the presence of traveling delays, the join-the-shortest-queue (JSQ) policy induces queue length oscillations and prolongs the mean waiting time. A variant of the JSQ policy, dubbed the randomized join-the-shortest-queue (RJSQ) policy, is devised to mitigate the oscillation phenomenon. By the RJSQ policy, customers are sent to each station with a probability approximately proportional to its service capacity; only a small fraction of customers are purposely routed to the shortest queue. The additional probability of routing a customer to the shortest queue, referred to as the balancing fraction, dictates the policy's performance. When the balancing fraction is within a certain range, load imbalance between the stations is negligible in heavy traffic, so that complete resource pooling is achieved. We specify the optimal order of magnitude for the balancing fraction, by which heuristic formulas are proposed to fine-tune the RJSQ policy. A joint problem of capacity planning and load balancing is considered for geographically separated stations. With well planned service capacities, the RJSQ policy sends all but a small fraction of customers to the nearest stations, rendering the system asymptotically equivalent to an aggregated single-server system with all customers having minimum traveling delays. If each customer's service requirement does not depend on the station, the RJSQ policy is asymptotically optimal for reducing workload.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Order within disorder: spectral key generation and distribution in random lasers
Authors:
Zhijia Hu,
Shilong He,
Lianghao Qi,
Yalan Li,
Siqi Li,
Bin Chen,
Wenyu Du,
Yan Kuai,
Zhigang Cao,
Min Wang,
Kaiming Zhou,
Lin Zhang,
Qingchuan Guo,
Weimin Ding,
Chao Li,
Kang Xie,
Anderson S. L. Gomes,
Benli Yu
Abstract:
In secure communication, highly random entropy sources are essential for information security. Random lasers (RLs), which arise from multiple scattering in disordered structures, are potentially ideal entropy sources. Traditionally, RLs are viewed as disordered and unpredictable. However, in this work, we present novel evidence that orderly patterns exist beneath the seemingly disordered outputs o…
▽ More
In secure communication, highly random entropy sources are essential for information security. Random lasers (RLs), which arise from multiple scattering in disordered structures, are potentially ideal entropy sources. Traditionally, RLs are viewed as disordered and unpredictable. However, in this work, we present novel evidence that orderly patterns exist beneath the seemingly disordered outputs of RLs. Utilizing deep learning techniques, a variety of advanced neural network models are used to analyze the spectral data in multiple dimensions. The results show that the time series of RLs spectra are unpredictable, but spectral wavelength component intensities can be recovered due to inter-modal correlations. This finding not only breaks through the traditional perception that RLs are unpredictable, but also reveals for the first time that RLs have the dual characteristics of both randomness and determinism. Based on this new characteristic, we further expand the application field of RLs and innovatively design a new type of key generation and distribution scheme. In this scheme, the disordered property of RLs is used for key generation to ensure high randomness, while their ordered property is used for key distribution to guarantee accuracy and reliability. The scheme provides a new strategy for secure communication.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Satellite-Assisted Low-Altitude Economy Networking: Concepts, Applications, and Opportunities
Authors:
Shizhao He,
Jiacheng Wang,
Ying-Chang Liang,
Geng Sun,
Dusit Niyato
Abstract:
The low-altitude economy (LAE) is a new economic paradigm that leverages low-altitude vehicles (LAVs) to perform diverse missions across diverse areas. To support the operations of LAE, it is essential to establish LAE networks that enable LAV management and communications.Existing studies mainly reuse terrestrial networks to construct LAE networks. However, the limited coverage of terrestrial net…
▽ More
The low-altitude economy (LAE) is a new economic paradigm that leverages low-altitude vehicles (LAVs) to perform diverse missions across diverse areas. To support the operations of LAE, it is essential to establish LAE networks that enable LAV management and communications.Existing studies mainly reuse terrestrial networks to construct LAE networks. However, the limited coverage of terrestrial networks poses challenges for serving LAVs in remote areas. Besides, efficient LAV operations also require support such as localization and navigation, which terrestrial networks designed for communications cannot fully provide. Due to ubiquitous coverage and diverse functions, satellites are a promising technology to support LAVs. Therefore, this article investigates satellite-assisted LAE networking. First, we introduce an overview of LAE and satellites, discussing their features, applications, and architectures. Next, we investigate opportunities for satellites to assist LAE from aspects of communication, control, and computation. As all assistance depends on reliable satellite-LAV communications, we propose a satellite-assisted LAE framework to tackle issues caused by the severe path loss and high dynamics in satellite-assisted LAE networks.The case study demonstrates that the distributed MIMO architecture efficiently reduces the required transmission power and extends service duration, while the two-timescale optimization scheme balances the performance and control signaling overheads. Specifically, the proposed framework comprises distributed satellite MIMO, distributed LAV MIMO, and a two-timescale optimization scheme.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Estimating the Diameter at Breast Height of Trees in a Forest With a Single 360 Camera
Authors:
Siming He,
Zachary Osman,
Fernando Cladera,
Dexter Ong,
Nitant Rai,
Patrick Corey Green,
Vijay Kumar,
Pratik Chaudhari
Abstract:
Forest inventories rely on accurate measurements of the diameter at breast height (DBH) for ecological monitoring, resource management, and carbon accounting. While LiDAR-based techniques can achieve centimeter-level precision, they are cost-prohibitive and operationally complex. We present a low-cost alternative that only needs a consumer-grade 360 video camera. Our semi-automated pipeline compri…
▽ More
Forest inventories rely on accurate measurements of the diameter at breast height (DBH) for ecological monitoring, resource management, and carbon accounting. While LiDAR-based techniques can achieve centimeter-level precision, they are cost-prohibitive and operationally complex. We present a low-cost alternative that only needs a consumer-grade 360 video camera. Our semi-automated pipeline comprises of (i) a dense point cloud reconstruction using Structure from Motion (SfM) photogrammetry software called Agisoft Metashape, (ii) semantic trunk segmentation by projecting Grounded Segment Anything (SAM) masks onto the 3D cloud, and (iii) a robust RANSAC-based technique to estimate cross section shape and DBH. We introduce an interactive visualization tool for inspecting segmented trees and their estimated DBH. On 61 acquisitions of 43 trees under a variety of conditions, our method attains median absolute relative errors of 5-9% with respect to "ground-truth" manual measurements. This is only 2-4% higher than LiDAR-based estimates, while employing a single 360 camera that costs orders of magnitude less, requires minimal setup, and is widely available.
△ Less
Submitted 15 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
Whleaper: A 10-DOF Flexible Bipedal Wheeled Robot
Authors:
Yinglei Zhu,
Sixiao He,
Zhenghao Qi,
Zhuoyuan Yong,
Yihua Qin,
Jianyu Chen
Abstract:
Wheel-legged robots combine the advantages of both wheeled robots and legged robots, offering versatile locomotion capabilities with excellent stability on challenging terrains and high efficiency on flat surfaces. However, existing wheel-legged robots typically have limited hip joint mobility compared to humans, while hip joint plays a crucial role in locomotion. In this paper, we introduce Whlea…
▽ More
Wheel-legged robots combine the advantages of both wheeled robots and legged robots, offering versatile locomotion capabilities with excellent stability on challenging terrains and high efficiency on flat surfaces. However, existing wheel-legged robots typically have limited hip joint mobility compared to humans, while hip joint plays a crucial role in locomotion. In this paper, we introduce Whleaper, a novel 10-degree-of-freedom (DOF) bipedal wheeled robot, with 3 DOFs at the hip of each leg. Its humanoid joint design enables adaptable motion in complex scenarios, ensuring stability and flexibility. This paper introduces the details of Whleaper, with a focus on innovative mechanical design, control algorithms and system implementation. Firstly, stability stems from the increased DOFs at the hip, which expand the range of possible postures and improve the robot's foot-ground contact. Secondly, the extra DOFs also augment its mobility. During walking or sliding, more complex movements can be adopted to execute obstacle avoidance tasks. Thirdly, we utilize two control algorithms to implement multimodal motion for walking and sliding. By controlling specific DOFs of the robot, we conducted a series of simulations and practical experiments, demonstrating that a high-DOF hip joint design can effectively enhance the stability and flexibility of wheel-legged robots. Whleaper shows its capability to perform actions such as squatting, obstacle avoidance sliding, and rapid turning in real-world scenarios.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Superstring amplitudes meet surfaceology
Authors:
Qu Cao,
Jin Dong,
Song He,
Fan Zhu
Abstract:
We reformulate tree-level amplitudes in open superstring theory (type-I) in terms of stringy Tr$(φ^3)$ amplitudes with various kinematical shifts in the "curve-integral" formulation: while the bosonic-string amplitude with $n$ pairs of "scaffolding" scalars comes from a particularly simple shift of the Tr$(φ^3)$ one (corresponding to $n$ length-$2$ cycles), the analogous superstring amplitude requ…
▽ More
We reformulate tree-level amplitudes in open superstring theory (type-I) in terms of stringy Tr$(φ^3)$ amplitudes with various kinematical shifts in the "curve-integral" formulation: while the bosonic-string amplitude with $n$ pairs of "scaffolding" scalars comes from a particularly simple shift of the Tr$(φ^3)$ one (corresponding to $n$ length-$2$ cycles), the analogous superstring amplitude requires "correction" terms given by bosonic-string amplitudes with longer, even-length "cycles", which are also Tr$(φ^3)$ ones at shifted kinematics dictated by the cycles; in total it is expressed as a sum of $(2n{-}3)!!$ shifted amplitudes originated from the expansion of a reduced Pfaffian. Upon taking $n$ scaffolding residues, this leads to a new formula of the $n$-gluon superstring amplitude, which is manifestly symmetric in $n{-}1$ legs, as a gauge-invariant combination of mixed bosonic string amplitudes with gluons and scalars, which come from length-$2$ cycles and longer ones respectively (the total sum is associated with the expansion a $n\times n$ symmetrical determinant); the corresponding prefactors are nested commutators of $2n$-gon kinematical variables, which nicely become traces of field-strengths for those legs corresponding to scalars in the mixed amplitudes. These interesting linear combinations of bosonic string amplitudes must guarantee the cancellation of tachyon poles and $F^3$ vertices ${\it etc.}$, and they give new relations between the superstring amplitude and its bosonic-string building blocks to all orders in the $α'$ expansion (the first order gives a new formula for gluon amplitudes with a single $F^3$ insertion in terms of Yang-Mills-scalar amplitudes). We provide both the worldsheet and "curve-integral" derivations, and discuss applications to heterotic and type II cases.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation
Authors:
Linshan Wu,
Yuxiang Nie,
Sunan He,
Jiaxin Zhuang,
Luyang Luo,
Neeraj Mahboobani,
Varut Vardhanabhuti,
Ronald Cheong Kin Chan,
Yifan Peng,
Pranav Rajpurkar,
Hao Chen
Abstract:
The integration of AI-assisted biomedical image analysis into clinical practice demands AI-generated findings that are not only accurate but also interpretable to clinicians. However, existing biomedical AI models generally lack the ability to simultaneously generate diagnostic findings and localize corresponding biomedical objects. This limitation makes it challenging for clinicians to correlate…
▽ More
The integration of AI-assisted biomedical image analysis into clinical practice demands AI-generated findings that are not only accurate but also interpretable to clinicians. However, existing biomedical AI models generally lack the ability to simultaneously generate diagnostic findings and localize corresponding biomedical objects. This limitation makes it challenging for clinicians to correlate AI-generated findings with visual evidence (e.g., tiny lesions) in images and interpret the results of AI models. To address this challenge, we introduce UniBiomed, the first universal foundation model for grounded biomedical image interpretation, which is capable of generating accurate diagnostic findings and simultaneously segmenting the corresponding biomedical targets. UniBiomed is based on a novel integration of Multi-modal Large Language Model and Segment Anything Model, which can effectively unify diverse biomedical tasks in universal training for advancing grounded interpretation. To develop UniBiomed, we curate a large-scale dataset comprising over 27 million triplets of images, region annotations, and text descriptions across ten biomedical imaging modalities. Extensive validation on 70 internal and 14 external datasets demonstrated the state-of-the-art performance of UniBiomed in diverse biomedical tasks, including image segmentation, disease recognition, region-aware diagnosis, vision question answering, and report generation. In summary, UniBiomed is a powerful and versatile biomedical foundation model, unlocking the untapped grounded interpretation capability for optimizing AI-assisted biomedical image analysis.
△ Less
Submitted 29 May, 2025; v1 submitted 30 April, 2025;
originally announced April 2025.
-
Rotation excursion algorithm with learning
Authors:
Sheng-Xue He
Abstract:
We introduce a novel heuristic algorithm named the Rotation Excursion Algorithm with Learning (REAL) designed for general-purpose optimization. REAL draws inspiration from the construction mechanism inherent in CEC optimization suites, integrating three fundamental operations with a natural growth rule to address optimization tasks. The initial operation involves rotating the current feasible solu…
▽ More
We introduce a novel heuristic algorithm named the Rotation Excursion Algorithm with Learning (REAL) designed for general-purpose optimization. REAL draws inspiration from the construction mechanism inherent in CEC optimization suites, integrating three fundamental operations with a natural growth rule to address optimization tasks. The initial operation involves rotating the current feasible solutions within the search space to generate and evaluate new solutions. The excursion operation aims to relocate current feasible solutions closer to historically superior solutions stored in a list known as the "list of visible spots." The third operation involves perturbing solutions generated by the preceding operations within their respective neighborhoods. The rotation operation is geared toward comprehensive and random exploration of the entire search space, while the excursion operation exploits known information to refine current solutions. Perturbation operation functions as a form of neighborhood search to further enhance solution quality. The natural growth rule dynamically adjusts REAL's balance between exploration and exploitation throughout the entire search process. To validate the efficacy of the proposed algorithm, we apply it to address a diverse set of 67 problems, encompassing 29 benchmark optimization problems, 30 test problems from CEC 2014, one from CEC 2022, and seven engineering problems. Numerical experiments demonstrate the superior performance of REAL when compared to various other heuristics.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Snake locomotion learning search
Authors:
Sheng-Xue He
Abstract:
This research introduces a novel heuristic algorithm known as the Snake Locomotion Learning Search algorithm (SLLS) designed to address optimization problems. The SLLS draws inspiration from the locomotion patterns observed in snakes, particularly serpentine and caterpillar locomotion. We leverage these two modes of snake locomotion to devise two distinct search mechanisms within the SLLS. In our…
▽ More
This research introduces a novel heuristic algorithm known as the Snake Locomotion Learning Search algorithm (SLLS) designed to address optimization problems. The SLLS draws inspiration from the locomotion patterns observed in snakes, particularly serpentine and caterpillar locomotion. We leverage these two modes of snake locomotion to devise two distinct search mechanisms within the SLLS. In our quest to mimic a snake's natural adaptation to its surroundings, we incorporate a learning efficiency component generated from the Sigmoid function. This helps strike a balance between exploration and exploitation capabilities throughout the SLLS computation process. The efficacy and effectiveness of this innovative algorithm are demonstrated through its application to 60 standard benchmark optimization problems and seven well-known engineering optimization problems. The performance analysis reveals that in most cases, the SLLS outperforms other algorithms, and even in the remaining scenarios, it exhibits robust performance. This conforms to the No Free Lunch Theorem, affirming that the SLLS stands as a valuable heuristic algorithm with significant potential for effectively addressing specific optimization challenges.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.