-
Echo: Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving
Authors:
Zhibin Wang,
Shipeng Li,
Xue Li,
Yuhang Zhou,
Zhonghui Zhang,
Zibo Wang,
Rong Gu,
Chen Tian,
Kun Yang,
Sheng Zhong
Abstract:
Large language models have been widely deployed in various applications, encompassing both interactive online tasks and batched offline tasks. Given the burstiness and latency sensitivity of online tasks, over-provisioning resources is common practice. This allows for the integration of latency-insensitive offline tasks during periods of low online load, enhancing resource utilization. However, st…
▽ More
Large language models have been widely deployed in various applications, encompassing both interactive online tasks and batched offline tasks. Given the burstiness and latency sensitivity of online tasks, over-provisioning resources is common practice. This allows for the integration of latency-insensitive offline tasks during periods of low online load, enhancing resource utilization. However, strategically serving online and offline tasks through a preemption mechanism fails to fully leverage the flexibility of offline tasks and suffers from KV cache recomputation and irregular workloads.
In this paper, we introduce Echo, a collaborative online-offline task serving system, including a scheduler, a KV cache manager, and estimation toolkits. The scheduler and KV cache manager work tightly to maximize the throughput of offline tasks, while the estimator further predicts execution time to ensure online task SLOs. The scheduler leverages the batch information of last iteration to reduce the search space for finding the optimal schedule. The KV cache manager sets the priority of the KV cache based on the type of tasks and the opportunity of prefix sharing to reduce the recomputation. Finally, the estimation toolkits predict the execution time, future memory consumption, and the throughput of offline tasks to guide the scheduler, KV cache manager, and the system deployer. Evaluation based on real-world workloads demonstrates that Echo can increase offline task throughput by up to $3.3\times$, while satisfying online task SLOs.
△ Less
Submitted 1 March, 2025;
originally announced April 2025.
-
Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions
Authors:
Ting-Hsuan Liao,
Yi Zhou,
Yu Shen,
Chun-Hao Paul Huang,
Saayan Mitra,
Jia-Bin Huang,
Uttaran Bhattacharya
Abstract:
We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions fro…
▽ More
We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Constraints on dark matter boosted by supernova shock within the effective field theory framework from the CDEX-10 experiment
Authors:
J. Z. Wang,
L. T. Yang,
Q. Yue,
K. J. Kang,
Y. J. Li,
H. P. An,
Greeshma C.,
J. P. Chang,
H. Chen,
Y. H. Chen,
J. P. Cheng,
W. H. Dai,
Z. Deng,
C. H. Fang,
X. P. Geng,
H. Gong,
Q. J. Guo,
T. Guo,
X. Y. Guo,
L. He,
J. R. He,
H. X. Huang,
T. C. Huang,
S. Karmakar,
H. B. Li
, et al. (62 additional authors not shown)
Abstract:
Supernova shocks can boost dark matter (DM) particles to high, yet nonrelativistic, velocities, providing a suitable mechanism for analysis within the framework of the nonrelativistic effective field theory (NREFT). These accelerated DM sources extend the experimental ability to scan the parameter space of light DM into the sub-GeV region. In this study, we specifically analyze DM accelerated by t…
▽ More
Supernova shocks can boost dark matter (DM) particles to high, yet nonrelativistic, velocities, providing a suitable mechanism for analysis within the framework of the nonrelativistic effective field theory (NREFT). These accelerated DM sources extend the experimental ability to scan the parameter space of light DM into the sub-GeV region. In this study, we specifically analyze DM accelerated by the Monogem Ring supernova remnant, whose age ($\sim 68000$ yr) and distance to Earth ($\sim 300$ parsecs) are strategically matched to enable detection with current terrestrial detectors. Utilizing the 205.4 kg$\cdot$day data obtained from the CDEX-10 experiment at the China Jinping Underground Laboratory (CJPL), we derive new constraints on boosted DM within the NREFT framework. The NREFT coupling constant exclusion regions now penetrate the sub-GeV mass range, with optimal sensitivity achieved for operators $\mathcal{O}_{3}$, $\mathcal{O}_{6}$, $\mathcal{O}_{15}$ in the 0.4--0.6 GeV mass range.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Dexterous Manipulation through Imitation Learning: A Survey
Authors:
Shan An,
Ziyu Meng,
Chao Tang,
Yuning Zhou,
Tengyu Liu,
Fangqiang Ding,
Shufang Zhang,
Yao Mu,
Ran Song,
Wei Zhang,
Zeng-Guang Hou,
Hong Zhang
Abstract:
Dexterous manipulation, which refers to the ability of a robotic hand or multi-fingered end-effector to skillfully control, reorient, and manipulate objects through precise, coordinated finger movements and adaptive force modulation, enables complex interactions similar to human hand dexterity. With recent advances in robotics and machine learning, there is a growing demand for these systems to op…
▽ More
Dexterous manipulation, which refers to the ability of a robotic hand or multi-fingered end-effector to skillfully control, reorient, and manipulate objects through precise, coordinated finger movements and adaptive force modulation, enables complex interactions similar to human hand dexterity. With recent advances in robotics and machine learning, there is a growing demand for these systems to operate in complex and unstructured environments. Traditional model-based approaches struggle to generalize across tasks and object variations due to the high dimensionality and complex contact dynamics of dexterous manipulation. Although model-free methods such as reinforcement learning (RL) show promise, they require extensive training, large-scale interaction data, and carefully designed rewards for stability and effectiveness. Imitation learning (IL) offers an alternative by allowing robots to acquire dexterous manipulation skills directly from expert demonstrations, capturing fine-grained coordination and contact dynamics while bypassing the need for explicit modeling and large-scale trial-and-error. This survey provides an overview of dexterous manipulation methods based on imitation learning, details recent advances, and addresses key challenges in the field. Additionally, it explores potential research directions to enhance IL-driven dexterous manipulation. Our goal is to offer researchers and practitioners a comprehensive introduction to this rapidly evolving domain.
△ Less
Submitted 24 April, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis
Authors:
Xi Wang,
Ziqi He,
Yang Zhou
Abstract:
Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transfo…
▽ More
Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transformer blocks within the U-Net is a "free lunch" for improving the signal-to-noise ratio during the sampling process. Next, we proposed Importance Probe to uncover and quantify the dynamic shifts in importance of the Transformer blocks throughout the denoising process. Finally, we design an adaptive importance-based re-weighting schedule tailored to specific image generation and editing tasks. Experimental results demonstrate that, our approach significantly improves the efficiency of the inference process, and enhances the aesthetic quality of the samples with identity consistency. Our method can be seamlessly integrated into any U-Net-based architecture. Code: https://github.com/Hytidel/UNetReweighting
△ Less
Submitted 5 May, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Relativistic dynamics of charmonia in strong magnetic fields
Authors:
Liuyuan Wen,
Meijian Li,
Yiyu Zhou,
Yang Li,
James P. Vary
Abstract:
We investigate the properties of charmonium systems in strong external magnetic fields using a relativistic light-front Hamiltonian approach within the Basis Light-Front Quantization (BLFQ) framework. By solving the eigenvalue problem for the invariant mass squared operator with confinement potentials and one-gluon-exchange interactions, we obtain the mass spectrum and wave functions under varying…
▽ More
We investigate the properties of charmonium systems in strong external magnetic fields using a relativistic light-front Hamiltonian approach within the Basis Light-Front Quantization (BLFQ) framework. By solving the eigenvalue problem for the invariant mass squared operator with confinement potentials and one-gluon-exchange interactions, we obtain the mass spectrum and wave functions under varying magnetic fields. Our results reveal significant spectral modifications via the Zeeman effect, including $η_c$-$J/ψ$ mixing and magnetic sublevel splitting. Momentum density analysis demonstrates wave function deformation, with transverse momentum broadening and longitudinal narrowing under strong fields, alongside structural shifts in parton distributions such as double-hump profiles in excited states. Relativistic corrections and center-of-mass coupling critically drive these dynamics, highlighting the necessity of a relativistic framework for QCD bound states in extreme magnetic environments.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
Authors:
Kexin Tian,
Jingrui Mao,
Yunlong Zhang,
Jiwan Jiang,
Yang Zhou,
Zhengzhong Tu
Abstract:
Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose…
▽ More
Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs' performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.
△ Less
Submitted 6 April, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
A unified algorithm for multi-particle correlations between azimuthal angle and transverse momentum in ultra-relativistic nuclear collisions
Authors:
Emil Gorm Dahlbæk Nielsen,
Nina Nathanson,
Kristjan Gulbrandsen,
You Zhou
Abstract:
Multi-particle correlations between azimuthal angle and mean transverse momentum are a powerful tool for probing size and shape correlations in the initial conditions of heavy-ion collisions. These correlations have also been employed to investigate nuclear structure, including potential nuclear shape phase transitions at the energy frontier. However, their implementation is highly nontrivial, and…
▽ More
Multi-particle correlations between azimuthal angle and mean transverse momentum are a powerful tool for probing size and shape correlations in the initial conditions of heavy-ion collisions. These correlations have also been employed to investigate nuclear structure, including potential nuclear shape phase transitions at the energy frontier. However, their implementation is highly nontrivial, and prior studies have been mostly limited to lower-order correlations, such as the modified Pearson correlation coefficient, $ρ(v_{\rm n}^{2}, [p_{\rm T}])$. This paper presents a unified framework that employs a recursive algorithm, enabling the efficient evaluation of arbitrary-order correlations while maintaining computational efficiency. This framework is demonstrated using widely adopted transport models, including AMPT and HIJING. The proposed unified algorithm for multi-particle correlations between azimuthal angle and transverse momentum provides a systematic and efficient approach for multi-particle correlation analyses. Its application in experiments at the Relativistic Heavy Ion Collider and the Large Hadron Collider facilitates the exploration of nuclear structure at ultra-relativistic energies.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
SkyReels-A2: Compose Anything in Video Diffusion Transformers
Authors:
Zhengcong Fei,
Debang Li,
Di Qiu,
Jiahua Wang,
Yikun Dou,
Rui Wang,
Jingtao Xu,
Mingyuan Fan,
Guibin Chen,
Yang Li,
Yahui Zhou
Abstract:
This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each ref…
▽ More
This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task elements-to-video (E2V), whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, A2 Bench. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. SkyReels-A2 is the first open-source commercial grade model for the generation of E2V, performing favorably against advanced closed-source commercial models. We anticipate SkyReels-A2 will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation
Authors:
Yuan Zhou,
Shilong Jin,
Litao Hua,
Wanjun Lv,
Haoran Duan,
Jungong Han
Abstract:
Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent…
▽ More
Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel framework that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise camera parameters; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer effectively mitigates the multi-face Janus problem in text-to-3D generation, outperforming existing methods in both visual quality and consistency.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Evidence of doubly OZI-suppressed decay $η_{c} \to ωφ$ in the radiative decay $J/ψ\to γη_{c}$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (680 additional authors not shown)
Abstract:
Using a sample of $(10087\pm44) \times 10^{6}$ $J/ψ$ events collected with the BESIII detector at the BEPCII collider, the first evidence for the doubly OZI-suppressed decay $η_{c} \to ωφ$ is reported with a significance of 4.0$σ$. The branching fraction of $η_{c} \to ωφ$ is measured to be $\mathcal{B}(η_{c} \to ωφ) = (3.86 \pm 0.92 \pm 0.62) \times 10^{-5}$, where the first uncertainty is statist…
▽ More
Using a sample of $(10087\pm44) \times 10^{6}$ $J/ψ$ events collected with the BESIII detector at the BEPCII collider, the first evidence for the doubly OZI-suppressed decay $η_{c} \to ωφ$ is reported with a significance of 4.0$σ$. The branching fraction of $η_{c} \to ωφ$ is measured to be $\mathcal{B}(η_{c} \to ωφ) = (3.86 \pm 0.92 \pm 0.62) \times 10^{-5}$, where the first uncertainty is statistical and the second is systematic. This result provides valuable insights into the underlying mechanisms of charmonium decays, particularly for processes such as $η_{c} \to VV$ (where $V$ represents a vector meson).
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Fully-gapped superconductivity with rotational symmetry breaking in pressurized kagome metal CsV$_3$Sb$_5$
Authors:
X. Y. Feng,
Z. Zhao,
J. Luo,
Y. Z. Zhou,
J. Yang,
A. F. Fang,
H. T. Yang,
H. -J. Gao,
R. Zhou,
Guo-qing Zheng
Abstract:
The discovery of the kagome metal CsV$_3$Sb$_5$ has generated significant interest in its complex physical properties, particularly its superconducting behavior under different pressures, though its nature remains debated. Here, we performed low-temperature, high-pressure $^{121/123}$Sb nuclear quadrupole resonance (NQR) measurements to explore the superconducting pairing symmetry in CsV$_3$Sb…
▽ More
The discovery of the kagome metal CsV$_3$Sb$_5$ has generated significant interest in its complex physical properties, particularly its superconducting behavior under different pressures, though its nature remains debated. Here, we performed low-temperature, high-pressure $^{121/123}$Sb nuclear quadrupole resonance (NQR) measurements to explore the superconducting pairing symmetry in CsV$_3$Sb$_5$. At ambient pressure, we found that the spin-lattice relaxation rate 1/$T_1$ exhibits a kink at $T \sim$ 0.4 $T_\textrm{c}$ within the superconducting state and follows a $T^3$ variation as temperature further decreases. This suggests the presence of two superconducting gaps with line nodes in the smaller one. As pressure increases beyond $P_{\rm c} \sim 1.85$ GPa, where the charge-density wave phase is completely suppressed, 1/$T_1$ shows no Hebel-Slichter peak just below $T_\textrm{c}$, and decreases rapidly, even faster than $T^5$, indicating that the gap is fully opened for pressures above $P_{\rm c}$. In this high pressure region, the angular dependence of the in-plane upper critical magnetic field $H_{\rm c2}$ breaks the $C_6$ rotational symmetry. We propose the $s+id$ pairing at $P > P_{\rm c}$ which explains both the 1/$T_1$ and $H_{\rm c2}$ behaviors. Our findings indicate that CsV$_3$Sb$_5$ is an unconventional superconductor and its superconducting state is even more exotic at high pressures.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
The Mini-SiTian Array: first-two-year operation
Authors:
Min He,
Hong Wu,
Liang Ge,
Jian-feng Tian,
Zheng Wang,
Hai-yang Mu,
Yu Zhang,
Yang Huang,
Jie Zheng,
Zhou Fan,
Zheng-yang Li,
Hong-hui Gu,
Heng-geng Han,
Kai Xiao,
Zhi-rui Li,
Jun-jie Jin,
Bei-chuan Wang,
Jun Ma,
Jin-hang Zou,
Ying Wu,
Jiu-peng Guo,
Li-guo Fang,
Zhi-gang Hou,
Bo-wen Zhang,
Yun-fei Xu
, et al. (48 additional authors not shown)
Abstract:
The SiTian project, designed to utilize 60 telescopes distributed across multiple sites in China, is a next-generation time-domain survey initiative. As a pathfinder for the SiTian project, the Mini-SiTian (MST) has been proposed and implemented to test the SiTian's brain and data pipeline, and to evaluate the feasibility of its technology and science cases. Mounted at the Xinglong Observatory, th…
▽ More
The SiTian project, designed to utilize 60 telescopes distributed across multiple sites in China, is a next-generation time-domain survey initiative. As a pathfinder for the SiTian project, the Mini-SiTian (MST) has been proposed and implemented to test the SiTian's brain and data pipeline, and to evaluate the feasibility of its technology and science cases. Mounted at the Xinglong Observatory, the MST project comprises three 30 cm telescopes and has been operated since Nov. 2022. Each telescope of the MST possesses a large field of view, covering $2.29^{\circ}$ $\times$ $1.53^{\circ}$ FOV, and is equipped with $g'$, $r'$ and $i'$ filters, respectively. Acting as the pioneer of the forthcoming SiTian project, the MST is dedicated to the discovery of variable stars, transients, and outburst events, and has already obtained some interesting scientific results. In this paper, we will summarize the first-two-year operation of the MST project.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Diagnosis of Pulmonary Hypertension by Integrating Multimodal Data with a Hybrid Graph Convolutional and Transformer Network
Authors:
Fubao Zhu,
Yang Zhang,
Gengmin Liang,
Jiaofen Nan,
Yanting Li,
Chuang Han,
Danyang Sun,
Zhiguo Wang,
Chen Zhao,
Wenxuan Zhou,
Jian He,
Yi Xu,
Iokfai Cheang,
Xu Zhu,
Yanli Zhou,
Weihua Zhou
Abstract:
Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study ana…
▽ More
Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study analyzed data from 204 patients (112 with pre-capillary PH, 32 with post-capillary PH, and 60 non-PH controls) at the First Affiliated Hospital of Nanjing Medical University. Diagnoses were confirmed through right heart catheterization. We selected 6 samples from each category for the test set (18 samples, 10%), with the remaining 186 samples used for the training set. This process was repeated 35 times for testing. This paper proposes a deep learning model that combines Graph convolutional networks (GCN), Convolutional neural networks (CNN), and Transformers. The model was developed to process multimodal data, including short-axis (SAX) sequences, four-chamber (4CH) sequences, and clinical parameters. Our model achieved a performance of Area under the receiver operating characteristic curve (AUC) = 0.81 +- 0.06(standard deviation) and Accuracy (ACC) = 0.73 +- 0.06 on the test set. The discriminative abilities were as follows: non-PH subjects (AUC = 0.74 +- 0.11), pre-capillary PH (AUC = 0.86 +- 0.06), and post-capillary PH (AUC = 0.83 +- 0.10). It has the potential to support clinical decision-making by effectively integrating multimodal data to assist physicians in making accurate and timely diagnoses.
△ Less
Submitted 27 March, 2025;
originally announced April 2025.
-
TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting
Authors:
Liangbin Xie,
Daniil Pakhomov,
Zhonghao Wang,
Zongze Wu,
Ziyan Chen,
Yuqian Zhou,
Haitian Zheng,
Zhifei Zhang,
Zhe Lin,
Jiantao Zhou,
Chao Dong
Abstract:
This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a no…
▽ More
This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which tests performance across mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill outperforms both multi-step BrushNet and few-step inpainting methods, setting a new benchmark for high-performance inpainting tasks. Our project page: https://liangbinxie.github.io/projects/TurboFill/
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs
Authors:
Juncheng Wu,
Wenlong Deng,
Xingxuan Li,
Sheng Liu,
Taomian Mi,
Yifan Peng,
Ziyang Xu,
Yi Liu,
Hyunjin Cho,
Chang-In Choi,
Yihan Cao,
Hui Ren,
Xiang Li,
Xiaoxiao Li,
Yuyin Zhou
Abstract:
Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical rea…
▽ More
Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code is available at https://github.com/UCSC-VLAA/MedReason.
△ Less
Submitted 4 April, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
Time-optimal Convexified Reeds-Shepp Paths on a Sphere
Authors:
Sixu Li,
Deepak Prakash Kumar,
Swaroop Darbha,
Yang Zhou
Abstract:
This article addresses time-optimal path planning for a vehicle capable of moving both forward and backward on a unit sphere with a unit maximum speed, and constrained by a maximum absolute turning rate $U_{max}$. The proposed formulation can be utilized for optimal attitude control of underactuated satellites, optimal motion planning for spherical rolling robots, and optimal path planning for mob…
▽ More
This article addresses time-optimal path planning for a vehicle capable of moving both forward and backward on a unit sphere with a unit maximum speed, and constrained by a maximum absolute turning rate $U_{max}$. The proposed formulation can be utilized for optimal attitude control of underactuated satellites, optimal motion planning for spherical rolling robots, and optimal path planning for mobile robots on spherical surfaces or uneven terrains. By utilizing Pontryagin's Maximum Principle and analyzing phase portraits, it is shown that for $U_{max}\geq1$, the optimal path connecting a given initial configuration to a desired terminal configuration falls within a sufficient list of 23 path types, each comprising at most 6 segments. These segments belong to the set $\{C,G,T\}$, where $C$ represents a tight turn with radius $r=\frac{1}{\sqrt{1+U_{max}^2}}$, $G$ represents a great circular arc, and $T$ represents a turn-in-place motion. Closed-form expressions for the angles of each path in the sufficient list are derived. The source code for solving the time-optimal path problem and visualization is publicly available at https://github.com/sixuli97/Optimal-Spherical-Convexified-Reeds-Shepp-Paths.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
Authors:
Xiaoke Huang,
Juncheng Wu,
Hui Liu,
Xianfeng Tang,
Yuyin Zhou
Abstract:
Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time…
▽ More
Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification
Authors:
Yang Yang,
Xijie Xu,
Yixun Zhou,
Jie Zheng
Abstract:
Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, w…
▽ More
Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at https://github.com/JieZheng-ShanghaiTech/CellVTA.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures
Authors:
Weifang Hu,
Xuanhua Shi,
Chang Wu,
Yunkai Zhang,
Xuan Peng,
Jiaqi Zhai,
Hai Jin,
Yongluan Zhou,
Xuehai Qian
Abstract:
This paper introduces CFP, a system that search intra-operator parallelism configurations by leveraging runtime profiles of actual parallel programs. The key idea is to profile a limited space by identifying a new structure named ParallelBlock, which is a group of operators with the property of communication-free tensor partition propagation: the partition of its input tensor can propagate through…
▽ More
This paper introduces CFP, a system that search intra-operator parallelism configurations by leveraging runtime profiles of actual parallel programs. The key idea is to profile a limited space by identifying a new structure named ParallelBlock, which is a group of operators with the property of communication-free tensor partition propagation: the partition of its input tensor can propagate through all operators to its output tensor without introducing communication or synchronization. Based on this property, an optimal tensor partition of operators within a ParallelBlock should be inferred from the partition of input tensor through partition propagation to prevent the avoidable communication. Thus, the search space can be reduced by only profiling each ParallelBlock with different input tensor partitions at its entry, instead of enumerating all combinations among operators within the ParallelBlock. Moreover, the search space is further reduced by identifying ParallelBlock sequences (segments) with similar parallel behavior. CFP computes the overall performance of the model based on the profiles of all segments. On GPT, LLAMA, and MoE models, CFP achieves up to a 1.51x, 1.31x, and 3.43x speedup over the state-of-the-art framework, Alpa.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Enhancing Fundus Image-based Glaucoma Screening via Dynamic Global-Local Feature Integration
Authors:
Yuzhuo Zhou,
Chi Liu,
Sheng Shen,
Siyu Le,
Liwen Yu,
Sihan Ouyang,
Zongyuan Ge
Abstract:
With the advancements in medical artificial intelligence (AI), fundus image classifiers are increasingly being applied to assist in ophthalmic diagnosis. While existing classification models have achieved high accuracy on specific fundus datasets, they struggle to address real-world challenges such as variations in image quality across different imaging devices, discrepancies between training and…
▽ More
With the advancements in medical artificial intelligence (AI), fundus image classifiers are increasingly being applied to assist in ophthalmic diagnosis. While existing classification models have achieved high accuracy on specific fundus datasets, they struggle to address real-world challenges such as variations in image quality across different imaging devices, discrepancies between training and testing images across different racial groups, and the uncertain boundaries due to the characteristics of glaucomatous cases. In this study, we aim to address the above challenges posed by image variations by highlighting the importance of incorporating comprehensive fundus image information, including the optic cup (OC) and optic disc (OD) regions, and other key image patches. Specifically, we propose a self-adaptive attention window that autonomously determines optimal boundaries for enhanced feature extraction. Additionally, we introduce a multi-head attention mechanism to effectively fuse global and local features via feature linear readout, improving the model's discriminative capability. Experimental results demonstrate that our method achieves superior accuracy and robustness in glaucoma classification.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Collaborative LLM Numerical Reasoning with Local Data Protection
Authors:
Min Zhang,
Yuzhe Lu,
Yun Zhou,
Panpan Xu,
Lin Lee Cheong,
Chang-Tien Lu,
Haozhu Wang
Abstract:
Numerical reasoning over documents, which demands both contextual understanding and logical inference, is challenging for low-capacity local models deployed on computation-constrained devices. Although such complex reasoning queries could be routed to powerful remote models like GPT-4, exposing local data raises significant data leakage concerns. Existing mitigation methods generate problem descri…
▽ More
Numerical reasoning over documents, which demands both contextual understanding and logical inference, is challenging for low-capacity local models deployed on computation-constrained devices. Although such complex reasoning queries could be routed to powerful remote models like GPT-4, exposing local data raises significant data leakage concerns. Existing mitigation methods generate problem descriptions or examples for remote assistance. However, the inherent complexity of numerical reasoning hinders the local model from generating logically equivalent queries and accurately inferring answers with remote guidance. In this paper, we present a model collaboration framework with two key innovations: (1) a context-aware synthesis strategy that shifts the query domains while preserving logical consistency; and (2) a tool-based answer reconstruction approach that reuses the remote-generated problem-solving pattern with code snippets. Experimental results demonstrate that our method achieves better reasoning accuracy than solely using local models while providing stronger data protection than fully relying on remote models. Furthermore, our method improves accuracy by 16.2% - 43.6% while reducing data leakage by 2.3% - 44.6% compared to existing data protection approaches.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Existence of Full Replica Symmetry Breaking for the Sherrington-Kirkpatrick Model at Low Temperature
Authors:
Yuxin Zhou
Abstract:
We prove the existence of full replica symmetry breaking (FRSB) for the Sherrington-Kirkpatrick (SK) model at low temperature. More specifically, we prove that slightly beyond the critical temperature, the Parisi measure for the SK model is supported on an interval starting at the origin and only has one jump discontinuity at the right endpoint.
We prove the existence of full replica symmetry breaking (FRSB) for the Sherrington-Kirkpatrick (SK) model at low temperature. More specifically, we prove that slightly beyond the critical temperature, the Parisi measure for the SK model is supported on an interval starting at the origin and only has one jump discontinuity at the right endpoint.
△ Less
Submitted 15 April, 2025; v1 submitted 31 March, 2025;
originally announced April 2025.
-
MiZero: The Shadowy Defender Against Text Style Infringements
Authors:
Ziwei Zhang,
Juan Wen,
Wanli Peng,
Zhengxian Wu,
Yinghan Zhou,
Yiming Xue
Abstract:
In-Context Learning (ICL) and efficient fine-tuning methods significantly enhanced the efficiency of applying Large Language Models (LLMs) to downstream tasks. However, they also raise concerns about the imitation and infringement of personal creative data. Current methods for data copyright protection primarily focuses on content security but lacks effectiveness in protecting the copyrights of te…
▽ More
In-Context Learning (ICL) and efficient fine-tuning methods significantly enhanced the efficiency of applying Large Language Models (LLMs) to downstream tasks. However, they also raise concerns about the imitation and infringement of personal creative data. Current methods for data copyright protection primarily focuses on content security but lacks effectiveness in protecting the copyrights of text styles. In this paper, we introduce a novel implicit zero-watermarking scheme, namely MiZero. This scheme establishes a precise watermark domain to protect the copyrighted style, surpassing traditional watermarking methods that distort the style characteristics. Specifically, we employ LLMs to extract condensed-lists utilizing the designed instance delimitation mechanism. These lists guide MiZero in generating the watermark. Extensive experiments demonstrate that MiZero effectively verifies text style copyright ownership against AI imitation.
△ Less
Submitted 30 March, 2025;
originally announced April 2025.
-
Gravitational Waves from Massive Black Hole Mergers in ASTRID: Predictions for LISA
Authors:
Bonny Y. Wang,
Yihao Zhou,
William Chen,
Nianyi Chen,
Tiziana Di Matteo,
Rupert Croft,
Simeon Bird,
Yueying Ni
Abstract:
We use the ASTRID cosmological simulation to forecast massive black hole (MBH) mergers detectable by LISA down to $z=0$. ASTRID directly models MBH dynamical friction, allowing a realistic tracking of their trajectory. It also incorporates relatively low-mass MBH seeds down to $5\times 10^{4}\mathrm{M}_{\odot}$, providing a more complete picture of LISA MBH mergers. We find that LISA MBH mergers i…
▽ More
We use the ASTRID cosmological simulation to forecast massive black hole (MBH) mergers detectable by LISA down to $z=0$. ASTRID directly models MBH dynamical friction, allowing a realistic tracking of their trajectory. It also incorporates relatively low-mass MBH seeds down to $5\times 10^{4}\mathrm{M}_{\odot}$, providing a more complete picture of LISA MBH mergers. We find that LISA MBH mergers initially have high eccentricities, peaking around $e_0 = 0.8$ across all redshifts. Accounting for this boosts the event rate from 5.6/yr (if circular orbits are assumed) to 10.5/yr. This enhancement is largely due to additional inspiral sources that will coalesce after LISA's observation, which constitute 46% of detected events. This underscores the importance of LISA's sensitivity to the early inspiral phase, especially for eccentric binaries that emit gravitational waves across a wider frequency band. Most LISA events in ASTRID arise from $M_{\mathrm{BH}} \sim 10^{5-6}~\mathrm{M}_{\odot}$, low-redshift ($z<2$) and low mass-ratio ($q\sim 0.01-0.1)$ mergers. Accounting for eccentricity broadens the detectable MBH mass range up to $10^9~\mathrm{M}_{\odot}$, and shifts the peak of detectable mergers to a lower redshift $z_{\rm peak} = 0.8$. This implies that the most massive LISA events may also be PTA sources. We predict LISA events to be in various galaxy environments, including many low-mass satellite galaxies. The EM counterparts of most LISA sources have AGN luminosities $L_{\rm bol}> 10^{42}$erg/s, albeit only $1\%$ with $ > 10^{44}$erg/s. The brightest AGN are those associated with the rare LISA/PTA events with $M_{\rm BH} > 10^{8}~\mathrm{M}_{\odot}$.
△ Less
Submitted 26 April, 2025; v1 submitted 31 March, 2025;
originally announced March 2025.
-
HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
Authors:
Boyuan Wang,
Xiaofeng Wang,
Chaojun Ni,
Guosheng Zhao,
Zhiqin Yang,
Zheng Zhu,
Muyang Zhang,
Yukun Zhou,
Xinze Chen,
Guan Huang,
Lihong Liu,
Xingang Wang
Abstract:
Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human…
▽ More
Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.
△ Less
Submitted 31 March, 2025; v1 submitted 31 March, 2025;
originally announced March 2025.
-
StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion
Authors:
Jin Zhou,
Yi Zhou,
Pengfei Xu,
Hui Huang
Abstract:
In the field of sketch generation, raster-format trained models often produce non-stroke artifacts, while vector-format trained models typically lack a holistic understanding of sketches, leading to compromised recognizability. Moreover, existing methods struggle to extract common features from similar elements (e.g., eyes of animals) appearing at varying positions across sketches. To address thes…
▽ More
In the field of sketch generation, raster-format trained models often produce non-stroke artifacts, while vector-format trained models typically lack a holistic understanding of sketches, leading to compromised recognizability. Moreover, existing methods struggle to extract common features from similar elements (e.g., eyes of animals) appearing at varying positions across sketches. To address these challenges, we propose StrokeFusion, a two-stage framework for vector sketch generation. It contains a dual-modal sketch feature learning network that maps strokes into a high-quality latent space. This network decomposes sketches into normalized strokes and jointly encodes stroke sequences with Unsigned Distance Function (UDF) maps, representing sketches as sets of stroke feature vectors. Building upon this representation, our framework exploits a stroke-level latent diffusion model that simultaneously adjusts stroke position, scale, and trajectory during generation. This enables high-fidelity sketch generation while supporting stroke interpolation editing. Extensive experiments on the QuickDraw dataset demonstrate that our framework outperforms state-of-the-art techniques, validating its effectiveness in preserving structural integrity and semantic features. Code and models will be made publicly available upon publication.
△ Less
Submitted 16 April, 2025; v1 submitted 31 March, 2025;
originally announced March 2025.
-
Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks
Authors:
Yu Zhou,
Dian Zheng,
Qijie Mo,
Renjie Lu,
Kun-Yu Lin,
Wei-Shi Zheng
Abstract:
In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated…
▽ More
In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated as a loss that implicitly optimizes the forgetting term while lacking supervision for the retention term, disturbing the distribution of pre-trained model and struggling to adequately preserve knowledge of the remaining classes. To address it, we refine the retention term using "dark knowledge" and propose a mask distillation unlearning method. By applying a mask to separate forgetting logits from retention logits, our approach optimizes both the forgetting and refined retention components simultaneously, retaining knowledge of the remaining classes while ensuring thorough forgetting of the target class. Without access to the remaining data or intervention (i.e., used in some works), we achieve state-of-the-art performance across various benchmarks. What's more, DELETE is a general solution that can be applied to various downstream tasks, including face recognition, backdoor defense, and semantic segmentation with great performance.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Exploring Temporal Dynamics in Event-based Eye Tracker
Authors:
Hongwei Ren,
Xiaopeng Lin,
Hongxiang Huang,
Yue Zhou,
Bojun Cheng
Abstract:
Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vis…
▽ More
Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vision systems, are capable of perceiving eye movements with extremely low power consumption and ultra-high temporal resolution. This makes them a promising solution for achieving high-speed, high-precision tracking with rich temporal dynamics. In this paper, we propose TDTracker, an effective eye-tracking framework that captures rapid eye movements by thoroughly modeling temporal dynamics from both implicit and explicit perspectives. TDTracker utilizes 3D convolutional neural networks to capture implicit short-term temporal dynamics and employs a cascaded structure consisting of a Frequency-aware Module, GRU, and Mamba to extract explicit long-term temporal dynamics. Ultimately, a prediction heatmap is used for eye coordinate regression. Experimental results demonstrate that TDTracker achieves state-of-the-art (SOTA) performance on the synthetic SEET dataset and secured Third place in the CVPR event-based eye-tracking challenge 2025. Our code is available at https://github.com/rhwxmx/TDTracker.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Nonreciprocity and unidirectional invisibility in three optical modes with non-Markovian effects
Authors:
H. Yi,
T. Z. Luan,
W. Y. Hu,
Cheng Shang,
Yan-Hui Zhou,
Zhi-Cheng Shi,
H. Z. Shen
Abstract:
In this work, we construct three coupled optical modes systems to obtain effective Hamiltonian mediated by coherent dissipative coupling during adiabatic elimination of large dissipation mode. We investigate the cooperative effect of coherent and dissipative photon-photon couplings in an open cavity system, which leads to nonreciprocity with a considerably large isolation ratio and flexible contro…
▽ More
In this work, we construct three coupled optical modes systems to obtain effective Hamiltonian mediated by coherent dissipative coupling during adiabatic elimination of large dissipation mode. We investigate the cooperative effect of coherent and dissipative photon-photon couplings in an open cavity system, which leads to nonreciprocity with a considerably large isolation ratio and flexible controllability. We discover unidirectional invisibility for electromagnetic wave propagation, which appears at the zero-damping condition (ZDC) for hybrid photon-photon modes and obtain transmission spectrum on the ZDC. We study the influences of the parameters on the nonreciprocal transmission of the system to capture the generic physics of the interference between coherent and dissipative couplings, which accurately reproduces the results of numerical simulation over a broad range of parameters. Moreover, we extend the study of nonreciprocal transmission with the Markovian approximation to the non-Markovian environments, which consist of a collection of oscillators (bosonic photonic modes) and give the adiabatic elimination method with non-Markovian effects. We illustrate that nonreciprocal transmission on ZDC exhibits a crossover from the non-Markovian to the Markovian regimes by controlling the environmental spectral width. This indicates a promising way to enhance or steer quantum nonreciprocal devices in optical cavities and provides potential applications for precision measurements and optical communications with non-Markovian effects.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?
Authors:
Tuo Liang,
Zhe Hu,
Jing Li,
Hao Zhang,
Yiren Lu,
Yunlai Zhou,
Yiran Qiao,
Disheng Liu,
Jeirui Peng,
Jing Ma,
Yu Yin
Abstract:
Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI's ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to cre…
▽ More
Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI's ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YesBut (V2), a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning, with particular emphasis on comparative reasoning between contradictory elements. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, comparative analysis and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs' understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative understanding though comparative reasoning.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Authors:
Jiahui Zhang,
Yurui Chen,
Yanpeng Zhou,
Yueming Xu,
Ze Huang,
Jilin Mei,
Junhui Chen,
Yu-Jie Yuan,
Xinyue Cai,
Guowei Huang,
Xingyue Quan,
Hang Xu,
Li Zhang
Abstract:
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a no…
▽ More
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
△ Less
Submitted 9 May, 2025; v1 submitted 29 March, 2025;
originally announced March 2025.
-
Nested Stochastic Gradient Descent for (Generalized) Sinkhorn Distance-Regularized Distributionally Robust Optimization
Authors:
Yufeng Yang,
Yi Zhou,
Zhaosong Lu
Abstract:
Distributionally robust optimization (DRO) is a powerful technique to train robust models against data distribution shift. This paper aims to solve regularized nonconvex DRO problems, where the uncertainty set is modeled by a so-called generalized Sinkhorn distance and the loss function is nonconvex and possibly unbounded. Such a distance allows to model uncertainty of distributions with different…
▽ More
Distributionally robust optimization (DRO) is a powerful technique to train robust models against data distribution shift. This paper aims to solve regularized nonconvex DRO problems, where the uncertainty set is modeled by a so-called generalized Sinkhorn distance and the loss function is nonconvex and possibly unbounded. Such a distance allows to model uncertainty of distributions with different probability supports and divergence functions. For this class of regularized DRO problems, we derive a novel dual formulation taking the form of nested stochastic programming, where the dual variable depends on the data sample. To solve the dual problem, we provide theoretical evidence to design a nested stochastic gradient descent (SGD) algorithm, which leverages stochastic approximation to estimate the nested stochastic gradients. We study the convergence rate of nested SGD and establish polynomial iteration and sample complexities that are independent of the data size and parameter dimension, indicating its potential for solving large-scale DRO problems. We conduct numerical experiments to demonstrate the efficiency and robustness of the proposed algorithm.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Dual Audio-Centric Modality Coupling for Talking Head Generation
Authors:
Ao Fu,
Ziqi Ni,
Yi Zhou
Abstract:
The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Cent…
▽ More
The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
LiteBIRD Science Goals and Forecasts: constraining isotropic cosmic birefringence
Authors:
E. de la Hoz,
P. Diego-Palazuelos,
J. Errard,
A. Gruppuso,
B. Jost,
R. M. Sullivan,
M. Bortolami,
Y. Chinone,
L. T. Hergt,
E. Komatsu,
Y. Minami,
I. Obata,
D. Paoletti,
D. Scott,
P. Vielva,
D. Adak,
R. Akizawa,
A. Anand,
J. Aumont,
C. Baccigalupi,
A. J. Banday,
R. B. Barreiro,
N. Bartolo,
S. Basak,
A. Basyrov
, et al. (90 additional authors not shown)
Abstract:
Cosmic birefringence (CB) is the rotation of the photons' linear polarisation plane during propagation. Such an effect is a tracer of parity-violating extensions of standard electromagnetism and would probe the existence of a new cosmological field acting as dark matter or dark energy. It has become customary to employ cosmic microwave background (CMB) polarised data to probe such a phenomenon. Re…
▽ More
Cosmic birefringence (CB) is the rotation of the photons' linear polarisation plane during propagation. Such an effect is a tracer of parity-violating extensions of standard electromagnetism and would probe the existence of a new cosmological field acting as dark matter or dark energy. It has become customary to employ cosmic microwave background (CMB) polarised data to probe such a phenomenon. Recent analyses on Planck and WMAP data provide a hint of detection of the isotropic CB angle with an amplitude of around $0.3^\circ$ at the level of $2.4$ to $3.6σ$. In this work, we explore the LiteBIRD capabilities in constraining such an effect, accounting for the impact of the more relevant systematic effects, namely foreground emission and instrumental polarisation angles. We build five semi-independent pipelines and test these against four different simulation sets with increasing complexity in terms of non-idealities. All the pipelines are shown to be robust and capable of returning the expected values of the CB angle within statistical fluctuations for all the cases considered. We find that the uncertainties in the CB estimates increase with more complex simulations. However, the trend is less pronounced for pipelines that account for the instrumental polarisation angles. For the most complex case analysed, we find that LiteBIRD will be able to detect a CB angle of $0.3^\circ$ with a statistical significance ranging from $5$ to $13 \, σ$, depending on the pipeline employed, where the latter uncertainty corresponds to a total error budget of the order of $0.02^\circ$.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Segment then Splat: A Unified Approach for 3D Open-Vocabulary Segmentation based on Gaussian Splatting
Authors:
Yiren Lu,
Yunlai Zhou,
Yiran Qiao,
Chaoda Song,
Tuo Liang,
Jing Ma,
Yu Yin
Abstract:
Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality. However, most existing methods rely on 2D pixel-level parsing, leading to multi-view inconsistencies and poor 3D object retrieval. Moreover, they are limited to static scenes and struggle with dynamic scenes due to the complexities of…
▽ More
Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality. However, most existing methods rely on 2D pixel-level parsing, leading to multi-view inconsistencies and poor 3D object retrieval. Moreover, they are limited to static scenes and struggle with dynamic scenes due to the complexities of motion modeling. In this paper, we propose Segment then Splat, a 3D-aware open vocabulary segmentation approach for both static and dynamic scenes based on Gaussian Splatting. Segment then Splat reverses the long established approach of "segmentation after reconstruction" by dividing Gaussians into distinct object sets before reconstruction. Once the reconstruction is complete, the scene is naturally segmented into individual objects, achieving true 3D segmentation. This approach not only eliminates Gaussian-object misalignment issues in dynamic scenes but also accelerates the optimization process, as it eliminates the need for learning a separate language field. After optimization, a CLIP embedding is assigned to each object to enable open-vocabulary querying. Extensive experiments on various datasets demonstrate the effectiveness of our proposed method in both static and dynamic scenarios.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Uniform vector bundles over $\mathbb{P}^4$
Authors:
Rong Du,
Yuhang Zhou
Abstract:
There is a long-standing conjecture which states that every uniform algebraic vector bundle of rank $r<2n$ on the $n$-dimensional projective space $\mathbb{P}^n$ over an algebraically closed field of characteristic $0$ is homogeneous. This conjecture is valid for $n\leq3$. In this paper, we classify all uniform vector bundles of rank $r<8$ over $\mathbb{P}^4$ and show that the conjecture holds for…
▽ More
There is a long-standing conjecture which states that every uniform algebraic vector bundle of rank $r<2n$ on the $n$-dimensional projective space $\mathbb{P}^n$ over an algebraically closed field of characteristic $0$ is homogeneous. This conjecture is valid for $n\leq3$. In this paper, we classify all uniform vector bundles of rank $r<8$ over $\mathbb{P}^4$ and show that the conjecture holds for $n=4$.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Updated model-independent measurement of the strong-phase differences between $D^0$ and $\bar{D}^0 \to K^{0}_{S/L}π^+π^-$ decays
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (696 additional authors not shown)
Abstract:
The strong-phase differences between $D^0\to K_{S/L}^0π^+π^-$ and $\bar{D}^0\to K_{S/L}^0π^+π^-$ decays are one of the most important inputs in measuring the $C\!P$ violating angle $γ$ via $B^- \to D K^-$ decays. They also play a key role in studies of charm mixing and indirect $C\!P$ violation. In this paper, the strong-phase differences are determined in a model-independent way with quantum-corr…
▽ More
The strong-phase differences between $D^0\to K_{S/L}^0π^+π^-$ and $\bar{D}^0\to K_{S/L}^0π^+π^-$ decays are one of the most important inputs in measuring the $C\!P$ violating angle $γ$ via $B^- \to D K^-$ decays. They also play a key role in studies of charm mixing and indirect $C\!P$ violation. In this paper, the strong-phase differences are determined in a model-independent way with quantum-correlated $D^0$-$\bar{D}^0$ decays from 7.93 fb$^{-1}$ of $e^+e^-$ annihilation data at $\sqrt{s}$=3.773 GeV by the BESIII experiment. These results are the most precise to date and are expected to significantly reduce associated uncertainties in determining the $C\!P$ violating angle $γ$ and related charm mixing parameters.
△ Less
Submitted 18 April, 2025; v1 submitted 27 March, 2025;
originally announced March 2025.
-
Performance Characterizations and Usage Guidelines of Samsung CXL Memory Module Hybrid Prototype
Authors:
Jianping Zeng,
Shuyi Pei,
Da Zhang,
Yuchen Zhou,
Amir Beygi,
Xuebin Yao,
Ramdas Kachare,
Tong Zhang,
Zongwang Li,
Marie Nguyen,
Rekha Pitchumani,
Yang Soek Ki,
Changhee Jung
Abstract:
The growing prevalence of data-intensive workloads, such as artificial intelligence (AI), machine learning (ML), high-performance computing (HPC), in-memory databases, and real-time analytics, has exposed limitations in conventional memory technologies like DRAM. While DRAM offers low latency and high throughput, it is constrained by high costs, scalability challenges, and volatility, making it le…
▽ More
The growing prevalence of data-intensive workloads, such as artificial intelligence (AI), machine learning (ML), high-performance computing (HPC), in-memory databases, and real-time analytics, has exposed limitations in conventional memory technologies like DRAM. While DRAM offers low latency and high throughput, it is constrained by high costs, scalability challenges, and volatility, making it less viable for capacity-bound and persistent applications in modern datacenters.
Recently, Compute Express Link (CXL) has emerged as a promising alternative, enabling high-speed, cacheline-granular communication between CPUs and external devices. By leveraging CXL technology, NAND flash can now be used as memory expansion, offering three-fold benefits: byte-addressability, scalable capacity, and persistence at a low cost. Samsung's CXL Memory Module Hybrid (CMM-H) is the first product to deliver these benefits through a hardware-only solution, i.e., it does not incur any OS and IO overheads like conventional block devices. In particular, CMM-H integrates a DRAM cache with NAND flash in a single device to deliver near-DRAM latency. This paper presents the first publicly available study for comprehensive characterizations of an FPGA-based CMM-H prototype. Through this study, we address users' concerns about whether a wide variety of applications can successfully run on a memory device backed by NAND flash medium. Additionally, based on these characterizations, we provide key insights into how to best take advantage of the CMM-H device.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback
Authors:
Yuan Meng,
Xiangtong Yao,
Haihui Ye,
Yirui Zhou,
Shengqiang Zhang,
Zhenshan Bing,
Alois Knoll
Abstract:
Recent advances in language-conditioned robotic manipulation have leveraged imitation and reinforcement learning to enable robots to execute tasks from human commands. However, these methods often suffer from limited generalization, adaptability, and the lack of large-scale specialized datasets, unlike data-rich domains such as computer vision, making long-horizon task execution challenging. To ad…
▽ More
Recent advances in language-conditioned robotic manipulation have leveraged imitation and reinforcement learning to enable robots to execute tasks from human commands. However, these methods often suffer from limited generalization, adaptability, and the lack of large-scale specialized datasets, unlike data-rich domains such as computer vision, making long-horizon task execution challenging. To address these gaps, we introduce DAHLIA, a data-agnostic framework for language-conditioned long-horizon robotic manipulation, leveraging large language models (LLMs) for real-time task planning and execution. DAHLIA employs a dual-tunnel architecture, where an LLM-powered planner collaborates with co-planners to decompose tasks and generate executable plans, while a reporter LLM provides closed-loop feedback, enabling adaptive re-planning and ensuring task recovery from potential failures. Moreover, DAHLIA integrates chain-of-thought (CoT) in task reasoning and temporal abstraction for efficient action execution, enhancing traceability and robustness. Our framework demonstrates state-of-the-art performance across diverse long-horizon tasks, achieving strong generalization in both simulated and real-world scenarios. Videos and code are available at https://ghiara.github.io/DAHLIA/.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Emergent Non-Markovian Gain in Open Quantum Systems
Authors:
H. Z. Shen,
Cheng Shang,
Yan-Hui Zhou,
X. X. Yi
Abstract:
Non-Markovian dynamics go beyond the Markovian approximation by capturing memory effects and information backflow in open quantum systems, which are crucial for describing realistic physical processes. In this work, we study the exact non-Markovian dynamics of a driven cavity coupled to an anisotropic three-dimensional photonic-crystal environment via counterrotating-wave interactions. We derive a…
▽ More
Non-Markovian dynamics go beyond the Markovian approximation by capturing memory effects and information backflow in open quantum systems, which are crucial for describing realistic physical processes. In this work, we study the exact non-Markovian dynamics of a driven cavity coupled to an anisotropic three-dimensional photonic-crystal environment via counterrotating-wave interactions. We derive an exact analytical expression for the cavity amplitude satisfying the integro-differential equation, which includes the contributions of the bound states outside the continuum and the dissipative parts with the continuum spectrum. Based on the characteristic function method, we derive the exact non-Markovian master equation for the cavity, which contributes to the gain of the cavity. We give the physical origin of non-Markovian gain in the presence of bound states in the system consisting of cavity and environment, which has no Markovian counterparts due to the nonexponential gain in the non-Markovian structured environment. We find that three different types of bound states can be formed in the system, containing one bound state with no inversion of photon number, two bound states with the periodic equal-amplitude oscillation, and the gain with two complex roots without the bound states formation. We derive a current equation including the source from the driving field, the transient current induced by the change in the number of photons, and the two-photon current caused by the counterrotating-wave term. The results are compared with those given by the rotating-wave interactions and extended to a more general quantum network involving an arbitrary number of coupled cavities. Our findings may pave the way for a deeper understanding of non-Markovian dynamics with gain in quantum networks involving counterrotating-wave effects.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
First observation of $Λ_{c}(2595)^{+} \to Λ^{+}_{c}π^0π^0$ and $Λ_{c}(2625)^{+}\to Λ^{+}_{c}π^0π^0$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (657 additional authors not shown)
Abstract:
By analysing $e^+e^-$ annihilation data corresponding to an integrated luminosity of 368.48~pb$^{-1}$ collected at the centre-of-mass energies of $\sqrt{s} = 4.918$ and $4.951$~GeV with the BESIII detector, we report the first observation of $Λ_{c}(2595)^{+}$ and $Λ_{c}(2625)^{+}\to Λ^{+}_{c}π^0π^0$ with statistical significances of 7.9$σ$ and 11.8$σ$, respectively. The branching fractions of…
▽ More
By analysing $e^+e^-$ annihilation data corresponding to an integrated luminosity of 368.48~pb$^{-1}$ collected at the centre-of-mass energies of $\sqrt{s} = 4.918$ and $4.951$~GeV with the BESIII detector, we report the first observation of $Λ_{c}(2595)^{+}$ and $Λ_{c}(2625)^{+}\to Λ^{+}_{c}π^0π^0$ with statistical significances of 7.9$σ$ and 11.8$σ$, respectively. The branching fractions of $Λ_{c}(2595)^{+}$ and $Λ_{c}(2625)^{+}\to Λ^{+}_{c}π^0π^0$ are measured to be $(59.5 \pm 11.1_{\rm stat.} \pm 7.9_{\rm syst.}) \%$ and $(41.0 \pm 5.2_{\rm stat.} \pm 3.3_{\rm syst.}) \%$, respectively. The absolute branching fraction of $Λ_{c}(2595)^{+}$ is consistent with the expectation of the mechanism referred to as the threshold effect, proposed for the strong decays of $Λ_{c}(2595)^{+}$ within uncertainty.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
LandMarkSystem Technical Report
Authors:
Zhenxiang Ma,
Zhenyu Yang,
Miao Tao,
Yuanzhen Zhou,
Zeyu He,
Yuchang Zhang,
Rong Fu,
Hengjie Li
Abstract:
3D reconstruction is vital for applications in autonomous driving, virtual reality, augmented reality, and the metaverse. Recent advancements such as Neural Radiance Fields(NeRF) and 3D Gaussian Splatting (3DGS) have transformed the field, yet traditional deep learning frameworks struggle to meet the increasing demands for scene quality and scale. This paper introduces LandMarkSystem, a novel comp…
▽ More
3D reconstruction is vital for applications in autonomous driving, virtual reality, augmented reality, and the metaverse. Recent advancements such as Neural Radiance Fields(NeRF) and 3D Gaussian Splatting (3DGS) have transformed the field, yet traditional deep learning frameworks struggle to meet the increasing demands for scene quality and scale. This paper introduces LandMarkSystem, a novel computing framework designed to enhance multi-scale scene reconstruction and rendering. By leveraging a componentized model adaptation layer, LandMarkSystem supports various NeRF and 3DGS structures while optimizing computational efficiency through distributed parallel computing and model parameter offloading. Our system addresses the limitations of existing frameworks, providing dedicated operators for complex 3D sparse computations, thus facilitating efficient training and rapid inference over extensive scenes. Key contributions include a modular architecture, a dynamic loading strategy for limited resources, and proven capabilities across multiple representative algorithms.This comprehensive solution aims to advance the efficiency and effectiveness of 3D reconstruction tasks.To facilitate further research and collaboration, the source code and documentation for the LandMarkSystem project are publicly available in an open-source repository, accessing the repository at: https://github.com/InternLandMark/LandMarkSystem.
△ Less
Submitted 28 March, 2025; v1 submitted 27 March, 2025;
originally announced March 2025.
-
Global analysis of fragmentation functions to light neutral hadrons
Authors:
Jun Gao,
ChongYang Liu,
Mengyang Li,
XiaoMin Shen,
Hongxi Xing,
Yuxiang Zhao,
Yiyu Zhou
Abstract:
Fragmentation functions (FFs) are crucial non-perturbative components in quantum chromodynamics (QCD), playing a vital role in predictions and understanding of the hadronization process. In this paper, we present the FFs for $K_S^0$, $η$, $π^0$ mesons, and $Λ$ baryons in the context of global QCD analysis. The data included in the fit are from single inclusive $e^+ e^-$ annihilation (SIA), semi-in…
▽ More
Fragmentation functions (FFs) are crucial non-perturbative components in quantum chromodynamics (QCD), playing a vital role in predictions and understanding of the hadronization process. In this paper, we present the FFs for $K_S^0$, $η$, $π^0$ mesons, and $Λ$ baryons in the context of global QCD analysis. The data included in the fit are from single inclusive $e^+ e^-$ annihilation (SIA), semi-inclusive deep-inelastic scattering (SIDIS) and proton-proton collisions, with kinematic cuts carefully applied to ensure validity of collinear factorization and perturbative QCD expansion. For the first time, data from SIDIS and hadron-in-jet production in SIA have been incorporated into the extraction of FFs for light-flavor neutral hadrons. Our analysis reveals that these data play a critical role in constraining the gluon distribution, and in distinguishing between different quark flavors. Pulls from different datasets are also studied by performing alternative fits with systematically subtracting groups of data from the nominal fit. For the quality of the fit, good $χ^2$ values are achieved for most of the datasets, and FFs are generally well constrained within the momentum fraction region $\pqty{0.1, 0.5}$. The extracted $K_S^0$ fragmentation functions, together with the $K_S^0$ FFs constructed from $K^{\pm}$ FFs via isospin symmetry, are used to test isospin symmetry in kaon fragmentation. Although a definitive conclusion cannot be reached yet, these studies have identified several potential measurements that can be performed at existing facilities, which may ultimately help us to arrive at a conclusive answer. With the comprehensive species of FFs extracted within the NPC framework, we are able to perform a test on the momentum sum rule with the light-flavor charged and neutral hadrons.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
The Optimal Tradeoff Between PAPR and Ambiguity Functions for Generalized OFDM Waveform Set in ISAC Systems
Authors:
Bichai Wang,
Xiuhong Wei,
Xueru Li,
Yongxing Zhou
Abstract:
Integrated sensing and communications (ISAC) has been identified as one of the six usage scenarios for IMT-2030. Compared with communication performance, sensing performance is much more vulnerable to interference, and the received backscattered sensing signal with target information is usually too weak to be detected. It is interesting to understand the optimal tradeoff between interference rejec…
▽ More
Integrated sensing and communications (ISAC) has been identified as one of the six usage scenarios for IMT-2030. Compared with communication performance, sensing performance is much more vulnerable to interference, and the received backscattered sensing signal with target information is usually too weak to be detected. It is interesting to understand the optimal tradeoff between interference rejection and signal strength improvement for the best sensing performance, but unfortunately it still remains unknown. In this paper, the trinity of auto-ambiguity function (AF), cross-AF and peak-to-average-power ratio (PAPR) is proposed to describe the interference and coverage related aspects for ISAC systems where multi-carrier waveform is usually assumed. We extend the existing orthogonal frequency division multiplexing (OFDM) waveforms in 5G to a generalized OFDM waveform set with some new members and a unified parametric representation. Then the optimal Pareto tradeoff between PAPR, auto-AF and cross-AF (i.e., the union bound) is developed for the generalized OFDM waveform set. To achieve the optimal Pareto union bound with reasonable computational complexity, we further propose a framework to optimize waveform parameters and sequences jointly. Finally, some practical design examples are provided and numerical results reveal that significant improvements can be achieved compared to the state-of-the-art 5G waveforms and sequences.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Omni-AD: Learning to Reconstruct Global and Local Features for Multi-class Anomaly Detection
Authors:
Jiajie Quan,
Ao Tong,
Yuxuan Cai,
Xinwei He,
Yulong Wang,
Yang Zhou
Abstract:
In multi-class unsupervised anomaly detection(MUAD), reconstruction-based methods learn to map input images to normal patterns to identify anomalous pixels. However, this strategy easily falls into the well-known "learning shortcut" issue when decoders fail to capture normal patterns and reconstruct both normal and abnormal samples naively. To address that, we propose to learn the input features i…
▽ More
In multi-class unsupervised anomaly detection(MUAD), reconstruction-based methods learn to map input images to normal patterns to identify anomalous pixels. However, this strategy easily falls into the well-known "learning shortcut" issue when decoders fail to capture normal patterns and reconstruct both normal and abnormal samples naively. To address that, we propose to learn the input features in global and local manners, forcing the network to memorize the normal patterns more comprehensively. Specifically, we design a two-branch decoder block, named Omni-block. One branch corresponds to global feature learning, where we serialize two self-attention blocks but replace the query and (key, value) with learnable tokens, respectively, thus capturing global features of normal patterns concisely and thoroughly. The local branch comprises depth-separable convolutions, whose locality enables effective and efficient learning of local features for normal patterns. By stacking Omni-blocks, we build a framework, Omni-AD, to learn normal patterns of different granularity and reconstruct them progressively. Comprehensive experiments on public anomaly detection benchmarks show that our method outperforms state-of-the-art approaches in MUAD. Code is available at https://github.com/easyoo/Omni-AD.git
△ Less
Submitted 28 March, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies
Authors:
Tianqi He,
Xiaohan Huang,
Yi Du,
Qingqing Long,
Ziyue Qiao,
Min Wu,
Yanjie Fu,
Yuanchun Zhou,
Meng Xiao
Abstract:
Feature Transformation is crucial for classic machine learning that aims to generate feature combinations to enhance the performance of downstream tasks from a data-centric perspective. Current methodologies, such as manual expert-driven processes, iterative-feedback techniques, and exploration-generative tactics, have shown promise in automating such data engineering workflow by minimizing human…
▽ More
Feature Transformation is crucial for classic machine learning that aims to generate feature combinations to enhance the performance of downstream tasks from a data-centric perspective. Current methodologies, such as manual expert-driven processes, iterative-feedback techniques, and exploration-generative tactics, have shown promise in automating such data engineering workflow by minimizing human involvement. However, three challenges remain in those frameworks: (1) It predominantly depends on downstream task performance metrics, as assessment is time-consuming, especially for large datasets. (2) The diversity of feature combinations will hardly be guaranteed after random exploration ends. (3) Rare significant transformations lead to sparse valuable feedback that hinders the learning processes or leads to less effective results. In response to these challenges, we introduce FastFT, an innovative framework that leverages a trio of advanced strategies.We first decouple the feature transformation evaluation from the outcomes of the generated datasets via the performance predictor. To address the issue of reward sparsity, we developed a method to evaluate the novelty of generated transformation sequences. Incorporating this novelty into the reward function accelerates the model's exploration of effective transformations, thereby improving the search productivity. Additionally, we combine novelty and performance to create a prioritized memory buffer, ensuring that essential experiences are effectively revisited during exploration. Our extensive experimental evaluations validate the performance, efficiency, and traceability of our proposed framework, showcasing its superiority in handling complex feature transformation tasks.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Dewey Long Context Embedding Model: A Technical Report
Authors:
Dun Zhang,
Panxiang Zou,
Yudong Zhou
Abstract:
This technical report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model. The increasing demand for retrieval-augmented generation (RAG) systems and the expanding context window capabilities of large language models (LLMs) have created critical challenges for conventional embedding models. Current approaches often struggle to maintain semantic…
▽ More
This technical report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model. The increasing demand for retrieval-augmented generation (RAG) systems and the expanding context window capabilities of large language models (LLMs) have created critical challenges for conventional embedding models. Current approaches often struggle to maintain semantic coherence when processing documents exceeding typical sequence length limitations, significantly impacting retrieval performance in knowledge-intensive applications. This paper presents dewey_en_beta, a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark while supporting 128K token sequences. Our technical contribution centers on chunk alignment training, an innovative methodology that enables the simultaneous generation of localized chunk embeddings and global document-level representations through distillation. Information regarding the model release can be found at https://huggingface.co/infgrad/dewey_en_beta.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos
Authors:
Jiaheng Zhou,
Yanfeng Zhou,
Wei Fang,
Yuxing Tang,
Le Lu,
Ge Yang
Abstract:
Ultrasound videos are an important form of clinical imaging data, and deep learning-based automated analysis can improve diagnostic accuracy and clinical efficiency. However, the scarcity of labeled data and the inherent challenges of video analysis have impeded the advancement of related methods. In this work, we introduce E-ViM$^3$, a data-efficient Vision Mamba network that preserves the 3D str…
▽ More
Ultrasound videos are an important form of clinical imaging data, and deep learning-based automated analysis can improve diagnostic accuracy and clinical efficiency. However, the scarcity of labeled data and the inherent challenges of video analysis have impeded the advancement of related methods. In this work, we introduce E-ViM$^3$, a data-efficient Vision Mamba network that preserves the 3D structure of video data, enhancing long-range dependencies and inductive biases to better model space-time correlations. With our design of Enclosure Global Tokens (EGT), the model captures and aggregates global features more effectively than competing methods. To further improve data efficiency, we employ masked video modeling for self-supervised pre-training, with the proposed Spatial-Temporal Chained (STC) masking strategy designed to adapt to various video scenarios. Experiments demonstrate that E-ViM$^3$ performs as the state-of-the-art in two high-level semantic analysis tasks across four datasets of varying sizes: EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS. Furthermore, our model achieves competitive performance with limited labels, highlighting its potential impact on real-world clinical applications.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Video Motion Graphs
Authors:
Haiyang Liu,
Zhan Xu,
Fa-Ting Hong,
Hsin-Ping Huang,
Yi Zhou,
Yang Zhou
Abstract:
We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robus…
▽ More
We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Results show that our Video Motion Graphs outperforms existing generative- and retrieval-based methods for multi-modal conditioned human motion video generation. Project page can be found at https://h-liu1997.github.io/Video-Motion-Graphs/
△ Less
Submitted 26 March, 2025;
originally announced March 2025.