-
Training-free Generation of Temporally Consistent Rewards from VLMs
Authors:
Yinuo Zhao,
Jiale Yuan,
Zhiyuan Xu,
Xiaoshuai Hao,
Xinyi Zhang,
Kun Wu,
Zhengping Che,
Chi Harold Liu,
Jian Tang
Abstract:
Recent advances in vision-language models (VLMs) have significantly improved performance in embodied tasks such as goal decomposition and visual comprehension. However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time app…
▽ More
Recent advances in vision-language models (VLMs) have significantly improved performance in embodied tasks such as goal decomposition and visual comprehension. However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time applicability. To address this, we propose $\mathrm{T}^2$-VLM, a novel training-free, temporally consistent framework that generates accurate rewards through tracking the status changes in VLM-derived subgoals. Specifically, our method first queries the VLM to establish spatially aware subgoals and an initial completion estimate before each round of interaction. We then employ a Bayesian tracking algorithm to update the goal completion status dynamically, using subgoal hidden states to generate structured rewards for reinforcement learning (RL) agents. This approach enhances long-horizon decision-making and improves failure recovery capabilities with RL. Extensive experiments indicate that $\mathrm{T}^2$-VLM achieves state-of-the-art performance in two robot manipulation benchmarks, demonstrating superior reward accuracy with reduced computation consumption. We believe our approach not only advances reward generation techniques but also contributes to the broader field of embodied AI. Project website: https://t2-vlm.github.io/.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models
Authors:
Kai Yao,
Zhaorui Tan,
Penglei Gao,
Lichun Li,
Kaixin Wu,
Yinggui Wang,
Yuan Zhao,
Yixin Ji,
Wei Wang,
Jianke Zhu
Abstract:
The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine…
▽ More
The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine-tuned with adapter to enhance privacy. However, the existing OT-based methods require high computational costs and lack theoretical analysis. This paper introduces a novel OT approach based on gradient-preserving compression, named GradOT. By analyzing the OT problem through the lens of optimization, we propose a method that selectively applies compression techniques such as rank compression and channel pruning, preserving the gradients of fine-tuned adapters while ensuring privacy. Extensive experiments demonstrate that our approach surpasses existing OT methods, both in terms of privacy protection and model performance. Our method provides a theoretical foundation for OT and offers a practical, training-free solution for offsite-tuning of large-scale LLMs.
△ Less
Submitted 6 July, 2025;
originally announced July 2025.
-
PromptSR: Cascade Prompting for Lightweight Image Super-Resolution
Authors:
Wenyang Liu,
Chen Cai,
Jianjun Gao,
Kejun Wu,
Yi Wang,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. T…
▽ More
Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at https://github.com/wenyang001/PromptSR.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
Beyond Charging Anxiety: An Explainable Approach to Understanding User Preferences of EV Charging Stations Using Review Data
Authors:
Zifei Wang,
Emmanuel Abolarin,
Kai Wu,
Venkatarao Rebba,
Jian Hu,
Zhen Hu,
Shan Bao,
Feng Zhou
Abstract:
Electric vehicles (EVs) charging infrastructure is directly related to the overall EV user experience and thus impacts the widespread adoption of EVs. Understanding key factors that affect EV users' charging experience is essential for building a robust and user-friendly EV charging infrastructure. This study leverages about $17,000$ charging station (CS) reviews on Google Maps to explore EV user…
▽ More
Electric vehicles (EVs) charging infrastructure is directly related to the overall EV user experience and thus impacts the widespread adoption of EVs. Understanding key factors that affect EV users' charging experience is essential for building a robust and user-friendly EV charging infrastructure. This study leverages about $17,000$ charging station (CS) reviews on Google Maps to explore EV user preferences for charging stations, employing ChatGPT 4.0 for aspect-based sentiment analysis. We identify twelve key aspects influencing user satisfaction, ranging from accessibility and reliability to amenities and pricing. Two distinct preference models are developed: a micro-level model focused on individual user satisfaction and a macro-level model capturing collective sentiment towards specific charging stations. Both models utilize the LightGBM algorithm for user preference prediction, achieving strong performance compared to other machine learning approaches. To further elucidate the impact of each aspect on user ratings, we employ SHAP (SHapley Additive exPlanations), a game-theoretic approach for interpreting machine learning models. Our findings highlight the significant impact of positive sentiment towards "amenities and location", coupled with negative sentiment regarding "reliability and maintenance", on overall user satisfaction. These insights offer actionable guidance to charging station operators, policymakers, and EV manufacturers, empowering them to enhance user experience and foster wider EV adoption.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
PCPP-Based Reconfiguration Inapproximability: Query Complexity vs. Soundness Gap Trade-offs
Authors:
Venkatesan Guruswami,
Xuandi Ren,
Kewen Wu
Abstract:
The Reconfiguration Inapproximability Hypothesis (RIH), recently established by Hirahara-Ohsaka (STOC'24) and Karthik-Manurangsi (ECCC'24), studies the hardness of reconfiguring one solution into another in constraint satisfaction problems (CSP) when restricted to approximate intermediate solutions. In this work, we make a tighter connection between RIH's soundness gap and that of probabilisticall…
▽ More
The Reconfiguration Inapproximability Hypothesis (RIH), recently established by Hirahara-Ohsaka (STOC'24) and Karthik-Manurangsi (ECCC'24), studies the hardness of reconfiguring one solution into another in constraint satisfaction problems (CSP) when restricted to approximate intermediate solutions. In this work, we make a tighter connection between RIH's soundness gap and that of probabilistically checkable proofs of proximity (PCPP). Consequently, we achieve an improved trade-off between soundness and query complexity in Gap CSP Reconfiguration. Our approach leverages a parallelization framework, which also appears in some recent parameterized inapproximability results.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Automated Vehicles Should be Connected with Natural Language
Authors:
Xiangbo Gao,
Keshu Wu,
Hao Zhang,
Kexin Tian,
Yang Zhou,
Zhengzhong Tu
Abstract:
Multi-agent collaborative driving promises improvements in traffic safety and efficiency through collective perception and decision making. However, existing communication media -- including raw sensor data, neural network features, and perception results -- suffer limitations in bandwidth efficiency, information completeness, and agent interoperability. Moreover, traditional approaches have large…
▽ More
Multi-agent collaborative driving promises improvements in traffic safety and efficiency through collective perception and decision making. However, existing communication media -- including raw sensor data, neural network features, and perception results -- suffer limitations in bandwidth efficiency, information completeness, and agent interoperability. Moreover, traditional approaches have largely ignored decision-level fusion, neglecting critical dimensions of collaborative driving. In this paper we argue that addressing these challenges requires a transition from purely perception-oriented data exchanges to explicit intent and reasoning communication using natural language. Natural language balances semantic density and communication bandwidth, adapts flexibly to real-time conditions, and bridges heterogeneous agent platforms. By enabling the direct communication of intentions, rationales, and decisions, it transforms collaborative driving from reactive perception-data sharing into proactive coordination, advancing safety, efficiency, and transparency in intelligent transportation systems.
△ Less
Submitted 29 June, 2025;
originally announced July 2025.
-
Characterization of Rydberg-Atom Signal Reception of Dual-Frequency Signals Coupled with Two Energy Levels
Authors:
Hao Wu,
Chongwu Xie,
Xinyuan Yao,
Kang-Da Wu,
Shanchi Wu,
Rui Ni,
Guo-Yong Xiang,
Chen Gong
Abstract:
Rydberg atomic sensors have been adopted for novel radio frequency (RF) measurement technique and the sensing capability for signals in multiple frequencies makes it attractive for multi-user communication. However, unlike traditional antennas where the signals in multiple frequencies are orthogonal, the received signals of atomic sensors corresponding to different energy levels will be downconver…
▽ More
Rydberg atomic sensors have been adopted for novel radio frequency (RF) measurement technique and the sensing capability for signals in multiple frequencies makes it attractive for multi-user communication. However, unlike traditional antennas where the signals in multiple frequencies are orthogonal, the received signals of atomic sensors corresponding to different energy levels will be downconverted to the baseband simultaneously, resulting in multi-user interference. Thus, in this paper, we analyze the mutual interference characteristics of two RF signals with different carrier frequencies coupling different energy levels. We introduce the joint response coefficient based on the receiver characteristics and analyze the interference of one user to another. We analyze the bit-error rate (BER) and symbol-error rate (SER) for two signals coupling two different energy levels. We also conduct experiments to validate the BER and SER results.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
SPT-3G D1: CMB temperature and polarization power spectra and cosmology from 2019 and 2020 observations of the SPT-3G Main field
Authors:
E. Camphuis,
W. Quan,
L. Balkenhol,
A. R. Khalife,
F. Ge,
F. Guidi,
N. Huang,
G. P. Lynch,
Y. Omori,
C. Trendafilova,
A. J. Anderson,
B. Ansarinejad,
M. Archipley,
P. S. Barry,
K. Benabed,
A. N. Bender,
B. A. Benson,
F. Bianchini,
L. E. Bleem,
F. R. Bouchet,
L. Bryant,
M. G. Campitiello,
J. E. Carlstrom,
C. L. Chang,
P. Chaubal
, et al. (72 additional authors not shown)
Abstract:
We present measurements of the temperature and E-mode polarization angular power spectra of the cosmic microwave background (CMB) from observations of 4% of the sky with SPT-3G, the current camera on the South Pole Telescope (SPT). The maps used in this analysis are the deepest used in a CMB TT/TE/EE analysis to date. The maps and resulting power spectra have been validated through blind and unbli…
▽ More
We present measurements of the temperature and E-mode polarization angular power spectra of the cosmic microwave background (CMB) from observations of 4% of the sky with SPT-3G, the current camera on the South Pole Telescope (SPT). The maps used in this analysis are the deepest used in a CMB TT/TE/EE analysis to date. The maps and resulting power spectra have been validated through blind and unblind tests. The measurements of the lensed EE and TE spectra are the most precise to date at l=1800-4000 and l=2200-4000, respectively. Combining our TT/TE/EE spectra with previously published SPT-3G CMB lensing results, we find parameters for the standard LCDM model consistent with Planck and ACT-DR6 with comparable constraining power. We report a Hubble constant of $H_0=66.66\pm0.60$ km/s/Mpc from SPT-3G alone, 6.2 sigma away from local measurements from SH0ES. For the first time, combined ground-based (SPT+ACT) CMB primary and lensing data have reached Planck's constraining power on some parameters, a milestone for CMB cosmology. The combination of these three CMB experiments yields the tightest CMB constraints to date, with $H_0=67.24\pm0.35$ km/s/Mpc, and the amplitude of clustering $σ_8=0.8137\pm0.0038$. CMB data alone show no evidence for physics beyond LCDM; however, we observe a 2.8 sigma difference in LCDM between CMB and baryon acoustic oscillation (BAO) results from DESI-DR2, which is relaxed in extended models. The combination of CMB and BAO yields 2-3 sigma shifts from LCDM in the curvature of the universe, the amplitude of CMB lensing, or the dark energy equation of state. It also drives mild preferences for models that address the Hubble tension through modified recombination or variations in the electron mass in a non-flat universe. This work highlights the growing power of ground-based CMB experiments and lays a foundation for further cosmological analyses with SPT-3G.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads
Authors:
Hongzhen Huang,
Kunming Zhang,
Hanlong Liao,
Kui Wu,
Guoming Tang
Abstract:
The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presen…
▽ More
The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presents WattsOnAI, a comprehensive software toolkit for the measurement, analysis, and visualization of energy use, power draw, hardware performance, and carbon emissions across AI workloads. By seamlessly integrating with existing AI frameworks, WattsOnAI offers standardized reports and exports fine-grained time-series data to support benchmarking and reproducibility in a lightweight manner. It further enables in-depth correlation analysis between hardware metrics and model performance and thus facilitates bottleneck identification and performance enhancement. By addressing critical limitations in existing tools, WattsOnAI encourages the research community to weigh environmental impact alongside raw performance of AI workloads and advances the shift toward more sustainable "Green AI" practices. The code is available at https://github.com/SusCom-Lab/WattsOnAI.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Beyond 200 Gb/s/lane: An Analytical Approach to Optimal Detection in Shaped IM-DD Optical Links with Relative Intensity Noise
Authors:
Felipe Villenas,
Kaiquan Wu,
Yunus Can Gültekin,
Jamal Riani,
Alex Alvarado
Abstract:
Next-generation intensity-modulation (IM) and direct-detection (DD) systems used in data centers are expected to operate at 400 Gb/s/lane and beyond. Such rates can be achieved by increasing the system bandwidth or the modulation format, which in turn requires maintaining or increasing the signal-to-noise ratio (SNR). Such SNR requirements can be achieved by increasing the transmitted optical powe…
▽ More
Next-generation intensity-modulation (IM) and direct-detection (DD) systems used in data centers are expected to operate at 400 Gb/s/lane and beyond. Such rates can be achieved by increasing the system bandwidth or the modulation format, which in turn requires maintaining or increasing the signal-to-noise ratio (SNR). Such SNR requirements can be achieved by increasing the transmitted optical power. This increase in optical power causes the emergence of relative intensity noise (RIN), a signal-dependent impairment inherent to the transmitter laser, which ultimately limits the performance of the system. In this paper, we develop an analytical symbol error rate (SER) expression for the optimal detector for the IM-DD optical link under study. The developed expression takes into account the signal-dependent nature of RIN and does not make any assumptions on the geometry or probability distribution of the constellation. Our expression is therefore applicable to general probabilistically and/or geometrically shaped systems. Unlike results available in the literature, our proposed expression provides a perfect match to numerical simulations of probabilistic and geometrically shaped systems.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
On Error Rate Approximations for FSO Systems with Weak Turbulence and Pointing Errors
Authors:
Carmen Álvarez Roa,
Yunus Can Gültekin,
Kaiquan Wu,
Cornelis Willem Korevaar,
Alex Alvarado
Abstract:
Atmospheric attenuation, atmospheric turbulence, geometric spread, and pointing errors, degrade the performance of free-space optical transmission. In the weak turbulence regime, the probability density function describing the distribution of the channel fading coefficient that models these four effects is known in the literature. This function is an integral equation, which makes it difficult to…
▽ More
Atmospheric attenuation, atmospheric turbulence, geometric spread, and pointing errors, degrade the performance of free-space optical transmission. In the weak turbulence regime, the probability density function describing the distribution of the channel fading coefficient that models these four effects is known in the literature. This function is an integral equation, which makes it difficult to find simple analytical expressions of important performance metrics such as the bit error rate (BER) and symbol error rate (SER). In this paper, we present simple and accurate approximations of the average BER and SER for pulse-amplitude modulation (PAM) in the weak turbulence regime for an intensity modulation and direct detection system. Our numerical results show that the proposed expressions exhibit excellent accuracy when compared against Monte Carlo simulations. To demonstrate the usefulness of the developed approximations, we perform two asymptotic analyses. First, we investigate the additional transmit power required to maintain the same SER when the spectral efficiency increases by 1 bit/symbol. Second, we study the asymptotic behavior of our SER approximation for dense PAM constellations and high transmit power.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration
Authors:
Xiangbo Gao,
Yuheng Wu,
Fengze Yang,
Xuewen Luo,
Keshu Wu,
Xinghao Chen,
Yuping Wang,
Chenxi Liu,
Yang Zhou,
Zhengzhong Tu
Abstract:
While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative o…
▽ More
While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird's-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at https://github.com/taco-group/AirV2X-Perception.
△ Less
Submitted 2 July, 2025; v1 submitted 23 June, 2025;
originally announced June 2025.
-
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
Authors:
Yuting Zhang,
Kaishen Yuan,
Hao Lu,
Yutao Yue,
Jintai Chen,
Kaishun Wu
Abstract:
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to…
▽ More
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Align the GAP: Prior-based Unified Multi-Task Remote Physiological Measurement Framework For Domain Generalization and Personalization
Authors:
Jiyao Wang,
Xiao Yang,
Hao Lu,
Dengbo He,
Kaishun Wu
Abstract:
Multi-source synsemantic domain generalization (MSSDG) for multi-task remote physiological measurement seeks to enhance the generalizability of these metrics and attracts increasing attention. However, challenges like partial labeling and environmental noise may disrupt task-specific accuracy. Meanwhile, given that real-time adaptation is necessary for personalized products, the test-time personal…
▽ More
Multi-source synsemantic domain generalization (MSSDG) for multi-task remote physiological measurement seeks to enhance the generalizability of these metrics and attracts increasing attention. However, challenges like partial labeling and environmental noise may disrupt task-specific accuracy. Meanwhile, given that real-time adaptation is necessary for personalized products, the test-time personalized adaptation (TTPA) after MSSDG is also worth exploring, while the gap between previous generalization and personalization methods is significant and hard to fuse. Thus, we proposed a unified framework for MSSD\textbf{G} and TTP\textbf{A} employing \textbf{P}riors (\textbf{GAP}) in biometrics and remote photoplethysmography (rPPG). We first disentangled information from face videos into invariant semantics, individual bias, and noise. Then, multiple modules incorporating priors and our observations were applied in different stages and for different facial information. Then, based on the different principles of achieving generalization and personalization, our framework could simultaneously address MSSDG and TTPA under multi-task remote physiological estimation with minimal adjustments. We expanded the MSSDG benchmark to the TTPA protocol on six publicly available datasets and introduced a new real-world driving dataset with complete labeling. Extensive experiments that validated our approach, and the codes along with the new dataset will be released.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
A Far-Infrared Search for Planet Nine Using AKARI All-Sky Survey
Authors:
Amos Y. -A. Chen,
Tomotsugu Goto,
Issei Yamamura,
Takao Nakagawa,
Cossas K. -W. Wu,
Terry Long Phan,
Tetsuya Hashimoto,
Yuri Uno,
Simon C. -C. Ho,
Seong Jin Kim
Abstract:
An unusual orbital element clustering of Kuiper belt objects (KBOs) has been observed. The most promising dynamic solution is the presence of a giant planet in the outer Solar system, Planet Nine. However, due to its extreme distance, intensive searches in optical have not been successful. We aim to find Planet Nine in the far-infrared, where it has the peak of the black body radiation, using the…
▽ More
An unusual orbital element clustering of Kuiper belt objects (KBOs) has been observed. The most promising dynamic solution is the presence of a giant planet in the outer Solar system, Planet Nine. However, due to its extreme distance, intensive searches in optical have not been successful. We aim to find Planet Nine in the far-infrared, where it has the peak of the black body radiation, using the most sensitive all-sky far-infrared survey to date, AKARI. In contrast to optical searches, where the energy of reflected sunlight decreases by $d^{4}$, thermal radiation in the infrared decreases with the square of the heliocentric distance $d^{2}$. We search for moving objects in the AKARI Single Scan Detection List. We select sources from a promising region suggested by an N-body simulation from Millholland and Laughlin 2017: $30^{\circ}<$ R.A. $<50^{\circ}$ and $-20^{\circ}<$ Dec. $<20^{\circ}$. Known sources are excluded by cross-matching AKARI sources with 9 optical and infrared catalogues. Furthermore, we select sources with small background strength to avoid sources in the cirrus. Since Planet Nine is stationary in a timescale of hours but moves on a monthly scale, our primary strategy is to select slowly moving objects that are stationary in 24 hours but not in six months, using multiple single scans by AKARI. The selected slowly moving AKARI sources are scrutinised for potential contamination from cosmic rays. Our analysis reveals two possible Planet Nine candidates whose positions and flux are within the theoretical prediction ranges. These candidates warrant further investigation through follow-up observations to confirm the existence and properties of Planet Nine.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Construction of Kondo Chains by Engineering Porphyrin π-Radicals on Au(111)
Authors:
Yan Zhao,
Kaiyue Jiang,
Peng-Yi Liu,
Ruoning Li,
Jie Li,
Xin Li,
Xinchen Fang,
Anjing Zhao,
Yutong Zhu,
Hongxiang Xu,
Ting Chen,
Dong Wang,
Xiaodong Zhuang,
Shimin Hou,
Kai Wu,
Song Gao,
Qing-Feng Sun,
Yajie Zhang,
Yongfeng Wang
Abstract:
Quantum manipulation of molecular radical spins provides a crucial platform for exploring emergent phenomena in many-body systems. Here, we combine surface-confined synthesis with scanning tunneling microscopy (STM) tip-induced dehydrogenation to achieve atom-precise engineering of quasi-one-dimensional porphyrin-based Kondo chains (1-7 units) on Au(111). Key design innovations leverage large-size…
▽ More
Quantum manipulation of molecular radical spins provides a crucial platform for exploring emergent phenomena in many-body systems. Here, we combine surface-confined synthesis with scanning tunneling microscopy (STM) tip-induced dehydrogenation to achieve atom-precise engineering of quasi-one-dimensional porphyrin-based Kondo chains (1-7 units) on Au(111). Key design innovations leverage large-sized porphyrins to suppress intrachain antiferromagnetic coupling, while ${Zn}^{2+}$ chelation at porphyrin cores enhances molecule-substrate interactions to amplify Kondo effect. High-resolution STS measurements and low-energy effective modeling collectively demonstrate that $π$-radicals at each fused-porphyrin unit form Kondo singlets screened by conduction electrons. Adjacent singlets develop direct coherent coupling via quantum-state-overlap-enabled electron tunneling. Crucially, chiral symmetry in the effective model governs zero-mode distribution-present in odd-length chains yet absent in even-length chains-which dictates pronounced odd-even quantum effects in STS spectra of finite chains. Furthermore, geometric control emerges through conformational distortions modulated by chain fusion width. This enables directional tuning of the competition between Kondo screening and magnetic exchange. Tilted single/fused-triple-porphyrin chains weaken spin exchange through enhanced Kondo coupling, while parallel fused-double-porphyrin chains suppress Kondo screening via increased spin exchange. This opposing modulation of Kondo versus exchange interactions establishes an inverse control paradigm. This work simultaneously resolves the dimensional dependence of many-body correlations in confined quantum systems and pioneers approaches for quantum-critical manipulation in molecular spin architectures.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Measuring Corporate Human Capital Disclosures: Lexicon, Data, Code, and Research Opportunities
Authors:
Elizabeth Demers,
Victor Xiaoqi Wang,
Kean Wu
Abstract:
Human capital (HC) is increasingly important to corporate value creation. Unlike other assets, however, HC is not currently subject to well-defined measurement or disclosure rules. We use a machine learning algorithm (word2vec) trained on a confirmed set of HC disclosures to develop a comprehensive list of HC-related keywords classified into five subcategories (DEI; health and safety; labor relati…
▽ More
Human capital (HC) is increasingly important to corporate value creation. Unlike other assets, however, HC is not currently subject to well-defined measurement or disclosure rules. We use a machine learning algorithm (word2vec) trained on a confirmed set of HC disclosures to develop a comprehensive list of HC-related keywords classified into five subcategories (DEI; health and safety; labor relations and culture; compensation and benefits; and demographics and other) that capture the multidimensional nature of HC management. We share our lexicon, corporate HC disclosures, and the Python code used to develop the lexicon, and we provide detailed examples of using our data and code, including for fine-tuning a BERT model. Researchers can use our HC lexicon (or modify the code to capture another construct of interest) with their samples of corporate communications to address pertinent HC questions. We close with a discussion of future research opportunities related to HC management and disclosure.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression
Authors:
Dingcheng Zhen,
Qian Qiao,
Tan Yu,
Kangxi Wu,
Ziwei Zhang,
Siyuan Liu,
Shunshun Yin,
Ming Tao
Abstract:
We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation…
▽ More
We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Frechet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.
△ Less
Submitted 15 June, 2025; v1 submitted 11 June, 2025;
originally announced June 2025.
-
The geometric bookkeeping guide to Feynman integral reduction and $\varepsilon$-factorised differential equations
Authors:
Iris Bree,
Federico Gasparotto,
Antonela Matijašić,
Pouria Mazloumi,
Dmytro Melnichenko,
Sebastian Pögel,
Toni Teschke,
Xing Wang,
Stefan Weinzierl,
Konglong Wu,
Xiaofeng Xu
Abstract:
We report on three improvements in the context of Feynman integral reduction and $\varepsilon$-factorised differential equations: Firstly, we show that with a specific choice of prefactors, we trivialise the $\varepsilon$-dependence of the integration-by-parts identities. Secondly, we observe that with a specific choice of order relation in the Laporta algorithm, we directly obtain a basis of mast…
▽ More
We report on three improvements in the context of Feynman integral reduction and $\varepsilon$-factorised differential equations: Firstly, we show that with a specific choice of prefactors, we trivialise the $\varepsilon$-dependence of the integration-by-parts identities. Secondly, we observe that with a specific choice of order relation in the Laporta algorithm, we directly obtain a basis of master integrals, whose differential equation on the maximal cut is in Laurent polynomial form with respect to $\varepsilon$ and compatible with a particular filtration. Thirdly, we prove that such a differential equation can always be transformed to an $\varepsilon$-factorised form. This provides a systematic algorithm to obtain an $\varepsilon$-factorised differential equation for any Feynman integral. Furthermore, the choices for the prefactors and the order relation significantly improve the efficiency of the reduction algorithm.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency
Authors:
Yifei Su,
Ning Liu,
Dong Chen,
Zhen Zhao,
Kun Wu,
Meng Li,
Zhiyuan Xu,
Zhengping Che,
Jian Tang
Abstract:
Generative modeling-based visuomotor policies have been widely adopted in robotic manipulation attributed to their ability to model multimodal action distributions. However, the high inference cost of multi-step sampling limits their applicability in real-time robotic systems. To address this issue, existing approaches accelerate the sampling process in generative modeling-based visuomotor policie…
▽ More
Generative modeling-based visuomotor policies have been widely adopted in robotic manipulation attributed to their ability to model multimodal action distributions. However, the high inference cost of multi-step sampling limits their applicability in real-time robotic systems. To address this issue, existing approaches accelerate the sampling process in generative modeling-based visuomotor policies by adapting acceleration techniques originally developed for image generation. Despite this progress, a major distinction remains: image generation typically involves producing independent samples without temporal dependencies, whereas robotic manipulation involves generating time-series action trajectories that require continuity and temporal coherence. To effectively exploit temporal information in robotic manipulation, we propose FreqPolicy, a novel approach that first imposes frequency consistency constraints on flow-based visuomotor policies. Our work enables the action model to capture temporal structure effectively while supporting efficient, high-quality one-step action generation. We introduce a frequency consistency constraint that enforces alignment of frequency-domain action features across different timesteps along the flow, thereby promoting convergence of one-step action generation toward the target distribution. In addition, we design an adaptive consistency loss to capture structural temporal variations inherent in robotic manipulation tasks. We assess FreqPolicy on 53 tasks across 3 simulation benchmarks, proving its superiority over existing one-step action generators. We further integrate FreqPolicy into the vision-language-action (VLA) model and achieve acceleration without performance degradation on the 40 tasks of Libero. Besides, we show efficiency and effectiveness in real-world robotic scenarios with an inference frequency 93.5Hz. The code will be publicly available.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Regularized Adaptive Graph Learning for Large-Scale Traffic Forecasting
Authors:
Kaiqi Wu,
Weiyang Kong,
Sen Zhang,
Yubao Liu,
Zitong Chen
Abstract:
Traffic prediction is a critical task in spatial-temporal forecasting with broad applications in travel planning and urban management. Adaptive graph convolution networks have emerged as mainstream solutions due to their ability to learn node embeddings in a data-driven manner and capture complex latent dependencies. However, existing adaptive graph learning methods for traffic forecasting often e…
▽ More
Traffic prediction is a critical task in spatial-temporal forecasting with broad applications in travel planning and urban management. Adaptive graph convolution networks have emerged as mainstream solutions due to their ability to learn node embeddings in a data-driven manner and capture complex latent dependencies. However, existing adaptive graph learning methods for traffic forecasting often either ignore the regularization of node embeddings, which account for a significant proportion of model parameters, or face scalability issues from expensive graph convolution operations. To address these challenges, we propose a Regularized Adaptive Graph Learning (RAGL) model. First, we introduce a regularized adaptive graph learning framework that synergizes Stochastic Shared Embedding (SSE) and adaptive graph convolution via a residual difference mechanism, achieving both embedding regularization and noise suppression. Second, to ensure scalability on large road networks, we develop the Efficient Cosine Operator (ECO), which performs graph convolution based on the cosine similarity of regularized embeddings with linear time complexity. Extensive experiments on four large-scale real-world traffic datasets show that RAGL consistently outperforms state-of-the-art methods in terms of prediction accuracy and exhibits competitive computational efficiency.
△ Less
Submitted 8 June, 2025;
originally announced June 2025.
-
Spark Transformer: Reactivating Sparsity in FFN and Attention
Authors:
Chong You,
Kan Wu,
Zhipeng Jia,
Lin Chen,
Srinadh Bhojanapalli,
Jiaxian Guo,
Utku Evci,
Jan Wassenberg,
Praneeth Netrapalli,
Jeremiah J. Willcock,
Suvinay Subramanian,
Felix Chern,
Alek Andreev,
Shreya Pathak,
Felix Yu,
Prateek Jain,
David E. Culler,
Henry M. Levy,
Sanjiv Kumar
Abstract:
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the Re…
▽ More
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges.
This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model quality, parameter count, and standard training procedures. Our method realizes sparsity via top-k masking for explicit control over sparsity level. Crucially, we introduce statistical top-k, a hardware-accelerator-friendly, linear-time approximate algorithm that avoids costly sorting and mitigates significant training slowdown from standard top-$k$ operators. Furthermore, Spark Transformer reallocates existing FFN parameters and attention key embeddings to form a low-cost predictor for identifying activated entries. This design not only mitigates quality loss from enforced sparsity, but also enhances wall-time benefit. Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
X-ray Polarization Detection of the Pulsar Wind Nebula in G21.5-0.9 with IXPE
Authors:
Niccolò Di Lalla,
Nicola Omodei,
Niccolò Bucciantini,
Jack T. Dinsmore,
Nicolò Cibrario,
Stefano Silvestri,
Josephine Wong,
Patrick Slane,
Tsunefumi Mizuno,
Michela Negro,
Roger W. Romani,
Riccardo Ferrazzoli,
Stephen Chi-Yung Ng,
Miltiadis Michailidis,
Yi-Jung Yang,
Fei Xie,
Martin C. Weisskopf,
Philip Kaaret,
Iván Agudo,
L. A. Antonelli,
Matteo Bachetti,
Luca Baldini,
Wayne H. Baumgartner,
Ronaldo Bellazzini,
Stefano Bianchi
, et al. (76 additional authors not shown)
Abstract:
We present the X-ray polarization observation of G21.5-0.9, a young Galactic supernova remnant (SNR), conducted with the Imaging X-ray Polarimetry Explorer (IXPE) in October 2023, with a total livetime of approximately 837 ks. Using different analysis methods, such as a space-integrated study of the entire region of the PWN and a space-resolved polarization map, we detect significant polarization…
▽ More
We present the X-ray polarization observation of G21.5-0.9, a young Galactic supernova remnant (SNR), conducted with the Imaging X-ray Polarimetry Explorer (IXPE) in October 2023, with a total livetime of approximately 837 ks. Using different analysis methods, such as a space-integrated study of the entire region of the PWN and a space-resolved polarization map, we detect significant polarization from the pulsar wind nebula (PWN) at the center of the SNR, with an average polarization degree of ~10% oriented at ~33° (north through east). No significant energy-dependent variation in polarization is observed across the IXPE band (2-8 keV). The polarization map, corrected for the effect of polarization leakage, reveals a consistent pattern in both degree and angle, with little change across the nebula. Our findings indicate the presence of a highly polarized central torus, suggesting low levels of turbulence at particle acceleration sites. Unlike Vela, but similar to the Crab Nebula, we observe substantial differences between radio and X-ray polarization maps. This suggests a clear separation in energy of the emitting particle populations and hints at an important, yet poorly understood, role of instabilities in the turbulence dynamics of PWNe.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning
Authors:
Zhao Jin,
Zhengping Che,
Zhen Zhao,
Kun Wu,
Yuheng Zhang,
Yinuo Zhao,
Zehui Liu,
Qiang Zhang,
Xiaozhu Ju,
Jing Tian,
Yousong Xue,
Jian Tang
Abstract:
Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mas…
▽ More
Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research. Our project is at https://x-humanoid-artvip.github.io/ .
△ Less
Submitted 5 June, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
Handle-based Mesh Deformation Guided By Vision Language Model
Authors:
Xingpeng Sun,
Shiyang Jia,
Zherong Pan,
Kui Wu,
Aniket Bera
Abstract:
Mesh deformation is a fundamental tool in 3D content manipulation. Despite extensive prior research, existing approaches often suffer from low output quality, require significant manual tuning, or depend on data-intensive training. To address these limitations, we introduce a training-free, handle-based mesh deformation method. % Our core idea is to leverage a Vision-Language Model (VLM) to interp…
▽ More
Mesh deformation is a fundamental tool in 3D content manipulation. Despite extensive prior research, existing approaches often suffer from low output quality, require significant manual tuning, or depend on data-intensive training. To address these limitations, we introduce a training-free, handle-based mesh deformation method. % Our core idea is to leverage a Vision-Language Model (VLM) to interpret and manipulate a handle-based interface through prompt engineering. We begin by applying cone singularity detection to identify a sparse set of potential handles. The VLM is then prompted to select both the deformable sub-parts of the mesh and the handles that best align with user instructions. Subsequently, we query the desired deformed positions of the selected handles in screen space. To reduce uncertainty inherent in VLM predictions, we aggregate the results from multiple camera views using a novel multi-view voting scheme. % Across a suite of benchmarks, our method produces deformations that align more closely with user intent, as measured by CLIP and GPTEval3D scores, while introducing low distortion -- quantified via membrane energy. In summary, our approach is training-free, highly automated, and consistently delivers high-quality mesh deformations.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models
Authors:
Meng Li,
Zhen Zhao,
Zhengping Che,
Fei Liao,
Kun Wu,
Zhiyuan Xu,
Pei Ren,
Zhao Jin,
Ning Liu,
Jian Tang
Abstract:
Robots deployed in dynamic environments must be able to not only follow diverse language instructions but flexibly adapt when user intent changes mid-execution. While recent Vision-Language-Action (VLA) models have advanced multi-task learning and instruction following, they typically assume static task intent, failing to respond when new instructions arrive during ongoing execution. This limitati…
▽ More
Robots deployed in dynamic environments must be able to not only follow diverse language instructions but flexibly adapt when user intent changes mid-execution. While recent Vision-Language-Action (VLA) models have advanced multi-task learning and instruction following, they typically assume static task intent, failing to respond when new instructions arrive during ongoing execution. This limitation hinders natural and robust interaction in dynamic settings, such as retail or household environments, where real-time intent changes are common. We propose SwitchVLA, a unified, execution-aware framework that enables smooth and reactive task switching without external planners or additional switch-specific data. We model task switching as a behavior modulation problem conditioned on execution state and instruction context. Expert demonstrations are segmented into temporally grounded contact phases, allowing the policy to infer task progress and adjust its behavior accordingly. A multi-behavior conditional policy is then trained to generate flexible action chunks under varying behavior modes through conditioned trajectory modeling. Experiments in both simulation and real-world robotic manipulation demonstrate that SwitchVLA enables robust instruction adherence, fluid task switching, and strong generalization-outperforming prior VLA baselines in both task success rate and interaction naturalness.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Text-guided Generation of Efficient Personalized Inspection Plans
Authors:
Xingpeng Sun,
Zherong Pan,
Xifeng Gao,
Kui Wu,
Aniket Bera
Abstract:
We propose a training-free, Vision-Language Model (VLM)-guided approach for efficiently generating trajectories to facilitate target inspection planning based on text descriptions. Unlike existing Vision-and-Language Navigation (VLN) methods designed for general agents in unknown environments, our approach specifically targets the efficient inspection of known scenes, with widespread applications…
▽ More
We propose a training-free, Vision-Language Model (VLM)-guided approach for efficiently generating trajectories to facilitate target inspection planning based on text descriptions. Unlike existing Vision-and-Language Navigation (VLN) methods designed for general agents in unknown environments, our approach specifically targets the efficient inspection of known scenes, with widespread applications in fields such as medical, marine, and civil engineering. Leveraging VLMs, our method first extracts points of interest (POIs) from the text description, then identifies a set of waypoints from which POIs are both salient and align with the spatial constraints defined in the prompt. Next, we interact with the VLM to iteratively refine the trajectory, preserving the visibility and prominence of the POIs. Further, we solve a Traveling Salesman Problem (TSP) to find the most efficient visitation order that satisfies the order constraint implied in the text description. Finally, we apply trajectory optimization to generate smooth, executable inspection paths for aerial and underwater vehicles. We have evaluated our method across a series of both handcrafted and real-world scanned environments. The results demonstrate that our approach effectively generates inspection planning trajectories that adhere to user instructions.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
A New 5 bit/2D-symbol Modulation Format for Relative Intensity Noise-dominated IM-DD Systems
Authors:
Felipe Villenas,
Kaiquan Wu,
Yunus Can Gültekin,
Jamal Riani,
Alex Alvarado
Abstract:
We propose a novel 5-bit/2D-symbol modulation format based on PAM-6 optimized for IM-DD systems dominated by relative intensity noise. The proposed modulation scheme improves SNR by 0.94 dB compared to conventional PAM-6 and achieves near-optimal BER performance.
We propose a novel 5-bit/2D-symbol modulation format based on PAM-6 optimized for IM-DD systems dominated by relative intensity noise. The proposed modulation scheme improves SNR by 0.94 dB compared to conventional PAM-6 and achieves near-optimal BER performance.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Modeling All-Atom Glycan Structures via Hierarchical Message Passing and Multi-Scale Pre-training
Authors:
Minghao Xu,
Jiaze Song,
Keming Wu,
Xiangxin Zhou,
Bin Cui,
Wentao Zhang
Abstract:
Understanding the various properties of glycans with machine learning has shown some preliminary promise. However, previous methods mainly focused on modeling the backbone structure of glycans as graphs of monosaccharides (i.e., sugar units), while they neglected the atomic structures underlying each monosaccharide, which are actually important indicators of glycan properties. We fill this blank b…
▽ More
Understanding the various properties of glycans with machine learning has shown some preliminary promise. However, previous methods mainly focused on modeling the backbone structure of glycans as graphs of monosaccharides (i.e., sugar units), while they neglected the atomic structures underlying each monosaccharide, which are actually important indicators of glycan properties. We fill this blank by introducing the GlycanAA model for All-Atom-wise Glycan modeling. GlycanAA models a glycan as a heterogeneous graph with monosaccharide nodes representing its global backbone structure and atom nodes representing its local atomic-level structures. Based on such a graph, GlycanAA performs hierarchical message passing to capture from local atomic-level interactions to global monosaccharide-level interactions. To further enhance model capability, we pre-train GlycanAA on a high-quality unlabeled glycan dataset, deriving the PreGlycanAA model. We design a multi-scale mask prediction algorithm to endow the model about different levels of dependencies in a glycan. Extensive benchmark results show the superiority of GlycanAA over existing glycan encoders and verify the further improvements achieved by PreGlycanAA. We maintain all resources at https://github.com/kasawa1234/GlycanAA
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Millimeter-wave observations of Euclid Deep Field South using the South Pole Telescope: A data release of temperature maps and catalogs
Authors:
M. Archipley,
A. Hryciuk,
L. E. Bleem,
K. Kornoelje,
M. Klein,
A. J. Anderson,
B. Ansarinejad,
M. Aravena,
L. Balkenhol,
P. S. Barry,
K. Benabed,
A. N. Bender,
B. A. Benson,
F. Bianchini,
S. Bocquet,
F. R. Bouchet,
E. Camphuis,
M. G. Campitiello,
J. E. Carlstrom,
J. Cathey,
C. L. Chang,
S. C. Chapman,
P. Chaubal,
P. M. Chichura,
A. Chokshi
, et al. (86 additional authors not shown)
Abstract:
Context. The South Pole Telescope third-generation camera (SPT-3G) has observed over 10,000 square degrees of sky at 95, 150, and 220 GHz (3.3, 2.0, 1.4 mm, respectively) overlapping the ongoing 14,000 square-degree Euclid Wide Survey. The Euclid collaboration recently released Euclid Deep Field observations in the first quick data release (Q1). Aims. With the goal of releasing complementary milli…
▽ More
Context. The South Pole Telescope third-generation camera (SPT-3G) has observed over 10,000 square degrees of sky at 95, 150, and 220 GHz (3.3, 2.0, 1.4 mm, respectively) overlapping the ongoing 14,000 square-degree Euclid Wide Survey. The Euclid collaboration recently released Euclid Deep Field observations in the first quick data release (Q1). Aims. With the goal of releasing complementary millimeter-wave data and encouraging legacy science, we performed dedicated observations of a 57-square-degree field overlapping the Euclid Deep Field South (EDF-S). Methods. The observing time totaled 20 days and we reached noise depths of 4.3, 3.8, and 13.2 $μ$K-arcmin at 95, 150, and 220 GHz, respectively. Results. In this work we present the temperature maps and two catalogs constructed from these data. The emissive source catalog contains 601 objects (334 inside EDF-S) with 54% synchrotron-dominated sources and 46% thermal dust emission-dominated sources. The 5$σ$ detection thresholds are 1.7, 2.0, and 6.5 mJy in the three bands. The cluster catalog contains 217 cluster candidates (121 inside EDF-S) with median mass $M_{500c}=2.12 \times 10^{14} M_{\odot}/h_{70}$ and median redshift $z$ = 0.70, corresponding to an order-of-magnitude improvement in cluster density over previous tSZ-selected catalogs in this region (3.81 clusters per square degree). Conclusions. The overlap between SPT and Euclid data will enable a range of multiwavelength studies of the aforementioned source populations. This work serves as the first step towards joint projects between SPT and Euclid and provides a rich dataset containing information on galaxies, clusters, and their environments.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
Authors:
Yuting Zhang,
Hao Lu,
Qingyong Hu,
Yin Wang,
Kaishen Yuan,
Xin Liu,
Kaishun Wu
Abstract:
Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, curren…
▽ More
Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, current MLLMs struggle with periodic tasks due to limitations in: 1) lack of temporal modelling and 2) conflict between short and long periods. This paper introduces Period-LLM, a multimodal large language model designed to enhance the performance of periodic tasks across various modalities, and constructs a benchmark of various difficulty for evaluating the cross-modal periodic capabilities of large models. Specially, We adopt an "Easy to Hard Generalization" paradigm, starting with relatively simple text-based tasks and progressing to more complex visual and multimodal tasks, ensuring that the model gradually builds robust periodic reasoning capabilities. Additionally, we propose a "Resisting Logical Oblivion" optimization strategy to maintain periodic reasoning abilities during semantic alignment. Extensive experiments demonstrate the superiority of the proposed Period-LLM over existing MLLMs in periodic tasks. The code is available at https://github.com/keke-nice/Period-LLM.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
TrackVLA: Embodied Visual Tracking in the Wild
Authors:
Shaoan Wang,
Jiazhao Zhang,
Minghan Li,
Jiahang Liu,
Anqi Li,
Kui Wu,
Fangwei Zhong,
Junzhi Yu,
Zhizheng Zhang,
He Wang
Abstract:
Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge thr…
▽ More
Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Can Large Language Models Match the Conclusions of Systematic Reviews?
Authors:
Christopher Polzak,
Alejandro Lozano,
Min Woo Sun,
James Burgess,
Yuhui Zhang,
Kevin Wu,
Serena Yeung-Levy
Abstract:
Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs…
▽ More
Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models
Authors:
Junwen Chen,
Heyang Jiang,
Yanbin Wang,
Keming Wu,
Ji Li,
Chao Zhang,
Keiji Yanai,
Dong Chen,
Yuhui Yuan
Abstract:
Generating high-quality, multi-layer transparent images from text prompts can unlock a new level of creative control, allowing users to edit each layer as effortlessly as editing text outputs from LLMs. However, the development of multi-layer generative models lags behind that of conventional text-to-image models due to the absence of a large, high-quality corpus of multi-layer transparent data. I…
▽ More
Generating high-quality, multi-layer transparent images from text prompts can unlock a new level of creative control, allowing users to edit each layer as effortlessly as editing text outputs from LLMs. However, the development of multi-layer generative models lags behind that of conventional text-to-image models due to the absence of a large, high-quality corpus of multi-layer transparent data. In this paper, we address this fundamental challenge by: (i) releasing the first open, ultra-high-fidelity PrismLayers (PrismLayersPro) dataset of 200K (20K) multilayer transparent images with accurate alpha mattes, (ii) introducing a trainingfree synthesis pipeline that generates such data on demand using off-the-shelf diffusion models, and (iii) delivering a strong, open-source multi-layer generation model, ART+, which matches the aesthetics of modern text-to-image generation models. The key technical contributions include: LayerFLUX, which excels at generating high-quality single transparent layers with accurate alpha mattes, and MultiLayerFLUX, which composes multiple LayerFLUX outputs into complete images, guided by human-annotated semantic layout. To ensure higher quality, we apply a rigorous filtering stage to remove artifacts and semantic mismatches, followed by human selection. Fine-tuning the state-of-the-art ART model on our synthetic PrismLayersPro yields ART+, which outperforms the original ART in 60% of head-to-head user study comparisons and even matches the visual quality of images generated by the FLUX.1-[dev] model. We anticipate that our work will establish a solid dataset foundation for the multi-layer transparent image generation task, enabling research and applications that require precise, editable, and visually compelling layered imagery.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Simulating the Unseen: Crash Prediction Must Learn from What Did Not Happen
Authors:
Zihao Li,
Xinyuan Cao,
Xiangbo Gao,
Kexin Tian,
Keshu Wu,
Mohammad Anis,
Hao Zhang,
Keke Long,
Jiwan Jiang,
Xiaopeng Li,
Yunlong Zhang,
Tianbao Yang,
Dominique Lord,
Zhengzhong Tu,
Yang Zhou
Abstract:
Traffic safety science has long been hindered by a fundamental data paradox: the crashes we most wish to prevent are precisely those events we rarely observe. Existing crash-frequency models and surrogate safety metrics rely heavily on sparse, noisy, and under-reported records, while even sophisticated, high-fidelity simulations undersample the long-tailed situations that trigger catastrophic outc…
▽ More
Traffic safety science has long been hindered by a fundamental data paradox: the crashes we most wish to prevent are precisely those events we rarely observe. Existing crash-frequency models and surrogate safety metrics rely heavily on sparse, noisy, and under-reported records, while even sophisticated, high-fidelity simulations undersample the long-tailed situations that trigger catastrophic outcomes such as fatalities. We argue that the path to achieving Vision Zero, i.e., the complete elimination of traffic fatalities and severe injuries, requires a paradigm shift from traditional crash-only learning to a new form of counterfactual safety learning: reasoning not only about what happened, but also about the vast set of plausible yet perilous scenarios that could have happened under slightly different circumstances. To operationalize this shift, our proposed agenda bridges macro to micro. Guided by crash-rate priors, generative scene engines, diverse driver models, and causal learning, near-miss events are synthesized and explained. A crash-focused digital twin testbed links micro scenes to macro patterns, while a multi-objective validator ensures that simulations maintain statistical realism. This pipeline transforms sparse crash data into rich signals for crash prediction, enabling the stress-testing of vehicles, roads, and policies before deployment. By learning from crashes that almost happened, we can shift traffic safety from reactive forensics to proactive prevention, advancing Vision Zero.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Out of the Past: An AI-Enabled Pipeline for Traffic Simulation from Noisy, Multimodal Detector Data and Stakeholder Feedback
Authors:
Rex Chen,
Karen Wu,
John McCartney,
Norman Sadeh,
Fei Fang
Abstract:
How can a traffic simulation be designed to faithfully reflect real-world traffic conditions? Past data-driven approaches to traffic simulation in the literature have relied on unrealistic or suboptimal heuristics. They also fail to adequately account for the effects of uncertainty and multimodality in the data on simulation outcomes. In this work, we integrate advances in AI to construct a three-…
▽ More
How can a traffic simulation be designed to faithfully reflect real-world traffic conditions? Past data-driven approaches to traffic simulation in the literature have relied on unrealistic or suboptimal heuristics. They also fail to adequately account for the effects of uncertainty and multimodality in the data on simulation outcomes. In this work, we integrate advances in AI to construct a three-step, end-to-end pipeline for generating a traffic simulation from detector data: computer vision for vehicle counting from camera footage, combinatorial optimization for vehicle route generation from multimodal data, and large language models for iterative simulation refinement from natural language feedback. Using a road network from Strongsville, Ohio as a testbed, we demonstrate that our pipeline can accurately capture the city's traffic patterns in a granular simulation. Beyond Strongsville, our traffic simulation framework can be generalized to other municipalities with different levels of data and infrastructure availability.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Wavelet Flow For Extragalactic Foreground Simulations
Authors:
M. Mebratu,
W. L. K. Wu
Abstract:
Extragalactic foregrounds in cosmic microwave background (CMB) observations are both a source of cosmological and astrophysical information and a nuisance to the CMB. Effective field-level modeling that captures their non-Gaussian statistical distributions is increasingly important for optimal information extraction, particularly given the precise and low-noise observations from current and upcomi…
▽ More
Extragalactic foregrounds in cosmic microwave background (CMB) observations are both a source of cosmological and astrophysical information and a nuisance to the CMB. Effective field-level modeling that captures their non-Gaussian statistical distributions is increasingly important for optimal information extraction, particularly given the precise and low-noise observations from current and upcoming experiments. We explore the use of Wavelet Flow (WF) models to tackle the novel task of modeling the field-level probability distributions of multi-component CMB secondaries. Specifically, we jointly train correlated CMB lensing convergence ($κ$) and cosmic infrared background (CIB) maps with a WF model and obtain a network that statistically recovers the input to high accuracy -- the trained network generates samples of $κ$ and CIB fields whose average power spectra are within a few percent of the inputs across all scales, and whose Minkowski functionals are similarly accurate compared to the inputs. Leveraging the multiscale architecture of these models, we fine-tune both the model parameters and the priors at each scale independently, optimizing performance across different resolutions. These results demonstrate that WF models can accurately simulate correlated components of CMB secondaries, supporting improved analysis of cosmological data. Our code and trained models can be found here (https://github.com/matiwosm/HybridPriorWavletFlow.git).
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models
Authors:
Kui Wu,
Shuhang Xu,
Hao Chen,
Churan Wang,
Zhoujun Li,
Yizhou Wang,
Fangwei Zhong
Abstract:
We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Vision-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs' reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM…
▽ More
We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Vision-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs' reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM reasoning only upon failure detection. The framework features a memory-augmented self-reflection mechanism that enables the VLM to progressively improve by learning from past experiences, effectively addressing VLMs' limitations in 3D spatial reasoning. Experimental results demonstrate significant performance improvements, with our framework boosting success rates by $72\%$ with state-of-the-art RL-based approaches and $220\%$ with PID-based methods in challenging environments. This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery, offering substantial advances for real-world robotic applications that require continuous target monitoring in dynamic, unstructured environments. Project website: https://sites.google.com/view/evt-recovery-assistant.
△ Less
Submitted 28 May, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
Hierarchical Instruction-aware Embodied Visual Tracking
Authors:
Kui Wu,
Hao Chen,
Churan Wang,
Fakhri Karray,
Zhoujun Li,
Yizhou Wang,
Fangwei Zhong
Abstract:
User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or g…
▽ More
User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose \textbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)} agent, which bridges instruction comprehension and action generation using \textit{spatial goals} as intermediaries. HIEVT first introduces \textit{LLM-based Semantic-Spatial Goal Aligner} to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the \textit{RL-based Adaptive Goal-Aligned Policy}, a general offline policy, enables the tracker to position the target as specified by the spatial goal. To benchmark UC-EVT tasks, we collect over ten million trajectories for training and evaluate across one seen environment and nine unseen challenging environments. Extensive experiments and real-world deployments demonstrate the robustness and generalizability of HIEVT across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at https://sites.google.com/view/hievt.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning
Authors:
Ruolin Shen,
Xiaozhong Ji,
Kai WU,
Jiangning Zhang,
Yijun He,
HaiHua Yang,
Xiaobin Hu,
Xiaoyu Sun
Abstract:
Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual…
▽ More
Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus' visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior cognitive knowledge. In this paper, we propose a visual refocus reinforcement framework via the policy optimization algorithm to encourage multi-modal models to think and refocus more before answering, and achieve excellent reasoning abilities to align and even surpass human camouflaged perception systems. Our extensive experiments on camouflaged perception successfully demonstrate the emergence of refocus visual phenomena, characterized by multiple reasoning tokens and dynamic adjustment of the detection box. Besides, experimental results on both camouflaged object classification and detection tasks exhibit significantly superior performance compared to Supervised Fine-Tuning (SFT) baselines.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Water Level Sensing via Communication Signals in a Bi-Static System
Authors:
Zhongqin Wang,
J. Andrew Zhang,
Kai Wu,
Y. Jay Guo
Abstract:
Accurate water level sensing is essential for flood monitoring, agricultural irrigation, and water resource optimization. Traditional methods require dedicated sensor deployments, leading to high installation costs, vulnerability to interference, and limited resolution. This work proposes PMNs-WaterSense, a novel scheme leveraging Channel State Information (CSI) from existing mobile networks for w…
▽ More
Accurate water level sensing is essential for flood monitoring, agricultural irrigation, and water resource optimization. Traditional methods require dedicated sensor deployments, leading to high installation costs, vulnerability to interference, and limited resolution. This work proposes PMNs-WaterSense, a novel scheme leveraging Channel State Information (CSI) from existing mobile networks for water level sensing. Our scheme begins with a CSI-power method to eliminate phase offsets caused by clock asynchrony in bi-static systems. We then apply multi-domain filtering across the time (Doppler), frequency (delay), and spatial (Angle-of-Arrival, AoA) domains to extract phase features that finely capture variations in path length over water. To resolve the $2π$ phase ambiguity, we introduce a Kalman filter-based unwrapping technique. Additionally, we exploit transceiver geometry to convert path length variations into water level height changes, even with limited antenna configurations. We validate our framework through controlled experiments with 28 GHz mmWave and 3.1 GHz LTE signals in real time, achieving average height estimation errors of 0.025 cm and 0.198 cm, respectively. Moreover, real-world river monitoring with 2.6 GHz LTE signals achieves an average error of 4.8 cm for a 1-meter water level change, demonstrating its effectiveness in practical deployments.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
DiffHairCard: Auto Hair Card Extraction with Differentiable Rendering
Authors:
Zhongtian Zheng,
Tao Huang,
Haozhe Su,
Xueqi Ma,
Yuefan Shen,
Tongtong Wang,
Yin Yang,
Xifeng Gao,
Zherong Pan,
Kui Wu
Abstract:
Hair cards remain a widely used representation for hair modeling in real-time applications, offering a practical trade-off between visual fidelity, memory usage, and performance. However, generating high-quality hair card models remains a challenging and labor-intensive task. This work presents an automated pipeline for converting strand-based hair models into hair card models with a limited numbe…
▽ More
Hair cards remain a widely used representation for hair modeling in real-time applications, offering a practical trade-off between visual fidelity, memory usage, and performance. However, generating high-quality hair card models remains a challenging and labor-intensive task. This work presents an automated pipeline for converting strand-based hair models into hair card models with a limited number of cards and textures while preserving the hairstyle appearance. Our key idea is a novel differentiable representation where each strand is encoded as a projected 2D spline in the texture space, which enables efficient optimization with differentiable rendering and structured results respecting the hair geometry. Based on this representation, we develop a novel algorithm pipeline, where we first cluster hair strands into initial hair cards and project the strands into the texture space. We then conduct a two-stage optimization where our first stage optimizes the texture and geometry of each hair card separately, and after texture reduction, our second stage conducts joint optimization of all the cards for fine-tuning. Put together, our method is evaluated on a wide range of hairstyles, including straight, wavy, curly, and coily hairs. To better capture the appearance of short or coily hair, we additionally support hair cap and cross-card. Furthermore, our framework supports seamless LoD transitions via texture sharing, balancing texture memory efficiency and visual quality.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
Authors:
Tencent Hunyuan Team,
Ao Liu,
Botong Zhou,
Can Xu,
Chayse Zhou,
ChenChen Zhang,
Chengcheng Xu,
Chenhao Wang,
Decheng Wu,
Dengpeng Wu,
Dian Jiao,
Dong Du,
Dong Wang,
Feng Zhang,
Fengzong Lian,
Guanghui Xu,
Guanwei Zhang,
Hai Wang,
Haipeng Luo,
Han Hu,
Huilin Xu,
Jiajia Wu,
Jianchen Zhu,
Jianfeng Yan,
Jiaqi Zhu
, et al. (230 additional authors not shown)
Abstract:
As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid response…
▽ More
As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.
△ Less
Submitted 4 July, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks
Authors:
Yuanze Hu,
Zhaoxin Fan,
Xinyu Wang,
Gen Li,
Ye Qiu,
Zhichao Yang,
Wenjun Wu,
Kejian Wu,
Yifan Sun,
Xiaotie Deng,
Jin Dong
Abstract:
Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules. However, this strategy heavily depends on the intrinsic capabilities of the language model, which can be suboptimal for lightweight m…
▽ More
Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules. However, this strategy heavily depends on the intrinsic capabilities of the language model, which can be suboptimal for lightweight models with limited representational capacity. In this work, we investigate this alignment bottleneck through the lens of mutual information, demonstrating that the constrained capacity of the language model inherently limits the Effective Mutual Information (EMI) between multimodal inputs and outputs, thereby compromising alignment quality. To address this challenge, we propose TinyAlign, a novel framework inspired by Retrieval-Augmented Generation, which strategically retrieves relevant context from a memory bank to enrich multimodal inputs and enhance their alignment. Extensive empirical evaluations reveal that TinyAlign significantly reduces training loss, accelerates convergence, and enhances task performance. Remarkably, it allows models to achieve baseline-level performance with only 40\% of the fine-tuning data, highlighting exceptional data efficiency. Our work thus offers a practical pathway for developing more capable lightweight VLMs while introducing a fresh theoretical lens to better understand and address alignment bottlenecks in constrained multimodal systems.
△ Less
Submitted 30 June, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
CALM: Co-evolution of Algorithms and Language Model for Automatic Heuristic Design
Authors:
Ziyao Huang,
Weiwei Wu,
Kui Wu,
Jianping Wang,
Wei-Bin Lee
Abstract:
Tackling complex optimization problems often relies on expert-designed heuristics, typically crafted through extensive trial and error. Recent advances demonstrate that large language models (LLMs), when integrated into well-designed evolutionary search frameworks, can autonomously discover high-performing heuristics at a fraction of the traditional cost. However, existing approaches predominantly…
▽ More
Tackling complex optimization problems often relies on expert-designed heuristics, typically crafted through extensive trial and error. Recent advances demonstrate that large language models (LLMs), when integrated into well-designed evolutionary search frameworks, can autonomously discover high-performing heuristics at a fraction of the traditional cost. However, existing approaches predominantly rely on verbal guidance, i.e., manipulating the prompt generation process, to steer the evolution of heuristics, without adapting the underlying LLM. We propose a hybrid framework that combines verbal and numerical guidance, the latter achieved by fine-tuning the LLM via reinforcement learning based on the quality of generated heuristics. This joint optimization allows the LLM to co-evolve with the search process. Our method outperforms state-of-the-art (SOTA) baselines across various optimization tasks, running locally on a single 24GB GPU using a 7B model with INT4 quantization. It surpasses methods that rely solely on verbal guidance, even when those use significantly more powerful API-based models.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports
Authors:
Kevin Wu,
Eric Wu,
Rahul Thapa,
Kevin Wei,
Angela Zhang,
Arvind Suresh,
Jacqueline J. Tao,
Min Woo Sun,
Alejandro Lozano,
James Zou
Abstract:
Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final ans…
▽ More
Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning.
△ Less
Submitted 20 May, 2025; v1 submitted 16 May, 2025;
originally announced May 2025.
-
Disentangling Reasoning and Knowledge in Medical Large Language Models
Authors:
Rahul Thapa,
Qingyang Wu,
Kevin Wu,
Harrison Zhang,
Angela Zhang,
Eric Wu,
Haotian Ye,
Suhana Bedi,
Nevin Aresh,
Joseph Boen,
Shriya Reddy,
Ben Athiwaratkun,
Shuaiwen Leon Song,
James Zou
Abstract:
Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human perfor…
▽ More
Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, HuatuoGPT-o1 scores 56.9 on knowledge but only 44.8 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.
△ Less
Submitted 23 June, 2025; v1 submitted 16 May, 2025;
originally announced May 2025.
-
Approximation-First Timeseries Monitoring Query At Scale
Authors:
Zeying Zhu,
Jonathan Chamberlain,
Kenny Wu,
David Starobinski,
Zaoxing Liu
Abstract:
Timeseries monitoring systems such as Prometheus play a crucial role in gaining observability of the underlying system components. These systems collect timeseries metrics from various system components and perform monitoring queries over periodic window-based aggregations (i.e., rule queries). However, despite wide adoption, the operational costs and query latency of rule queries remain high. In…
▽ More
Timeseries monitoring systems such as Prometheus play a crucial role in gaining observability of the underlying system components. These systems collect timeseries metrics from various system components and perform monitoring queries over periodic window-based aggregations (i.e., rule queries). However, despite wide adoption, the operational costs and query latency of rule queries remain high. In this paper, we identify major bottlenecks associated with repeated data scans and query computations concerning window overlaps in rule queries, and present PromSketch, an approximation-first query framework as intermediate caches for monitoring systems. It enables low operational costs and query latency, by combining approximate window-based query frameworks and sketch-based precomputation. PromSketch is implemented as a standalone module that can be integrated into Prometheus and VictoriaMetrics, covering 70% of Prometheus' aggregation over time queries. Our evaluation shows that PromSketch achieves up to a two orders of magnitude reduction in query latency over Prometheus and VictoriaMetrics, while lowering operational dollar costs of query processing by two orders of magnitude compared to Prometheus and by at least 4x compared to VictoriaMetrics with at most 5% average errors across statistics. The source code has been made available at https://github.com/Froot-NetSys/promsketch.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Multiple parameter bifurcations in a modified Gower-Leslie predator-prey system with addictive Allee effect
Authors:
Xiaoling Wang,
Kuilin Wu,
Lan Zou
Abstract:
In this paper, we explore a modified Leslie-Gower type predator-prey model with Holling I functional response and addictive Allee effect in prey. It is shown that the highest codimension of a nilpotent cusp 4, and the model can undergo degenerate Bogdanov-Takens bifurcation of codimension 4. Besides, when the model has a center-type equilibrium, we show that it is a weak focus with order 4, and th…
▽ More
In this paper, we explore a modified Leslie-Gower type predator-prey model with Holling I functional response and addictive Allee effect in prey. It is shown that the highest codimension of a nilpotent cusp 4, and the model can undergo degenerate Bogdanov-Takens bifurcation of codimension 4. Besides, when the model has a center-type equilibrium, we show that it is a weak focus with order 4, and the model can exhibit Hopf bifurcation of codimension 5. Our results indicate that addictive Allee effect can induce not only richer dynamics and bifurcations, but also the coextinction of both populations with some positive initial densities. Finally, numerical simulations, including three limit cycles and four limit cycles, are presented to illustrate the theoretical results.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Generative AI for Autonomous Driving: Frontiers and Opportunities
Authors:
Yuping Wang,
Shuo Xing,
Cui Can,
Renjie Li,
Hongyuan Hua,
Kexin Tian,
Zhaobin Mo,
Xiangbo Gao,
Keshu Wu,
Sulong Zhou,
Hengxu You,
Juntong Peng,
Junge Zhang,
Zehao Wang,
Rui Song,
Mingxuan Yan,
Walter Zimmer,
Xingcheng Zhou,
Peiran Li,
Zhaohan Lu,
Chia-Ju Chen,
Yue Huang,
Ryan A. Rossi,
Lichao Sun,
Hongkai Yu
, et al. (22 additional authors not shown)
Abstract:
Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, partic…
▽ More
Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, particularly the pursuit of Level 5 autonomy. This survey delivers a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack. We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models (LLMs). We then map their frontier applications in image, LiDAR, trajectory, occupancy, video generation as well as LLM-guided reasoning and decision making. We categorize practical applications, such as synthetic data workflows, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI. We identify key obstacles and possibilities such as comprehensive generalization across rare cases, evaluation and safety checks, budget-limited implementation, regulatory compliance, ethical concerns, and environmental effects, while proposing research plans across theoretical assurances, trust metrics, transport integration, and socio-technical influence. By unifying these threads, the survey provides a forward-looking reference for researchers, engineers, and policymakers navigating the convergence of generative AI and advanced autonomous mobility. An actively maintained repository of cited works is available at https://github.com/taco-group/GenAI4AD.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.