Search | arXiv e-print repository

Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation

Authors: Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, Zicheng Liu

Abstract: Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image represe… ▽ More Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods. △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.20997 [pdf, ps, other]

A Glimpse of Satellite Galaxies in the Milky Way with the 2.5-meter Wide Field Survey Telescope (WFST): Bootes III and Draco

Authors: Chao Yang, Zhizheng Pan, Min Fang, Xian Zhong Zheng, Binyang Liu, Guoliang Li, Tian-Rui Sun, Ji-An Jiang, Miaomiao Zhang, Zhen Wan, Shuang Liu, Han Qu, Ji Yang, Xu Kong, Wenhao Liu, Yiping Shu, Jiang Chang, Tinggui Wang, Lulu Fan, Yongquan Xue, Wentao Luo, Hongxin Zhang, Zheng Lou, Haibin Zhao, Bin Li , et al. (12 additional authors not shown)

Abstract: We carry out deep imaging of the Milky Way satellite galaxies, Bootes III and Draco, with WFST as one pilot observing program to demonstrate the capability of WFST. Combining catalogs with PS1 DR2 and Gaia DR3, we derive proper motions for candidate member stars in these two satellite galaxies over a 12-year time baseline, yielding uncertainties of ~1.8 mas/yr at 21 mag and ~3.0 mas/yr at 22 mag i… ▽ More We carry out deep imaging of the Milky Way satellite galaxies, Bootes III and Draco, with WFST as one pilot observing program to demonstrate the capability of WFST. Combining catalogs with PS1 DR2 and Gaia DR3, we derive proper motions for candidate member stars in these two satellite galaxies over a 12-year time baseline, yielding uncertainties of ~1.8 mas/yr at 21 mag and ~3.0 mas/yr at 22 mag in the r band. The proper motions derived from bright and faint stars are consistent, indicating no significant variation in proper motion across stellar luminosity as these galaxies undergo tidal interactions with the MW. Meanwhile, we suggest that Bootes III represents the bound remnant of the progenitor galaxy that gave rise to the Styx stream, as evidenced by its elongated density profile and overdensity in both spatial and kinematic space. This is the first paper to use WFST to measure the proper motions of faint stars in Milky Way satellite galaxies. More detailed analyses will be presented in forthcoming papers from the wide field survey (WFS) program. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 17 pages, 12 figures, 3 tables. Accepted for publication in ApJ

arXiv:2506.20981 [pdf, ps, other]

PrivacyGo: Privacy-Preserving Ad Measurement with Multidimensional Intersection

Authors: Jian Du, Haohao Qian, Shikun Zhang, Wen-jie Lu, Donghang Lu, Yongchuan Niu, Bo Jiang, Yongjun Zhao, Qiang Yan

Abstract: This paper tackles the challenging and practical problem of multi-identifier private user profile matching for privacy-preserving ad measurement, a cornerstone of modern advertising analytics. We introduce a comprehensive cryptographic framework leveraging reversed Oblivious Pseudorandom Functions (OPRF) and novel blind key rotation techniques to support secure matching across multiple identifiers… ▽ More This paper tackles the challenging and practical problem of multi-identifier private user profile matching for privacy-preserving ad measurement, a cornerstone of modern advertising analytics. We introduce a comprehensive cryptographic framework leveraging reversed Oblivious Pseudorandom Functions (OPRF) and novel blind key rotation techniques to support secure matching across multiple identifiers. Our design prevents cross-identifier linkages and includes a differentially private mechanism to obfuscate intersection sizes, mitigating risks such as membership inference attacks. We present a concrete construction of our protocol that achieves both strong privacy guarantees and high efficiency. It scales to large datasets, offering a practical and scalable solution for privacy-centric applications like secure ad conversion tracking. By combining rigorous cryptographic principles with differential privacy, our work addresses a critical need in the advertising industry, setting a new standard for privacy-preserving ad measurement frameworks. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20963 [pdf, ps, other]

EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora

Authors: Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang, Xiaofang Zhou

Abstract: Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus. However, existing approaches typically assume a static corpus, requiring expensive full-graph reconstruction whenever new documents arrive, limiting their scalability in dynamic, evolving environments. To address these limitations, we introduce EraRAG, a no… ▽ More Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus. However, existing approaches typically assume a static corpus, requiring expensive full-graph reconstruction whenever new documents arrive, limiting their scalability in dynamic, evolving environments. To address these limitations, we introduce EraRAG, a novel multi-layered Graph-RAG framework that supports efficient and scalable dynamic updates. Our method leverages hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize the original corpus into hierarchical graph structures, enabling efficient and localized insertions of new data without disrupting the existing topology. The design eliminates the need for retraining or costly recomputation while preserving high retrieval accuracy and low latency. Experiments on large-scale benchmarks demonstrate that EraRag achieves up to an order of magnitude reduction in update time and token consumption compared to existing Graph-RAG systems, while providing superior accuracy performance. This work offers a practical path forward for RAG systems that must operate over continually growing corpora, bridging the gap between retrieval efficiency and adaptability. Our code and data are available at https://github.com/EverM0re/EraRAG-Official. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Under review

arXiv:2506.20960 [pdf, ps, other]

OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs

Authors: Yiman Zhang, Ziheng Luo, Qiangyu Yan, Wei He, Borui Jiang, Xinghao Chen, Kai Han

Abstract: In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverag… ▽ More In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos: OmniEval includes 810 audio-visual synchronized videos, 285 Chinese videos and 525 English videos; (iii) Diversity and granularity of tasks: OmniEval contains 2617 question-answer pairs, comprising 1412 open-ended questions and 1205 multiple-choice questions. These questions are divided into 3 major task types and 12 sub-task types to achieve comprehensive evaluation. Among them, we introduce a more granular video localization task named Grounding. Then we conduct experiments on OmniEval with several omni-modality models. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities. Codes and data could be found at https://omnieval-benchmark.github.io/. △ Less

Submitted 29 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20761 [pdf, ps, other]

doi 10.1145/3717823.3718300

A Framework for Building Data Structures from Communication Protocols

Authors: Alexandr Andoni, Shunhua Jiang, Omri Weinstein

Abstract: We present a general framework for designing efficient data structures for high-dimensional pattern-matching problems ($\exists \;? i\in[n], f(x_i,y)=1$) through communication models in which $f(x,y)$ admits sublinear communication protocols with exponentially-small error. Specifically, we reduce the data structure problem to the Unambiguous Arthur-Merlin (UAM) communication complexity of… ▽ More We present a general framework for designing efficient data structures for high-dimensional pattern-matching problems ($\exists \;? i\in[n], f(x_i,y)=1$) through communication models in which $f(x,y)$ admits sublinear communication protocols with exponentially-small error. Specifically, we reduce the data structure problem to the Unambiguous Arthur-Merlin (UAM) communication complexity of $f(x,y)$ under product distributions. We apply our framework to the Partial Match problem (a.k.a, matching with wildcards), whose underlying communication problem is sparse set-disjointness. When the database consists of $n$ points in dimension $d$, and the number of $\star$'s in the query is at most $w = c\log n \;(\ll d)$, the fastest known linear-space data structure (Cole, Gottlieb and Lewenstein, STOC'04) had query time $t \approx 2^w = n^c$, which is nontrivial only when $c<1$. By contrast, our framework produces a data structure with query time $n^{1-1/(c \log^2 c)}$ and space close to linear. To achieve this, we develop a one-sided $ε$-error communication protocol for Set-Disjointness under product distributions with $\tildeΘ(\sqrt{d\log(1/ε)})$ complexity, improving on the classical result of Babai, Frankl and Simon (FOCS'86). Building on this protocol, we show that the Unambiguous AM communication complexity of $w$-Sparse Set-Disjointness with $ε$-error under product distributions is $\tilde{O}(\sqrt{w \log(1/ε)})$, independent of the ambient dimension $d$, which is crucial for the Partial Match result. Our framework sheds further light on the power of data-dependent data structures, which is instrumental for reducing to the (much easier) case of product distributions. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 53 pages, STOC 2025

arXiv:2506.20558 [pdf, ps, other]

CCISolver: End-to-End Detection and Repair of Method-Level Code-Comment Inconsistency

Authors: Renyi Zhong, Yintong Huo, Wenwei Gu, Jinxi Kuang, Zhihan Jiang, Guangba Yu, Yichen Li, David Lo, Michael R. Lyu

Abstract: Comments within code serve as a crucial foundation for software documentation, facilitating developers to communicate and understand the code effectively. However, code-comment inconsistency (CCI) can negatively affect software development, testing, and maintenance. Recent efforts to mitigate this issue have emerged, but existing studies often suffer from inaccurate datasets and inadequate solutio… ▽ More Comments within code serve as a crucial foundation for software documentation, facilitating developers to communicate and understand the code effectively. However, code-comment inconsistency (CCI) can negatively affect software development, testing, and maintenance. Recent efforts to mitigate this issue have emerged, but existing studies often suffer from inaccurate datasets and inadequate solutions, weakening their practical effectiveness. In this study, we first conduct a quantitative analysis of existing datasets, revealing a substantial portion of sampled data are mislabeled. To address these data limitations, we introduce CCIBench, a refined dataset comprising high-quality data, to support the training and evaluation of method-level CCI methods. Furthermore, we present an innovative end-to-end LLM-based framework, CCISolver, designed to improve code quality by identifying and rectifying CCIs. Comprehensive evaluations demonstrate CCISolver's superior performance. For detection, it establishes a new state-of-the-art with an F1-score of 89.54%. In fixing task, it achieves a remarkable 18.84% relative improvement in GLEU score over the strongest baseline. This superiority is confirmed by human evaluation, where CCISolver's fixing success rate of 0.6533 significantly surpasses existing methods. Critically, in a practical end-to-end setting, CCISolver's innovative architecture is approximately 36% faster for inference than the baseline model, underscoring its scalability and real-world applicability. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: This manuscript is under review

arXiv:2506.20532 [pdf, ps, other]

{\tt RapidGBM}: An Efficient Tool for Fermi-GBM Visibility Checking and Data Analysis with a Case Study of EP240617a

Authors: Yun Wang, Jia Ren, Lu-Yao Jiang, Hao Zhou, Yi-Han Iris Yin, Yi-Fang Liang, Zhi-Ping Jin, Yi-Zhong Fan, Da-Ming Wei, Wei Chen, Hui Sun, Jing-Wei Hu, Dong-Yue Li, Jun Yang, Wen-Da Zhang, Yuan Liu, Wei-Min Yuan, Xue-Feng Wu

Abstract: We have developed a lightweight tool {\tt RapidGBM}, featured by a web-based interface and capabilities of rapid calculation of Fermi-GBM visibilities and performing basic data analysis. It has two key features: (1) immediately check the visibility of Fermi-GBM for new transients, and (2) check the light curve and perform spectral analysis after the hourly TTE data is released. The visibility chec… ▽ More We have developed a lightweight tool {\tt RapidGBM}, featured by a web-based interface and capabilities of rapid calculation of Fermi-GBM visibilities and performing basic data analysis. It has two key features: (1) immediately check the visibility of Fermi-GBM for new transients, and (2) check the light curve and perform spectral analysis after the hourly TTE data is released. The visibility check and the response matrix generation required for spectral analysis can be achieved through the historical pointing file after the orbit calculation, even when the real-time pointing file is not yet available. As a case, we apply the tool to EP240617a, an X-ray transient triggered by Einstein Probe (EP). We demonstrate the workflow of visibility checking, data processing, and spectral analysis for this event. The results suggest that EP240617a can be classified as an X-ray-rich GRB (XRR) and confirm the feasibility of using historical pointing files for rapid analysis. Further, we discuss possible physical interpretations of such events, including implications for jet launching and progenitor scenarios. Therefore, {\tt RapidGBM} is expected to assist Einstein Probe Transient Advocates (EP-TAs), Space-based multi-band astronomical Variable Objects Monitor Burst Advocates (SVOM-BAs), and other members of the community in cross-checking high-energy transients. Based on prompt emission parameter relations (e.g. $E_{\rm p}$-$E_{γ,\rm iso}$), it can also help identify peculiar GRBs (e.g. long-short burst, magnetar giant flare, etc.) and and provide useful references (e.g. more accurate $T_0$) for scheduling follow-up observations. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 11 pages, 7 figures, 1 table

arXiv:2506.20510 [pdf]

High-temperature helical edge states in BiSbTeSe$_2$/graphene van der Waals heterostructure

Authors: Yoichi Tanabe, Ngoc Han Tu, Ming-Chun Jiang, Yi Ling Chiew, Mitsutaka Haruta, Kiyohiro Adachi, David Pomaranski, Ryo Ito, Yuya Shimazaki, Daisuke Hashizume, Xiuzhen Yu, Guang-Yu Guo, Ryotaro Arita, Michihisa Yamamoto

Abstract: Van der Waals heterostructures have been used to tailor atomic layers into various artificial materials through interactions at heterointerfaces. The interplay between the band gap created by the band folding of the interfacial potential and the band inversion driven by enhanced spin-orbit interaction (SOI) through band hybridization enables us to realize a two-dimensional topological insulator (2… ▽ More Van der Waals heterostructures have been used to tailor atomic layers into various artificial materials through interactions at heterointerfaces. The interplay between the band gap created by the band folding of the interfacial potential and the band inversion driven by enhanced spin-orbit interaction (SOI) through band hybridization enables us to realize a two-dimensional topological insulator (2D-TI). Here we report the realization of graphene 2D-TIs by epitaxial growth of three-dimensional topological insulator (3D-TI) BiSbTeSe$_2$ ultrathin films on graphene. By increasing the BiSbTeSe$_2$ thickness from 2 nm to 9 nm to enhance SOI on graphene, the electronic state is altered from the trivial Kekul${é}$ insulator to the 2D-TI. The nonlocal transport reveals the helical edge conduction which survives up to 200 K at maximum. Our graphene 2D-TI is stable, easy to make electrical contacts, and of high quality. It offers various applications including spin-current conversion and platforms for Majorana fermions in junctions to superconductors. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 31pages, 4 figures, and 7 supporting figures

arXiv:2506.20502 [pdf]

Probing Solar Polar Regions

Authors: Yuanyong Deng, Hui Tian, Jie Jiang, Shuhong Yang, Hao Li, Robert Cameron, Laurent Gizon, Louise Harra, Robert F. Wimmer-Schweingruber, Frédéric Auchère, Xianyong Bai, Luis Bellot Rubio, Linjie Chen, Pengfei Chen, Lakshmi Pradeep Chitta, Jackie Davies, Fabio Favata, Li Feng, Xueshang Feng, Weiqun Gan, Don Hassler, Jiansen He, Junfeng Hou, Zhenyong Hou, Chunlan Jin , et al. (23 additional authors not shown)

Abstract: The magnetic fields and dynamical processes in the solar polar regions play a crucial role in the solar magnetic cycle and in supplying mass and energy to the fast solar wind, ultimately being vital in controlling solar activities and driving space weather. Despite numerous efforts to explore these regions, to date no imaging observations of the Sun's poles have been achieved from vantage points o… ▽ More The magnetic fields and dynamical processes in the solar polar regions play a crucial role in the solar magnetic cycle and in supplying mass and energy to the fast solar wind, ultimately being vital in controlling solar activities and driving space weather. Despite numerous efforts to explore these regions, to date no imaging observations of the Sun's poles have been achieved from vantage points out of the ecliptic plane, leaving their behavior and evolution poorly understood. This observation gap has left three top-level scientific questions unanswered, 1) How does the solar dynamo work and drive the solar magnetic cycle? 2) What drives the fast solar wind? 3) How do space weather processes globally originate from the Sun and propagate throughout the solar system? The Solar Polar-orbit Observatory (SPO) mission, a solar polar exploration spacecraft, is proposed to address these three unanswered scientific questions by imaging the Sun's poles from high heliolatitudes. In order to achieve its scientific goals, SPO will carry six remote-sensing and four in-situ instruments to measure the vector magnetic fields and Doppler velocity fields in the photosphere, to observed the Sun in the extreme ultraviolet, X-ray, and radio wavelengths, to image the corona and the heliosphere up to 45 $R_\odot$, and to perform in-situ detection of magnetic fields, and low- and high-energy particles in the solar wind. △ Less

Submitted 28 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

Comments: Accepted for publication in Chinese Journal of Space Science

arXiv:2506.20479 [pdf, ps, other]

The MALATANG survey: Dense gas distribution on sub-kiloparsec scales across the disk of M82

Authors: Jian-Fa Wang, Yu Gao, Qing-Hua Tan, Xue-Jian Jiang, Li Ji, Zhi-Yu Zhang, Jun-Zhi Wang, Jun-Feng Wang, R. Thomas Greve, Yan Jiang, Ashley Bemis, Elias Brinks, Aeree Chung, J. Malcolm Currie, Richard de Grijs, Taotao Fang, C. Luis Ho, Bumhyun Lee, Satoki Matsushita, Michał Michałowski, Soojong Pak, Panomporn Poojon, G. Mark Rawlings, Amelie Saintonge, Yi-Chen Sun , et al. (1 additional authors not shown)

Abstract: We present observations of HCN J=4-3 and HCO^+ J=4-3 lines obtained with the James Clerk Maxwell Telescope as part of the MALATANG survey, combined with archival HCN J=1-0 and HCO^+ J=1-0 data from the Green Bank Telescope, to study the spatial distribution and excitation conditions of dense molecular gas in the disk of M82. We detect HCN J=4-3 and HCO^+ J=4-3 emission within the central region (<… ▽ More We present observations of HCN J=4-3 and HCO^+ J=4-3 lines obtained with the James Clerk Maxwell Telescope as part of the MALATANG survey, combined with archival HCN J=1-0 and HCO^+ J=1-0 data from the Green Bank Telescope, to study the spatial distribution and excitation conditions of dense molecular gas in the disk of M82. We detect HCN J=4-3 and HCO^+ J=4-3 emission within the central region (< 500 pc) of the galaxy, while the J=1-0 emission lines exhibit a more extended spatial distribution (> 700 pc). The dense gas shows a clear double-lobed structure in both spatial distribution and kinematics, with the HCN and HCO^+ J=4-3 lines in the southwest lobe blueshifted by ~ 40 km/s relative to the J=1-0 lines. The HCN J=4-3/1-0 and HCO^+ J=4-3/1-0 line-luminosity ratios range from 0.09 to 0.53 and from 0.14 to 0.87, respectively, with mean values of 0.18 +/- 0.04 and 0.36 +/- 0.06. The HCN ratio is lower than the typical average observed in nearby star-forming galaxies, whereas the HCO^+ ratio is comparatively higher, suggesting that the high-J HCN emission in M82 is significantly sub-thermally excited. Spatially, the peak values of the J=4-3/1-0 ratios are found in the northwest region of M82, coinciding with the galaxy-scale outflow. Elevated HCN/HCO^+ ratios are also detected in roughly the same area, potentially tracing local excitation enhancements driven by the outflow. The HCN/HCO^+ J=4-3 ratio across all detected regions ranges from 0.19 to 1.07 with a mean value of 0.41 +/- 0.11, which is significantly lower than the average J=1-0 ratio of 0.76 +/- 0.08. Both ratios are significantly lower than the average values observed in nearby star-forming galaxies, which could be related to the relatively low gas density and the presence of an extended photo-dissociation region in M82. △ Less

Submitted 26 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20414 [pdf, ps, other]

Shell effects in nuclear charge radii based on Skyrme density functionals

Authors: Rong An, Shuai Sun, Xiang Jiang, Na Tang, Li-Gang Cao, Feng-Shou Zhang

Abstract: A unified description of the charge radii throughout the entire nuclide chart plays an essential role for our understanding of nuclear structure and fundamental nuclear interactions. In this work, the influence of new term, which catches the spirit of neutron and proton pairs condensation around Fermi surface, on the charge radii has been investigated based on the Skyrme density functionals with t… ▽ More A unified description of the charge radii throughout the entire nuclide chart plays an essential role for our understanding of nuclear structure and fundamental nuclear interactions. In this work, the influence of new term, which catches the spirit of neutron and proton pairs condensation around Fermi surface, on the charge radii has been investigated based on the Skyrme density functionals with the effective forces SLy5 and SkM$^{*}$. The differential charge radii of even-even Ca, Ni, Sn, and Pb isotopes are employed to evaluate the validity of this theoretical model. Meanwhile, the results obtained by the relativistic density functional with the effective Lagrangian NL3 are also shown for the quantitative comparison. The calculated results suggest that the modified model can improve the trend of changes of the differential charge radii along Ca, Ni, Sn, and Pb isotopic chains, especially the shell closure effect at the neutron numbers $N=28$, 82 and 126. The shell quenching phenomena of charge radii can also be predicted at the neutron number $N=50$ along the corresponding Ni and Sn isotopes, respectively. The inverted parabolic-like shapes between the two fully filled shells can also be observed, but the amplitude is gradually weakened from Ca to Pb isotopic chains. Combining the existing literatures, it suggests that the discontinuous behavior in nuclear charge radii can be described well by considering the influence of neutron Cooper pairs condensation around Fermi surface. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 29 pages, 4 figures, 1 table, To be appeared in Physical Review C

arXiv:2506.20392 [pdf]

Transport Evidence for Wigner Crystals in Monolayer MoTe2

Authors: Mingjie Zhang, Zhenyu Wang, Yifan Jiang, Yaotian Liu, Kenji Watanabe, Takashi Taniguchi, Song Liu, Shiming Lei, Yongqing Li, Yang Xu

Abstract: The crystallization of charge carriers, dubbed the Wigner crystal, is anticipated at low densities in clean two-dimensional electronic systems (2DES). While there has been extensive investigation across diverse platforms, probing spontaneous charge and spin ordering is hindered by disorder effects and limited interaction energies. Here, we report transport evidence for Wigner crystals with antifer… ▽ More The crystallization of charge carriers, dubbed the Wigner crystal, is anticipated at low densities in clean two-dimensional electronic systems (2DES). While there has been extensive investigation across diverse platforms, probing spontaneous charge and spin ordering is hindered by disorder effects and limited interaction energies. Here, we report transport evidence for Wigner crystals with antiferromagnetic exchange interactions in high-quality, hexagonal boron nitride encapsulated monolayer MoTe2, a system that achieves a large interaction parameter (r_s) at proper hole densities. A density-tuned metal-insulator transition (MIT) occurring at 3.1E10^11 cm-2 (corresponding to r_s~32) and pronounced nonlinear charge transport in the insulating regime at low temperatures signify the formation of Wigner crystals. Thermal melting of the crystalline phase is observed below approximately 2 K via temperature-dependent nonlinear transport. Magnetoresistance measurements further reveal a substantial enhancement of spin susceptibility as approaching the MIT. The temperature dependence of spin susceptibility in the Wigner crystal phase closely follows the Curie-Weiss law, with the extracted negative Weiss constant illustrating antiferromagnetic exchange interactions. Furthermore, we have found the system exhibits metallic-like differential resistivity under finite DC bias, possibly indicating the existence of a non-equilibrium coherent state in the depinning of Wigner crystals. Our observations establish monolayer MoTe2 as a promising platform for exploring magnetic and dynamic properties of Wigner crystals. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 25 pages, 4 figures and 8 supplemental figures

arXiv:2506.20344 [pdf, ps, other]

A Complete Loss Landscape Analysis of Regularized Deep Matrix Factorization

Authors: Po Chen, Rujun Jiang, Peng Wang

Abstract: Despite its wide range of applications across various domains, the optimization foundations of deep matrix factorization (DMF) remain largely open. In this work, we aim to fill this gap by conducting a comprehensive study of the loss landscape of the regularized DMF problem. Toward this goal, we first provide a closed-form expression of all critical points. Building on this, we establish precise c… ▽ More Despite its wide range of applications across various domains, the optimization foundations of deep matrix factorization (DMF) remain largely open. In this work, we aim to fill this gap by conducting a comprehensive study of the loss landscape of the regularized DMF problem. Toward this goal, we first provide a closed-form expression of all critical points. Building on this, we establish precise conditions under which a critical point is a local minimizer, a global minimizer, a strict saddle point, or a non-strict saddle point. Leveraging these results, we derive a necessary and sufficient condition under which each critical point is either a local minimizer or a strict saddle point. This provides insights into why gradient-based methods almost always converge to a local minimizer of the regularized DMF problem. Finally, we conduct numerical experiments to visualize its loss landscape under different settings to support our theory. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 35 pages, 3 figures

arXiv:2506.20332 [pdf, ps, other]

Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards

Authors: Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, Jun Song, Yuning Jiang, Bo Zheng

Abstract: Vision-language model-based mobile agents have gained the ability to not only understand complex instructions and mobile screenshots, but also optimize their action outputs via thinking and reasoning, benefiting from reinforcement learning, such as Group Relative Policy Optimization (GRPO). However, existing research centers on offline reinforcement learning training or online optimization using a… ▽ More Vision-language model-based mobile agents have gained the ability to not only understand complex instructions and mobile screenshots, but also optimize their action outputs via thinking and reasoning, benefiting from reinforcement learning, such as Group Relative Policy Optimization (GRPO). However, existing research centers on offline reinforcement learning training or online optimization using action-level rewards, which limits the agent's dynamic interaction with the environment. This often results in agents settling into local optima, thereby weakening their ability for exploration and error action correction. To address these challenges, we introduce an approach called Mobile-R1, which employs interactive multi-turn reinforcement learning with task-level rewards for mobile agents. Our training framework consists of three stages: initial format finetuning, single-step online training via action-level reward, followed by online training via task-level reward based on multi-turn trajectories. This strategy is designed to enhance the exploration and error correction capabilities of Mobile-R1, leading to significant performance improvements. Moreover, we have collected a dataset covering 28 Chinese applications with 24,521 high-quality manual annotations and established a new benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/. △ Less

Submitted 27 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

Comments: 14 pages, 12 figures

arXiv:2506.20265 [pdf, ps, other]

Two-dimensional transition metal selenides family M2Se: A platform for superconductivity, band topology, and charge density waves

Authors: Shu-Xiang Qiao, Kai-Yue Jiang, Yu-Lin Han, Na Jiao, Ying-Jie Chen, Hong-Yan Lu, Ping Zhang

Abstract: MXenes and MBenes, which are two-dimensional (2D) transition metal carbides/nitrides and borides, have been extensively studied for their impressive properties. Recently, we reported a family of transition metal sulfides MSene (M2S) with rich properties [Phys. Rev. B 111, L041404 (2025)], it is worth studying whether selenides with similar structure also have rich properties. In this work, through… ▽ More MXenes and MBenes, which are two-dimensional (2D) transition metal carbides/nitrides and borides, have been extensively studied for their impressive properties. Recently, we reported a family of transition metal sulfides MSene (M2S) with rich properties [Phys. Rev. B 111, L041404 (2025)], it is worth studying whether selenides with similar structure also have rich properties. In this work, through high-throughput screening, we present a novel family of 2D transition metal selenides, M2Se. In this family, there are fifty-eight candidate materials, of which ten are stable and metallic. Notably, eight exhibit superconductivity, among which four are superconducting topological metals. Besides, eight show charge density wave (CDW) behavior, among which five also exhibit antiferromagnetism. It is revealed that CDW originates from electron-phonon coupling rather than Fermi surface nesting. Moreover, strain can be applied to regulate the competition between CDW and superconductivity. Our findings reveal the rich properties of superconductivity, band topology, CDW, and magnetism in M2Se, providing a new platform for the controllable integration of multifunctional quantum states. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 10 pages, 5 figures

arXiv:2506.20263 [pdf, ps, other]

Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification

Authors: Ning Luo, Meiyin Hu, Huan Wan, Yanyan Yang, Zhuohang Jiang, Xin Wei

Abstract: Few-shot fine-grained image classification (FS-FGIC) presents a significant challenge, requiring models to distinguish visually similar subclasses with limited labeled examples. Existing methods have critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods fail to utilize hierarchical feature information and lack mechanisms… ▽ More Few-shot fine-grained image classification (FS-FGIC) presents a significant challenge, requiring models to distinguish visually similar subclasses with limited labeled examples. Existing methods have critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods fail to utilize hierarchical feature information and lack mechanisms to focus on discriminative regions. We propose the Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN), which integrates dual-layer feature reconstruction with mask-enhanced feature processing to improve fine-grained classification. HMDRN incorporates a dual-layer feature reconstruction and fusion module that leverages complementary visual information from different network hierarchies. Through learnable fusion weights, the model balances high-level semantic representations from the last layer with mid-level structural details from the penultimate layer. Additionally, we design a spatial binary mask-enhanced transformer self-reconstruction module that processes query features through adaptive thresholding while maintaining complete support features, enhancing focus on discriminative regions while filtering background noise. Extensive experiments on three challenging fine-grained datasets demonstrate that HMDRN consistently outperforms state-of-the-art methods across Conv-4 and ResNet-12 backbone architectures. Comprehensive ablation studies validate the effectiveness of each proposed component, revealing that dual-layer reconstruction enhances inter-class discrimination while mask-enhanced transformation reduces intra-class variations. Visualization results provide evidence of HMDRN's superior feature reconstruction capabilities. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20260 [pdf, ps, other]

Argumentative Ensembling for Robust Recourse under Model Multiplicity

Authors: Junqi Jiang, Antonio Rago, Francesco Leofante, Francesca Toni

Abstract: In machine learning, it is common to obtain multiple equally performing models for the same prediction task, e.g., when training neural networks with different random seeds. Model multiplicity (MM) is the situation which arises when these competing models differ in their predictions for the same input, for which ensembling is often employed to determine an aggregation of the outputs. Providing rec… ▽ More In machine learning, it is common to obtain multiple equally performing models for the same prediction task, e.g., when training neural networks with different random seeds. Model multiplicity (MM) is the situation which arises when these competing models differ in their predictions for the same input, for which ensembling is often employed to determine an aggregation of the outputs. Providing recourse recommendations via counterfactual explanations (CEs) under MM thus becomes complex, since the CE may not be valid across all models, i.e., the CEs are not robust under MM. In this work, we formalise the problem of providing recourse under MM, which we name recourse-aware ensembling (RAE). We propose the idea that under MM, CEs for each individual model should be considered alongside their predictions so that the aggregated prediction and recourse are decided in tandem. Centred around this intuition, we introduce six desirable properties for solutions to this problem. For solving RAE, we propose a novel argumentative ensembling method which guarantees the robustness of CEs under MM. Specifically, our method leverages computational argumentation to explicitly represent the conflicts between models and counterfactuals regarding prediction results and CE validity. It then uses argumentation semantics to resolve the conflicts and obtain the final solution, in a manner which is parametric to the chosen semantics. Our method also allows for the specification of preferences over the models under MM, allowing further customisation of the ensemble. In a comprehensive theoretical analysis, we characterise the behaviour of argumentative ensembling with four different argumentation semantics. We then empirically demonstrate the effectiveness of our approach in satisfying desirable properties with eight instantiations of our method. (Abstract is shortened for arXiv.) △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: arXiv admin note: substantial text overlap with arXiv:2312.15097

arXiv:2506.20200 [pdf, ps, other]

MS-IQA: A Multi-Scale Feature Fusion Network for PET/CT Image Quality Assessment

Authors: Siqiao Li, Chen Hui, Wei Zhang, Rui Liang, Chenyue Song, Feng Jiang, Haiqi Zhu, Zhixuan Li, Hong Huang, Xiang Li

Abstract: Positron Emission Tomography / Computed Tomography (PET/CT) plays a critical role in medical imaging, combining functional and anatomical information to aid in accurate diagnosis. However, image quality degradation due to noise, compression and other factors could potentially lead to diagnostic uncertainty and increase the risk of misdiagnosis. When evaluating the quality of a PET/CT image, both l… ▽ More Positron Emission Tomography / Computed Tomography (PET/CT) plays a critical role in medical imaging, combining functional and anatomical information to aid in accurate diagnosis. However, image quality degradation due to noise, compression and other factors could potentially lead to diagnostic uncertainty and increase the risk of misdiagnosis. When evaluating the quality of a PET/CT image, both low-level features like distortions and high-level features like organ anatomical structures affect the diagnostic value of the image. However, existing medical image quality assessment (IQA) methods are unable to account for both feature types simultaneously. In this work, we propose MS-IQA, a novel multi-scale feature fusion network for PET/CT IQA, which utilizes multi-scale features from various intermediate layers of ResNet and Swin Transformer, enhancing its ability of perceiving both local and global information. In addition, a multi-scale feature fusion module is also introduced to effectively combine high-level and low-level information through a dynamically weighted channel attention mechanism. Finally, to fill the blank of PET/CT IQA dataset, we construct PET-CT-IQA-DS, a dataset containing 2,700 varying-quality PET/CT images with quality scores assigned by radiologists. Experiments on our dataset and the publicly available LDCTIQAC2023 dataset demonstrate that our proposed model has achieved superior performance against existing state-of-the-art methods in various IQA metrics. This work provides an accurate and efficient IQA method for PET/CT. Our code and dataset are available at https://github.com/MS-IQA/MS-IQA/. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Accepted to MICCAI 2025

arXiv:2506.20109 [pdf, ps, other]

Evaluating Disassembly Errors With Only Binaries

Authors: Lambang Akbar Wijayadi, Yuancheng Jiang, Roland H. C. Yap, Zhenkai Liang, Zhuohao Liu

Abstract: Disassemblers are crucial in the analysis and modification of binaries. Existing works showing disassembler errors largely rely on practical implementation without specific guarantees and assume source code and compiler toolchains to evaluate ground truth. However, the assumption of source code is contrary to typical binary scenarios where only the binary is available. In this work, we investigate… ▽ More Disassemblers are crucial in the analysis and modification of binaries. Existing works showing disassembler errors largely rely on practical implementation without specific guarantees and assume source code and compiler toolchains to evaluate ground truth. However, the assumption of source code is contrary to typical binary scenarios where only the binary is available. In this work, we investigate an approach with minimal assumptions and a sound approach to disassembly error evaluation that does not require source code. Any source code does not address the fundamental problem of binary disassembly and fails when only the binary exists. As far as we know, this is the first work to evaluate disassembly errors using only the binary. We propose TraceBin, which uses dynamic execution to find disassembly errors. TraceBin targets the use case where the disassembly is used in an automated fashion for security tasks on a target binary, such as static binary instrumentation, binary hardening, automated code repair, and so on, which may be affected by disassembly errors. Discovering disassembly errors in the target binary aids in reducing problems caused by such errors. Furthermore, we are not aware of existing approaches that can evaluate errors given only a target binary, as they require source code. Our evaluation shows TraceBin finds: (i) errors consistent with existing studies even without source; (ii) disassembly errors due to control flow; (iii) new interesting errors; (iv) errors in non-C/C++ binaries; (v) errors in closed-source binaries; and (vi) show that disassembly errors can have significant security implications. Overall, our experimental results show that TraceBin finds many errors in existing popular disassemblers. It is also helpful in automated security tasks on (closed source) binaries relying on disassemblers. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.20103 [pdf, ps, other]

BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos

Authors: Jiahao Lin, Weixuan Peng, Bojia Zi, Yifeng Gao, Xianbiao Qi, Xingjun Ma, Yu-Gang Jiang

Abstract: Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatia… ▽ More Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 7 page,4 figures,2 tables

ACM Class: I.4

arXiv:2506.20095 [pdf, ps, other]

Solving Infinite Families of Dual Conformal Integrals and Periods

Authors: Song He, Xuhang Jiang

Abstract: We compute infinite families of all-loop planar, dual conformal invariant (DCI) integrals, which contribute to four-point Coulomb-branch amplitudes and correlators in ${\cal N}=4$ supersymmetric Yang-Mills theory, by solving ``boxing" differential equations via package HyperlogProcedures; this amounts to an ``inverse-boxing" operation/integration recursively acting on lower-loop cases (with the bo… ▽ More We compute infinite families of all-loop planar, dual conformal invariant (DCI) integrals, which contribute to four-point Coulomb-branch amplitudes and correlators in ${\cal N}=4$ supersymmetric Yang-Mills theory, by solving ``boxing" differential equations via package HyperlogProcedures; this amounts to an ``inverse-boxing" operation/integration recursively acting on lower-loop cases (with the box integral as the starting point), and the resulting single-valued harmonic polylogarithmic functions (SVHPL) are nicely labeled by ``binary" strings of $0$ and $1$ without consecutive $1$'s. These functions are special cases of the so-called generalized ladders studied in arXiv:1207.3824, where extended Steinmann relations are imposed due to planarity, and they are counted by the Fibonacci sequence. Our results can be viewed as ``two-dimensional" extensions of the well-known ladder integrals to many more infinite families of DCI integrals: the ladders have strings with a single $1$ followed by all $0$'s, and the other extreme, which nicely evaluate to the ``zigzag" SVHPL functions with alternating $1$'s and $0$'s, are nothing but the four-point DCI integrals from the very special family of anti-prism $f$-graphs. We also study periods of these integrals: while their periods are in general complicated single-valued multiple zeta values (SVMZV), the ``zigzag" DCI integrals from anti-prism gives exactly the famous ``zigzag" periods proportional to $ζ_{2L{+}1}$, and empirically it provides a numerical lower-bound for $L$-loop periods of any binary string, with the upper-bound given by that of the ladder. Based on $f$-graphs as a tool for studying these periods, we discuss several interesting facts and observations about these (motivic) SVMZV and relations among them to all loops, and enumerate a basis for them up to $L=10$. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 44 pages, many figures

arXiv:2506.19884 [pdf, ps, other]

MNN-AECS: Energy Optimization for LLM Decoding on Mobile Devices via Adaptive Core Selection

Authors: Zhengxiang Huang, Chaoyue Niu, Zhaode Wang, Jiarui Xue, Hanming Zhang, Yugang Wang, Zewei Xin, Xiaotang Jiang, Chengfei Lv, Fan Wu, Guihai Chen

Abstract: As the demand for on-device Large Language Model (LLM) inference grows, energy efficiency has become a major concern, especially for battery-limited mobile devices. Our analysis shows that the memory-bound LLM decode phase dominates energy use, and yet most existing works focus on accelerating the prefill phase, neglecting energy concerns. We introduce Adaptive Energy-Centric Core Selection (AECS)… ▽ More As the demand for on-device Large Language Model (LLM) inference grows, energy efficiency has become a major concern, especially for battery-limited mobile devices. Our analysis shows that the memory-bound LLM decode phase dominates energy use, and yet most existing works focus on accelerating the prefill phase, neglecting energy concerns. We introduce Adaptive Energy-Centric Core Selection (AECS) and integrate it into MNN to create the energy-efficient version, MNN-AECS, the first engine-level system solution without requiring root access or OS modifications for energy-efficient LLM decoding. MNN-AECS is designed to reduce LLM decoding energy while keeping decode speed within an acceptable slowdown threshold by dynamically selecting low-power CPU cores. MNN-AECS is evaluated across 5 Android and 2 iOS devices on 5 popular LLMs of various sizes. Compared to original MNN, MNN-AECS cuts down energy use by 23% without slowdown averaged over all 7 devices and 4 datasets. Against other engines, including llama.cpp, executorch, mllm, and MediaPipe, MNN-AECS delivers 39% to 78% energy saving and 12% to 363% speedup on average. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19874 [pdf, ps, other]

Towards Provable (In)Secure Model Weight Release Schemes

Authors: Xin Yang, Bintao Tang, Yuhao Wang, Zimo Ji, Terry Jingchen Zhang, Wenyuan Jiang

Abstract: Recent secure weight release schemes claim to enable open-source model distribution while protecting model ownership and preventing misuse. However, these approaches lack rigorous security foundations and provide only informal security guarantees. Inspired by established works in cryptography, we formalize the security of weight release schemes by introducing several concrete security definitions.… ▽ More Recent secure weight release schemes claim to enable open-source model distribution while protecting model ownership and preventing misuse. However, these approaches lack rigorous security foundations and provide only informal security guarantees. Inspired by established works in cryptography, we formalize the security of weight release schemes by introducing several concrete security definitions. We then demonstrate our definition's utility through a case study of TaylorMLP, a prominent secure weight release scheme. Our analysis reveals vulnerabilities that allow parameter extraction thus showing that TaylorMLP fails to achieve its informal security goals. We hope this work will advocate for rigorous research at the intersection of machine learning and security communities and provide a blueprint for how future weight release schemes should be designed and evaluated. △ Less

Submitted 26 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

Comments: 8 pages, 2 figures; author name typos and institutions corrected

arXiv:2506.19707 [pdf, ps, other]

Enhanced Image Recognition Using Gaussian Boson Sampling

Authors: Si-Qiu Gong, Ming-Cheng Chen, Hua-Liang Liu, Hao Su, Yi-Chao Gu, Hao-Yang Tang, Meng-Hao Jia, Yu-Hao Deng, Qian Wei, Hui Wang, Han-Sen Zhong, Xiao Jiang, Li Li, Nai-Le Liu, Chao-Yang Lu, Jian-Wei Pan

Abstract: Gaussian boson sampling (GBS) has emerged as a promising quantum computing paradigm, demonstrating its potential in various applications. However, most existing works focus on theoretical aspects or simple tasks, with limited exploration of its capabilities in solving real-world practical problems. In this work, we propose a novel GBS-based image recognition scheme inspired by extreme learning mac… ▽ More Gaussian boson sampling (GBS) has emerged as a promising quantum computing paradigm, demonstrating its potential in various applications. However, most existing works focus on theoretical aspects or simple tasks, with limited exploration of its capabilities in solving real-world practical problems. In this work, we propose a novel GBS-based image recognition scheme inspired by extreme learning machine (ELM) to enhance the performance of perceptron and implement it using our latest GBS device, Jiuzhang. Our approach utilizes an 8176-mode temporal-spatial hybrid encoding photonic processor, achieving approximately 2200 average photon clicks in the quantum computational advantage regime. We apply this scheme to classify images from the MNIST and Fashion-MNIST datasets, achieving a testing accuracy of 95.86% on MNIST and 85.95% on Fashion-MNIST. These results surpass those of classical method SVC with linear kernel and previous physical ELM-based experiments. Additionally, we explore the influence of three hyperparameters and the efficiency of GBS in our experiments. This work not only demonstrates the potential of GBS in real-world machine learning applications but also aims to inspire further advancements in powerful machine learning schemes utilizing GBS technology. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19694 [pdf, ps, other]

UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation

Authors: Yue Zhou, Yuan Bi, Wenjuan Tong, Wei Wang, Nassir Navab, Zhongliang Jiang

Abstract: Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations,… ▽ More Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19683 [pdf, ps, other]

Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance

Authors: Xuesong Li, Dianye Huang, Yameng Zhang, Nassir Navab, Zhongliang Jiang

Abstract: Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for… ▽ More Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for improved ultrasound interpretability and basic scanning guidance among non-expert users, e.g., in point-of-care settings, has not yet been explored. In this study, we first introduce the scene graph (SG) for ultrasound images to explain image content to ordinary and provide guidance for ultrasound scanning. The ultrasound SG is first computed using a transformer-based one-stage method, eliminating the need for explicit object detection. To generate a graspable image explanation for ordinary, the user query is then used to further refine the abstract SG representation through LLMs. Additionally, the predicted SG is explored for its potential in guiding ultrasound scanning toward missing anatomies within the current imaging view, assisting ordinary users in achieving more standardized and complete anatomical exploration. The effectiveness of this SG-based image explanation and scanning guidance has been validated on images from the left and right neck regions, including the carotid and thyroid, across five volunteers. The results demonstrate the potential of the method to maximally democratize ultrasound by enhancing its interpretability and usability for ordinaries. △ Less

Submitted 26 June, 2025; v1 submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19681 [pdf, ps, other]

Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images

Authors: Cheng Jin, Fengtao Zhou, Yunfang Yu, Jiabo Ma, Yihui Wang, Yingxue Xu, Huajun Zhou, Hao Jiang, Luyang Luo, Luhui Mao, Zifan He, Xiuming Zhang, Jing Zhang, Ronald Chan, Herui Yao, Hao Chen

Abstract: Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information du… ▽ More Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information during training to extract genome-anchored histological embeddings, enabling effective molecular prediction using only WSIs at inference. Through extensive evaluation across 49 molecular oncology tasks using 11,257 cases among 20 cohorts, PathLUPI demonstrated superior performance compared to conventional methods trained solely on WSIs. Crucially, it achieves AUC $\geq$ 0.80 in 14 of the biomarker prediction and molecular subtyping tasks and C-index $\geq$ 0.70 in survival cohorts of 5 major cancer types. Moreover, PathLUPI embeddings reveal distinct cellular morphological signatures associated with specific genotypes and related biological pathways within WSIs. By effectively encoding molecular context to refine WSI representations, PathLUPI overcomes a key limitation of existing models and offers a novel strategy to bridge molecular insights with routine pathology workflows for wider clinical application. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: Under Review

arXiv:2506.19643 [pdf, ps, other]

Unsupervised Data Generation for Offline Reinforcement Learning: A Perspective from Model

Authors: Shuncheng He, Hongchang Zhang, Jianzhun Shao, Yuhang Jiang, Xiangyang Ji

Abstract: Offline reinforcement learning (RL) recently gains growing interests from RL researchers. However, the performance of offline RL suffers from the out-of-distribution problem, which can be corrected by feedback in online RL. Previous offline RL research focuses on restricting the offline algorithm in in-distribution even in-sample action sampling. In contrast, fewer work pays attention to the influ… ▽ More Offline reinforcement learning (RL) recently gains growing interests from RL researchers. However, the performance of offline RL suffers from the out-of-distribution problem, which can be corrected by feedback in online RL. Previous offline RL research focuses on restricting the offline algorithm in in-distribution even in-sample action sampling. In contrast, fewer work pays attention to the influence of the batch data. In this paper, we first build a bridge over the batch data and the performance of offline RL algorithms theoretically, from the perspective of model-based offline RL optimization. We draw a conclusion that, with mild assumptions, the distance between the state-action pair distribution generated by the behavioural policy and the distribution generated by the optimal policy, accounts for the performance gap between the policy learned by model-based offline RL and the optimal policy. Secondly, we reveal that in task-agnostic settings, a series of policies trained by unsupervised RL can minimize the worst-case regret in the performance gap. Inspired by the theoretical conclusions, UDG (Unsupervised Data Generation) is devised to generate data and select proper data for offline training under tasks-agnostic settings. Empirical results demonstrate that UDG can outperform supervised data generation on solving unknown tasks. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19500 [pdf, ps, other]

NaviAgent: Bilevel Planning on Tool Dependency Graphs for Function Calling

Authors: Yan Jiang, Hao Zhou, LiZhong GU, Ai Han, TianLong Li

Abstract: LLMs' reliance on static knowledge and fragile tool invocation severely hinders the orchestration of complex, heterogeneous toolchains, particularly at large scales. Existing methods typically use rigid single-path execution, resulting in poor error recovery and exponentially growing search spaces. We introduce NaviAgent, a graph-navigated bilevel planning architecture for robust function calling,… ▽ More LLMs' reliance on static knowledge and fragile tool invocation severely hinders the orchestration of complex, heterogeneous toolchains, particularly at large scales. Existing methods typically use rigid single-path execution, resulting in poor error recovery and exponentially growing search spaces. We introduce NaviAgent, a graph-navigated bilevel planning architecture for robust function calling, comprising a Multi-Path Decider and Graph-Encoded Navigator. As an LLM-powered agent, the Multi-Path Decider defines a four-dimensional decision space and continuously perceives environmental states, dynamically selecting the optimal action to fully cover all tool invocation scenarios. The Graph-Encoded Navigator constructs a Tool Dependency Heterogeneous Graph (TDHG), where node embeddings explicitly fuse API schema structure with historical invocation behavior. It also integrates a novel heuristic search strategy that guides the Decider toward efficient and highly successful toolchains, even for unseen tool combinations. Experiments show that NaviAgent consistently achieves the highest task success rate (TSR) across all foundation models and task complexities, outperforming the average baselines (ReAct, ToolLLM, α-UMI) by 13.5%, 16.4%, and 19.0% on Qwen2.5-14B, Qwen2.5-32B, and Deepseek-V3, respectively. Its execution steps are typically within one step of the most efficient baseline, ensuring a strong balance between quality and efficiency. Notably, a fine-tuned Qwen2.5-14B model achieves a TSR of 49.5%, surpassing the much larger 32B model (44.9%) under our architecture. Incorporating the Graph-Encoded Navigator further boosts TSR by an average of 2.4 points, with gains up over 9 points on complex tasks for larger models (Deepseek-V3 and GPT-4o), highlighting its essential role in toolchain orchestration. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19488 [pdf, ps, other]

SceneCrafter: Controllable Multi-View Driving Scene Editing

Authors: Zehao Zhu, Yuliang Zou, Chiyu Max Jiang, Bo Sun, Vincent Casser, Xiukun Huang, Jiahao Wang, Zhenpei Yang, Ruiqi Gao, Leonidas Guibas, Mingxing Tan, Dragomir Anguelov

Abstract: Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other ha… ▽ More Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning ``empty street" priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. We also introduce an alpha-blending framework to synthesize data with local edits, leveraging a model trained on empty street priors through novel masked training and multi-view repaint paradigm. SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: CVPR 2025

arXiv:2506.19449 [pdf, ps, other]

A broadband platform to search for hidden photons

Authors: Daqing Liu, Bin Tang, Xingfang Jiang, Xianyun Liu, Ning Ma

Abstract: The optical behavior of a structure consisting of graphene sheets embedded in media was studied, and the differences between the structure and ordinary birefringent crystal, double zero-reflectance point, were identified. We showed the changes in the optical behavior of the structure due to the existence of hidden photons. When a radiation illuminates the structure, only… ▽ More The optical behavior of a structure consisting of graphene sheets embedded in media was studied, and the differences between the structure and ordinary birefringent crystal, double zero-reflectance point, were identified. We showed the changes in the optical behavior of the structure due to the existence of hidden photons. When a radiation illuminates the structure, only $ω^2/ω_p^2>1+\frac{m_X^2 c^4 χ^2}{ε_r\hbar^2ω_p^2}$ can propagate through the structure. This provides a broadband platform for detecting hidden photons, where the sensitivity increases with the mass of the hidden photon.In contrast, if the mass of hidden photon is small, one can use a method similar to the light-shining-through-thin-wall technique. The structure is a platform to actively search for hidden photons since the operating point of the structure does not have to match the mass shell of hidden photons. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 8 pages, 5 figures

arXiv:2506.19425 [pdf, ps, other]

What Makes the Best Decomposition? Investigating Binary Decomposition Under FCG Variance

Authors: Ang Jia, He Jiang, Zhilei Ren, Xiaochen Li, Ming Fan, Ting Liu

Abstract: Binary decomposition, which decomposes binary files into modules, plays a critical role in binary reuse detection. Existing binary decomposition works either apply anchor-based methods by extending anchor functions to generate modules, or apply clustering-based methods by using clustering algorithms to group binary functions, which all rely on that reused code shares similar function call relation… ▽ More Binary decomposition, which decomposes binary files into modules, plays a critical role in binary reuse detection. Existing binary decomposition works either apply anchor-based methods by extending anchor functions to generate modules, or apply clustering-based methods by using clustering algorithms to group binary functions, which all rely on that reused code shares similar function call relationships. However, we find that function call graphs (FCGs) vary a lot when using different compilation settings, especially with diverse function inlining decisions. In this work, we conduct the first systematic empirical study on the variance of FCGs compiled by various compilation settings and explore its effect on binary decomposition methods. We first construct a dataset compiled by 17 compilers, using 6 optimizations to 4 architectures and analyze the changes and mappings of the FCGs. We find that the size of FCGs changes dramatically, while the FCGs are still linked by three different kinds of mappings. Then we evaluate the existing works under the FCG variance, and results show that existing works are facing great challenges when conducting cross-compiler evaluation with diverse optimization settings. Finally, we propose a method to identify the optimal decomposition and compare the existing decomposition works with the optimal decomposition. Existing works either suffer from low coverage or cannot generate stable community similarities. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19368 [pdf, ps, other]

Yotta: A Large-Scale Trustless Data Trading Scheme for Blockchain System

Authors: Xiang Liu, Zhanpeng Guo, Liangxi Liu, Mengyao Zheng, Yiming Qiu, Linshan Jiang

Abstract: Data trading is one of the key focuses of Web 3.0. However, all the current methods that rely on blockchain-based smart contracts for data exchange cannot support large-scale data trading while ensuring data security, which falls short of fulfilling the spirit of Web 3.0. Even worse, there is currently a lack of discussion on the essential properties that large-scale data trading should satisfy. I… ▽ More Data trading is one of the key focuses of Web 3.0. However, all the current methods that rely on blockchain-based smart contracts for data exchange cannot support large-scale data trading while ensuring data security, which falls short of fulfilling the spirit of Web 3.0. Even worse, there is currently a lack of discussion on the essential properties that large-scale data trading should satisfy. In this work, we are the first to formalize the property requirements for enabling data trading in Web 3.0. Based on these requirements, we are the first to propose Yotta, a complete batch data trading scheme for blockchain, which features a data trading design that leverages our innovative cryptographic workflow with IPFS and zk-SNARK. Our simulation results demonstrate that Yotta outperforms baseline approaches up to 130 times and exhibits excellent scalability to satisfy all the properties. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 9 pages, 2 figures, Exploratory Paper

Journal ref: Nanyang Blockchain Conference 2025

arXiv:2506.19296 [pdf, ps, other]

The Effect of Depth on the Expressivity of Deep Linear State-Space Models

Authors: Zeyu Bao, Penghao Yu, Haotian Jiang, Qianxiao Li

Abstract: Deep state-space models (SSMs) have gained increasing popularity in sequence modelling. While there are numerous theoretical investigations of shallow SSMs, how the depth of the SSM affects its expressiveness remains a crucial problem. In this paper, we systematically investigate the role of depth and width in deep linear SSMs, aiming to characterize how they influence the expressive capacity of t… ▽ More Deep state-space models (SSMs) have gained increasing popularity in sequence modelling. While there are numerous theoretical investigations of shallow SSMs, how the depth of the SSM affects its expressiveness remains a crucial problem. In this paper, we systematically investigate the role of depth and width in deep linear SSMs, aiming to characterize how they influence the expressive capacity of the architecture. First, we rigorously prove that in the absence of parameter constraints, increasing depth and increasing width are generally equivalent, provided that the parameter count remains within the same order of magnitude. However, under the assumption that the parameter norms are constrained, the effects of depth and width differ significantly. We show that a shallow linear SSM with large parameter norms can be represented by a deep linear SSM with smaller norms using a constructive method. In particular, this demonstrates that deep SSMs are more capable of representing targets with large norms than shallow SSMs under norm constraints. Finally, we derive upper bounds on the minimal depth required for a deep linear SSM to represent a given shallow linear SSM under constrained parameter norms. We also validate our theoretical results with numerical experiments △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19257 [pdf, ps, other]

MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

Authors: Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal l… ▽ More Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align. △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.19205 [pdf]

Deposition-Dependent Coverage and Performance of Phosphonic Acid Interface Modifiers in Halide Perovskite Optoelectronics

Authors: Hannah Contreras, Aidan O'Brien, Margherita Taddei, Yangwei Shi, Fangyuan Jiang, Robert J. E. Westbrook, Yadong Zhang, Rajiv Giridharagopal, Paul A. Lee, Stephen Barlow, Seth R. Marder, Neal R. Armstrong, David S. Ginger

Abstract: In this work, we study the effect of various deposition methods for phosphonic acid interface modifiers commonly pursued as self-assembled monolayers in high-performance metal halide perovskite photovoltaics and light-emitting diodes. We compare the deposition of (2-(3,6-diiodo-9H-carbazol-9-yl)ethyl)phosphonic acid onto indium tin oxide (ITO) bottom contacts by varying three parameters: the metho… ▽ More In this work, we study the effect of various deposition methods for phosphonic acid interface modifiers commonly pursued as self-assembled monolayers in high-performance metal halide perovskite photovoltaics and light-emitting diodes. We compare the deposition of (2-(3,6-diiodo-9H-carbazol-9-yl)ethyl)phosphonic acid onto indium tin oxide (ITO) bottom contacts by varying three parameters: the method of deposition, specifically spin coating or prolonged dip coating, ITO surface treatment via HCl/FeCl3 etching, and use in combination with a second modifier, 1,6-hexylenediphosphonic acid. We demonstrate that varying these modification protocols can impact time-resolved photoluminescence carrier lifetimes and quasi-Fermi level splitting of perovskite films deposited onto the phosphonic-acid-modified ITO. Ultraviolet photoelectron spectroscopy shows an increase in effective work function after phosphonic acid modification and clear evidence for photoemission from carbazole functional groups at the ITO surface. We use X-ray photoelectron spectroscopy to probe differences in phosphonic acid coverage on the metal oxide contact and show that perovskite samples grown on ITO with the highest phosphonic acid coverage exhibit the longest carrier lifetimes. Finally, we establish that device performance follows these same trends. These results indicate that the reactivity, heterogeneity, and composition of the bottom contact help to control recombination rates and therefore power conversion efficiencies. ITO etching, prolonged deposition times for phosphonic acids via dip coating, and the use of a secondary, more hydrophilic bis-phosphonic acid, all contribute to improvements in surface coverage, carrier lifetime, and device efficiency. These improvements each have a positive impact, and we achieve the best results when all three strategies are implemented. △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.19180 [pdf, ps, other]

Precise Measurement of the $Λ$ Electric Dipole Moment through the Entangled Strange Baryon-Antibaryon System

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (696 additional authors not shown)

Abstract: The dominance of matter over antimatter in the universe has consistently driven the pursuit of new physics beyond the Standard Model that violates charge-parity symmetry. Unlike the well-constrained electrons and neutrons, strange baryons (hyperons) remain a largely unexplored territory, in which interactions between hyperons and particles from new physics could induce a non-trivial electric dipol… ▽ More The dominance of matter over antimatter in the universe has consistently driven the pursuit of new physics beyond the Standard Model that violates charge-parity symmetry. Unlike the well-constrained electrons and neutrons, strange baryons (hyperons) remain a largely unexplored territory, in which interactions between hyperons and particles from new physics could induce a non-trivial electric dipole moment (EDM). However, direct measurements of hyperon EDMs through spin precession are highly challenging due to their short lifetimes. In this paper, we present a novel method to extract the EDM of the lightest hyperon, $Λ$, using the entangled $Λ$$\overlineΛ$ system. Our result is consistent with zero, achieving a three-order-of-magnitude improvement over the previous upper limit established in the 1980s with comparable statistics, providing stringent constraints on potential new physics. △ Less

Submitted 28 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.18898 [pdf, ps, other]

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Authors: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang

Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expan… ▽ More This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: Project page: https://tar.csuhan.com

arXiv:2506.18871 [pdf, ps, other]

OmniGen2: Exploration to Advanced Multimodal Generation

Authors: Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu

Abstract: In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables… ▽ More In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2 △ Less

Submitted 25 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.18856 [pdf, ps, other]

RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base

Authors: Kuanning Wang, Yuqian Fu, Tianyu Wang, Yanwei Fu, Longfei Liang, Yu-Gang Jiang, Xiangyang Xue

Abstract: Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features fro… ▽ More Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features from multi-view CAD rendered images and also attaching 3D points; 2) Retrieving relevant CAD features from the knowledge base based on the current query image via our ReSPC module; and 3) Incorporating retrieved CAD information to refine pose predictions via retrieval-augmented decoding. Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach, particularly in handling occlusions and novel viewpoints. Supplementary material is available on our project website: https://sressers.github.io/RAG-6DPose . △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: Accepted by IROS 2025

arXiv:2506.18786 [pdf, ps, other]

Flow-Aware Diffusion for Real-Time VR Restoration: Enhancing Spatiotemporal Coherence and Efficiency

Authors: Yitong Zhu, Guanxuan Jiang, Zhuowen Liang, Yuyang Wang

Abstract: Cybersickness remains a critical barrier to the widespread adoption of Virtual Reality (VR), particularly in scenarios involving intense or artificial motion cues. Among the key contributors is excessive optical flow-perceived visual motion that, when unmatched by vestibular input, leads to sensory conflict and discomfort. While previous efforts have explored geometric or hardware based mitigation… ▽ More Cybersickness remains a critical barrier to the widespread adoption of Virtual Reality (VR), particularly in scenarios involving intense or artificial motion cues. Among the key contributors is excessive optical flow-perceived visual motion that, when unmatched by vestibular input, leads to sensory conflict and discomfort. While previous efforts have explored geometric or hardware based mitigation strategies, such methods often rely on predefined scene structures, manual tuning, or intrusive equipment. In this work, we propose U-MAD, a lightweight, real-time, AI-based solution that suppresses perceptually disruptive optical flow directly at the image level. Unlike prior handcrafted approaches, this method learns to attenuate high-intensity motion patterns from rendered frames without requiring mesh-level editing or scene specific adaptation. Designed as a plug and play module, U-MAD integrates seamlessly into existing VR pipelines and generalizes well to procedurally generated environments. The experiments show that U-MAD consistently reduces average optical flow and enhances temporal stability across diverse scenes. A user study further confirms that reducing visual motion leads to improved perceptual comfort and alleviated cybersickness symptoms. These findings demonstrate that perceptually guided modulation of optical flow provides an effective and scalable approach to creating more user-friendly immersive experiences. The code will be released at https://github.com/XXXXX (upon publication). △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.18734 [pdf, ps, other]

Can boundary configuration be tuned to optimize directional quantum steering harvesting?

Authors: Xiao-Li Huang, Xiao-Ying Jiang, Yu-Xuan Wang, Si-Yu Liu, Zejun Wang, Shu-Min Wu

Abstract: We investigate the harvesting of quantum steering and its asymmetry between two static detectors locally interacting with a vacuum massless scalar field near an infinite, perfectly reflecting boundary. The detectors are arranged either parallel or orthogonal to the boundary, with detector $B$ assumed to have an energy gap greater than or equal to that of detector $A$. It is interesting to observe… ▽ More We investigate the harvesting of quantum steering and its asymmetry between two static detectors locally interacting with a vacuum massless scalar field near an infinite, perfectly reflecting boundary. The detectors are arranged either parallel or orthogonal to the boundary, with detector $B$ assumed to have an energy gap greater than or equal to that of detector $A$. It is interesting to observe that, with increasing distance between the detectors and the boundary, the boundary tends to suppress quantum steering in one direction while enhancing it in the opposite direction. In the case of identical detectors, steering is symmetric when they are aligned parallel to the boundary. However, orthogonal alignment breaks this symmetry due to their unequal spatial proximity to the boundary. For non-identical detectors in the parallel configuration, the steering from $A$ to $B$ ($A \rightarrow B$) generally surpasses that from $B$ to $A$ ($B \rightarrow A$). In contrast, when the detectors are oriented orthogonally to the boundary, the relative strength of $A \rightarrow B$ and $B \rightarrow A$ steerability depends on the interplay between the boundary effects and the detectors' energy gap difference. Across most of the parameter space, the orthogonal alignment tends to enhance $B \rightarrow A$ steering while suppressing $A \rightarrow B$ steering compared to the parallel setup. These findings suggest that boundary configurations should be flexibly adjusted according to the directional dependence of steering harvesting in order to optimize quantum information extraction. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 20 pages, 7 figures,

arXiv:2506.18701 [pdf, ps, other]

Matrix-Game: Interactive World Foundation Model

Authors: Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou

Abstract: We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising ove… ▽ More We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: Technical Report

arXiv:2506.18636 [pdf]

A comparative analysis of plasmonic and dielectric metasurface sensing platforms powered by bound states in the continuum

Authors: Tao Jiang, Angana Bhattacharya, Martin Barkey, Andreas Aigner, Thomas Weber, Juan Wang, Stefan A. Maier, Andreas Tittl

Abstract: Nanophotonic platforms based on surface-enhanced infrared absorbance spectroscopy (SEIRAS) have emerged as an effective tool for molecular detection. Sensitive nanophotonic sensors with robust resonant modes and amplified electromagnetic near fields are essential for spectroscopy, especially in lossy environments. Metasurfaces driven by bound state in the continuum (BICs) have unlocked a powerful… ▽ More Nanophotonic platforms based on surface-enhanced infrared absorbance spectroscopy (SEIRAS) have emerged as an effective tool for molecular detection. Sensitive nanophotonic sensors with robust resonant modes and amplified electromagnetic near fields are essential for spectroscopy, especially in lossy environments. Metasurfaces driven by bound state in the continuum (BICs) have unlocked a powerful platform for molecular detection due to their exceptional spectral selectivity. While plasmonic BIC metasurfaces are preferred for molecular spectroscopy due to their high surface fields, enhancing the interaction with analytes, dielectric BICs have become popular due to their high-quality factors and, thus high sensitivity. However, their sensing performance has largely been demonstrated in air, neglecting the intrinsic infrared (IR) losses found in common solvents. This study evaluates the suitability of plasmonic versus dielectric platforms for in-situ molecular spectroscopy. Here, the sensing performance of plasmonic (gold) and dielectric (silicon) metasurfaces is assessed across liquid environments with varying losses resembling typical solvents. The results show that dielectric metasurfaces excel in dry conditions, while plasmonic BIC metasurfaces outperform them in lossy solvents, with a distinct crossover point where both show similar performance. Our results provide a framework for selecting the optimal metasurface material platform for SEIRAS studies based on environmental conditions. △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.18606 [pdf, ps, other]

A hybrid nonet with $J^{PC}=1^{-+}$ or a tetraquark 81-plet

Authors: Niu Su, Er-Liang Cui, Yi-Wei Jiang, Hua-Xing Chen

Abstract: Confirming the existence of hybrid states remains challenging due to their experimental indistinguishability from tightly bound tetraquarks and loosely bound molecules. To address this issue, we employ QCD sum rules to systematically investigate the $π_1(1600)$ and $η_1(1855)$ as candidate tetraquark states with exotic quantum numbers $J^{PC} = 1^{-+}$. Within the hybrid framework, an $SU(3)$ flav… ▽ More Confirming the existence of hybrid states remains challenging due to their experimental indistinguishability from tightly bound tetraquarks and loosely bound molecules. To address this issue, we employ QCD sum rules to systematically investigate the $π_1(1600)$ and $η_1(1855)$ as candidate tetraquark states with exotic quantum numbers $J^{PC} = 1^{-+}$. Within the hybrid framework, an $SU(3)$ flavor nonet is expected, featuring two isoscalar configurations, $q\bar{q}g$ and $s\bar{s}g$, where $q = u/d$. In contrast, the tetraquark scenario predicts an $SU(3)$ flavor 81-plet comprising three isoscalar states: $qq\bar{q}\bar{q}$, $qs\bar{q}\bar{s}$, and $ss\bar{s}\bar{s}$. Our analysis yields a mass of $2.22^{+0.18}_{-0.26}$ GeV for the $ss\bar{s}\bar{s}$ tetraquark state, which is expected to decay predominantly into the $φφ$ and $ηf_1(1420)$ final states. Therefore, experimental scrutiny of their invariant mass spectra is pivotal for distinguishing between hybrid and tetraquark interpretations. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 6 pages, 3 figures, suggestions and comments welcome

arXiv:2506.18506 [pdf]

Detection of subsurface structures with a vehicle-based atom gravity gradiometer

Authors: Xiaowei Zhang, Jiaqi Zhong, Muyan Wang, Huilin Wan, Hui Xiong, Dandan Jiang, Zhi Li, Dekai Mao, Bin Gao, Biao Tang, Xi Chen, Jin Wang, Mingsheng Zhan

Abstract: High-precision mobile gravity gradiometers are very useful in geodesy and geophysics. Atom gravity gradiometers (AGGs) could be among the most accurate mobile gravity gradiometers but are currently constrained by the trade-off between portability and sensitivity. Here, we present a high-sensitivity mobile AGG featuring an ultra-compact sensor head with a volume of only 94 L. In the laboratory, it… ▽ More High-precision mobile gravity gradiometers are very useful in geodesy and geophysics. Atom gravity gradiometers (AGGs) could be among the most accurate mobile gravity gradiometers but are currently constrained by the trade-off between portability and sensitivity. Here, we present a high-sensitivity mobile AGG featuring an ultra-compact sensor head with a volume of only 94 L. In the laboratory, it achieves a sensitivity of 77 E/$\sqrt{Hz}$ (1 E=1$\times10^{-9}$/s$^2$) and a long-term stability of better than 0.5 E. We integrated the instrument in a minivan, enabling efficient mobile field surveys with excellent maneuverability in confined spaces. Using this vehicular system, we surveyed the gravitational field over a set of subsurface structures within a small wooded area, successfully resolving their structural signatures with a signal-to-noise ratio of 57 and quantifying the water depth in a reservoir with an accuracy of $\pm$0.23 m. Compared with previous observations using a CG-5 gravimeter, the superior spatial resolution inherent in gradiometry is clearly demonstrated. This work paves the way for bring AGGs to practical field applications. △ Less

Submitted 25 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

Comments: 13 pages, 8 figures

arXiv:2506.18478 [pdf, ps, other]

High-Dimensional Multi-Study Robust Factor Model for Analyzing RNA Sequencing Data from Heterogeneous Sources

Authors: Xiaolu Jiang, Wei Liu

Abstract: The amount of high-dimensional large-scale RNA sequencing data derived from multiple heterogeneous sources has increased exponentially in biological science. During data collection, significant technical noise or errors may occur. To robustly extract meaningful features from this type of data, we introduce a high-dimensional multi-study robust factor model, called MultiRFM, which learns latent fea… ▽ More The amount of high-dimensional large-scale RNA sequencing data derived from multiple heterogeneous sources has increased exponentially in biological science. During data collection, significant technical noise or errors may occur. To robustly extract meaningful features from this type of data, we introduce a high-dimensional multi-study robust factor model, called MultiRFM, which learns latent features and accounts for the heterogeneity among sources. MultiRFM demonstrates significantly greater robustness compared to existing multi-study factor models and is capable of estimating study-specific factors that are overlooked by single-study robust factor models. Specifically,we utilize a multivariate t-distribution to model errors, capturing potential heavy tails, and incorporate both study-shared and study-specified factors to represent common and specific information among studies. For parameter estimation, we have designed a computationally efficient variational estimation approach. A step-wise singular value ratio method is proposed to determine the discrete tuning parameters. Extensive simulation studies indicate that MultiRFM surpasses state-of-the-art methods in terms of estimation accuracy across various scenarios. Real-world applications involving two RNA sequencing datasets demonstrate that MultiRFM outperforms competing methods in model fitting, prediction, and computational efficiency, significantly facilitating downstream tasks. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 36 pages,4 figures

arXiv:2506.18476 [pdf, ps, other]

Context Consistency Learning via Sentence Removal for Semi-Supervised Video Paragraph Grounding

Authors: Yaokun Zhong, Siyu Jiang, Jian Zhu, Jian-Fang Hu

Abstract: Semi-Supervised Video Paragraph Grounding (SSVPG) aims to localize multiple sentences in a paragraph from an untrimmed video with limited temporal annotations. Existing methods focus on teacher-student consistency learning and video-level contrastive loss, but they overlook the importance of perturbing query contexts to generate strong supervisory signals. In this work, we propose a novel Context… ▽ More Semi-Supervised Video Paragraph Grounding (SSVPG) aims to localize multiple sentences in a paragraph from an untrimmed video with limited temporal annotations. Existing methods focus on teacher-student consistency learning and video-level contrastive loss, but they overlook the importance of perturbing query contexts to generate strong supervisory signals. In this work, we propose a novel Context Consistency Learning (CCL) framework that unifies the paradigms of consistency regularization and pseudo-labeling to enhance semi-supervised learning. Specifically, we first conduct teacher-student learning where the student model takes as inputs strongly-augmented samples with sentences removed and is enforced to learn from the adequately strong supervisory signals from the teacher model. Afterward, we conduct model retraining based on the generated pseudo labels, where the mutual agreement between the original and augmented views' predictions is utilized as the label confidence. Extensive experiments show that CCL outperforms existing methods by a large margin. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: Accepted by ICME2025

arXiv:2506.18420 [pdf, ps, other]

Incompressible Euler limit from the Boltzmann equation with Maxwell reflection boundary condition in the half-space

Authors: Ning Jiang, Chao Wang, Yulong Wu, Zhifei Zhang

Abstract: In this paper, we rigorously justify the incompressible Euler limit of the Boltzmann equation with general Maxwell reflection boundary condition in the half-space. The accommodation coefficient $α\in (0,1]$ is assumed to be $O(1)$. Our construction of solutions includes the interior fluid part and Knudsen-Prandtl coupled boundary layers. The corresponding solutions to the nonlinear Euler and nonli… ▽ More In this paper, we rigorously justify the incompressible Euler limit of the Boltzmann equation with general Maxwell reflection boundary condition in the half-space. The accommodation coefficient $α\in (0,1]$ is assumed to be $O(1)$. Our construction of solutions includes the interior fluid part and Knudsen-Prandtl coupled boundary layers. The corresponding solutions to the nonlinear Euler and nonlinear Prandtl systems are taken to be shear flows. Due to the presence of the nonlinear Prandtl layer, the remainder equation loses one order normal derivative. The key technical novelty lies in employing the full conservation laws to convert this loss of the normal derivative into the loss of tangential spatial derivative, avoiding any loss of regularity in time. By working within an analytic $L^2 \mbox{-} L^\infty$ framework, we establish the uniform estimate on the remainder equations, thus justify the validity of the incompressible Euler limit from the Boltzmann equation for the shear flow case. △ Less

Submitted 23 June, 2025; originally announced June 2025.

MSC Class: 35B25; 35F20; 35Q20; 76N15; 82C40

Showing 51–100 of 26,487 results for author: Jiang