Search | arXiv e-print repository

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Authors: Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie

Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image underst… ▽ More Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: Unified image understanding and generation model

arXiv:2506.10316 [pdf, ps, other]

Search for sub-GeV invisible particles in inclusive decays of $J/ψ$ to $φ$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (704 additional authors not shown)

Abstract: A search for an invisible particle, $X$, with a mass between 0 and 0.96 $\textrm{GeV}/\textit{c}^{2}$, is performed in the process $J/ψ\rightarrowφ+ X$ using $(8774.0\pm39.4)\times10^{6}$ $J/ψ$ events collected with the BESIII detector from 2017 to 2019. The $φ$ meson is fully reconstructed and an efficient veto of photons, neutral and charged hadrons up to twice the $K_L^0$ mass is applied to the… ▽ More A search for an invisible particle, $X$, with a mass between 0 and 0.96 $\textrm{GeV}/\textit{c}^{2}$, is performed in the process $J/ψ\rightarrowφ+ X$ using $(8774.0\pm39.4)\times10^{6}$ $J/ψ$ events collected with the BESIII detector from 2017 to 2019. The $φ$ meson is fully reconstructed and an efficient veto of photons, neutral and charged hadrons up to twice the $K_L^0$ mass is applied to the rest of the events, and the recoil mass against the $φ$ is obtained precisely from the kinematic constraint in the event. No significant signal is observed in the investigated region and the upper limit on the inclusive branching fraction of $J/ψ\rightarrowφ+ X$ is determined to be $7.5\times10^{-8}$ at 90% confidence level. Upper limits at a 90% confidence level are also given for this branching fraction as a function of the invisible particle mass, varying from $9\times10^{-9}$ to $4\times10^{-8}$ over the investigated mass range. Additionally, a 90% confidence level upper limit on the branching fraction of $η\rightarrow \rm{invisible}$ is determined to $2.6\times10^{-5}$, which improves the previous best results by more than four times. The analysis technique in this work offers a clean window to search for sub-GeV invisible particles, which can be adapted for other $J/ψ$ decays and direct $e^+e^-$ annihilation experiments in future studies, and improve the sensitivity by orders of magnitude. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 10 pages, 3 figures

arXiv:2506.10300 [pdf]

Revisiting the electron affinity of selenium

Authors: Rui Zhang, Wenru Jie, Jiayi Chen, Qihan Liu, Chuangang Ning

Abstract: The electron affinity (EA) of atomic selenium, previously established as 16,297.276(9) cm-1 based on the laser photodetachment microscopy (LPM) measurements in 2012, exhibited a significant deviation from other earlier experimental values, yet it remained the accepted reference standard for over a decade. In this letter, we re-examined the EA of Se using the slow-electron velocity-map imaging meth… ▽ More The electron affinity (EA) of atomic selenium, previously established as 16,297.276(9) cm-1 based on the laser photodetachment microscopy (LPM) measurements in 2012, exhibited a significant deviation from other earlier experimental values, yet it remained the accepted reference standard for over a decade. In this letter, we re-examined the EA of Se using the slow-electron velocity-map imaging method and revealed a substantial deviation in the LPM result. Measurements for the different isotopes of Se and the energy-level splitting of the neutral Se atom's 3P2 - 3P1 further verified the accuracy and robustness of our SEVI method. Based on these experimental evidences, we recommended a revised EA(Se) value of 16,297.78(4) cm-1, which is in excellent agreement with the previous laser photodetachment threshold (LPT) experimental results. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 5 pages, 4 figures

arXiv:2506.10035 [pdf, ps, other]

FastFLUX: Pruning FLUX with Block-wise Replacement and Sandwich Training

Authors: Fuhan Cai, Yong Guo, Jie Li, Wenbo Li, Xiangzhong Fang, Jian Chen

Abstract: Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance… ▽ More Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance degradation and incur substantial training costs. To address these limitations, we propose FastFLUX, an architecture-level pruning framework designed to enhance the inference efficiency of FLUX. At its core is the Block-wise Replacement with Linear Layers (BRLL) method, which replaces structurally complex residual branches in ResBlocks with lightweight linear layers while preserving the original shortcut connections for stability. Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning strategy that leverages LoRA to supervise neighboring blocks, mitigating performance drops caused by structural replacement. Experiments show that our FastFLUX maintains high image quality under both qualitative and quantitative evaluations, while significantly improving inference speed, even with 20\% of the hierarchy pruned. Our code will be available soon. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 14 pages

arXiv:2506.09924 [pdf, ps, other]

On the Linear Programming Model for Dynamic Stochastic Matching and Its Application on Pricing

Authors: Junlin Chen, Chiwei Yan, Hai Jiang

Abstract: Important pricing problems in centralized matching markets -- such as carpooling and food delivery platforms -- often exhibit a bi-level structure. At the upper level, the platform sets prices for different types of agents (e.g., riders with various origins and destinations, or delivery orders from diverse restaurants to customer locations). The lower level then matches converted agents to minimiz… ▽ More Important pricing problems in centralized matching markets -- such as carpooling and food delivery platforms -- often exhibit a bi-level structure. At the upper level, the platform sets prices for different types of agents (e.g., riders with various origins and destinations, or delivery orders from diverse restaurants to customer locations). The lower level then matches converted agents to minimize operational costs, such as pooling riders with similar itineraries into the same vehicle or consolidating multiple delivery orders into a single trip. Motivated by these applications, we study the optimal value (cost) function of a linear programming model with respect to agent arrival rates, originally proposed by Aouad and Saritac (2022) for cost-minimizing dynamic stochastic matching under limited time. In particular, we investigate the concavity of this cost function. We show that weak concavity holds when there are fewer than or equal to three agent types with identical patience level. For an arbitrary number of agent types, (weak) concavity is more likely to hold when agents have high or similar patience, or when agents are more heterogeneous in their matching efficiencies. Building on these theoretical insights, we develop a Minorize-Maximization (MM) algorithm for the pricing problem that requires little stepsize tuning, and demonstrate substantial performance improvements over projected gradient-based methods on a large-scale, real-world ridesharing dataset. This makes it a compelling algorithmic choice for solving such pricing problems in practice. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.09724 [pdf, ps, other]

The Four Color Theorem for Cell Instance Segmentation

Authors: Ye Zhang, Yu Zhou, Yifeng Wang, Jun Xiao, Ziyue Wang, Yongbing Zhang, Jianxu Chen

Abstract: Cell instance segmentation is critical to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detection-based, contour-based, and distance mapping-based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. I… ▽ More Cell instance segmentation is critical to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detection-based, contour-based, and distance mapping-based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. In this paper, we propose a novel cell instance segmentation method inspired by the four-color theorem. By conceptualizing cells as countries and tissues as oceans, we introduce a four-color encoding scheme that ensures adjacent instances receive distinct labels. This reformulation transforms instance segmentation into a constrained semantic segmentation problem with only four predicted classes, substantially simplifying the instance differentiation process. To solve the training instability caused by the non-uniqueness of four-color encoding, we design an asymptotic training strategy and encoding transformation method. Extensive experiments on various modes demonstrate our approach achieves state-of-the-art performance. The code is available at https://github.com/zhangye-zoe/FCIS. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Accepted at ICML 2025

arXiv:2506.09677 [pdf, ps, other]

Reasoning Models Are More Easily Gaslighted Than You Think

Authors: Bin Zhu, Hailong Yin, Jingjing Chen, Yu-Gang Jiang

Abstract: Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three… ▽ More Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models' susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings reveal fundamental limitations in the robustness of reasoning models, highlighting the gap between step-by-step reasoning and belief persistence. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.09553 [pdf, ps, other]

GLD-Road:A global-local decoding road network extraction model for remote sensing images

Authors: Ligao Deng, Yupeng Deng, Yu Meng, Jingbo Chen, Zhihao Xi, Diyou Liu, Qifeng Chu

Abstract: Road networks are crucial for mapping, autonomous driving, and disaster response. While manual annotation is costly, deep learning offers efficient extraction. Current methods include postprocessing (prone to errors), global parallel (fast but misses nodes), and local iterative (accurate but slow). We propose GLD-Road, a two-stage model combining global efficiency and local precision. First, it de… ▽ More Road networks are crucial for mapping, autonomous driving, and disaster response. While manual annotation is costly, deep learning offers efficient extraction. Current methods include postprocessing (prone to errors), global parallel (fast but misses nodes), and local iterative (accurate but slow). We propose GLD-Road, a two-stage model combining global efficiency and local precision. First, it detects road nodes and connects them via a Connect Module. Then, it iteratively refines broken roads using local searches, drastically reducing computation. Experiments show GLD-Road outperforms state-of-the-art methods, improving APLS by 1.9% (City-Scale) and 0.67% (SpaceNet3). It also reduces retrieval time by 40% vs. Sat2Graph (global) and 92% vs. RNGDet++ (local). The experimental results are available at https://github.com/ucas-dlg/GLD-Road. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.09525 [pdf, ps, other]

Beyond Personalization: Federated Recommendation with Calibration via Low-rank Decomposition

Authors: Jundong Chen, Honglei Zhang, Haoxuan Li, Chunxu Zhang, Zhiwei Li, Yidong Li

Abstract: Federated recommendation (FR) is a promising paradigm to protect user privacy in recommender systems. Distinct from general federated scenarios, FR inherently needs to preserve client-specific parameters, i.e., user embeddings, for privacy and personalization. However, we empirically find that globally aggregated item embeddings can induce skew in user embeddings, resulting in suboptimal performan… ▽ More Federated recommendation (FR) is a promising paradigm to protect user privacy in recommender systems. Distinct from general federated scenarios, FR inherently needs to preserve client-specific parameters, i.e., user embeddings, for privacy and personalization. However, we empirically find that globally aggregated item embeddings can induce skew in user embeddings, resulting in suboptimal performance. To this end, we theoretically analyze the user embedding skew issue and propose Personalized Federated recommendation with Calibration via Low-Rank decomposition (PFedCLR). Specifically, PFedCLR introduces an integrated dual-function mechanism, implemented with a buffer matrix, to jointly calibrate local user embedding and personalize global item embeddings. To ensure efficiency, we employ a low-rank decomposition of the buffer matrix to reduce the model overhead. Furthermore, for privacy, we train and upload the local model before personalization, preventing the server from accessing sensitive information. Extensive experiments demonstrate that PFedCLR effectively mitigates user embedding skew and achieves a desirable trade-off among performance, efficiency, and privacy, outperforming state-of-the-art (SOTA) methods. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.09518 [pdf, other]

HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene

Authors: Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Chengxuan Qian, Juyuan Kang, Shuqin Gao, Honglong Zhao, Tianlu Mao, Yucheng Zhang

Abstract: Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: r… ▽ More Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations. These issues collectively hinder coherent and efficient dynamic reconstruction. To address these limitations, we propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation. It first identifies motion-relevant regions via an Anchor Filter to suppresses redundant updates in static areas. A self-supervised Induced Flow-Guided Deformation module induces anchor motion using multi-frame feature aggregation, eliminating the need for explicit flow labels. To further handle fine-grained deformations, a Hierarchical Anchor Propagation mechanism increases anchor resolution based on motion complexity and propagates multi-level transformations. Extensive experiments on synthetic and real-world benchmarks validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency. △ Less

Submitted 11 June, 2025; originally announced June 2025.

arXiv:2506.09454 [pdf, ps, other]

NDCG-Consistent Softmax Approximation with Accelerated Convergence

Authors: Yuanhao Pu, Defu Lian, Xiaolong Chen, Xu Huang, Jin Chen, Enhong Chen

Abstract: Ranking tasks constitute fundamental components of extreme similarity learning frameworks, where extremely large corpora of objects are modeled through relative similarity relationships adhering to predefined ordinal structures. Among various ranking surrogates, Softmax (SM) Loss has been widely adopted due to its natural capability to handle listwise ranking via global negative comparisons, along… ▽ More Ranking tasks constitute fundamental components of extreme similarity learning frameworks, where extremely large corpora of objects are modeled through relative similarity relationships adhering to predefined ordinal structures. Among various ranking surrogates, Softmax (SM) Loss has been widely adopted due to its natural capability to handle listwise ranking via global negative comparisons, along with its flexibility across diverse application scenarios. However, despite its effectiveness, SM Loss often suffers from significant computational overhead and scalability limitations when applied to large-scale object spaces. To address this challenge, we propose novel loss formulations that align directly with ranking metrics: the Ranking-Generalizable \textbf{squared} (RG$^2$) Loss and the Ranking-Generalizable interactive (RG$^\times$) Loss, both derived through Taylor expansions of the SM Loss. Notably, RG$^2$ reveals the intrinsic mechanisms underlying weighted squared losses (WSL) in ranking methods and uncovers fundamental connections between sampling-based and non-sampling-based loss paradigms. Furthermore, we integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method, providing both generalization guarantees and convergence rate analyses. Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance relative to SM Loss, while significantly accelerating convergence. This framework offers the similarity learning community both theoretical insights and practically efficient tools, with methodologies applicable to a broad range of tasks where balancing ranking quality and computational efficiency is essential. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 35 pages

arXiv:2506.09386 [pdf, ps, other]

Search for the charmonium weak decays $J/ψ\to D_{s}^{-}ρ^{+}+c.c.$ and $J/ψ\to D_{s}^{-}π^{+}+c.c.$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (705 additional authors not shown)

Abstract: Based on $(10087\pm44)\times 10^6$ $J/ψ$ events recorded with the BESIII detector, we search for the rare charmonium weak decays $J/ψ\to D_{s}^{-}ρ^{+}+c.c.$ and $J/ψ\to D_{s}^{-}π^{+}+c.c.$ No signal is observed, and upper limits on the branching fractions at the $90\%$ confidence level are set as $\mathcal{B}(J/ψ\to D_{s}^{-}ρ^{+}+c.c.)<8.0\times10^{-7}$ and… ▽ More Based on $(10087\pm44)\times 10^6$ $J/ψ$ events recorded with the BESIII detector, we search for the rare charmonium weak decays $J/ψ\to D_{s}^{-}ρ^{+}+c.c.$ and $J/ψ\to D_{s}^{-}π^{+}+c.c.$ No signal is observed, and upper limits on the branching fractions at the $90\%$ confidence level are set as $\mathcal{B}(J/ψ\to D_{s}^{-}ρ^{+}+c.c.)<8.0\times10^{-7}$ and $\mathcal{B}(J/ψ\to D_{s}^{-}π^{+}+c.c.)<4.1\times10^{-7}$. Our results provide the most stringent experimental constraints on these decays. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 18 pages, 3 figures

arXiv:2506.09352 [pdf, ps, other]

doi 10.3847/1538-4357/addec3

New Symbiotic Stars from LAMOST DR10 Spectra and Multi-band Photometry

Authors: Jing Chen, Liang Wang, Yin-Bi Li, Xiao-Xiao Ma, A-Li Luo, Zi-Chong Zhang, Ming-Yi Ding, Kai Zhang

Abstract: Symbiotic star (SySt) is long-period interacting binary system, typically consisting of a white dwarf and a red giant surrounded by a nebula. These systems are natural astrophysical laboratories for investigating binary star evolution. In this paper, we identified nine SySts from the LAMOST DR10 low-resolution spectra survey, seven of which were previously known, while two are newly identified. In… ▽ More Symbiotic star (SySt) is long-period interacting binary system, typically consisting of a white dwarf and a red giant surrounded by a nebula. These systems are natural astrophysical laboratories for investigating binary star evolution. In this paper, we identified nine SySts from the LAMOST DR10 low-resolution spectra survey, seven of which were previously known, while two are newly identified. Initially, we selected LAMOST spectra exhibiting typical SySt emission lines (e.g., $\rm H_α, ~H_β, ~H_γ, ~and ~He II$). Subsequently, we utilized the distribution of known SySts on the HR diagram to select SySt candidates, and visually inspected their spectra. Ultimately, we classified all nine as S-type SySts using the $J - H$ vs. $H - K$ diagram. Additionally, based on multi-band photometric data from GALEX, Gaia, 2MASS, ALLWISE, and several X-ray catalogs, we found 12 accreting-only SySt (acc-SySt) candidates, characterized by concurrent ultraviolet and infrared excess and accretion process. Furthermore, we estimated the white dwarf temperatures by fitting their observed SEDs using a combination of Kurucz stellar atmosphere model and Koester white dwarf model. We compared the accretion rates of acc-SySt candidates and confirmed SySts, and found they have similar accretion rate distribution, providing evidence that these acc-SySt candidates constitute bona fide SySts. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 13 pages, 8 figures, Accepted by ApJ

arXiv:2506.09344 [pdf, ps, other]

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 18 pages,8 figures

arXiv:2506.09240 [pdf]

Electron mobility in AlN from first principles

Authors: Amanda Wang, Nick Pant, Woncheol Lee, Jie-Cheng Chen, Feliciano Giustino, Emmanouil Kioupakis

Abstract: Aluminum nitride is a promising ultra-wide band gap semiconductor for optoelectronics and power electronics. However, its practical applications have been limited by challenges with doping and achieving high electrical conductivity. Recent advances in crystal quality and defect control have led to improvements in experimentally measured mobilities. In this work, we apply first-principles calculati… ▽ More Aluminum nitride is a promising ultra-wide band gap semiconductor for optoelectronics and power electronics. However, its practical applications have been limited by challenges with doping and achieving high electrical conductivity. Recent advances in crystal quality and defect control have led to improvements in experimentally measured mobilities. In this work, we apply first-principles calculations to determine the upper limits of the electron mobility in AlN as a function of temperature, doping, and crystallographic orientation. We account for the combined effects of electron scattering by phonons and ionized impurity to model doped systems, and examine both full and partial ionization conditions. Our results show that the piezoelectric interaction from the long-range component of the acoustic modes is the dominant source of electron-phonon scattering at room temperature. Ionized-impurity scattering starts to dominate scattering at dopant concentrations above $10^{16}$ cm$^{-3}$, reducing the mobility by more than an order of magnitude in the high doping regime. Our calculated Hall mobility values are in good agreement with experimental data for samples with comparable dopant concentrations. We also find that electron mobilities as high as $956$ cm$^2$/V$\cdot$s could be achievable at lower dopant concentrations. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 18 pages, 2 figures in main text, 2 in supplementary

arXiv:2506.09175 [pdf, ps, other]

PHRASED: Phrase Dictionary Biasing for Speech Translation

Authors: Peidong Wang, Jian Xue, Rui Zhao, Junkun Chen, Aswin Shanmugam Subramanian, Jinyu Li

Abstract: Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to t… ▽ More Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to two types of widely adopted models, a transducer-based streaming speech translation model and a multimodal large language model. Experimental results show that the phrase dictionary biasing method outperforms phrase list biasing by 21% relatively for the streaming speech translation model. In addition, phrase dictionary biasing enables multimodal large language models to use external phrase information, achieving 85% relative improvement in phrase recall. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.09114 [pdf, other]

TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval

Authors: Jialin Chen, Ziyu Zhao, Gaukhar Nurbek, Aosong Feng, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, Rex Ying

Abstract: The ubiquity of dynamic data in domains such as weather, healthcare, and energy underscores a growing need for effective interpretation and retrieval of time-series data. These data are inherently tied to domain-specific contexts, such as clinical notes or weather narratives, making cross-modal retrieval essential not only for downstream tasks but also for developing robust time-series foundation… ▽ More The ubiquity of dynamic data in domains such as weather, healthcare, and energy underscores a growing need for effective interpretation and retrieval of time-series data. These data are inherently tied to domain-specific contexts, such as clinical notes or weather narratives, making cross-modal retrieval essential not only for downstream tasks but also for developing robust time-series foundation models by retrieval-augmented generation (RAG). Despite the increasing demand, time-series retrieval remains largely underexplored. Existing methods often lack semantic grounding, struggle to align heterogeneous modalities, and have limited capacity for handling multi-channel signals. To address this gap, we propose TRACE, a generic multimodal retriever that grounds time-series embeddings in aligned textual context. TRACE enables fine-grained channel-level alignment and employs hard negative mining to facilitate semantically meaningful retrieval. It supports flexible cross-modal retrieval modes, including Text-to-Timeseries and Timeseries-to-Text, effectively linking linguistic descriptions with complex temporal patterns. By retrieving semantically relevant pairs, TRACE enriches downstream models with informative context, leading to improved predictive accuracy and interpretability. Beyond a static retrieval engine, TRACE also serves as a powerful standalone encoder, with lightweight task-specific tuning that refines context-aware representations while maintaining strong cross-modal alignment. These representations achieve state-of-the-art performance on downstream forecasting and classification tasks. Extensive experiments across multiple domains highlight its dual utility, as both an effective encoder for downstream applications and a general-purpose retriever to enhance time-series models. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.09080 [pdf, other]

FinHEAR: Human Expertise and Adaptive Risk-Aware Temporal Reasoning for Financial Decision-Making

Authors: Jiaxiang Chen, Mingxi Zou, Zhuo Wang, Qifan Wang, Dongning Sun, Chi Zhang, Zenglin Xu

Abstract: Financial decision-making presents unique challenges for language models, demanding temporal reasoning, adaptive risk assessment, and responsiveness to dynamic events. While large language models (LLMs) show strong general reasoning capabilities, they often fail to capture behavioral patterns central to human financial decisions-such as expert reliance under information asymmetry, loss-averse sens… ▽ More Financial decision-making presents unique challenges for language models, demanding temporal reasoning, adaptive risk assessment, and responsiveness to dynamic events. While large language models (LLMs) show strong general reasoning capabilities, they often fail to capture behavioral patterns central to human financial decisions-such as expert reliance under information asymmetry, loss-averse sensitivity, and feedback-driven temporal adjustment. We propose FinHEAR, a multi-agent framework for Human Expertise and Adaptive Risk-aware reasoning. FinHEAR orchestrates specialized LLM-based agents to analyze historical trends, interpret current events, and retrieve expert-informed precedents within an event-centric pipeline. Grounded in behavioral economics, it incorporates expert-guided retrieval, confidence-adjusted position sizing, and outcome-based refinement to enhance interpretability and robustness. Empirical results on curated financial datasets show that FinHEAR consistently outperforms strong baselines across trend prediction and trading tasks, achieving higher accuracy and better risk-adjusted returns. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.09052 [pdf]

Llama-Affinity: A Predictive Antibody Antigen Binding Model Integrating Antibody Sequences with Llama3 Backbone Architecture

Authors: Delower Hossain, Ehsan Saghapour, Kevin Song, Jake Y. Chen

Abstract: Antibody-facilitated immune responses are central to the body's defense against pathogens, viruses, and other foreign invaders. The ability of antibodies to specifically bind and neutralize antigens is vital for maintaining immunity. Over the past few decades, bioengineering advancements have significantly accelerated therapeutic antibody development. These antibody-derived drugs have shown remark… ▽ More Antibody-facilitated immune responses are central to the body's defense against pathogens, viruses, and other foreign invaders. The ability of antibodies to specifically bind and neutralize antigens is vital for maintaining immunity. Over the past few decades, bioengineering advancements have significantly accelerated therapeutic antibody development. These antibody-derived drugs have shown remarkable efficacy, particularly in treating cancer, SARS-CoV-2, autoimmune disorders, and infectious diseases. Traditionally, experimental methods for affinity measurement have been time-consuming and expensive. With the advent of artificial intelligence, in silico medicine has been revolutionized; recent developments in machine learning, particularly the use of large language models (LLMs) for representing antibodies, have opened up new avenues for AI-based design and improved affinity prediction. Herein, we present an advanced antibody-antigen binding affinity prediction model (LlamaAffinity), leveraging an open-source Llama 3 backbone and antibody sequence data sourced from the Observed Antibody Space (OAS) database. The proposed approach shows significant improvement over existing state-of-the-art (SOTA) methods (AntiFormer, AntiBERTa, AntiBERTy) across multiple evaluation metrics. Specifically, the model achieved an accuracy of 0.9640, an F1-score of 0.9643, a precision of 0.9702, a recall of 0.9586, and an AUC-ROC of 0.9936. Moreover, this strategy unveiled higher computational efficiency, with a five-fold average cumulative training time of only 0.46 hours, significantly lower than in previous studies. △ Less

Submitted 17 May, 2025; originally announced June 2025.

Comments: 7 Pages

arXiv:2506.09022 [pdf, ps, other]

Do Multiple Instance Learning Models Transfer?

Authors: Daniel Shao, Richard J. Chen, Andrew H. Song, Joel Runevic, Ming Y. Lu, Tong Ding, Faisal Mahmood

Abstract: Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology (CPath) for generating clinically meaningful slide-level embeddings from gigapixel tissue images. However, MIL often struggles with small, weakly supervised clinical datasets. In contrast to fields such as NLP and conventional computer vision, where transfer learning is widely used to address data scarcity, the t… ▽ More Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology (CPath) for generating clinically meaningful slide-level embeddings from gigapixel tissue images. However, MIL often struggles with small, weakly supervised clinical datasets. In contrast to fields such as NLP and conventional computer vision, where transfer learning is widely used to address data scarcity, the transferability of MIL models remains poorly understood. In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch. Moreover, pretraining on pancancer datasets enables strong generalization across organs and tasks, outperforming slide foundation models while using substantially less pretraining data. These findings highlight the robust adaptability of MIL models and demonstrate the benefits of leveraging transfer learning to boost performance in CPath. Lastly, we provide a resource which standardizes the implementation of MIL models and collection of pretrained model weights on popular CPath tasks, available at https://github.com/mahmoodlab/MIL-Lab △ Less

Submitted 11 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: ICML 2025 (Spotlight). 20 pages, 8 figures

arXiv:2506.08967 [pdf, ps, other]

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Authors: Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang , et al. (51 additional authors not shown)

Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du… ▽ More Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks. △ Less

Submitted 13 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: 12 pages, 3 figures

arXiv:2506.08898 [pdf, ps, other]

Preference-Driven Multi-Objective Combinatorial Optimization with Conditional Computation

Authors: Mingfeng Fan, Jianan Zhou, Yifeng Zhang, Yaoxin Wu, Jinbiao Chen, Guillaume Adrien Sartoretti

Abstract: Recent deep reinforcement learning methods have achieved remarkable success in solving multi-objective combinatorial optimization problems (MOCOPs) by decomposing them into multiple subproblems, each associated with a specific weight vector. However, these methods typically treat all subproblems equally and solve them using a single model, hindering the effective exploration of the solution space… ▽ More Recent deep reinforcement learning methods have achieved remarkable success in solving multi-objective combinatorial optimization problems (MOCOPs) by decomposing them into multiple subproblems, each associated with a specific weight vector. However, these methods typically treat all subproblems equally and solve them using a single model, hindering the effective exploration of the solution space and thus leading to suboptimal performance. To overcome the limitation, we propose POCCO, a novel plug-and-play framework that enables adaptive selection of model structures for subproblems, which are subsequently optimized based on preference signals rather than explicit reward values. Specifically, we design a conditional computation block that routes subproblems to specialized neural architectures. Moreover, we propose a preference-driven optimization algorithm that learns pairwise preferences between winning and losing solutions. We evaluate the efficacy and versatility of POCCO by applying it to two state-of-the-art neural methods for MOCOPs. Experimental results across four classic MOCOP benchmarks demonstrate its significant superiority and strong generalization. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 22 pages, 6 figures, under review

arXiv:2506.08806 [pdf]

doi 10.1103/PhysRevB.111.235415

Ferromagnetic Two-dimensional Electron Gases with Magnetic Doping and Proximity Effects

Authors: Zixin Fan, Jiale Chen, Qiangtao Sui, Haoming Ling, Zihao Wang, Lingyuan Kong, Dingyi Li, Fang Yang, Run Zhao, Hanghui Chen, Pan Chen, Yan Liang, Jiandi Zhang

Abstract: The advent of magnetic two-dimensional electron gases (2DEGs) at oxide interfaces has provided new opportunities in the field of spintronics. The enhancement of magnetism in 2DEGs at oxide interfaces continues to be a significant challenge, as exemplified by the relatively weak magnetism observed in the classical LaAlO3/SrTiO3 interface. Here, we present ferromagnetic (FM) 2DEGs at the interface f… ▽ More The advent of magnetic two-dimensional electron gases (2DEGs) at oxide interfaces has provided new opportunities in the field of spintronics. The enhancement of magnetism in 2DEGs at oxide interfaces continues to be a significant challenge, as exemplified by the relatively weak magnetism observed in the classical LaAlO3/SrTiO3 interface. Here, we present ferromagnetic (FM) 2DEGs at the interface fabricated between the FM insulator EuTiO3 (ETO) and the strong spin-orbit coupled (SOC) perovskite insulator KTaO3 (KTO). With the combined effects of magnetic atom doping and magnetic proximity from ETO films, the coercive field of 2DEGs can be significantly enhanced. Magnetoresistance (MR) curve with a high coercive field of 1000 Oe has been observed, in conjunction with a temperature-dependent unambiguous hysteresis loop in the anomalous Hall effect (AHE). Furthermore, within the 2DEGs, we have identified a synergetic interplay between magnetic scattering and the weak antilocalization (WAL) effect on transport. This study provides fresh insights into the formation of FM 2DEGs at ETO/KTO interfaces, and introduce an innovative pathway for creating high-performance magnetic 2DEGs at oxide interfaces. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 21 pages and 5 figures

Journal ref: Phys. Rev. B 111, 235415 (2025)

arXiv:2506.08624 [pdf, ps, other]

Measurement of $ψ(2S)$ to $J/ψ$ cross-section ratio as function of multiplicity in $p$Pb collisions at$\sqrt{s_{NN}} = 8.16$ TeV

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, F. Alessio, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis, L. An , et al. (1137 additional authors not shown)

Abstract: The production ratio of $ψ(2S)$ to $J/ψ$ charmonium states is presented as a function of multiplicity in proton-lead collisions at a centre-of-mass energy of $\sqrt{s_{NN}}=8.16$ TeV, for both prompt and nonprompt sources. The total luminosity recorded by the LHCb experiment corresponds to 13.6 $pb^{-1}$ for $p$Pb collisions and 20.8 $pb^{-1}$ for Pb$p$ collisions, where the first particle indicat… ▽ More The production ratio of $ψ(2S)$ to $J/ψ$ charmonium states is presented as a function of multiplicity in proton-lead collisions at a centre-of-mass energy of $\sqrt{s_{NN}}=8.16$ TeV, for both prompt and nonprompt sources. The total luminosity recorded by the LHCb experiment corresponds to 13.6 $pb^{-1}$ for $p$Pb collisions and 20.8 $pb^{-1}$ for Pb$p$ collisions, where the first particle indicates the forward direction of the detector. Measurements are performed in the dimuon final state at forward (backward) centre-of-mass rapidity $1.5<y^*<4.0$ ($-5.0<y^*<-2.5$) for $p$Pb (Pb$p$) collisions.A multiplicity dependence of the prompt production ratio is observed in $p$Pb collisions, whereas no dependence is found in nonprompt production, nor in either prompt or nonprompt production in Pb$p$ collisions. These results suggest that in the Pb-going direction additional suppression mechanisms beyond comover effects may be present, possibly related to the formation of quark-gluon plasma. This highlights a transition from small to large collision systems and provides important insight into the suppression of charmonia in proton-nucleus collisions. △ Less

Submitted 12 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: All figures and tables, along with machine-readable versions and any supplementary material and additional information, are available at https://lbfence.cern.ch/alcm/public/analysis/full-details/4177/ (LHCb public pages)

Report number: LHCb-PAPER-2025-011, CERN-EP-2025-114

arXiv:2506.08576 [pdf, ps, other]

Measurement of the $η$ transition form factor through $η' \rightarrow π^+π^-η$ decay

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (680 additional authors not shown)

Abstract: Based on a sample of $(1.0087\pm0.0044)\times10^{10}$ $J/ψ$ events collected at BESIII, the transition form factor of the $η$ meson is extracted by analyzing $J/ψ\toγη',~η'\toπ^+π^-η,~η\toγl^+l^-$ ($l$=$e$, $μ$) events. The measured slope of the transition form factor is $Λ^{-2}=1.645\pm0.093_{\rm stat.}\pm {0.024_{\rm sys.}}$ (GeV/$c^2$)$^{-2}$ for the di-electron channel and… ▽ More Based on a sample of $(1.0087\pm0.0044)\times10^{10}$ $J/ψ$ events collected at BESIII, the transition form factor of the $η$ meson is extracted by analyzing $J/ψ\toγη',~η'\toπ^+π^-η,~η\toγl^+l^-$ ($l$=$e$, $μ$) events. The measured slope of the transition form factor is $Λ^{-2}=1.645\pm0.093_{\rm stat.}\pm {0.024_{\rm sys.}}$ (GeV/$c^2$)$^{-2}$ for the di-electron channel and $Λ^{-2}=1.645\pm0.343_{\rm stat.}\pm0.017_{\rm sys.}$ (GeV/$c^2$)$^{-2}$ for the di-muon channel. The branching fractions for $η\rightarrowγe^+e^-$ and $η\rightarrowγμ^+μ^-$ are measured to be $\mathcal{B}(η\toγe^+e^-)=(6.79\pm0.04_{\rm stat.}\pm0.36_{\rm sys.})\times 10^{-3}$ and $\mathcal{B}(η\toγμ^+μ^-)=(2.97\pm0.11_{\rm stat.}\pm0.07_{\rm sys.})\times 10^{-4}$. By combining with the results based on the $J/ψ\toγη,~η\toγe^+e^-$ events from the previous BESIII measurement, we determine $Λ^{-2}=1.707\pm0.076_{\rm stat.}\pm0.029_{\rm sys.}$ (GeV/$c^2$)$^{-2}$ and $\mathcal{B}(η\toγe^+e^-)=(6.93\pm0.28_{\rm tot.})\times 10^{-3}$. In addition, we search for the dark photon ($A'$) using the combined events. No significant signal is observed, and the upper limits on $\mathcal{B}(η\toγA',~A'\to e^+e^-)$ are set at 90\% confidence level for different $A'$ mass hypotheses. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08534 [pdf, ps, other]

DCD: A Semantic Segmentation Model for Fetal Ultrasound Four-Chamber View

Authors: Donglian Li, Hui Guo, Minglang Chen, Huizhen Chen, Jialing Chen, Bocheng Liang, Pengchen Liang, Ying Tan

Abstract: Accurate segmentation of anatomical structures in the apical four-chamber (A4C) view of fetal echocardiography is essential for early diagnosis and prenatal evaluation of congenital heart disease (CHD). However, precise segmentation remains challenging due to ultrasound artifacts, speckle noise, anatomical variability, and boundary ambiguity across different gestational stages. To reduce the workl… ▽ More Accurate segmentation of anatomical structures in the apical four-chamber (A4C) view of fetal echocardiography is essential for early diagnosis and prenatal evaluation of congenital heart disease (CHD). However, precise segmentation remains challenging due to ultrasound artifacts, speckle noise, anatomical variability, and boundary ambiguity across different gestational stages. To reduce the workload of sonographers and enhance segmentation accuracy, we propose DCD, an advanced deep learning-based model for automatic segmentation of key anatomical structures in the fetal A4C view. Our model incorporates a Dense Atrous Spatial Pyramid Pooling (Dense ASPP) module, enabling superior multi-scale feature extraction, and a Convolutional Block Attention Module (CBAM) to enhance adaptive feature representation. By effectively capturing both local and global contextual information, DCD achieves precise and robust segmentation, contributing to improved prenatal cardiac assessment. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08527 [pdf, ps, other]

The Compton-thick AGN Population and the $N_{\rm H}$ Distribution of Low-mass AGN in our Cosmic Backyard

Authors: A. Annuar, D. M. Alexander, P. Gandhi, G. B. Lansbury, M. N. Rosli, D. Stern, D. Asmus, D. R. Ballantyne, M. Baloković, F. E. Bauer, P. G. Boorman, W. N. Brandt, M. Brightman, C. T. J. Chen, A. Del Moro, D. Farrah, F. A. Harrison, M. J. Koss, L. Lanz, S. Marchesi, P. Mohanadas, E. Nardini, C. Ricci, L. Zappacosta

Abstract: We present a census of the Compton-thick (CT) active galactic nucleus (AGN) population and the column density ($N_{\rm{H}}$) distribution of AGN in our cosmic backyard using a mid-infrared selected AGN sample within 15 Mpc. The column densities are measured from broadband X-ray spectral analysis, mainly using data from $\textit{Chandra}$ and $\textit{NuSTAR}$. Our sample probes AGN with intrinsic… ▽ More We present a census of the Compton-thick (CT) active galactic nucleus (AGN) population and the column density ($N_{\rm{H}}$) distribution of AGN in our cosmic backyard using a mid-infrared selected AGN sample within 15 Mpc. The column densities are measured from broadband X-ray spectral analysis, mainly using data from $\textit{Chandra}$ and $\textit{NuSTAR}$. Our sample probes AGN with intrinsic 2-10 keV luminosities of $L_{\rm 2-10, int} = 10^{37}$-$10^{43}$ erg s$^{-1}$, reaching a parameter space inaccessible to more distant samples. We directly measure a 32$^{+30}_{-18}\%$ CT AGN fraction and obtain an $N_{\rm{H}}$ distribution that agrees with that inferred by the $\textit{Swift}$-BAT survey. Restricting the sample to the largely unexplored domain of low-luminosity AGN with $L_{\rm 2-10, int}$ $\leq$ $10^{42}$ erg s$^{-1}$, we found a CT fraction of 19$^{+30}_{-14}\%$, consistent with those observed at higher luminosities. Comparing the host-galaxy properties between the two samples, we find consistent star formation rates, though the majority of our galaxy have lower stellar masses (by $\approx 0.3$ dex). In contrast, the two samples have very different black hole mass ($M_{\rm BH}$) distributions, with our sample having $\approx$1.5 dex lower mean mass ($M_{\rm BH}$ $\sim$ 10$^{6}$ $M_\odot$). Additionally, our sample contains a significantly higher number of LINERs and H$_{\rm{II}}$-type nuclei. The Eddington ratio range probed by our sample, however, is the same as $\textit{Swift}$-BAT, although the latter dominates at higher accretion rates, and our sample is more evenly distributed. The majority of our sample with $λ_{\rm Edd} \ge$ 10$^{-3}$ tend to be CT, while those with $λ_{\rm Edd} <$ 10$^{-3}$ are mostly unobscured or mildly obscured. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 22 pages, 18 figures, 3 tables. Accepted for publication in MNRAS on 5 June 2025

arXiv:2506.08403 [pdf, ps, other]

TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

Authors: Weiya Li, Junjie Chen, Bei Li, Boyang Liu, Zichen Wen, Nuanqiao Shan, Xiaoqian Liu, Anping Liu, Huajie Liu, Hu Song, Linfeng Zhang

Abstract: Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtask… ▽ More Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi-agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi-agent framework called TACTIC, which stands for T ranslation A gents with Cognitive- T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory-grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high-quality translation. Experimental results on diverse language pairs from the FLORES-200 and WMT24 benchmarks show that our method consistently achieves state-of-the-art performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at https://github.com/weiyali126/TACTIC. △ Less

Submitted 11 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

Comments: 20 pages, 4 figures, Under review. Code: https://github.com/weiyali126/TACTIC

arXiv:2506.08383 [pdf]

Network Threat Detection: Addressing Class Imbalanced Data with Deep Forest

Authors: Jiaqi Chen, Rongbin Ye

Abstract: With the rapid expansion of Internet of Things (IoT) networks, detecting malicious traffic in real-time has become a critical cybersecurity challenge. This research addresses the detection challenges by presenting a comprehensive empirical analysis of machine learning techniques for malware detection using the IoT-23 dataset provided by the Stratosphere Laboratory. We address the significant class… ▽ More With the rapid expansion of Internet of Things (IoT) networks, detecting malicious traffic in real-time has become a critical cybersecurity challenge. This research addresses the detection challenges by presenting a comprehensive empirical analysis of machine learning techniques for malware detection using the IoT-23 dataset provided by the Stratosphere Laboratory. We address the significant class imbalance within the dataset through three resampling strategies. We implement and compare a few machine learning techniques. Our findings demonstrate that the combination of appropriate imbalance treatment techniques with ensemble methods, particularly gcForest, achieves better detection performance compared to traditional approaches. This work contributes significantly to the development of more intelligent and efficient automated threat detection systems for IoT environments, helping to secure critical infrastructure against sophisticated cyber attacks while optimizing computational resource usage. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07999 [pdf, ps, other]

MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation

Authors: Junhao Chen, Yulia Tsvetkov, Xiaochuang Han

Abstract: Recent progress in multimodal generation has increasingly combined autoregressive (AR) and diffusion-based approaches, leveraging their complementary strengths: AR models capture long-range dependencies and produce fluent, context-aware outputs, while diffusion models operate in continuous latent spaces to refine high-fidelity visual details. However, existing hybrids often lack systematic guidanc… ▽ More Recent progress in multimodal generation has increasingly combined autoregressive (AR) and diffusion-based approaches, leveraging their complementary strengths: AR models capture long-range dependencies and produce fluent, context-aware outputs, while diffusion models operate in continuous latent spaces to refine high-fidelity visual details. However, existing hybrids often lack systematic guidance on how and why to allocate model capacity between these paradigms. In this work, we introduce MADFormer, a Mixed Autoregressive and Diffusion Transformer that serves as a testbed for analyzing AR-diffusion trade-offs. MADFormer partitions image generation into spatial blocks, using AR layers for one-pass global conditioning across blocks and diffusion layers for iterative local refinement within each block. Through controlled experiments on FFHQ-1024 and ImageNet, we identify two key insights: (1) block-wise partitioning significantly improves performance on high-resolution images, and (2) vertically mixing AR and diffusion layers yields better quality-efficiency balances--improving FID by up to 75% under constrained inference compute. Our findings offer practical design principles for future hybrid generative models. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07992 [pdf, ps, other]

PairEdit: Learning Semantic Variations for Exemplar-based Image Editing

Authors: Haoguang Lu, Jiacheng Chen, Zhenguo Yang, Aurele Tohokantche Gnanha, Fu Lee Wang, Li Qing, Xudong Mao

Abstract: Recent advancements in text-guided image editing have achieved notable success by leveraging natural language prompts for fine-grained semantic control. However, certain editing semantics are challenging to specify precisely using textual descriptions alone. A practical alternative involves learning editing semantics from paired source-target examples. Existing exemplar-based editing methods still… ▽ More Recent advancements in text-guided image editing have achieved notable success by leveraging natural language prompts for fine-grained semantic control. However, certain editing semantics are challenging to specify precisely using textual descriptions alone. A practical alternative involves learning editing semantics from paired source-target examples. Existing exemplar-based editing methods still rely on text prompts describing the change within paired examples or learning implicit text-based editing instructions. In this paper, we introduce PairEdit, a novel visual editing method designed to effectively learn complex editing semantics from a limited number of image pairs or even a single image pair, without using any textual guidance. We propose a target noise prediction that explicitly models semantic variations within paired images through a guidance direction term. Moreover, we introduce a content-preserving noise schedule to facilitate more effective semantic learning. We also propose optimizing distinct LoRAs to disentangle the learning of semantic variations from content. Extensive qualitative and quantitative evaluations demonstrate that PairEdit successfully learns intricate semantics while significantly improving content consistency compared to baseline methods. Code will be available at https://github.com/xudonmao/PairEdit. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07907 [pdf, ps, other]

A novel measurement of the strong-phase difference between $D^0\to K^-π^+$ and $\bar{D}^0\to K^-π^+$ decays using $C$-even and $C$-odd quantum-correlated $D\bar{D}$ pairs

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (707 additional authors not shown)

Abstract: A novel measurement technique of strong-phase differences between the decay amplitudes of $D^0$ and $\bar{D}^0$ mesons is introduced which exploits quantum-correlated $D\bar{D}$ pairs produced by $e^+e^-$ collisions at energies above the $ψ(3770)$ production threshold, where $D\bar{D}$ pairs are produced in both even and odd eigenstates of the charge-conjugation symmetry. Employing this technique,… ▽ More A novel measurement technique of strong-phase differences between the decay amplitudes of $D^0$ and $\bar{D}^0$ mesons is introduced which exploits quantum-correlated $D\bar{D}$ pairs produced by $e^+e^-$ collisions at energies above the $ψ(3770)$ production threshold, where $D\bar{D}$ pairs are produced in both even and odd eigenstates of the charge-conjugation symmetry. Employing this technique, the first determination of a $D^0$-$\bar{D^0}$ relative strong phase is reported with such data samples. The strong-phase difference between $D^0\to K^-π^+$ and $\bar{D}^0\to K^-π^+$ decays, $δ^{D}_{Kπ}$, is measured to be $δ^{D}_{Kπ}=\left(192.8^{+11.0 + 1.9}_{-12.4 -2.4}\right)^\circ$, using a dataset corresponding to an integrated luminosity of 7.13 $\text{fb}^{-1}$ collected at center-of-mass energies between $4.13-4.23 \text{ GeV}$ by the BESIII experiment. △ Less

Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07906 [pdf, ps, other]

First observation of quantum correlations in $e^+e^-\to XD\bar{D}$ and $C$-even constrained $D\bar{D}$ pairs

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (707 additional authors not shown)

Abstract: The study of meson pairs produced with quantum correlations gives direct access to parameters that are challenging to measure in other systems. In this Letter, the existence of quantum correlations due to charge-conjugation symmetry $C$ are demonstrated in $D\bar{D}$ pairs produced through the processes $e^+e^-\to D\bar{D}$, $e^+e^- \to D^{*}\bar{D}$, and $e^+e^- \to D^{*} \bar{D}^*$, where the la… ▽ More The study of meson pairs produced with quantum correlations gives direct access to parameters that are challenging to measure in other systems. In this Letter, the existence of quantum correlations due to charge-conjugation symmetry $C$ are demonstrated in $D\bar{D}$ pairs produced through the processes $e^+e^-\to D\bar{D}$, $e^+e^- \to D^{*}\bar{D}$, and $e^+e^- \to D^{*} \bar{D}^*$, where the lack of charge superscripts refers to an admixture of neutral-charm-meson particle and antiparticle states, using $7.13 \text{ fb}^{-1}$ of $e^+e^-$ collision data collected by the BESIII experiment between center-of-mass energies of $4.13-4.23 \text{ GeV}$. Processes with either $C$-even or $C$-odd constraints are identified and separated. A procedure is presented that harnesses the entangled production process to enable measurements of $D^0$-meson hadronic parameters. This study provides the first confirmation of quantum correlations in $e^+e^-\to X D\bar{D}$ processes and the first observation of a $C$-even constrained $D\bar{D}$ system. The procedure is applied to measure $δ^{D}_{Kπ}$, the strong phase between the $D^0\to K^-π^+$ and $\bar{D}^0\to K^-π^+$ decay amplitudes, which results in the determination of $δ^{D}_{Kπ}=\left(192.8^{+11.0 + 1.9}_{-12.4 -2.4}\right)^\circ$. The potential for measurements of other hadronic decay parameters and charm mixing with these and future datasets is also discussed. △ Less

Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07820 [pdf, ps, other]

Guideline Forest: Experience-Induced Multi-Guideline Reasoning with Stepwise Aggregation

Authors: Jiaxiang Chen, Zhuo Wang, Mingxi Zou, Qifan Wang, Zenglin Xu

Abstract: Human reasoning is flexible, adaptive, and grounded in prior experience-qualities that large language models (LLMs) still struggle to emulate. Existing methods either explore diverse reasoning paths at inference time or search for optimal workflows through expensive operations, but both fall short in leveraging multiple reusable strategies in a structured, efficient manner. We propose Guideline Fo… ▽ More Human reasoning is flexible, adaptive, and grounded in prior experience-qualities that large language models (LLMs) still struggle to emulate. Existing methods either explore diverse reasoning paths at inference time or search for optimal workflows through expensive operations, but both fall short in leveraging multiple reusable strategies in a structured, efficient manner. We propose Guideline Forest, a framework that enhances LLMs reasoning by inducing structured reasoning strategies-called guidelines-from verified examples and executing them via step-wise aggregation. Unlike test-time search or single-path distillation, our method draws on verified reasoning experiences by inducing reusable guidelines and expanding each into diverse variants. Much like human reasoning, these variants reflect alternative thought patterns, are executed in parallel, refined via self-correction, and aggregated step by step-enabling the model to adaptively resolve uncertainty and synthesize robust solutions.We evaluate Guideline Forest on four benchmarks-GSM8K, MATH-500, MBPP, and HumanEval-spanning mathematical and programmatic reasoning. Guideline Forest consistently outperforms strong baselines, including CoT, ReAct, ToT, FoT, and AFlow. Ablation studies further highlight the effectiveness of multi-path reasoning and stepwise aggregation, underscoring the Guideline Forest's adaptability and generalization potential. △ Less

Submitted 9 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07652 [pdf, ps, other]

FMaMIL: Frequency-Driven Mamba Multi-Instance Learning for Weakly Supervised Lesion Segmentation in Medical Images

Authors: Hangbei Cheng, Xiaorong Dong, Xueyu Liu, Jianan Zhang, Xuetao Ma, Mingqiang Wei, Liansheng Wang, Junxin Chen, Yongfei Wu

Abstract: Accurate lesion segmentation in histopathology images is essential for diagnostic interpretation and quantitative analysis, yet it remains challenging due to the limited availability of costly pixel-level annotations. To address this, we propose FMaMIL, a novel two-stage framework for weakly supervised lesion segmentation based solely on image-level labels. In the first stage, a lightweight Mamba-… ▽ More Accurate lesion segmentation in histopathology images is essential for diagnostic interpretation and quantitative analysis, yet it remains challenging due to the limited availability of costly pixel-level annotations. To address this, we propose FMaMIL, a novel two-stage framework for weakly supervised lesion segmentation based solely on image-level labels. In the first stage, a lightweight Mamba-based encoder is introduced to capture long-range dependencies across image patches under the MIL paradigm. To enhance spatial sensitivity and structural awareness, we design a learnable frequency-domain encoding module that supplements spatial-domain features with spectrum-based information. CAMs generated in this stage are used to guide segmentation training. In the second stage, we refine the initial pseudo labels via a CAM-guided soft-label supervision and a self-correction mechanism, enabling robust training even under label noise. Extensive experiments on both public and private histopathology datasets demonstrate that FMaMIL outperforms state-of-the-art weakly supervised methods without relying on pixel-level annotations, validating its effectiveness and potential for digital pathology applications. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07517 [pdf, ps, other]

doi 10.1145/3711896.3736832

Addressing Correlated Latent Exogenous Variables in Debiased Recommender Systems

Authors: Shuqiang Zhang, Yuchao Zhang, Jinkun Chen, Haochen Sui

Abstract: Recommendation systems (RS) aim to provide personalized content, but they face a challenge in unbiased learning due to selection bias, where users only interact with items they prefer. This bias leads to a distorted representation of user preferences, which hinders the accuracy and fairness of recommendations. To address the issue, various methods such as error imputation based, inverse propensity… ▽ More Recommendation systems (RS) aim to provide personalized content, but they face a challenge in unbiased learning due to selection bias, where users only interact with items they prefer. This bias leads to a distorted representation of user preferences, which hinders the accuracy and fairness of recommendations. To address the issue, various methods such as error imputation based, inverse propensity scoring, and doubly robust techniques have been developed. Despite the progress, from the structural causal model perspective, previous debiasing methods in RS assume the independence of the exogenous variables. In this paper, we release this assumption and propose a learning algorithm based on likelihood maximization to learn a prediction model. We first discuss the correlation and difference between unmeasured confounding and our scenario, then we propose a unified method that effectively handles latent exogenous variables. Specifically, our method models the data generation process with latent exogenous variables under mild normality assumptions. We then develop a Monte Carlo algorithm to numerically estimate the likelihood function. Extensive experiments on synthetic datasets and three real-world datasets demonstrate the effectiveness of our proposed method. The code is at https://github.com/WallaceSUI/kdd25-background-variable. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '25), August 3--7, 2025, Toronto, ON, Canada

arXiv:2506.07503 [pdf]

Large Language Models for Multilingual Vulnerability Detection: How Far Are We?

Authors: Honglin Shu, Michael Fu, Junji Yu, Dong Wang, Chakkrit Tantithamthavorn, Junjie Chen, Yasutaka Kamei

Abstract: Various deep learning-based approaches utilizing pre-trained language models (PLMs) have been proposed for automated vulnerability detection. With recent advancements in large language models (LLMs), several studies have begun exploring their application to vulnerability detection tasks. However, existing studies primarily focus on specific programming languages (e.g., C/C++) and function-level de… ▽ More Various deep learning-based approaches utilizing pre-trained language models (PLMs) have been proposed for automated vulnerability detection. With recent advancements in large language models (LLMs), several studies have begun exploring their application to vulnerability detection tasks. However, existing studies primarily focus on specific programming languages (e.g., C/C++) and function-level detection, leaving the strengths and weaknesses of PLMs and LLMs in multilingual and multi-granularity scenarios largely unexplored. To bridge this gap, we conduct a comprehensive fine-grained empirical study evaluating the effectiveness of state-of-the-art PLMs and LLMs for multilingual vulnerability detection. Using over 30,000 real-world vulnerability-fixing patches across seven programming languages, we systematically assess model performance at both the function-level and line-level. Our key findings indicate that GPT-4o, enhanced through instruction tuning and few-shot prompting, significantly outperforms all other evaluated models, including CodeT5P. Furthermore, the LLM-based approach demonstrates superior capability in detecting unique multilingual vulnerabilities, particularly excelling in identifying the most dangerous and high-severity vulnerabilities. These results underscore the promising potential of adopting LLMs for multilingual vulnerability detection at function-level and line-level, revealing their complementary strengths and substantial improvements over PLM approaches. This first empirical evaluation of PLMs and LLMs for multilingual vulnerability detection highlights LLMs' value in addressing real-world software security challenges. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 33 pages, 9 figures

arXiv:2506.07463 [pdf, ps, other]

CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

Authors: Guang Liu, Liangdong Wang, Jijie Li, Yang Yu, Yao Xu, Jiabei Chen, Yu Bai, Feng Liao, Yonghua Lin

Abstract: We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources fr… ▽ More We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract $4.5$ billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07431 [pdf, ps, other]

FAMSeg: Fetal Femur and Cranial Ultrasound Segmentation Using Feature-Aware Attention and Mamba Enhancement

Authors: Jie He, Minglang Chen, Minying Lu, Bocheng Liang, Junming Wei, Guiyan Peng, Jiaxi Chen, Ying Tan

Abstract: Accurate ultrasound image segmentation is a prerequisite for precise biometrics and accurate assessment. Relying on manual delineation introduces significant errors and is time-consuming. However, existing segmentation models are designed based on objects in natural scenes, making them difficult to adapt to ultrasound objects with high noise and high similarity. This is particularly evident in sma… ▽ More Accurate ultrasound image segmentation is a prerequisite for precise biometrics and accurate assessment. Relying on manual delineation introduces significant errors and is time-consuming. However, existing segmentation models are designed based on objects in natural scenes, making them difficult to adapt to ultrasound objects with high noise and high similarity. This is particularly evident in small object segmentation, where a pronounced jagged effect occurs. Therefore, this paper proposes a fetal femur and cranial ultrasound image segmentation model based on feature perception and Mamba enhancement to address these challenges. Specifically, a longitudinal and transverse independent viewpoint scanning convolution block and a feature perception module were designed to enhance the ability to capture local detail information and improve the fusion of contextual information. Combined with the Mamba-optimized residual structure, this design suppresses the interference of raw noise and enhances local multi-dimensional scanning. The system builds global information and local feature dependencies, and is trained with a combination of different optimizers to achieve the optimal solution. After extensive experimental validation, the FAMSeg network achieved the fastest loss reduction and the best segmentation performance across images of varying sizes and orientations. △ Less

Submitted 14 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07368 [pdf, other]

C3S3: Complementary Competition and Contrastive Selection for Semi-Supervised Medical Image Segmentation

Authors: Jiaying He, Yitong Lin, Jiahe Chen, Honghui Xu, Jianwei Zheng

Abstract: For the immanent challenge of insufficiently annotated samples in the medical field, semi-supervised medical image segmentation (SSMIS) offers a promising solution. Despite achieving impressive results in delineating primary target areas, most current methodologies struggle to precisely capture the subtle details of boundaries. This deficiency often leads to significant diagnostic inaccuracies. To… ▽ More For the immanent challenge of insufficiently annotated samples in the medical field, semi-supervised medical image segmentation (SSMIS) offers a promising solution. Despite achieving impressive results in delineating primary target areas, most current methodologies struggle to precisely capture the subtle details of boundaries. This deficiency often leads to significant diagnostic inaccuracies. To tackle this issue, we introduce C3S3, a novel semi-supervised segmentation model that synergistically integrates complementary competition and contrastive selection. This design significantly sharpens boundary delineation and enhances overall precision. Specifically, we develop an $\textit{Outcome-Driven Contrastive Learning}$ module dedicated to refining boundary localization. Additionally, we incorporate a $\textit{Dynamic Complementary Competition}$ module that leverages two high-performing sub-networks to generate pseudo-labels, thereby further improving segmentation quality. The proposed C3S3 undergoes rigorous validation on two publicly accessible datasets, encompassing the practices of both MRI and CT scans. The results demonstrate that our method achieves superior performance compared to previous cutting-edge competitors. Especially, on the 95HD and ASD metrics, our approach achieves a notable improvement of at least $6\%$, highlighting the significant advancements. The code is available at https://github.com/Y-TARL/C3S3. △ Less

Submitted 8 June, 2025; originally announced June 2025.

Comments: 6 pages, 4 figures, ICME2025

arXiv:2506.07351 [pdf, ps, other]

Decentralized Optimization on Compact Submanifolds by Quantized Riemannian Gradient Tracking

Authors: Jun Chen, Lina Liu, Tianyi Zhu, Yong Liu, Guang Dai, Yunliang Jiang, Ivor W. Tsang

Abstract: This paper considers the problem of decentralized optimization on compact submanifolds, where a finite sum of smooth (possibly non-convex) local functions is minimized by $n$ agents forming an undirected and connected graph. However, the efficiency of distributed optimization is often hindered by communication bottlenecks. To mitigate this, we propose the Quantized Riemannian Gradient Tracking (Q-… ▽ More This paper considers the problem of decentralized optimization on compact submanifolds, where a finite sum of smooth (possibly non-convex) local functions is minimized by $n$ agents forming an undirected and connected graph. However, the efficiency of distributed optimization is often hindered by communication bottlenecks. To mitigate this, we propose the Quantized Riemannian Gradient Tracking (Q-RGT) algorithm, where agents update their local variables using quantized gradients. The introduction of quantization noise allows our algorithm to bypass the constraints of the accurate Riemannian projection operator (such as retraction), further improving iterative efficiency. To the best of our knowledge, this is the first algorithm to achieve an $\mathcal{O}(1/K)$ convergence rate in the presence of quantization, matching the convergence rate of methods without quantization. Additionally, we explicitly derive lower bounds on decentralized consensus associated with a function of quantization levels. Numerical experiments demonstrate that Q-RGT performs comparably to non-quantized methods while reducing communication bottlenecks and computational overhead. △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2506.07346 [pdf, ps, other]

Another look at quasilinear Schrödinger equations with prescribed mass via dual method

Authors: Jianhua Chen, Vicentiu D. Radulescu, Jijiang Sun, Jian Zhang

Abstract: In this paper, we aim to study the existence of ground state normalized solutions for the following quasilinear Schrödinger equation $-Δu-Δ(u^2)u=h(u)+λu,\,\, x\in\R^N$, under the mass constraint $\int_{\R^N}|u|^2\text{d}x=a,$ where $N\geq2$, $a>0$ is a given mass, $λ$ is a Lagrange multiplier and $h$ is a nonlinear reaction term with some suitable conditions. By employing a suitable transformatio… ▽ More In this paper, we aim to study the existence of ground state normalized solutions for the following quasilinear Schrödinger equation $-Δu-Δ(u^2)u=h(u)+λu,\,\, x\in\R^N$, under the mass constraint $\int_{\R^N}|u|^2\text{d}x=a,$ where $N\geq2$, $a>0$ is a given mass, $λ$ is a Lagrange multiplier and $h$ is a nonlinear reaction term with some suitable conditions. By employing a suitable transformation $u=f(v)$, we reformulate the original problem into the equivalent form $-Δv =h(f(v))f'(v)+λf(v)f'(v),\,\, x\in\R^N,$ with prescribed mass $ \int_{\R^N}|f(v)|^2\text{d}x=a. $ To address the challenge posed by the $L^2$-norm $\|f(v)\|^2_2$ not necessarily equaling $a$, we introduce a novel stretching mapping: $ v_t(x):=f^{-1}(t^{N/2}f(v(tx))). $ This construction, combined with a dual method and detailed analytical techniques, enables us to establish the following existence results: (1)Existence of solutions via constrained minimization using dual methods; (2) Existence of ground state normalized solutions under general $L^2$-supercritical growth conditions, along with nonexistence results, analyzed via dual methods; (3)Existence of normalized solutions under critical growth conditions, treated via dual methods. Additionally, we analyze the asymptotic behavior of the ground state energy obtained in {\bf(P2)}. Our results extend and refine those of Colin-Jeanjean-Squassina [Nonlinearity 20: 1353-1385, 2010], of Jeanjean-Luo-Wang [J. Differ. Equ. 259: 3894-3928, 2015], of Li-Zou [Pacific J. Math. 322: 99-138, 2023], of Zhang-Li-Wang [Topol. Math. Nonl. Anal. 61: 465-489, 2023] and so on. We believe that the methodology developed here can be adapted to study related problems concerning the existence of normalized solutions for quasilinear Schrödinger equations via the dual method. △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2506.07165 [pdf, ps, other]

AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models

Authors: Qi Liu, Jingqing Ruan, Hao Li, Haodong Zhao, Desheng Wang, Jiansong Chen, Wan Guanglu, Xunliang Cai, Zhi Zheng, Tong Xu

Abstract: Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dy… ▽ More Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO's capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at https://github.com/Javkonline/AMoPO. △ Less

Submitted 8 June, 2025; originally announced June 2025.

Comments: Accepted by ACL 2025

arXiv:2506.07084 [pdf, ps, other]

The PML method for calculating the propagating modes of electromagnetic wave in periodic structures

Authors: Lide Cai, Junqing Chen, Yanpeng Gao

Abstract: When the electromagnetic wave is incident on the periodic structures, in addition to the scattering field, some propagating modes that are traveling in the periodic medium could be generated. In the present paper, we study the calculation of propagating modes. We formulate the problem as a nonlinear eigenvalue problem in an unbounded periodic domain. Then we use perfectly matched layers to truncat… ▽ More When the electromagnetic wave is incident on the periodic structures, in addition to the scattering field, some propagating modes that are traveling in the periodic medium could be generated. In the present paper, we study the calculation of propagating modes. We formulate the problem as a nonlinear eigenvalue problem in an unbounded periodic domain. Then we use perfectly matched layers to truncate the unbounded domain, recast the problem to a quadratic eigenvalue problem, and prove the approximation property of the truncation. Finally, we formulate the quadratic eigenvalue problem to a general eigenvalue problem, use the finite element method to discrete the truncation problem, and show numerical examples to verify theoretical results. △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2506.06952 [pdf, ps, other]

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

Authors: Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

Abstract: Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of… ▽ More Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models. △ Less

Submitted 7 June, 2025; originally announced June 2025.

Comments: Unified multimodal model, Flow-matching

arXiv:2506.06664 [pdf, ps, other]

Generalized Trajectory Scoring for End-to-end Multimodal Planning

Authors: Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, Jose M. Alvarez

Abstract: End-to-end multi-modal planning is a promising paradigm in autonomous driving, enabling decision-making with diverse trajectory candidates. A key component is a robust trajectory scorer capable of selecting the optimal trajectory from these candidates. While recent trajectory scorers focus on scoring either large sets of static trajectories or small sets of dynamically generated ones, both approac… ▽ More End-to-end multi-modal planning is a promising paradigm in autonomous driving, enabling decision-making with diverse trajectory candidates. A key component is a robust trajectory scorer capable of selecting the optimal trajectory from these candidates. While recent trajectory scorers focus on scoring either large sets of static trajectories or small sets of dynamically generated ones, both approaches face significant limitations in generalization. Static vocabularies provide effective coarse discretization but struggle to make fine-grained adaptation, while dynamic proposals offer detailed precision but fail to capture broader trajectory distributions. To overcome these challenges, we propose GTRS (Generalized Trajectory Scoring), a unified framework for end-to-end multi-modal planning that combines coarse and fine-grained trajectory evaluation. GTRS consists of three complementary innovations: (1) a diffusion-based trajectory generator that produces diverse fine-grained proposals; (2) a vocabulary generalization technique that trains a scorer on super-dense trajectory sets with dropout regularization, enabling its robust inference on smaller subsets; and (3) a sensor augmentation strategy that enhances out-of-domain generalization while incorporating refinement training for critical trajectory discrimination. As the winning solution of the Navsim v2 Challenge, GTRS demonstrates superior performance even with sub-optimal sensor inputs, approaching privileged methods that rely on ground-truth perception. Code will be available at https://github.com/NVlabs/GTRS. △ Less

Submitted 7 June, 2025; originally announced June 2025.

Comments: The 1st place solution of the End-to-end Driving Track at the CVPR 2025 Autonomous Grand Challenge

arXiv:2506.06541 [pdf, ps, other]

KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Authors: Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Sivaprasad Sudhir, Om Chabra, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska

Abstract: Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and… ▽ More Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and understanding capabilities. However, it remains unclear to what extent these capabilities translate into successful design and execution of such complex pipelines. We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines spanning 1700 data files from 24 data sources in 6 different domains. We show that these pipelines test the end-to-end capabilities of AI systems on data processing, requiring data discovery, wrangling and cleaning, efficient processing, statistical reasoning, and orchestrating data processing steps given a high-level task. Our evaluation tests 5 general models and 3 code generation models using our reference framework, DS-GURU, which instructs the AI model to decompose a question into a sequence of subtasks, reason through each step, and synthesize Python code that implements the proposed design. Our results on KRAMABENCH show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, when extensive data processing and domain knowledge are required to construct real-world data science pipelines, existing out-of-box models fall short. Progress on KramaBench represents crucial steps towards developing autonomous data science agents for real-world applications. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench. △ Less

Submitted 6 June, 2025; originally announced June 2025.

arXiv:2506.06362 [pdf, ps, other]

CR-BLEA: Contrastive Ranking for Adaptive Resource Allocation in Bilevel Evolutionary Algorithms

Authors: Dejun Xu, Jijia Chen, Gary G. Yen, Min Jiang

Abstract: Bilevel optimization poses a significant computational challenge due to its nested structure, where each upper-level candidate solution requires solving a corresponding lower-level problem. While evolutionary algorithms (EAs) are effective at navigating such complex landscapes, their high resource demands remain a key bottleneck -- particularly the redundant evaluation of numerous unpromising lowe… ▽ More Bilevel optimization poses a significant computational challenge due to its nested structure, where each upper-level candidate solution requires solving a corresponding lower-level problem. While evolutionary algorithms (EAs) are effective at navigating such complex landscapes, their high resource demands remain a key bottleneck -- particularly the redundant evaluation of numerous unpromising lower-level tasks. Despite recent advances in multitasking and transfer learning, resource waste persists. To address this issue, we propose a novel resource allocation framework for bilevel EAs that selectively identifies and focuses on promising lower-level tasks. Central to our approach is a contrastive ranking network that learns relational patterns between paired upper- and lower-level solutions online. This knowledge guides a reference-based ranking strategy that prioritizes tasks for optimization and adaptively controls resampling based on estimated population quality. Comprehensive experiments across five state-of-the-art bilevel algorithms show that our framework significantly reduces computational cost while preserving -- or even enhancing -- solution accuracy. This work offers a generalizable strategy to improve the efficiency of bilevel EAs, paving the way for more scalable bilevel optimization. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.06295 [pdf, ps, other]

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Authors: Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, Linfeng Zhang

Abstract: Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniqu… ▽ More Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1 x speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub. △ Less

Submitted 17 May, 2025; originally announced June 2025.

arXiv:2506.06283 [pdf, other]

Facial Foundational Model Advances Early Warning of Coronary Artery Disease from Live Videos with DigitalShadow

Authors: Juexiao Zhou, Zhongyi Han, Mankun Xin, Xingwei He, Guotao Wang, Jiaoyan Song, Gongning Luo, Wenjia He, Xintong Li, Yuetan Chu, Juanwen Chen, Bo Wang, Xia Wu, Wenwen Duan, Zhixia Guo, Liyan Bai, Yilin Pan, Xuefei Bi, Lu Liu, Long Feng, Xiaonan He, Xin Gao

Abstract: Global population aging presents increasing challenges to healthcare systems, with coronary artery disease (CAD) responsible for approximately 17.8 million deaths annually, making it a leading cause of global mortality. As CAD is largely preventable, early detection and proactive management are essential. In this work, we introduce DigitalShadow, an advanced early warning system for CAD, powered b… ▽ More Global population aging presents increasing challenges to healthcare systems, with coronary artery disease (CAD) responsible for approximately 17.8 million deaths annually, making it a leading cause of global mortality. As CAD is largely preventable, early detection and proactive management are essential. In this work, we introduce DigitalShadow, an advanced early warning system for CAD, powered by a fine-tuned facial foundation model. The system is pre-trained on 21 million facial images and subsequently fine-tuned into LiveCAD, a specialized CAD risk assessment model trained on 7,004 facial images from 1,751 subjects across four hospitals in China. DigitalShadow functions passively and contactlessly, extracting facial features from live video streams without requiring active user engagement. Integrated with a personalized database, it generates natural language risk reports and individualized health recommendations. With privacy as a core design principle, DigitalShadow supports local deployment to ensure secure handling of user data. △ Less

Submitted 23 April, 2025; originally announced June 2025.

Showing 51–100 of 10,762 results for author: Chen, J