Search | arXiv e-print repository

Search for $e^+e^-\to K_S^0 K_S^0 h_c$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (642 additional authors not shown)

Abstract: Using $e^+e^-$ collision data at 13 center-of-mass energies ranging from 4.600 to 4.950 GeV collected with the BESIII detector, we search for the unmeasured $e^+e^-\to K_S^0 K_S^0 h_c$ process . No significant signal is observed, and the upper limits of the Born cross sections at each center-of-mass energy are presented. Using $e^+e^-$ collision data at 13 center-of-mass energies ranging from 4.600 to 4.950 GeV collected with the BESIII detector, we search for the unmeasured $e^+e^-\to K_S^0 K_S^0 h_c$ process . No significant signal is observed, and the upper limits of the Born cross sections at each center-of-mass energy are presented. △ Less

Submitted 11 February, 2025; originally announced February 2025.

arXiv:2502.07319 [pdf, ps, other]

Learnable Residual-based Latent Denoising in Semantic Communication

Authors: Mingkai Xu, Yongpeng Wu, Yuxuan Shi, Xiang-Gen Xia, Wenjun Zhang, Ping Zhang

Abstract: A latent denoising semantic communication (SemCom) framework is proposed for robust image transmission over noisy channels. By incorporating a learnable latent denoiser into the receiver, the received signals are preprocessed to effectively remove the channel noise and recover the semantic information, thereby enhancing the quality of the decoded images. Specifically, a latent denoising mapping is… ▽ More A latent denoising semantic communication (SemCom) framework is proposed for robust image transmission over noisy channels. By incorporating a learnable latent denoiser into the receiver, the received signals are preprocessed to effectively remove the channel noise and recover the semantic information, thereby enhancing the quality of the decoded images. Specifically, a latent denoising mapping is established by an iterative residual learning approach to improve the denoising efficiency while ensuring stable performance. Moreover, channel signal-to-noise ratio (SNR) is utilized to estimate and predict the latent similarity score (SS) for conditional denoising, where the number of denoising steps is adapted based on the predicted SS sequence, further reducing the communication latency. Finally, simulations demonstrate that the proposed framework can effectively and efficiently remove the channel noise at various levels and reconstruct visual-appealing images. △ Less

Submitted 29 April, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: This paper has been accepted by IEEE Wireless Communications Letters

arXiv:2502.07317 [pdf, other]

doi 10.1016/j.nima.2025.170548

Position reconstruction and surface background model for the PandaX-4T detector

Authors: Zhicheng Qian, Linhui Gu, Chen Cheng, Zihao Bo, Wei Chen, Xun Chen, Yunhua Chen, Zhaokan Cheng, Xiangyi Cui, Yingjie Fan, Deqing Fang, Zhixing Gao, Lisheng Geng, Karl Giboni, Xunan Guo, Xuyuan Guo, Zichao Guo, Chencheng Han, Ke Han, Changda He, Jinrong He, Di Huang, Houqi Huang, Junting Huang, Ruquan Hou , et al. (78 additional authors not shown)

Abstract: We report the position reconstruction methods and surface background model for the PandaX-4T dark matter direct search experiment. This work develops two position reconstruction algorithms: template matching (TM) method and photon acceptance function (PAF) method. Both methods determine the horizontal position of events based on the light pattern of secondary scintillation collected by the light s… ▽ More We report the position reconstruction methods and surface background model for the PandaX-4T dark matter direct search experiment. This work develops two position reconstruction algorithms: template matching (TM) method and photon acceptance function (PAF) method. Both methods determine the horizontal position of events based on the light pattern of secondary scintillation collected by the light sensors. After a comprehensive evaluation of resolution, uniformity, and robustness, the PAF method was selected for position reconstruction, while the TM method was employed for verification. The PAF method achieves a bulk event resolution of 1.0 mm and a surface event resolution of 4.4 mm for a typical $S2$ signal with a bottom charge of 1500 PE (about 14 keV). The uniformity is around 20\%. Robustness studies reveal average deviations of 5.1 mm and 8.8 mm for the commissioning run (Run0) and the first science run (Run1), respectively, due to the deactivation of certain PMTs. A data-driven surface background model is developed based on the PAF method. The surface background is estimated to be $0.09 \pm 0.06$ events for Run0 (0.54 tonne$\cdot$year) and $0.17 \pm 0.11$ events for Run1 (1.00 tonne$\cdot$year). △ Less

Submitted 11 February, 2025; originally announced February 2025.

Comments: 22 pages, 15 figures, 2 tables

arXiv:2502.07239 [pdf, other]

Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation

Authors: Pinxin Liu, Pengfei Zhang, Hyeongwoo Kim, Pablo Garrido, Ari Sharpio, Kyle Olszewski

Abstract: Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introdu… ▽ More Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications, shown in Fig.1 Project Page: https://andypinxinliu.github.io/Contextual-Gesture/. △ Less

Submitted 10 February, 2025; originally announced February 2025.

arXiv:2502.07027 [pdf, other]

Representational Alignment with Chemical Induced Fit for Molecular Relational Learning

Authors: Peiliang Zhang, Jingling Yuan, Qing Xie, Yongjun Zhu, Lin Li

Abstract: Molecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting structural features. The representational similarity between substructure pairs determines the functional compatibility of molecular binding sites. Nevertheless, aligning substructure representations by attention mechanisms lacks guidance from chemical knowledge,… ▽ More Molecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting structural features. The representational similarity between substructure pairs determines the functional compatibility of molecular binding sites. Nevertheless, aligning substructure representations by attention mechanisms lacks guidance from chemical knowledge, resulting in unstable model performance in chemical space (\textit{e.g.}, functional group, scaffold) shifted data. With theoretical justification, we propose the \textbf{Re}presentational \textbf{Align}ment with Chemical Induced \textbf{Fit} (ReAlignFit) to enhance the stability of MRL. ReAlignFit dynamically aligns substructure representation in MRL by introducing chemical Induced Fit-based inductive bias. In the induction process, we design the Bias Correction Function based on substructure edge reconstruction to align representations between substructure pairs by simulating chemical conformational changes (dynamic combination of substructures). ReAlignFit further integrates the Subgraph Information Bottleneck during fit process to refine and optimize substructure pairs exhibiting high chemical functional compatibility, leveraging them to generate molecular embeddings. Experimental results on nine datasets demonstrate that ReAlignFit outperforms state-of-the-art models in two tasks and significantly enhances model's stability in both rule-shifted and scaffold-shifted data distributions. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.06877 [pdf, other]

WirelessGPT: A Generative Pre-trained Multi-task Learning Framework for Wireless Communication

Authors: Tingting Yang, Ping Zhang, Mengfan Zheng, Yuxuan Shi, Liwen Jing, Jianbo Huang, Nan Li

Abstract: This paper introduces WirelessGPT, a pioneering foundation model specifically designed for multi-task learning in wireless communication and sensing. Specifically, WirelessGPT leverages large-scale wireless channel datasets for unsupervised pretraining and extracting universal channel representations, which captures complex spatiotemporal dependencies. In fact,this task-agnostic design adapts Wire… ▽ More This paper introduces WirelessGPT, a pioneering foundation model specifically designed for multi-task learning in wireless communication and sensing. Specifically, WirelessGPT leverages large-scale wireless channel datasets for unsupervised pretraining and extracting universal channel representations, which captures complex spatiotemporal dependencies. In fact,this task-agnostic design adapts WirelessGPT seamlessly to a wide range of downstream tasks, using a unified representation with minimal fine-tuning. By unifying communication and sensing functionalities, WirelessGPT addresses the limitations of task-specific models, offering a scalable and efficient solution for integrated sensing and communication (ISAC). With an initial parameter size of around 80 million, WirelessGPT demonstrates significant improvements over conventional methods and smaller AI models, reducing reliance on large-scale labeled data. As the first foundation model capable of supporting diverse tasks across different domains, WirelessGPT establishes a new benchmark, paving the way for future advancements in multi-task wireless systems. △ Less

Submitted 8 February, 2025; originally announced February 2025.

Comments: 8 pages, 4 figures

arXiv:2502.06155 [pdf, other]

Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

Authors: Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang

Abstract: Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects:… ▽ More Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism. △ Less

Submitted 17 February, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

arXiv:2502.06145 [pdf, other]

Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

Authors: Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, Liefeng Bo

Abstract: Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environmen… ▽ More Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. Beyond extracting motion signals from source video, we additionally capture environmental representations as conditional inputs. The environment is formulated as the region with the exclusion of characters and our model generates characters to populate these regions while maintaining coherence with the environmental context. We propose a shape-agnostic mask strategy that more effectively characterizes the relationship between character and environment. Furthermore, to enhance the fidelity of object interactions, we leverage an object guider to extract features of interacting objects and employ spatial blending for feature injection. We also introduce a pose modulation strategy that enables the model to handle more diverse motion patterns. Experimental results demonstrate the superior performance of the proposed method. △ Less

Submitted 9 February, 2025; originally announced February 2025.

Comments: Project Page: https://humanaigc.github.io/animate-anyone-2/

arXiv:2502.06090

Tip-Enhanced Raman Spectroscopy of Cell Wall Heterogeneity for Aspergillus Fumigatus

Authors: Zhenfei Jiang, Jizhou Wang, Zhe He, Peng Zhang, Zhenhuan Yi, Alexei V. Sokolov, Marlan O. Scully

Abstract: Tip-enhanced Raman spectroscopy (TERS) enables nanoscale chemical mapping of biological structures, providing high-resolution, high-signal-to-noise ratio imaging into molecular distribution and interactions beyond the capabilities of conventional Raman imaging. However, challenges such as the deformation of fragile biological cells and the complexity of signal interpretation would increase the dif… ▽ More Tip-enhanced Raman spectroscopy (TERS) enables nanoscale chemical mapping of biological structures, providing high-resolution, high-signal-to-noise ratio imaging into molecular distribution and interactions beyond the capabilities of conventional Raman imaging. However, challenges such as the deformation of fragile biological cells and the complexity of signal interpretation would increase the difficulty in investigating biological samples with TERS. Here, we demonstrate using TERS to investigate the cell wall heterogeneity of Aspergillus fumigatus spores. Using TERS imaging and spectral analysis, we map the chemical components including melanin within the fungal cell wall. The results reveal distinct spectral features associated with polysaccharides, lipids, and proteins. Furthermore, by comparing the wild-type and albino mutant spores, we illuminate the biochemical characteristics of Dihydroxynaphthalene melanin (DHN-melanin) in the fungal cell wall. △ Less

Submitted 7 May, 2025; v1 submitted 9 February, 2025; originally announced February 2025.

Comments: need a major revision

arXiv:2502.05783 [pdf, other]

WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch

Authors: Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu

Abstract: While just-in-time interventions (JITIs) have effectively targeted common health behaviors, individuals often have unique needs to intervene in personal undesirable actions that can negatively affect physical, mental, and social well-being. We present WatchGuardian, a smartwatch-based JITI system that empowers users to define custom interventions for these personal actions with a small number of s… ▽ More While just-in-time interventions (JITIs) have effectively targeted common health behaviors, individuals often have unique needs to intervene in personal undesirable actions that can negatively affect physical, mental, and social well-being. We present WatchGuardian, a smartwatch-based JITI system that empowers users to define custom interventions for these personal actions with a small number of samples. For the model to detect new actions based on limited new data samples, we developed a few-shot learning pipeline that finetuned a pre-trained inertial measurement unit (IMU) model on public hand-gesture datasets. We then designed a data augmentation and synthesis process to train additional classification layers for customization. Our offline evaluation with 26 participants showed that with three, five, and ten examples, our approach achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of 74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to compare WatchGuardian against a rule-based intervention. Our results demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in undesirable actions, substantially outperforming the baseline by 29.0%. Our findings underscore the effectiveness of a customizable, AI-driven JITI system for individuals in need of behavioral intervention in personal undesirable actions. We envision that our work can inspire broader applications of user-defined personalized intervention with advanced AI solutions. △ Less

Submitted 9 February, 2025; originally announced February 2025.

Comments: Under submission

MSC Class: 68U35 ACM Class: H.5.2; I.2.1

arXiv:2502.05657 [pdf, other]

Ideas and Requirements for the Global Cosmic-Ray Observatory (GCOS)

Authors: Markus Ahlers, Ingo Allekotte, Jaime Alvarez-Muniz, Gioacchino Alex Anastasi, Luis Anchordoqui, Rita de Cassia Dos Anjos, Hari Haran Balakrishnan, Rafael Alves Batista, Jose Bellido, Mario Bertaina, Sonali Bhatnagar, Pierre Billoir, Kathrin Bismark, Teresa Bister, Martina Bohacova, Carla Bonifazi, Fraser Bradfield, Antonella Castellina, Lorenzo Cazon, Kevin Almeida Cheminant, Alan Coleman, Fabio Convenga, Darko Veberič, Paramita Dasgupta, Kai Daumiller , et al. (114 additional authors not shown)

Abstract: After a successful kick-off meeting in 2021. two workshops in 2022 and 2023 on the future Global Cosmic-Ray Observatory (GCOS) focused mainly on a straw man design of the detector and science possibilities for astro- and particle physics. About 100 participants gathered for in-person and hybrid panel discussions. In this report, we summarize these discussions, present a preliminary straw-man desig… ▽ More After a successful kick-off meeting in 2021. two workshops in 2022 and 2023 on the future Global Cosmic-Ray Observatory (GCOS) focused mainly on a straw man design of the detector and science possibilities for astro- and particle physics. About 100 participants gathered for in-person and hybrid panel discussions. In this report, we summarize these discussions, present a preliminary straw-man design for GCOS and collect short write-ups of the flash talks given during the focus sessions. △ Less

Submitted 8 February, 2025; originally announced February 2025.

Comments: 48 pages, 27 figures

arXiv:2502.05422 [pdf]

Magnetic transition in marcasite FeTe$_{2}$ induced by the competition between crystal field splitting and Coulomb repulsion

Authors: Yue-Fei Hou, Zhibin Shao, Minghu Pan, Shiyang Wu, Fawei Zheng, Zhen-Guo Fu, Ping Zhang

Abstract: The magnetic ground states in crystalline systems are significant for both fundamental condensed matter physics and practical materials engineering. Marcasite FeTe$_{2}$, characterized as a small-gap semiconductor, exhibits anomalous magnetic behaviors in low-temperature experiments. In this study, first-principles density functional theory calculations combined with scanning tunneling microscopy/… ▽ More The magnetic ground states in crystalline systems are significant for both fundamental condensed matter physics and practical materials engineering. Marcasite FeTe$_{2}$, characterized as a small-gap semiconductor, exhibits anomalous magnetic behaviors in low-temperature experiments. In this study, first-principles density functional theory calculations combined with scanning tunneling microscopy/spectroscopy are employed to investigate the magnetic ground state of marcasite FeTe$_{2}$. It is revealed that the competition between crystal field splitting and Coulomb repulsion plays the key role in the formation of localized magnetic moments in FeTe$_{2}$. The ground state of FeTe$_{2}$ bulk is confirmed to be nonmagnetic, while the previously observed magnetic responses of FeTe$_{2}$ are suggested to be related to the magnetic Fe atoms on the crystal surfaces. Our work proposes a straightforward theoretical criterion for determining ground-state magnetism of various localized-moment systems. △ Less

Submitted 10 February, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

Comments: 6 figures, 13 pages. Minor revisions were performed. Comments are welcome

arXiv:2502.05173 [pdf, other]

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

Abstract: While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully co… ▽ More While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}. △ Less

Submitted 27 April, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.04848 [pdf, other]

Broadband $γ$-ray spectrum of supernova remnant Cassiopeia A

Authors: Zhen Cao, F. Aharonian, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, W. Bian, A. V. Bukevich, C. M. Cai, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, H. X. Chen, Liang Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. Chen, S. H. Chen, S. Z. Chen , et al. (293 additional authors not shown)

Abstract: The core-collapse supernova remnant (SNR) Cassiopeia A (Cas A) is one of the brightest galactic radio sources with an angular radius of $\sim$ 2.5 $\arcmin$. Although no extension of this source has been detected in the $γ$-ray band, using more than 1000 days of LHAASO data above $\sim 0.8$ TeV, we find that its spectrum is significantly softer than those obtained with Imaging Air Cherenkov Telesc… ▽ More The core-collapse supernova remnant (SNR) Cassiopeia A (Cas A) is one of the brightest galactic radio sources with an angular radius of $\sim$ 2.5 $\arcmin$. Although no extension of this source has been detected in the $γ$-ray band, using more than 1000 days of LHAASO data above $\sim 0.8$ TeV, we find that its spectrum is significantly softer than those obtained with Imaging Air Cherenkov Telescopes (IACTs) and its flux near $\sim 1$ TeV is about two times higher. In combination with analyses of more than 16 years of \textit{Fermi}-LAT data covering $0.1 \, \mathrm{GeV} - 1 \, \mathrm{TeV}$, we find that the spectrum above 30 GeV deviates significantly from a single power-law, and is best described by a smoothly broken power-law with a spectral index of $1.90 \pm 0.15_\mathrm{stat}$ ($3.41 \pm 0.19_\mathrm{stat}$) below (above) a break energy of $0.63 \pm 0.21_\mathrm{stat} \, \mathrm{TeV}$. Given differences in the angular resolution of LHAASO-WCDA and IACTs, TeV $γ$-ray emission detected with LHAASO may have a significant contribution from regions surrounding the SNR illuminated by particles accelerated earlier, which, however, are treated as background by IACTs. Detailed modelling can be used to constrain acceleration processes of TeV particles in the early stage of SNR evolution. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.04681 [pdf, other]

CALF-SBM: A Covariate-Assisted Latent Factor Stochastic Block Model

Authors: Sydney Louit, Evan Clark, Alexander Gelbard, Niketna Vivek, Jun Yan, Panpan Zhang

Abstract: We propose a novel network generative model extended from the standard stochastic block model by concurrently utilizing observed node-level information and accounting for network-enabled nodal heterogeneity. The proposed model is so so-called covariate-assisted latent factor stochastic block model (CALF-SBM). The inference for the proposed model is done in a fully Bayesian framework. The primary a… ▽ More We propose a novel network generative model extended from the standard stochastic block model by concurrently utilizing observed node-level information and accounting for network-enabled nodal heterogeneity. The proposed model is so so-called covariate-assisted latent factor stochastic block model (CALF-SBM). The inference for the proposed model is done in a fully Bayesian framework. The primary application of CALF-SBM in the present research is focused on community detection, where a model-selection-based approach is employed to estimate the number of communities which is practically assumed unknown. To assess the performance of CALF-SBM, an extensive simulation study is carried out, including comparisons with multiple classical and modern network clustering algorithms. Lastly, the paper presents two real data applications, respectively based on an extremely new network data demonstrating collaborative relationships of otolaryngologists in the United States and a traditional aviation network data containing information about direct flights between airports in the United States and Canada. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.04674 [pdf, other]

AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts

Authors: Soichiro Murakami, Peinan Zhang, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura

Abstract: Effective linguistic choices that attract potential customers play crucial roles in advertising success. This study aims to explore the linguistic features of ad texts that influence human preferences. Although the creation of attractive ad texts is an active area of research, progress in understanding the specific linguistic features that affect attractiveness is hindered by several obstacles. Fi… ▽ More Effective linguistic choices that attract potential customers play crucial roles in advertising success. This study aims to explore the linguistic features of ad texts that influence human preferences. Although the creation of attractive ad texts is an active area of research, progress in understanding the specific linguistic features that affect attractiveness is hindered by several obstacles. First, human preferences are complex and influenced by multiple factors, including their content, such as brand names, and their linguistic styles, making analysis challenging. Second, publicly available ad text datasets that include human preferences are lacking, such as ad performance metrics and human feedback, which reflect people's interests. To address these problems, we present AdParaphrase, a paraphrase dataset that contains human preferences for pairs of ad texts that are semantically equivalent but differ in terms of wording and style. This dataset allows for preference analysis that focuses on the differences in linguistic features. Our analysis revealed that ad texts preferred by human judges have higher fluency, longer length, more nouns, and use of bracket symbols. Furthermore, we demonstrate that an ad text-generation model that considers these findings significantly improves the attractiveness of a given text. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase. △ Less

Submitted 11 February, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

Comments: Accepted to NAACL2025 Findings

arXiv:2502.04649 [pdf, other]

End-to-End Learning Framework for Solving Non-Markovian Optimal Control

Authors: Xiaole Zhang, Peiyu Zhang, Xiongye Xiao, Shixuan Li, Vasileios Tzoumas, Vijay Gupta, Paul Bogdan

Abstract: Integer-order calculus often falls short in capturing the long-range dependencies and memory effects found in many real-world processes. Fractional calculus addresses these gaps via fractional-order integrals and derivatives, but fractional-order dynamical systems pose substantial challenges in system identification and optimal control due to the lack of standard control methodologies. In this pap… ▽ More Integer-order calculus often falls short in capturing the long-range dependencies and memory effects found in many real-world processes. Fractional calculus addresses these gaps via fractional-order integrals and derivatives, but fractional-order dynamical systems pose substantial challenges in system identification and optimal control due to the lack of standard control methodologies. In this paper, we theoretically derive the optimal control via linear quadratic regulator (LQR) for fractional-order linear time-invariant (FOLTI) systems and develop an end-to-end deep learning framework based on this theoretical foundation. Our approach establishes a rigorous mathematical model, derives analytical solutions, and incorporates deep learning to achieve data-driven optimal control of FOLTI systems. Our key contributions include: (i) proposing an innovative system identification method control strategy for FOLTI systems, (ii) developing the first end-to-end data-driven learning framework, Fractional-Order Learning for Optimal Control (FOLOC), that learns control policies from observed trajectories, and (iii) deriving a theoretical analysis of sample complexity to quantify the number of samples required for accurate optimal control in complex real-world problems. Experimental results indicate that our method accurately approximates fractional-order system behaviors without relying on Gaussian noise assumptions, pointing to promising avenues for advanced optimal control. △ Less

Submitted 1 May, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

arXiv:2502.04507 [pdf, other]

Fast Video Generation with Sliding Tile Attention

Authors: Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang

Abstract: Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained vide… ▽ More Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench. △ Less

Submitted 6 February, 2025; originally announced February 2025.

arXiv:2502.04288 [pdf]

Leveraging Geolocation in Clinical Records to Improve Alzheimer's Disease Diagnosis Using DMV Framework

Authors: Peng Zhang, Divya Chaudhary

Abstract: Alzheimer's Disease (AD) early detection is critical for enabling timely intervention and improving patient outcomes. This paper presents a DMV framework using Llama3-70B and GPT-4o as embedding models to analyze clinical notes and predict a continuous risk score associated with early AD onset. Framing the task as a regression problem, we model the relationship between linguistic features in clini… ▽ More Alzheimer's Disease (AD) early detection is critical for enabling timely intervention and improving patient outcomes. This paper presents a DMV framework using Llama3-70B and GPT-4o as embedding models to analyze clinical notes and predict a continuous risk score associated with early AD onset. Framing the task as a regression problem, we model the relationship between linguistic features in clinical notes (inputs) and a target variable (data value) that answers specific questions related to AD risk within certain topic categories. By leveraging a multi-faceted feature set that includes geolocation data, we capture additional environmental context potentially linked to AD. Our results demonstrate that the integration of the geolocation information significantly decreases the error of predicting early AD risk scores over prior models by 28.57% (Llama3-70B) and 33.47% (GPT4-o). Our findings suggest that this combined approach can enhance the predictive accuracy of AD risk assessment, supporting early diagnosis and intervention in clinical settings. Additionally, the framework's ability to incorporate geolocation data provides a more comprehensive risk assessment model that could help healthcare providers better understand and address environmental factors contributing to AD development. △ Less

Submitted 6 February, 2025; originally announced February 2025.

arXiv:2502.04268 [pdf, other]

Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances

Authors: Yi Yu, Botao Ren, Peiyuan Zhang, Mingxin Liu, Junwei Luo, Shaofeng Zhang, Feipeng Da, Junchi Yan, Xue Yang

Abstract: With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning OOD from point annotations has gained great attention. In this paper, we rethink this challenging task setting with the layout among instances and present Point2RBox-v2. At the core are three principles: 1) Gaussian overlap loss. It learns an upper bound for ea… ▽ More With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning OOD from point annotations has gained great attention. In this paper, we rethink this challenging task setting with the layout among instances and present Point2RBox-v2. At the core are three principles: 1) Gaussian overlap loss. It learns an upper bound for each instance by treating objects as 2D Gaussian distributions and minimizing their overlap. 2) Voronoi watershed loss. It learns a lower bound for each instance through watershed on Voronoi tessellation. 3) Consistency loss. It learns the size/rotation variation between two output sets with respect to an input image and its augmented view. Supplemented by a few devised techniques, e.g. edge loss and copy-paste, the detector is further enhanced. To our best knowledge, Point2RBox-v2 is the first approach to explore the spatial layout among instances for learning point-supervised OOD. Our solution is elegant and lightweight, yet it is expected to give a competitive performance especially in densely packed scenes: 62.61%/86.15%/34.71% on DOTA/HRSC/FAIR1M. Code is available at https://github.com/VisionXLab/point2rbox-v2. △ Less

Submitted 6 February, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

Comments: 11 pages, 5 figures, 10 tables

arXiv:2502.03828 [pdf, ps, other]

doi 10.1103/PhysRevD.111.L071101

Observation of $D\to \bar{K}_{1}(1270)μ^+ν_μ$ and test of lepton flavor universality with $D\to \bar{K}_1(1270) \ell^{+} ν_{\ell}$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (646 additional authors not shown)

Abstract: By analyzing 7.93 $\rm fb^{-1}$ of $e^+e^-$ collision data collected at the center-of-mass energy of 3.773 GeV with the BESIII detector operated at the BEPCII collider, we report the observation of the semimuonic decays of $D^+\to \bar K_1(1270)^0μ^+ν_μ$ and $D^0\to K_1(1270)^-μ^+ν_μ$ with statistical significances of $12.5σ$ and $6.0σ$, respectively. Their decay branching fractions are determined… ▽ More By analyzing 7.93 $\rm fb^{-1}$ of $e^+e^-$ collision data collected at the center-of-mass energy of 3.773 GeV with the BESIII detector operated at the BEPCII collider, we report the observation of the semimuonic decays of $D^+\to \bar K_1(1270)^0μ^+ν_μ$ and $D^0\to K_1(1270)^-μ^+ν_μ$ with statistical significances of $12.5σ$ and $6.0σ$, respectively. Their decay branching fractions are determined to be ${\mathcal B}[D^{+}\to \bar{K}_1(1270)^0 μ^{+}ν_μ]=(2.36\pm0.20^{+0.18}_{-0.27}\pm 0.48)\times10^{-3}$ and ${\mathcal B}[D^{0}\to K_1(1270)^{-} μ^{+}ν_μ]=(0.78\pm0.11^{+0.05}_{-0.09}\pm 0.15)\times10^{-3}$, where the first and second uncertainties are statistical and systematic, respectively, and the third originates from the input branching fraction of $\bar K_{1}(1270)^0\to K^- π^+π^0$ or $K_1(1270)^-\to K^-π^+π^-$. Combining our branching fractions with the previous measurements of ${\mathcal B}[D^+\to \bar K_1(1270)^0e^+ν_{e}]$ and ${\mathcal B}[D^0\to K_1(1270)^-e^+ν_{e}]$, we determine the branching fraction ratios to be ${\mathcal B}[D^+\to \bar K_1(1270)^0μ^+ν_μ]/{\mathcal B}[D^+\to \bar K_1(1270)^0e^+ν_{e}]=1.03 \pm 0.14 \substack{+0.11\\-0.15}$ and ${\mathcal B}[D^0\to K_1(1270)^-μ^+ν_μ]/{\mathcal B}[D^0\to K_1(1270)^-e^+ν_{e}]=0.74\pm 0.13 \substack{+0.08\\-0.13}$. Using the branching fractions measured in this work and the world-average lifetimes of the $D^+$ and $D^0$ mesons, we determine the semimuonic partial decay width ratio to be $Γ[D^+\to \bar K_1(1270)^0 μ^+ν_μ]/Γ[D^0\to K_1(1270)^- μ^+ν_μ]=1.22\pm 0.10\substack{+0.06\\-0.09}$, which is consistent with unity as predicted by isospin conservation. △ Less

Submitted 18 April, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

Comments: 11 pages, 5 figures

Journal ref: Phys. Rev. D 111, L071101(2025)

arXiv:2502.03732 [pdf, other]

More Modality, More AI: Exploring Design Opportunities of AI-Based Multi-modal Remote Monitoring Technologies for Early Detection of Mental Health Sequelae in Youth Concussion Patients

Authors: Bingsheng Yao, Menglin Zhao, Yuling Sun, Weidan Cao, Changchang Yin, Stephen Intille, Xuhai Xu, Ping Zhang, Jingzhen Yang, Dakuo Wang

Abstract: Anxiety, depression, and suicidality are common mental health sequelae following concussion in youth patients, often exacerbating concussion symptoms and prolonging recovery. Despite the critical need for early detection of these mental health symptoms, clinicians often face challenges in accurately collecting patients' mental health data and making clinical decision-making in a timely manner. Tod… ▽ More Anxiety, depression, and suicidality are common mental health sequelae following concussion in youth patients, often exacerbating concussion symptoms and prolonging recovery. Despite the critical need for early detection of these mental health symptoms, clinicians often face challenges in accurately collecting patients' mental health data and making clinical decision-making in a timely manner. Today's remote patient monitoring (RPM) technologies offer opportunities to objectively monitor patients' activities, but they were not specifically designed for youth concussion patients; moreover, the large amount of data collected by RPM technologies may also impose significant workloads on clinicians to keep up with and use the data. To address these gaps, we employed a three-stage study consisting of a formative study, interface design, and design evaluation. We first conducted a formative study through semi-structured interviews with six highly professional concussion clinicians and identified clinicians' key challenges in remotely collecting patient information and accessing patient treatment compliance. Subsequently, we proposed preliminary clinician-facing interface designs with the integration of AI-based RPM technologies (AI-RPM), followed by design evaluation sessions with highly professional concussion clinicians. Clinicians underscored the value of integrating multi-modal AI-RPM technologies to support their decision-making while emphasizing the importance of customizable interfaces through collaborative design and multiple responsible design considerations. △ Less

Submitted 3 April, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

arXiv:2502.03017 [pdf, other]

Search for Double Beta Decay of $^{136}$Xe to the $0^+_1$ Excited State of $^{136}$Ba with PandaX-4T

Authors: PandaX Collaboration, Lingyin Luo, Zihao Bo, Wei Chen, Xun Chen, Yunhua Chen, Zhaokan Cheng, Xiangyi Cui, Yingji Fang, Deqing Fang, Zhixing Gao, Lisheng Geng, Karl Giboni, Xunan Guo, Xuyuan Guo, Zichao Guo, Chencheng Han, Ke Han, Changda He, Jinrong He, Di Huang, Houqi Huang, Junting Huang, Ruquan Hou, Yu Hou , et al. (76 additional authors not shown)

Abstract: We perform a search of double beta decay of $^{136}$Xe to the excited state, $0^+_1$, of $^{136}$Ba (2$νββ$-0$_1^+$), using the dual-phase xenon detector of PandaX-4T with the first 94.9-day commissioning data. The multi-site events are reconstructed up to the MeV energy scale, which helps to improve the background model significantly. The background contribution from the stainless steel platform… ▽ More We perform a search of double beta decay of $^{136}$Xe to the excited state, $0^+_1$, of $^{136}$Ba (2$νββ$-0$_1^+$), using the dual-phase xenon detector of PandaX-4T with the first 94.9-day commissioning data. The multi-site events are reconstructed up to the MeV energy scale, which helps to improve the background model significantly. The background contribution from the stainless steel platform outside PandaX-4T cryostat is evaluated for the first time. No significant evidence for 2$νββ$-$0_1^+$ is observed, resulting in a lower limit on half-life of $7.5 \times 10^{22}$ yr at the 90% confidence level. This is the first experimental limit on such a rare decay in a natural xenon-based detector. △ Less

Submitted 7 March, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

arXiv:2502.01326 [pdf, other]

Flyby-induced displacement: analytic solution

Authors: P. -M. Zhang, Z. K. Silagadze, P. A. Horvathy

Abstract: We describe the scattering of particles by a sandwich gravitational wave generated during a flyby using an analytical approach. The derivative-of-the-Gaussian profile proposed by Gibbons and Hawking is approximated by the hyperbolic scarf potential, which allows for an exact analytic solution via the Nikiforov-Uvarov method. Our results confirm the prediction of Zel'dovich and Polnarev about certa… ▽ More We describe the scattering of particles by a sandwich gravitational wave generated during a flyby using an analytical approach. The derivative-of-the-Gaussian profile proposed by Gibbons and Hawking is approximated by the hyperbolic scarf potential, which allows for an exact analytic solution via the Nikiforov-Uvarov method. Our results confirm the prediction of Zel'dovich and Polnarev about certain ``magical" amplitudes of the potential. △ Less

Submitted 13 May, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

Comments: Affiliation updated

arXiv:2501.18850 [pdf, other]

Equivariant Hypergraph Diffusion for Crystal Structure Prediction

Authors: Yang Liu, Chuan Zhou, Shuai Zhang, Peng Zhang, Xixun Lin, Shirui Pan

Abstract: Crystal Structure Prediction (CSP) remains a fundamental challenge with significant implications for the development of new materials and the advancement of various scientific disciplines. Recent developments have shown that generative models, particularly diffusion models, hold great promise for CSP. However, traditional graph-based representations, where atomic bonds are modeled as pairwise grap… ▽ More Crystal Structure Prediction (CSP) remains a fundamental challenge with significant implications for the development of new materials and the advancement of various scientific disciplines. Recent developments have shown that generative models, particularly diffusion models, hold great promise for CSP. However, traditional graph-based representations, where atomic bonds are modeled as pairwise graph edges, fail to fully capture the intricate high-order interactions essential for accurately representing crystal structures. In this work, we propose a novel approach that utilizes hypergraphs to represent crystal structures, providing a more expressive abstraction for modeling multi-way atomic interactions. By adopting hypergraphs, we can effectively capture complex high-order relationships and symmetries, such as permutation and periodic translation invariance, which are crucial for characterizing crystal structures. In this work, we propose the \textbf{E}quivariant \textbf{H}ypergraph \textbf{Diff}usion Model (\textbf{EH-Diff}), a generative model designed to take advantage of the symmetry-preserving properties of hypergraphs. EH-Diff exploits these features to offer an efficient and accurate method for predicting crystal structures with a strong theoretical justification to preserve invariance properties. Empirically, we conduct extensive experiments on four benchmark datasets, and the results demonstrate that EH-Diff outperforms state-of-the-art CSP methods with only one sample. △ Less

Submitted 30 January, 2025; originally announced January 2025.

Comments: 14 pages, 4 figures

arXiv:2501.18801 [pdf, other]

Every Image Listens, Every Image Dances: Music-Driven Image Animation

Authors: Zhikang Dong, Weituo Hao, Ju-Chiang Wang, Peng Zhang, Pawel Polak

Abstract: Image animation has become a promising area in multimodal research, with a focus on generating videos from reference images. While prior work has largely emphasized generic video generation guided by text, music-driven dance video generation remains underexplored. In this paper, we introduce MuseDance, an innovative end-to-end model that animates reference images using both music and text inputs.… ▽ More Image animation has become a promising area in multimodal research, with a focus on generating videos from reference images. While prior work has largely emphasized generic video generation guided by text, music-driven dance video generation remains underexplored. In this paper, we introduce MuseDance, an innovative end-to-end model that animates reference images using both music and text inputs. This dual input enables MuseDance to generate personalized videos that follow text descriptions and synchronize character movements with the music. Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences, making flexible and creative video generation accessible to users of all expertise levels. To advance research in this field, we present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new baseline for the music-driven image animation task. △ Less

Submitted 30 January, 2025; originally announced January 2025.

arXiv:2501.16330 [pdf, other]

RelightVid: Temporal-Consistent Diffusion Model for Video Relighting

Authors: Ye Fang, Zeyi Sun, Shangzhan Zhang, Tong Wu, Yinghao Xu, Pan Zhang, Jiaqi Wang, Gordon Wetzstein, Dahua Lin

Abstract: Diffusion models have demonstrated remarkable success in image generation and editing, with recent advancements enabling albedo-preserving image relighting. However, applying these models to video relighting remains challenging due to the lack of paired video relighting datasets and the high demands for output fidelity and temporal consistency, further complicated by the inherent randomness of dif… ▽ More Diffusion models have demonstrated remarkable success in image generation and editing, with recent advancements enabling albedo-preserving image relighting. However, applying these models to video relighting remains challenging due to the lack of paired video relighting datasets and the high demands for output fidelity and temporal consistency, further complicated by the inherent randomness of diffusion models. To address these challenges, we introduce RelightVid, a flexible framework for video relighting that can accept background video, text prompts, or environment maps as relighting conditions. Trained on in-the-wild videos with carefully designed illumination augmentations and rendered videos under extreme dynamic lighting, RelightVid achieves arbitrary video relighting with high temporal consistency without intrinsic decomposition while preserving the illumination priors of its image backbone. △ Less

Submitted 27 January, 2025; originally announced January 2025.

arXiv:2501.16103 [pdf, ps, other]

Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

Authors: Yinghan Li, Yifei Li, Jiejing Zhang, Bujiao Chen, Xiaotong Chen, Lian Duan, Yejun Jin, Zheng Li, Xuanyu Liu, Haoyu Wang, Wente Wang, Yajie Wang, Jiacheng Yang, Peiyang Zhang, Laiwen Zheng, Wenyuan Yu

Abstract: It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up… ▽ More It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up to 91% of the peak Tensor Core throughput on NVIDIA H800 GPU and 95% on NVIDIA H20 GPU. △ Less

Submitted 27 January, 2025; originally announced January 2025.

Comments: 11 pages

ACM Class: D.1.3; I.2.6

arXiv:2501.15907 [pdf, other]

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

Abstract: Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high… ▽ More Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation. △ Less

Submitted 27 January, 2025; originally announced January 2025.

Comments: Extended version of arXiv:2407.05361, submitted to TASLP, dataset is available at: https://huggingface.co/datasets/amphion/Emilia-Dataset

arXiv:2501.15898 [pdf, ps, other]

Homotopy categories and fibrant model structures

Authors: Xue-Song Lu, Pu Zhang

Abstract: The homotopy category of a model structure on a weakly idempotent complete additive category is proved to be equivalent to the additive quotient of the category of cofibrant-fibrant objects with respect to the subcategory of cofibrant-fibrant-trivial objects. A model structure on pointed category is fibrant, if every object is a fibrant object. Fibrant model structures is explicitly described by t… ▽ More The homotopy category of a model structure on a weakly idempotent complete additive category is proved to be equivalent to the additive quotient of the category of cofibrant-fibrant objects with respect to the subcategory of cofibrant-fibrant-trivial objects. A model structure on pointed category is fibrant, if every object is a fibrant object. Fibrant model structures is explicitly described by trivial cofibrations, and also by fibrations. Fibrantly weak factorization systems are introduced, fibrant model structures are constructed via fibrantly weak factorization systems, and a one-one correspondence between fibrantly weak factorization systems and fibrant model structures is given. Applications are given to rediscover the $ω$-model structures and the $\mathcal W$-model structures, and their relations with exact model structures are discussed. △ Less

Submitted 27 January, 2025; originally announced January 2025.

arXiv:2501.15875 [pdf, other]

LCTG Bench: LLM Controlled Text Generation Benchmark

Authors: Kentaro Kurihara, Masato Mita, Peinan Zhang, Shota Sasaki, Ryosuke Ishigami, Naoaki Okazaki

Abstract: The rise of large language models (LLMs) has led to more diverse and higher-quality machine-generated text. However, their high expressive power makes it difficult to control outputs based on specific business instructions. In response, benchmarks focusing on the controllability of LLMs have been developed, but several issues remain: (1) They primarily cover major languages like English and Chines… ▽ More The rise of large language models (LLMs) has led to more diverse and higher-quality machine-generated text. However, their high expressive power makes it difficult to control outputs based on specific business instructions. In response, benchmarks focusing on the controllability of LLMs have been developed, but several issues remain: (1) They primarily cover major languages like English and Chinese, neglecting low-resource languages like Japanese; (2) Current benchmarks employ task-specific evaluation metrics, lacking a unified framework for selecting models based on controllability across different use cases. To address these challenges, this research introduces LCTG Bench, the first Japanese benchmark for evaluating the controllability of LLMs. LCTG Bench provides a unified framework for assessing control performance, enabling users to select the most suitable model for their use cases based on controllability. By evaluating nine diverse Japanese-specific and multilingual LLMs like GPT-4, we highlight the current state and challenges of controllability in Japanese LLMs and reveal the significant gap between multilingual models and Japanese-specific models. △ Less

Submitted 27 January, 2025; originally announced January 2025.

Comments: 15 pages, 11 figures. Project page: this [URL](https://github.com/CyberAgentAILab/LCTG-Bench)

arXiv:2501.15447 [pdf, ps, other]

Observation of $h_{c}$ radiative decays to multiple light hadrons and the tensor state $f_2(1270)$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (666 additional authors not shown)

Abstract: Using $ψ(3686)\rightarrow π^{0} h_{c}$ decays from a data sample of $(27.12\pm0.14)\times10^{8}$ $ψ(3686)$ events collected by the BESIII detector at the BEPCII collider, $h_c$ radiative decays to $γπ^{+}π^{-},~γπ^{+}π^{-}η,~\gamma2(π^{+}π^{-})$, and $γp\bar{p}$ are observed for the first time, each with a significance greater than $5σ$. The corresponding branching fractions are measured. Furtherm… ▽ More Using $ψ(3686)\rightarrow π^{0} h_{c}$ decays from a data sample of $(27.12\pm0.14)\times10^{8}$ $ψ(3686)$ events collected by the BESIII detector at the BEPCII collider, $h_c$ radiative decays to $γπ^{+}π^{-},~γπ^{+}π^{-}η,~\gamma2(π^{+}π^{-})$, and $γp\bar{p}$ are observed for the first time, each with a significance greater than $5σ$. The corresponding branching fractions are measured. Furthermore, intermediate states below 2.8 GeV/$c^{2}$ are investigated, leading to the first observation of the decay process of $h_c\rightarrowγf_{2}(1270)\rightarrowγπ^{+}π^{-}$ with a significance of $5.5\,σ$. This observation represents the first instance of $h_c$ radiative decay to a tensor state. △ Less

Submitted 26 January, 2025; originally announced January 2025.

arXiv:2501.15368 [pdf, other]

Baichuan-Omni-1.5 Technical Report

Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang , et al. (68 additional authors not shown)

Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip… ▽ More We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks. △ Less

Submitted 25 January, 2025; originally announced January 2025.

arXiv:2501.15069

Magnetic Field induced control and Multiple Magnomechanically Induced Transparency in Single Cavity

Authors: Ghaisud Din, Muqaddar Abbas, Yunlong Wang, Feiran Wang, Pei Zhang

Abstract: We investigate magnomechanically induced transparency (MMIT) in a microwave 3D copper cavity with two YIG spheres under varying interaction parameters. Numerical simulations show that the steady-state magnon number increases with stronger coupling between cavity photons and magnons, and is sensitive to both bias and drive magnetic fields. Pronounced peaks in the magnon population near resonant fie… ▽ More We investigate magnomechanically induced transparency (MMIT) in a microwave 3D copper cavity with two YIG spheres under varying interaction parameters. Numerical simulations show that the steady-state magnon number increases with stronger coupling between cavity photons and magnons, and is sensitive to both bias and drive magnetic fields. Pronounced peaks in the magnon population near resonant fields highlight the importance of the bias field in energy transfer. The transparency windows are tunable, with up to quadruple windows depending on the coupling and magnon-phonon interactions, as seen in the transmission spectrum. Dispersion analysis reveals normal and anomalous regions, enabling slow and fast light propagation modulated by coupling strength. Phase and group delay variations, influenced by the drive field, further validate the tunability of transparency windows. This study demonstrates the potential of MMIT for precise control with out any additional non-linearity over light-matter interactions, with applications in quantum information processing and optical communications. △ Less

Submitted 6 May, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

Comments: Withdrawn due to errors in Section 2 and Appendix A. Section 2 omits a key coupling term in the Hamiltonian, affecting predictions. Appendix A contains a flaw in the linearization step used to derive the fluctuations. We are revising the analysis and will resubmit

arXiv:2501.14989 [pdf, ps, other]

Redefining Coherent Risk Measures: From Gauge Optimization to Regularization

Authors: Ningji Wei, Xian Yu, Peter Zhang

Abstract: It is well understood that each coherent risk measure can be represented as the expectation with respect to the worst-case reweighted density function, chosen from an abstract risk envelope. This paper introduces an equivalent but more explicit definition of the risk envelope that uses gauge sets (i.e., a type of convex sets widely utilized in convex analysis and gauge optimization) to provide a g… ▽ More It is well understood that each coherent risk measure can be represented as the expectation with respect to the worst-case reweighted density function, chosen from an abstract risk envelope. This paper introduces an equivalent but more explicit definition of the risk envelope that uses gauge sets (i.e., a type of convex sets widely utilized in convex analysis and gauge optimization) to provide a generalized measure of distance between any reweighting function and the nominal one. Using the primal gauge set reweighting problem, we provide a unified framework for various existing methods in optimization under uncertainty, including risk-neutral/risk-averse stochastic programming, robust optimization, and distributionally robust optimization with moment-based and distance-based ambiguity sets. On the other hand, the associated dual problem offers an intuitive interpretation from the regularization perspective. This approach not only simplifies the derivation of classic results but also provides a versatile framework for robustness design via manipulations of the gauge sets (e.g., intersection, union, summation, convex combination, and function basis enforcement). To demonstrate this flexibility, we present approaches for customizing robustness to specific managerial needs, including methods for selecting flexible tail behaviors, addressing spatial distributional ambiguities, combining multiple robustness metrics, and achieving heterogeneous distributional robustness. We also discuss general reformulation techniques and computational approaches for this unified framework. △ Less

Submitted 18 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

MSC Class: 90C17; 90C15; 91G70

arXiv:2501.14384 [pdf]

Efficiently charting the space of mixed vacancy-ordered perovskites by machine-learning encoded atomic-site information

Authors: Fan Zhang, Li Fu, Weiwei Gao, Peihong Zhang, Jijun Zhao

Abstract: Vacancy-ordered double perovskites (VODPs) are promising alternatives to three-dimensional lead halide perovskites for optoelectronic and photovoltaic applications. Mixing these materials creates a vast compositional space, allowing for highly tunable electronic and optical properties. However, the extensive chemical landscape poses significant challenges in efficiently screening candidates with t… ▽ More Vacancy-ordered double perovskites (VODPs) are promising alternatives to three-dimensional lead halide perovskites for optoelectronic and photovoltaic applications. Mixing these materials creates a vast compositional space, allowing for highly tunable electronic and optical properties. However, the extensive chemical landscape poses significant challenges in efficiently screening candidates with target properties. In this study, we illustrate the diversity of electronic and optical characteristics as well as the nonlinear mixing effects on electronic structures within mixed VODPs. For mixed systems with limited local environment options, the information regarding atomic-site occupation in-principle determines both structural configurations and all essential properties. Building upon this concept, we have developed a model that integrates a data-augmentation scheme with a transformer-inspired graph neural network (GNN), which encodes atomic-site information from mixed systems. This approach enables us to accurately predict band gaps and formation energies for test samples, achieving Root Mean Square Errors (RMSE) of 21 meV and 3.9 meV/atom, respectively. Trained with datasets that include (up to) ternary mixed systems and supercells with less than 72 atoms, our model can be generalized to medium- and high-entropy mixed VODPs (with 4 to 6 principal mixing elements) and large supercells containing more than 200 atoms. Furthermore, our model successfully reproduces experimentally observed bandgap bowing in Sn-based mixed VODPs and reveals an unconventional mixing effect that can result in smaller band gaps compared to those found in pristine systems. △ Less

Submitted 24 January, 2025; originally announced January 2025.

Comments: 22 pages, 9 figures

arXiv:2501.14206 [pdf, ps, other]

Cross section measurement of $e^{+}e^{-} \to f_{1}(1285)π^{+}π^{-}$ at center-of-mass energies between $3.808$ and $4.951\rm GeV$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (639 additional authors not shown)

Abstract: Using data samples collected by the \mbox{BESIII} detector located at the Beijing Electron Positron Collider, the cross sections of the process $e^+e^-\to f_{1}(1285)π^+π^-$ are measured at forty-five center-of-mass energies from $3.808$ to $4.951 {\rm GeV}$. An investigation on the cross section line shape is performed, and no significant structure is observed. Using data samples collected by the \mbox{BESIII} detector located at the Beijing Electron Positron Collider, the cross sections of the process $e^+e^-\to f_{1}(1285)π^+π^-$ are measured at forty-five center-of-mass energies from $3.808$ to $4.951 {\rm GeV}$. An investigation on the cross section line shape is performed, and no significant structure is observed. △ Less

Submitted 23 January, 2025; originally announced January 2025.

arXiv:2501.14187 [pdf, ps, other]

Linear enhanced dissipation for the 2D Taylor-Couette flow in the exterior region: A supplementary example for Gearhart-Prüss type lemma

Authors: Te Li, Ping Zhang, Yibin Zhang

Abstract: From the perspective of asymptotic stability at high Reynolds numbers, Taylor-Couette flow, as a typical rotating shear flow, exhibits rich decay behaviors. Previously, for the extensively studied Couette flow or the Taylor-Couette flow in bounded annular domains, methods based on resolvent estimates could derive exponential decay asymptotic for the solutions of the linearized system. However, unl… ▽ More From the perspective of asymptotic stability at high Reynolds numbers, Taylor-Couette flow, as a typical rotating shear flow, exhibits rich decay behaviors. Previously, for the extensively studied Couette flow or the Taylor-Couette flow in bounded annular domains, methods based on resolvent estimates could derive exponential decay asymptotic for the solutions of the linearized system. However, unlike the Couette flow or the Taylor-Couette flow in bounded annular domains, the Taylor-Couette flow in exterior regions exhibits degeneration of derivatives of any order at infinity. In this paper, we present in Theorem 1.1 that the linearized system of the 2D Taylor-Couette flow in the exterior region exhibits space-time coupled polynomial decay asymptotics. We also prove that the solution to this system, when it contains inhomogeneous terms, cannot be expected to exhibit space-time coupled exponential decay, as detailed in Theorem 1.2. The result of Theorem 1.2 indicates that, even if we can obtain sharp resolvent estimates in different weighted spaces, the Gearhart-Prüss type lemma no longer applies. This suggests that resolvent estimates may not be very effective for handling degenerate shear flows. Furthermore, Theorem 1.2 also implies that, for the transition threshold problem of the 2D Taylor-Couette flow in exterior regions, we can at most expect the solution to exhibit long-time behavior with space-time coupled polynomial decay. Finally, we present a generalization of Theorem 1.2, as detailed in Theorem 1.3. △ Less

Submitted 23 January, 2025; originally announced January 2025.

arXiv:2501.13898 [pdf, other]

PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection

Authors: Peiyuan Zhang, Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Yue Zhou, Xiaosong Jia, Xudong Lu, Jingdong Chen, Xiang Li, Junchi Yan, Yansheng Li

Abstract: With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integ… ▽ More With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integrating three unique image views: the original view, a resized view, and a rotated/flipped (rot/flp) view. Based on the views, a scale augmentation module and an angle acquisition module are constructed. In the first module, a Scale-Sensitive Consistency (SSC) loss and a Scale-Sensitive Feature Fusion (SSFF) module are introduced to improve the model's ability to estimate object scale. To achieve precise angle predictions, the second module employs symmetry-based self-supervised learning. Additionally, we introduce an end-to-end version that eliminates the pseudo-label generation process by integrating a detector branch and introduces an Instance-Aware Weighting (IAW) strategy to focus on high-quality predictions. We conducted extensive experiments on the DIOR-R, DOTA-v1.0/v1.5/v2.0, FAIR1M, STAR, and RSAR datasets. Across all these datasets, our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods. The code will be available at https://github.com/ZpyWHU/PointOBB-v3. △ Less

Submitted 23 January, 2025; originally announced January 2025.

Comments: 16 pages, 5 figures, 10 tables

arXiv:2501.13339 [pdf, ps, other]

Joint Beamforming and Position Optimization for Fluid RIS-aided ISAC Systems

Authors: Junjie Ye, Peichang Zhang, Xiao-Peng Li, Lei Huang, Yuanwei Liu

Abstract: A fluid reconfigurable intelligent surface (fRIS)-aided integrated sensing and communications (ISAC) system is proposed to enhance multi-target sensing and multi-user communication. Unlike the conventional RIS, the fRIS incorporates movable elements whose positions can be flexibly adjusted to provide extra spatial degrees of freedom. In this system, a joint optimization problem is formulated to mi… ▽ More A fluid reconfigurable intelligent surface (fRIS)-aided integrated sensing and communications (ISAC) system is proposed to enhance multi-target sensing and multi-user communication. Unlike the conventional RIS, the fRIS incorporates movable elements whose positions can be flexibly adjusted to provide extra spatial degrees of freedom. In this system, a joint optimization problem is formulated to minimize sensing beampattern mismatch and communication symbol estimation error by optimizing the symbol estimator, transmit beamformer, fRIS phase shifts, and element positions. To solve this problem, an algorithm based on alternating minimization is devised, where subproblems are solved leveraging augmented Lagrangian method, quadratic programming, semidefinite-relaxation, and majorization-minimization techniques. A key challenge exists that the fRIS element positions affect both the incident and reflective channels, leading to the high-order composite functions regarding the positions. As a remedy, it is proved that the high-order terms can be transformed to linear and linear-difference forms using the characteristics of fRIS and structural channels, which facilitates the position optimization. Numerical results validate the effectiveness of the proposed scheme as compared to the conventional RIS-aided ISAC systems and other benchmarks. △ Less

Submitted 24 January, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

Comments: 13 pages, 10 figures, has submitted to an IEEE journal for possible publication

arXiv:2501.12948 [pdf, other]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu , et al. (175 additional authors not shown)

Abstract: We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters… ▽ More We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. △ Less

Submitted 22 January, 2025; originally announced January 2025.

arXiv:2501.12696 [pdf, other]

doi 10.1109/JSAC.2025.3531406

SoundSpring: Loss-Resilient Audio Transceiver with Dual-Functional Masked Language Modeling

Authors: Shengshi Yao, Jincheng Dai, Xiaoqi Qin, Sixian Wang, Siye Wang, Kai Niu, Ping Zhang

Abstract: In this paper, we propose "SoundSpring", a cutting-edge error-resilient audio transceiver that marries the robustness benefits of joint source-channel coding (JSCC) while also being compatible with current digital communication systems. Unlike recent deep JSCC transceivers, which learn to directly map audio signals to analog channel-input symbols via neural networks, our SoundSpring adopts the lay… ▽ More In this paper, we propose "SoundSpring", a cutting-edge error-resilient audio transceiver that marries the robustness benefits of joint source-channel coding (JSCC) while also being compatible with current digital communication systems. Unlike recent deep JSCC transceivers, which learn to directly map audio signals to analog channel-input symbols via neural networks, our SoundSpring adopts the layered architecture that delineates audio compression from digital coded transmission, but it sufficiently exploits the impressive in-context predictive capabilities of large language (foundation) models. Integrated with the casual-order mask learning strategy, our single model operates on the latent feature domain and serve dual-functionalities: as efficient audio compressors at the transmitter and as effective mechanisms for packet loss concealment at the receiver. By jointly optimizing towards both audio compression efficiency and transmission error resiliency, we show that mask-learned language models are indeed powerful contextual predictors, and our dual-functional compression and concealment framework offers fresh perspectives on the application of foundation language models in audio communication. Through extensive experimental evaluations, we establish that SoundSpring apparently outperforms contemporary audio transmission systems in terms of signal fidelity metrics and perceptual quality scores. These new findings not only advocate for the practical deployment of SoundSpring in learning-based audio communication systems but also inspire the development of future audio semantic transceivers. △ Less

Submitted 22 January, 2025; originally announced January 2025.

Comments: To appear in IEEE JSAC

arXiv:2501.12614 [pdf, other]

Electric field reconstruction with three polarizations for the radio detection of ultra-high energy particles

Authors: Kewen Zhang, Tim Huege, Ramesh Koirala, Pengxiong Ma, Matías Tueros, Xin Xu, Chao Zhang, Pengfei Zhang, Yi Zhang

Abstract: The amplitude, polarization, frequency spectrum and energy fluence carried by the electric field at a given measurement position are the key parameters for retrieving information from radio signals generated by extensive air showers. Accurate reconstruction of the electric field from the signals recorded by the antennas is therefore essential for the radio detection technique. Conventional reconst… ▽ More The amplitude, polarization, frequency spectrum and energy fluence carried by the electric field at a given measurement position are the key parameters for retrieving information from radio signals generated by extensive air showers. Accurate reconstruction of the electric field from the signals recorded by the antennas is therefore essential for the radio detection technique. Conventional reconstruction methods primarily focus on electric field reconstruction for antennas with two horizontal polarizations. In this paper, we introduce an analytical least-squares ($χ^2$) reconstruction method that operates with both two and three polarizations, providing the reconstructed electric field at each antenna. This solution has been verified for simple and realistic antenna responses, with a particular focus on inclined air showers. Our method achieves an accuracy better than 4\% in determining the Hilbert peak amplitude of the electric field and better than 6\% accuracy on the estimation of the energy fluence, with minimal bias. Additionally, this was found to be reliable for almost any arrival directions, and the direction dependence has been investigated. This work also demonstrates that incorporating vertically polarized antennas enhances the precision of reconstruction, leading to a more accurate and reliable electric field estimation for inclined air showers. Consequently, the method enhances our ability to extract information about cosmic rays from the detected signals in current and future experiments. △ Less

Submitted 24 January, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

arXiv:2501.12368 [pdf, other]

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Authors: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang

Abstract: Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary… ▽ More Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer △ Less

Submitted 21 January, 2025; originally announced January 2025.

Comments: Tech Report

arXiv:2501.10705 [pdf, other]

Secure Communication in Dynamic RDARS-Driven Systems

Authors: Ziqian Pei, Jintao Wang, Pingping Zhang, Zheng Shi, Guanghua Yang, Shaodan Ma

Abstract: In this letter, we investigate a dynamic reconfigurable distributed antenna and reflection surface (RDARS)-driven secure communication system, where the working mode of the RDARS can be flexibly configured. We aim to maximize the secrecy rate by jointly designing the active beamforming vectors, reflection coefficients, and the channel-aware mode selection matrix. To address the non-convex binary a… ▽ More In this letter, we investigate a dynamic reconfigurable distributed antenna and reflection surface (RDARS)-driven secure communication system, where the working mode of the RDARS can be flexibly configured. We aim to maximize the secrecy rate by jointly designing the active beamforming vectors, reflection coefficients, and the channel-aware mode selection matrix. To address the non-convex binary and cardinality constraints introduced by dynamic mode selection, we propose an efficient alternating optimization (AO) framework that employs penalty-based fractional programming (FP) and successive convex approximation (SCA) transformations. Simulation results demonstrate the potential of RDARS in enhancing the secrecy rate and show its superiority compared to existing reflection surface-based schemes. △ Less

Submitted 18 January, 2025; originally announced January 2025.

Comments: 5 pages, 5 figures

arXiv:2501.10182 [pdf, other]

Secure Semantic Communication With Homomorphic Encryption

Authors: Rui Meng, Dayu Fan, Haixiao Gao, Yifan Yuan, Bizhu Wang, Xiaodong Xu, Mengying Sun, Chen Dong, Xiaofeng Tao, Ping Zhang, Dusit Niyato

Abstract: In recent years, Semantic Communication (SemCom), which aims to achieve efficient and reliable transmission of meaning between agents, has garnered significant attention from both academia and industry. To ensure the security of communication systems, encryption techniques are employed to safeguard confidentiality and integrity. However, traditional cryptography-based encryption algorithms encount… ▽ More In recent years, Semantic Communication (SemCom), which aims to achieve efficient and reliable transmission of meaning between agents, has garnered significant attention from both academia and industry. To ensure the security of communication systems, encryption techniques are employed to safeguard confidentiality and integrity. However, traditional cryptography-based encryption algorithms encounter obstacles when applied to SemCom. Motivated by this, this paper explores the feasibility of applying homomorphic encryption to SemCom. Initially, we review the encryption algorithms utilized in mobile communication systems and analyze the challenges associated with their application to SemCom. Subsequently, we employ scale-invariant feature transform to demonstrate that semantic features can be preserved in homomorphic encrypted ciphertext. Based on this finding, we propose a task-oriented SemCom scheme secured through homomorphic encryption. We design the privacy preserved deep joint source-channel coding (JSCC) encoder and decoder, and the frequency of key updates can be adjusted according to service requirements without compromising transmission performance. Simulation results validate that, when compared to plaintext images, the proposed scheme can achieve almost the same classification accuracy performance when dealing with homomorphic ciphertext images. Furthermore, we provide potential future research directions for homomorphic encrypted SemCom. △ Less

Submitted 17 January, 2025; originally announced January 2025.

Comments: 8 pages, 3 figures

arXiv:2501.10130 [pdf, other]

Study of $η\rightarrowπ^+π^-l^+l^-$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (637 additional authors not shown)

Abstract: Using a sample of $(10087\pm44)\times10^{6}$ $J/ψ$ events accumulated with the BESIII detector, we analyze the decays $η\rightarrowπ^+π^-l^+l^-$ ($l=e$ or $μ$) via the process $J/ψ\rightarrowγη$. The branching fraction of $η\rightarrowπ^+π^-e^+e^-$ is measured to be $\mathcal{B}(η\rightarrowπ^+π^-e^+e^-)=(3.07\pm0.12_{\rm{stat.}}\pm0.19_{\rm{syst.}}) \times10^{-4}$. No signal events are observed f… ▽ More Using a sample of $(10087\pm44)\times10^{6}$ $J/ψ$ events accumulated with the BESIII detector, we analyze the decays $η\rightarrowπ^+π^-l^+l^-$ ($l=e$ or $μ$) via the process $J/ψ\rightarrowγη$. The branching fraction of $η\rightarrowπ^+π^-e^+e^-$ is measured to be $\mathcal{B}(η\rightarrowπ^+π^-e^+e^-)=(3.07\pm0.12_{\rm{stat.}}\pm0.19_{\rm{syst.}}) \times10^{-4}$. No signal events are observed for the $η\rightarrowπ^{+}π^{-}μ^{+}μ^{-}$ decay, leading to an upper limit on the branching fraction of $\mathcal{B}(η\rightarrowπ^{+}π^{-}μ^{+}μ^{-})<4.0\times10^{-7}$ at the 90\% confidence level. Furthermore, the $CP$-violation asymmetry parameter is found to be $\mathcal{A}_{CP}(η\rightarrowπ^{+}π^{-}e^{+}e^{-})=(-4.04\pm4.69_{\rm{stat.}}\pm0.14_{\rm{syst.}})\%$, showing no evidence of $CP$-violation with current statistics. Additionally, we extract the transition form factor from the decay amplitude of $η\rightarrowπ^+π^-e^+e^-$. Finally, axion-like particles are searched for via the decay $η\rightarrowπ^+π^-a, a\rightarrow e^+e^-$, and upper limits on this branching fraction relative to that of $η\rightarrowπ^+π^-e^+e^-$ are presented as a function of the axion-like particle mass in the range $5-200\ \mathrm{MeV}/c^{2}$. △ Less

Submitted 17 January, 2025; originally announced January 2025.

arXiv:2501.10119 [pdf, other]

Gluon skewed generalized parton distributions of proton from a light-front Hamiltonian approach

Authors: Pengxiang Zhang, Yiping Liu, Siqi Xu, Chandan Mondal, Xingbo Zhao, James P. Vary

Abstract: We calculate all leading-twist gluon generalized parton distributions (GPDs) inside the proton at nonzero skewness using the basis light-front quantization framework. The proton's light-front wave functions are derived from a light-front quantized Hamiltonian incorporating Quantum Chromodynamics inputs. Our results show that the qualitative behaviors of the GPDs are consistent with those from othe… ▽ More We calculate all leading-twist gluon generalized parton distributions (GPDs) inside the proton at nonzero skewness using the basis light-front quantization framework. The proton's light-front wave functions are derived from a light-front quantized Hamiltonian incorporating Quantum Chromodynamics inputs. Our results show that the qualitative behaviors of the GPDs are consistent with those from other theoretical calculations. Additionally, we analyze the GPDs in the boost-invariant longitudinal coordinate, $σ=\frac{1}{2} b^- P^+$, which serves as the Fourier conjugate of the skewness. The GPDs in $σ$-space exhibit diffraction patterns, reminiscent of optical wave diffraction. △ Less

Submitted 17 January, 2025; originally announced January 2025.

Comments: 11 pages, 3 figures, and 1 table

arXiv:2501.09400 [pdf, ps, other]

Joint Antenna Selection and Beamforming Design for Active RIS-aided ISAC Systems

Authors: Wei Ma, Peichang Zhang, Junjie Ye, Rouyang Guan, Xiao-Peng Li, Lei Huang

Abstract: Active reconfigurable intelligent surface (A-RIS) aided integrated sensing and communications (ISAC) system has been considered as a promising paradigm to improve spectrum efficiency. However, massive energy-hungry radio frequency (RF) chains hinder its large-scale deployment. To address this issue, an A-RIS-aided ISAC system with antenna selection (AS) is proposed in this work, where a target is… ▽ More Active reconfigurable intelligent surface (A-RIS) aided integrated sensing and communications (ISAC) system has been considered as a promising paradigm to improve spectrum efficiency. However, massive energy-hungry radio frequency (RF) chains hinder its large-scale deployment. To address this issue, an A-RIS-aided ISAC system with antenna selection (AS) is proposed in this work, where a target is sensed while multiple communication users are served with specifically selected antennas. Specifically, a cuckoo search-based scheme is first utilized to select the antennas associated with high-gain channels. Subsequently, with the properly selected antennas, the weighted sum-rate (WSR) of the system is optimized under the condition of radar probing power level, power budget for the A-RIS and transmitter. To solve the highly non-convex optimization problem, we develop an efficient algorithm based on weighted minimum mean square error (WMMSE) and fractional programming (FP). Simulation results show that the proposed AS scheme and the algorithm are effective, which reduce the number of RF chains without significant performance degradation. △ Less

Submitted 16 January, 2025; originally announced January 2025.

arXiv:2501.09079 [pdf, other]

Demonstrating quantum error mitigation on logical qubits

Authors: Aosai Zhang, Haipeng Xie, Yu Gao, Jia-Nan Yang, Zehang Bao, Zitian Zhu, Jiachen Chen, Ning Wang, Chuanyu Zhang, Jiarun Zhong, Shibo Xu, Ke Wang, Yaozu Wu, Feitong Jin, Xuhao Zhu, Yiren Zou, Ziqi Tan, Zhengyi Cui, Fanhao Shen, Tingting Li, Yihang Han, Yiyang He, Gongyu Liu, Jiayuan Shen, Han Wang , et al. (10 additional authors not shown)

Abstract: A long-standing challenge in quantum computing is developing technologies to overcome the inevitable noise in qubits. To enable meaningful applications in the early stages of fault-tolerant quantum computing, devising methods to suppress post-correction logical failures is becoming increasingly crucial. In this work, we propose and experimentally demonstrate the application of zero-noise extrapola… ▽ More A long-standing challenge in quantum computing is developing technologies to overcome the inevitable noise in qubits. To enable meaningful applications in the early stages of fault-tolerant quantum computing, devising methods to suppress post-correction logical failures is becoming increasingly crucial. In this work, we propose and experimentally demonstrate the application of zero-noise extrapolation, a practical quantum error mitigation technique, to error correction circuits on state-of-the-art superconducting processors. By amplifying the noise on physical qubits, the circuits yield outcomes that exhibit a predictable dependence on noise strength, following a polynomial function determined by the code distance. This property enables the effective application of polynomial extrapolation to mitigate logical errors. Our experiments demonstrate a universal reduction in logical errors across various quantum circuits, including fault-tolerant circuits of repetition and surface codes. We observe a favorable performance in multi-round error correction circuits, indicating that this method remains effective when the circuit depth increases. These results advance the frontier of quantum error suppression technologies, opening a practical way to achieve reliable quantum computing in the early fault-tolerant era. △ Less

Submitted 15 January, 2025; originally announced January 2025.

Showing 201–250 of 4,528 results for author: Zhang, P