-
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Authors:
Longtao Zheng,
Yifan Zhang,
Hanzhong Guo,
Jiachun Pan,
Zhenxiong Tan,
Jiahao Lu,
Chuanxin Tang,
Bo An,
Shuicheng Yan
Abstract:
Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware di…
▽ More
Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Pre-train, Align, and Disentangle: Empowering Sequential Recommendation with Large Language Models
Authors:
Yuhao Wang,
Junwei Pan,
Pengyue Jia,
Wanyu Wang,
Maolin Wang,
Zhixiang Feng,
Xiaotian Li,
Jie Jiang,
Xiangyu Zhao
Abstract:
Sequential Recommendation (SR) aims to leverage the sequential patterns in users' historical interactions to accurately track their preferences. However, the primary reliance of existing SR methods on collaborative data results in challenges such as the cold-start problem and sub-optimal performance. Concurrently, despite the proven effectiveness of large language models (LLMs), their integration…
▽ More
Sequential Recommendation (SR) aims to leverage the sequential patterns in users' historical interactions to accurately track their preferences. However, the primary reliance of existing SR methods on collaborative data results in challenges such as the cold-start problem and sub-optimal performance. Concurrently, despite the proven effectiveness of large language models (LLMs), their integration into commercial recommender systems is impeded by issues such as high inference latency, incomplete capture of all distribution statistics, and catastrophic forgetting. To address these issues, we introduce a novel Pre-train, Align, and Disentangle (PAD) framework to enhance SR models with LLMs. In particular, we initially pre-train both the SR and LLM models to obtain collaborative and textual embeddings. Subsequently, we propose a characteristic recommendation-anchored alignment loss using multi-kernel maximum mean discrepancy with Gaussian kernels. Lastly, a triple-experts architecture, comprising aligned and modality-specific experts with disentangled embeddings, is fine-tuned in a frequency-aware manner. Experimental results on three public datasets validate the efficacy of PAD, indicating substantial enhancements and compatibility with various SR backbone models, particularly for cold items. The code and datasets are accessible for reproduction at https://github.com/Applied-Machine-Learning-Lab/PAD.
△ Less
Submitted 25 April, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
Understanding Particles From Video: Property Estimation of Granular Materials via Visuo-Haptic Learning
Authors:
Zeqing Zhang,
Guangze Zheng,
Xuebo Ji,
Guanqi Chen,
Ruixing Jia,
Wentao Chen,
Guanhua Chen,
Liangjun Zhang,
Jia Pan
Abstract:
Granular materials (GMs) are ubiquitous in daily life. Understanding their properties is also important, especially in agriculture and industry. However, existing works require dedicated measurement equipment and also need large human efforts to handle a large number of particles. In this paper, we introduce a method for estimating the relative values of particle size and density from the video of…
▽ More
Granular materials (GMs) are ubiquitous in daily life. Understanding their properties is also important, especially in agriculture and industry. However, existing works require dedicated measurement equipment and also need large human efforts to handle a large number of particles. In this paper, we introduce a method for estimating the relative values of particle size and density from the video of the interaction with GMs. It is trained on a visuo-haptic learning framework inspired by a contact model, which reveals the strong correlation between GM properties and the visual-haptic data during the probe-dragging in the GMs. After training, the network can map the visual modality well to the haptic signal and implicitly characterize the relative distribution of particle properties in its latent embeddings, as interpreted in that contact model. Therefore, we can analyze GM properties using the trained encoder, and only visual information is needed without extra sensory modalities and human efforts for labeling. The presented GM property estimator has been extensively validated via comparison and ablation experiments. The generalization capability has also been evaluated and a real-world application on the beach is also demonstrated. Experiment videos are available at \url{https://sites.google.com/view/gmwork/vhlearning} .
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration
Authors:
Hao Li,
Xiang Chen,
Jiangxin Dong,
Jinhui Tang,
Jinshan Pan
Abstract:
Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for imag…
▽ More
Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
Transversal Logical Clifford gates on rotated surface codes with reconfigurable neutral atom arrays
Authors:
Zi-Han Chen,
Ming-Cheng Chen,
Chao-Yang Lu,
Jian-Wei Pan
Abstract:
We propose hardware-efficient schemes for implementing logical H and S gates transversally on rotated surface codes with reconfigurable neutral atom arrays. For logical H gates, we develop a simple strategy to rotate code patches efficiently with two sets of 2D-acousto-optic deflectors (2D-AODs). Our protocol for logical S gates utilizes the time-dynamics of the data and ancilla qubits during synd…
▽ More
We propose hardware-efficient schemes for implementing logical H and S gates transversally on rotated surface codes with reconfigurable neutral atom arrays. For logical H gates, we develop a simple strategy to rotate code patches efficiently with two sets of 2D-acousto-optic deflectors (2D-AODs). Our protocol for logical S gates utilizes the time-dynamics of the data and ancilla qubits during syndrome extraction (SE). In particular, we break away from traditional schemes where transversal logical gates take place between two SE rounds and instead embed our fold-transversal logical operation inside a single SE round, leveraging the fact that data and ancilla qubits can be morphed to an unrotated surface code state at half-cycle. Under circuit noise, we observe the performance of our S gate protocol is on par with the quantum memory. Together with transversal logical CNOT gates, our protocols complete a transversal logical Clifford gate set on rotated surface codes and admit efficient implementation on neutral atom array platforms.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning
Authors:
Jianming Pan,
Zeqi Ye,
Xiao Yang,
Xu Yang,
Weiqing Liu,
Lewen Wang,
Jiang Bian
Abstract:
Data-driven decision-making processes increasingly utilize end-to-end learnable deep neural networks to render final decisions. Sometimes, the output of the forward functions in certain layers is determined by the solutions to mathematical optimization problems, leading to the emergence of differentiable optimization layers that permit gradient back-propagation. However, real-world scenarios often…
▽ More
Data-driven decision-making processes increasingly utilize end-to-end learnable deep neural networks to render final decisions. Sometimes, the output of the forward functions in certain layers is determined by the solutions to mathematical optimization problems, leading to the emergence of differentiable optimization layers that permit gradient back-propagation. However, real-world scenarios often involve large-scale datasets and numerous constraints, presenting significant challenges. Current methods for differentiating optimization problems typically rely on implicit differentiation, which necessitates costly computations on the Jacobian matrices, resulting in low efficiency. In this paper, we introduce BPQP, a differentiable convex optimization framework designed for efficient end-to-end learning. To enhance efficiency, we reformulate the backward pass as a simplified and decoupled quadratic programming problem by leveraging the structural properties of the KKT matrix. This reformulation enables the use of first-order optimization algorithms in calculating the backward pass gradients, allowing our framework to potentially utilize any state-of-the-art solver. As solver technologies evolve, BPQP can continuously adapt and improve its efficiency. Extensive experiments on both simulated and real-world datasets demonstrate that BPQP achieves a significant improvement in efficiency--typically an order of magnitude faster in overall execution time compared to other differentiable optimization layers. Our results not only highlight the efficiency gains of BPQP but also underscore its superiority over differentiable optimization layer baselines.
△ Less
Submitted 29 December, 2024; v1 submitted 28 November, 2024;
originally announced November 2024.
-
FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution
Authors:
Junyang Chen,
Jinshan Pan,
Jiangxin Dong
Abstract:
Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. I…
▽ More
Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.
△ Less
Submitted 27 November, 2024;
originally announced November 2024.
-
A tale of Bethe logarithms: leptonic widths of $χ_{cJ}$ and Lamb shift
Authors:
Yu Jia,
Jichen Pan
Abstract:
The rare annihilation decays of $P$-wave spin-triplet quarkonia into lepton pair have to proceed via two-photon intermediate state, which are plagued with the infrared divergence symptom. We recognize that the physical root of the IR divergence and its remedy is the same as the Lamb shift in QED. In this work we provide a complete solution to this IR problem by including the effect of the higher F…
▽ More
The rare annihilation decays of $P$-wave spin-triplet quarkonia into lepton pair have to proceed via two-photon intermediate state, which are plagued with the infrared divergence symptom. We recognize that the physical root of the IR divergence and its remedy is the same as the Lamb shift in QED. In this work we provide a complete solution to this IR problem by including the effect of the higher Fock component of the $χ_{cJ}$ state, {\it viz.}, the $c\bar{c}(^3S_1^{(1)})$ pair accompanied with a very long wavelength photon. Adding the contributions from the leading and next-to-leading order Fock components together, we arrive at the IR finite and factorization scale independent predictions for leptonic widths of $χ_{cJ}(nP)$. The Bethe logarithms associated with these exclusive reactions are found to have rather different traits from those associated with Lamb shift. We present numerical predictions by employing several influential quark potential models. The predicted leptonic widths of $χ_{cJ}(nP)$ are sizable and the observation prospect for $e^+e^-\to χ_{cJ}(1P,2P)$ at BESIII looks bright. The future observation of $e^+e^-\to X(3872)$ will shed important light on the $c\bar{c}$ content of the $X(3872)$ meson.
△ Less
Submitted 27 November, 2024;
originally announced November 2024.
-
A Haptic-Based Proximity Sensing System for Buried Object in Granular Material
Authors:
Zeqing Zhang,
Ruixing Jia,
Youcan Yan,
Ruihua Han,
Shijie Lin,
Qian Jiang,
Liangjun Zhang,
Jia Pan
Abstract:
The proximity perception of objects in granular materials is significant, especially for applications like minesweeping. However, due to particles' opacity and complex properties, existing proximity sensors suffer from high costs from sophisticated hardware and high user-cost from unintuitive results. In this paper, we propose a simple yet effective proximity sensing system for underground stuff b…
▽ More
The proximity perception of objects in granular materials is significant, especially for applications like minesweeping. However, due to particles' opacity and complex properties, existing proximity sensors suffer from high costs from sophisticated hardware and high user-cost from unintuitive results. In this paper, we propose a simple yet effective proximity sensing system for underground stuff based on the haptic feedback of the sensor-granules interaction. We study and employ the unique characteristic of particles -- failure wedge zone, and combine the machine learning method -- Gaussian process regression, to identify the force signal changes induced by the proximity of objects, so as to achieve near-field perception. Furthermore, we design a novel trajectory to control the probe searching in granules for a wide range of perception. Also, our proximity sensing system can adaptively determine optimal parameters for robustness operation in different particles. Experiments demonstrate our system can perceive underground objects over 0.5 to 7 cm in advance among various materials.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
String breaking mechanism in a lattice Schwinger model simulator
Authors:
Ying Liu,
Wei-Yong Zhang,
Zi-Hang Zhu,
Ming-Gen He,
Zhen-Sheng Yuan,
Jian-Wei Pan
Abstract:
String breaking is a fundamental concept in gauge theories, describing the decay of a flux string connecting two charges through the production of particle-antiparticle pairs. This phenomenon is particularly important in particle physics, notably in Quantum Chromodynamics, and plays a crucial role in condensed matter physics. However, achieving a theoretical understanding of this non-perturbative…
▽ More
String breaking is a fundamental concept in gauge theories, describing the decay of a flux string connecting two charges through the production of particle-antiparticle pairs. This phenomenon is particularly important in particle physics, notably in Quantum Chromodynamics, and plays a crucial role in condensed matter physics. However, achieving a theoretical understanding of this non-perturbative effect is challenging, as conventional numerical approaches often fall short and require substantial computational resources. On the experimental side, studying these effects necessitates advanced setups, such as high-energy colliders, which makes direct observation difficult. Here, we report an experimental investigation of the string breaking mechanism in a one-dimensional U(1) lattice gauge theory using an optical lattice quantum simulator. By deterministically preparing initial states of varying lengths with fixed charges at each end, and adiabatically tuning the mass and string tension, we observed in situ microscopic confined phases that exhibit either string or brokenstring states. Further analysis reveals that string breaking occurs under a resonance condition, leading to the creation of new particle-antiparticle pairs. These findings offer compelling evidence of string breaking and provide valuable insights into the intricate dynamics of lattice gauge theories. Our work underscores the potential of optical lattices as controllable quantum simulators, enabling the exploration of complex gauge theories and their associated phenomena.
△ Less
Submitted 22 November, 2024;
originally announced November 2024.
-
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
Authors:
Zehua Pei,
Hui-Ling Zhen,
Xianzhi Yu,
Sinno Jialin Pan,
Mingxuan Yuan,
Bei Yu
Abstract:
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains through the extensive scaling of model parameters. Recent works observe the redundancy across the transformer blocks and develop compression methods by structured pruning of the unimportant blocks. However, such straightforward elimination will always provide irreversible performance degradat…
▽ More
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains through the extensive scaling of model parameters. Recent works observe the redundancy across the transformer blocks and develop compression methods by structured pruning of the unimportant blocks. However, such straightforward elimination will always provide irreversible performance degradation. In this paper, we propose FuseGPT, a novel methodology to recycle the pruned transformer blocks to further recover the model performance. Firstly we introduce a new importance detection metric, Macro Influence (MI), to detect the long-term influence of each transformer block by calculating their loss of information after removal. Then we propose group-level layers fusion, which adopts the parameters in layers of the unimportant blocks and injects them into the corresponding layers inside the neighboring blocks. The fusion is not one-off but through iterative parameter updates by lightweight group-level fine-tuning. Specifically, these injected parameters are frozen but weighted with learnable rank decomposition matrices to reduce the overhead during fine-tuning. Our approach not only works well on large language models but also on large multimodal models. The experiments have shown that, by using modest amounts of data, FuseGPT can outperform previous works in both perplexity and zero-shot task performance.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
Independent Optical Frequency Combs Powered 546 km Field Test of Twin-Field Quantum Key Distribution
Authors:
Lai Zhou,
Jinping Lin,
Chengfang Ge,
Yuanbin Fan,
Zhiliang Yuan,
Hao Dong,
Yang Liu,
Di Ma,
Jiu-Peng Chen,
Cong Jiang,
Xiang-Bin Wang,
Li-Xing You,
Qiang Zhang,
Jian-Wei Pan
Abstract:
Owing to its repeater-like rate-loss scaling, twin-field quantum key distribution (TF-QKD) has repeatedly exhibited in laboratory its superiority for secure communication over record fiber lengths. Field trials pose a new set of challenges however, which must be addressed before the technology's roll-out into real-world. Here, we verify in field the viability of using independent optical frequency…
▽ More
Owing to its repeater-like rate-loss scaling, twin-field quantum key distribution (TF-QKD) has repeatedly exhibited in laboratory its superiority for secure communication over record fiber lengths. Field trials pose a new set of challenges however, which must be addressed before the technology's roll-out into real-world. Here, we verify in field the viability of using independent optical frequency combs -- installed at sites separated by a straight-line distance of 300~km -- to achieve a versatile TF-QKD setup that has no need for optical frequency dissemination and thus enables an open and network-friendly fiber configuration. Over 546 and 603 km symmetric links, we record a finite-size secure key rate (SKR) of 0.53~bit/s and an asymptotic SKR of 0.12 bit/s, respectively. Of practical importance, the setup is demonstrated to support 44~km fiber asymmetry in the 452 km link. Our work marks an important step towards incorporation of long-haul fiber links into large quantum networks.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
Unveiling Redundancy in Diffusion Transformers (DiTs): A Systematic Study
Authors:
Xibo Sun,
Jiarui Fang,
Aoyu Li,
Jinzhe Pan
Abstract:
The increased model capacity of Diffusion Transformers (DiTs) and the demand for generating higher resolutions of images and videos have led to a significant rise in inference latency, impacting real-time performance adversely. While prior research has highlighted the presence of high similarity in activation values between adjacent diffusion steps (referred to as redundancy) and proposed various…
▽ More
The increased model capacity of Diffusion Transformers (DiTs) and the demand for generating higher resolutions of images and videos have led to a significant rise in inference latency, impacting real-time performance adversely. While prior research has highlighted the presence of high similarity in activation values between adjacent diffusion steps (referred to as redundancy) and proposed various caching mechanisms to mitigate computational overhead, the exploration of redundancy in existing literature remains limited, with findings often not generalizable across different DiT models. This study aims to address this gap by conducting a comprehensive investigation into redundancy across a broad spectrum of mainstream DiT models. Our experimental analysis reveals substantial variations in the distribution of redundancy across diffusion steps among different DiT models. Interestingly, within a single model, the redundancy distribution remains stable regardless of variations in input prompts, step counts, or scheduling strategies. Given the lack of a consistent pattern across diverse models, caching strategies designed for a specific group of models may not easily transfer to others. To overcome this challenge, we introduce a tool for analyzing the redundancy of individual models, enabling subsequent research to develop tailored caching strategies for specific model architectures. The project is publicly available at https://github.com/xdit-project/DiTCacheAnalysis.
△ Less
Submitted 17 November, 2024;
originally announced November 2024.
-
Probing false vacuum decay on a cold-atom gauge-theory quantum simulator
Authors:
Zi-Hang Zhu,
Ying Liu,
Gianluca Lagnese,
Federica Maria Surace,
Wei-Yong Zhang,
Ming-Gen He,
Jad C. Halimeh,
Marcello Dalmonte,
Siddhardh C. Morampudi,
Frank Wilczek,
Zhen-Sheng Yuan,
Jian-Wei Pan
Abstract:
In the context of quantum electrodynamics, the decay of false vacuum leads to the production of electron-positron pair, a phenomenon known as the Schwinger effect. In practical experimental scenarios, producing a pair requires an extremely strong electric field, thus suppressing the production rate and making this process very challenging to observe. Here we report an experimental investigation, i…
▽ More
In the context of quantum electrodynamics, the decay of false vacuum leads to the production of electron-positron pair, a phenomenon known as the Schwinger effect. In practical experimental scenarios, producing a pair requires an extremely strong electric field, thus suppressing the production rate and making this process very challenging to observe. Here we report an experimental investigation, in a cold-atom quantum simulator, of the effect of the background field on pair production from the infinite-mass vacuum in a $1+1$D $\mathrm{U}(1)$ lattice gauge theory. The ability to tune the background field allows us to study pair production in a large production rate regime. Furthermore, we find that the energy spectrum of the time-evolved observables in the zero mass limit displays excitation peaks analogous to bosonic modes in the Schwinger model. Our work opens the door to quantum-simulation experiments that can controllably tune the production of pairs and manipulate their far-from-equilibrium dynamics.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
Towards Unifying Feature Interaction Models for Click-Through Rate Prediction
Authors:
Yu Kang,
Junwei Pan,
Jipeng Jin,
Shudong Huang,
Xiaofeng Gao,
Lei Xiao
Abstract:
Modeling feature interactions plays a crucial role in accurately predicting click-through rates (CTR) in advertising systems. To capture the intricate patterns of interaction, many existing models employ matrix-factorization techniques to represent features as lower-dimensional embedding vectors, enabling the modeling of interactions as products between these embeddings. In this paper, we propose…
▽ More
Modeling feature interactions plays a crucial role in accurately predicting click-through rates (CTR) in advertising systems. To capture the intricate patterns of interaction, many existing models employ matrix-factorization techniques to represent features as lower-dimensional embedding vectors, enabling the modeling of interactions as products between these embeddings. In this paper, we propose a general framework called IPA to systematically unify these models. Our framework comprises three key components: the Interaction Function, which facilitates feature interaction; the Layer Pooling, which constructs higher-level interaction layers; and the Layer Aggregator, which combines the outputs of all layers to serve as input for the subsequent classifier. We demonstrate that most existing models can be categorized within our framework by making specific choices for these three components. Through extensive experiments and a dimensional collapse analysis, we evaluate the performance of these choices. Furthermore, by leveraging the most powerful components within our framework, we introduce a novel model that achieves competitive results compared to state-of-the-art CTR models. PFL gets significant GMV lift during online A/B test in Tencent's advertising platform and has been deployed as the production model in several primary scenarios.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
Modified Gravity Constraints from the Full Shape Modeling of Clustering Measurements from DESI 2024
Authors:
M. Ishak,
J. Pan,
R. Calderon,
K. Lodha,
G. Valogiannis,
A. Aviles,
G. Niz,
L. Yi,
C. Zheng,
C. Garcia-Quintero,
A. de Mattia,
L. Medina-Varela,
J. L. Cervantes-Cota,
U. Andrade,
D. Huterer,
H. E. Noriega,
G. Zhao,
A. Shafieloo,
W. Fang,
S. Ahlen,
D. Bianchi,
D. Brooks,
E. Burtin,
E. Chaussidon,
T. Claybaugh
, et al. (45 additional authors not shown)
Abstract:
We present cosmological constraints on deviations from general relativity (GR) from the first-year of clustering observations from the Dark Energy Spectroscopic Instrument (DESI) in combination with other datasets. We first consider the $μ(a,k)$-$Σ(a,k)$ modified gravity (MG) parametrization (as well as $η(a,k)$) in flat $Λ$CDM and $w_0 w_a$CDM backgrounds. Using a functional form for time-only ev…
▽ More
We present cosmological constraints on deviations from general relativity (GR) from the first-year of clustering observations from the Dark Energy Spectroscopic Instrument (DESI) in combination with other datasets. We first consider the $μ(a,k)$-$Σ(a,k)$ modified gravity (MG) parametrization (as well as $η(a,k)$) in flat $Λ$CDM and $w_0 w_a$CDM backgrounds. Using a functional form for time-only evolution gives $μ_0= 0.11^{+0.44}_{-0.54}$ from DESI(FS+BAO)+BBN and a wide prior on $n_{s}$. Using DESI(FS+BAO)+CMB+DESY3+DESY5-SN, we obtain $μ_0 = 0.05\pm 0.22$ and $Σ_0 = 0.008\pm 0.045$ in the $Λ$CDM background. In $w_0 w_a$CDM, we obtain $μ_0 =-0.24^{+0.32}_{-0.28}$ and $Σ_0 = 0.006\pm 0.043$, consistent with GR, and we still find a preference of the data for dynamical dark energy with $w_0>-1$ and $w_a<0$. We then use binned forms in the two backgrounds starting with two bins in redshift and then combining them with two bins in scale for a total of 4 and 8 MG parameters, respectively. All MG parameters are found consistent with GR. We also find that the tension reported for $Σ_0$ with GR when using Planck PR3 goes away when we use the recent LoLLiPoP+HiLLiPoP likelihoods. As noted previously, this seems to indicate that the tension is related to the CMB lensing anomaly in PR3 which is also resolved when using these likelihoods. We then constrain the class of Horndeski theory in the effective field theory of dark energy. We consider both EFT-basis and $α$-basis. Assuming a power law parametrization for the function $Ω$, which controls non-minimal coupling, we obtain $Ω_0 = 0.012^{+0.001}_{-0.012}$ and $s_0 = 0.996^{+0.54}_{-0.20}$ from DESI(FS+BAO)+DESY5SN+CMB in a $Λ$CDM background. Similar results are obtained when using the $α$-basis, where we constrain $c_M<1.14$, and are all consistent with GR. [Abridged.]
△ Less
Submitted 20 December, 2024; v1 submitted 18 November, 2024;
originally announced November 2024.
-
DESI 2024 VII: Cosmological Constraints from the Full-Shape Modeling of Clustering Measurements
Authors:
DESI Collaboration,
A. G. Adame,
J. Aguilar,
S. Ahlen,
S. Alam,
D. M. Alexander,
C. Allende Prieto,
M. Alvarez,
O. Alves,
A. Anand,
U. Andrade,
E. Armengaud,
S. Avila,
A. Aviles,
H. Awan,
B. Bahr-Kalus,
S. Bailey,
C. Baltay,
A. Bault,
J. Behera,
S. BenZvi,
F. Beutler,
D. Bianchi,
C. Blake,
R. Blum
, et al. (188 additional authors not shown)
Abstract:
We present cosmological results from the measurement of clustering of galaxy, quasar and Lyman-$α$ forest tracers from the first year of observations with the Dark Energy Spectroscopic Instrument (DESI Data Release 1). We adopt the full-shape (FS) modeling of the power spectrum, including the effects of redshift-space distortions, in an analysis which has been validated in a series of supporting p…
▽ More
We present cosmological results from the measurement of clustering of galaxy, quasar and Lyman-$α$ forest tracers from the first year of observations with the Dark Energy Spectroscopic Instrument (DESI Data Release 1). We adopt the full-shape (FS) modeling of the power spectrum, including the effects of redshift-space distortions, in an analysis which has been validated in a series of supporting papers. In the flat $Λ$CDM cosmological model, DESI (FS+BAO), combined with a baryon density prior from Big Bang Nucleosynthesis and a weak prior on the scalar spectral index, determines matter density to $Ω_\mathrm{m}=0.2962\pm 0.0095$, and the amplitude of mass fluctuations to $σ_8=0.842\pm 0.034$. The addition of the cosmic microwave background (CMB) data tightens these constraints to $Ω_\mathrm{m}=0.3056\pm 0.0049$ and $σ_8=0.8121\pm 0.0053$, while further addition of the the joint clustering and lensing analysis from the Dark Energy Survey Year-3 (DESY3) data leads to a 0.4% determination of the Hubble constant, $H_0 = (68.40\pm 0.27)\,{\rm km\,s^{-1}\,Mpc^{-1}}$. In models with a time-varying dark energy equation of state, combinations of DESI (FS+BAO) with CMB and type Ia supernovae continue to show the preference, previously found in the DESI DR1 BAO analysis, for $w_0>-1$ and $w_a<0$ with similar levels of significance. DESI data, in combination with the CMB, impose the upper limits on the sum of the neutrino masses of $\sum m_ν< 0.071\,{\rm eV}$ at 95% confidence. DESI data alone measure the modified-gravity parameter that controls the clustering of massive particles, $μ_0=0.11^{+0.45}_{-0.54}$, while the combination of DESI with the CMB and the clustering and lensing analysis from DESY3 constrains both modified-gravity parameters, giving $μ_0 = 0.04\pm 0.22$ and $Σ_0 = 0.044\pm 0.047$, in agreement with general relativity. [Abridged.]
△ Less
Submitted 21 November, 2024; v1 submitted 18 November, 2024;
originally announced November 2024.
-
DESI 2024 V: Full-Shape Galaxy Clustering from Galaxies and Quasars
Authors:
DESI Collaboration,
A. G. Adame,
J. Aguilar,
S. Ahlen,
S. Alam,
D. M. Alexander,
M. Alvarez,
O. Alves,
A. Anand,
U. Andrade,
E. Armengaud,
S. Avila,
A. Aviles,
H. Awan,
S. Bailey,
C. Baltay,
A. Bault,
J. Behera,
S. BenZvi,
F. Beutler,
D. Bianchi,
C. Blake,
R. Blum,
S. Brieden,
A. Brodzeller
, et al. (174 additional authors not shown)
Abstract:
We present the measurements and cosmological implications of the galaxy two-point clustering using over 4.7 million unique galaxy and quasar redshifts in the range $0.1<z<2.1$ divided into six redshift bins over a $\sim 7,500$ square degree footprint, from the first year of observations with the Dark Energy Spectroscopic Instrument (DESI Data Release 1). By fitting the full power spectrum, we exte…
▽ More
We present the measurements and cosmological implications of the galaxy two-point clustering using over 4.7 million unique galaxy and quasar redshifts in the range $0.1<z<2.1$ divided into six redshift bins over a $\sim 7,500$ square degree footprint, from the first year of observations with the Dark Energy Spectroscopic Instrument (DESI Data Release 1). By fitting the full power spectrum, we extend previous DESI DR1 baryon acoustic oscillation (BAO) measurements to include redshift-space distortions and signals from the matter-radiation equality scale. For the first time, this Full-Shape analysis is blinded at the catalogue-level to avoid confirmation bias and the systematic errors are accounted for at the two-point clustering level, which automatically propagates them into any cosmological parameter. When analysing the data in terms of compressed model-agnostic variables, we obtain a combined precision of 4.7\% on the amplitude of the redshift space distortion signal reaching similar precision with just one year of DESI data than with 20 years of observation from previous generation surveys. We analyse the data to directly constrain the cosmological parameters within the $Λ$CDM model using perturbation theory and combine this information with the reconstructed DESI DR1 galaxy BAO. Using a Big Bang Nucleosynthesis Gaussian prior on the baryon density parameter, and a Gaussian prior on the spectral index, we constrain the matter density is $Ω_m=0.296\pm 0.010 $ and the Hubble constant $H_0=(68.63 \pm 0.79)[{\rm km\, s^{-1}Mpc^{-1}}]$. Additionally, we measure the amplitude of clustering $σ_8=0.841 \pm 0.034$. The DESI DR1 results are in agreement with the $Λ$CDM model based on general relativity with parameters consistent with those from Planck. The cosmological interpretation of these results in combination with external datasets are presented in a companion paper.
△ Less
Submitted 11 March, 2025; v1 submitted 18 November, 2024;
originally announced November 2024.
-
DESI 2024 II: Sample Definitions, Characteristics, and Two-point Clustering Statistics
Authors:
DESI Collaboration,
A. G. Adame,
J. Aguilar,
S. Ahlen,
S. Alam,
D. M. Alexander,
M. Alvarez,
O. Alves,
A. Anand,
U. Andrade,
E. Armengaud,
S. Avila,
A. Aviles,
H. Awan,
S. Bailey,
C. Baltay,
A. Bault,
J. Behera,
S. BenZvi,
F. Beutler,
D. Bianchi,
C. Blake,
R. Blum,
S. Brieden,
A. Brodzeller
, et al. (178 additional authors not shown)
Abstract:
We present the samples of galaxies and quasars used for DESI 2024 cosmological analyses, drawn from the DESI Data Release 1 (DR1). We describe the construction of large-scale structure (LSS) catalogs from these samples, which include matched sets of synthetic reference `randoms' and weights that account for variations in the observed density of the samples due to experimental design and varying in…
▽ More
We present the samples of galaxies and quasars used for DESI 2024 cosmological analyses, drawn from the DESI Data Release 1 (DR1). We describe the construction of large-scale structure (LSS) catalogs from these samples, which include matched sets of synthetic reference `randoms' and weights that account for variations in the observed density of the samples due to experimental design and varying instrument performance. We detail how we correct for variations in observational completeness, the input `target' densities due to imaging systematics, and the ability to confidently measure redshifts from DESI spectra. We then summarize how remaining uncertainties in the corrections can be translated to systematic uncertainties for particular analyses. We describe the weights added to maximize the signal-to-noise of DESI DR1 2-point clustering measurements. We detail measurement pipelines applied to the LSS catalogs that obtain 2-point clustering measurements in configuration and Fourier space. The resulting 2-point measurements depend on window functions and normalization constraints particular to each sample, and we present the corrections required to match models to the data. We compare the configuration- and Fourier-space 2-point clustering of the data samples to that recovered from simulations of DESI DR1 and find they are, generally, in statistical agreement to within 2\% in the inferred real-space over-density field. The LSS catalogs, 2-point measurements, and their covariance matrices will be released publicly with DESI DR1.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
EROAM: Event-based Camera Rotational Odometry and Mapping in Real-time
Authors:
Wanli Xing,
Shijie Lin,
Linhan Yang,
Zeqing Zhang,
Yanjun Du,
Maolin Lei,
Yipeng Pan,
Jia Pan
Abstract:
This paper presents EROAM, a novel event-based rotational odometry and mapping system that achieves real-time, accurate camera rotation estimation. Unlike existing approaches that rely on event generation models or contrast maximization, EROAM employs a spherical event representation by projecting events onto a unit sphere and introduces Event Spherical Iterative Closest Point (ES-ICP), a novel ge…
▽ More
This paper presents EROAM, a novel event-based rotational odometry and mapping system that achieves real-time, accurate camera rotation estimation. Unlike existing approaches that rely on event generation models or contrast maximization, EROAM employs a spherical event representation by projecting events onto a unit sphere and introduces Event Spherical Iterative Closest Point (ES-ICP), a novel geometric optimization framework designed specifically for event camera data. The spherical representation simplifies rotational motion formulation while enabling continuous mapping for enhanced spatial resolution. Combined with parallel point-to-line optimization, EROAM achieves efficient computation without compromising accuracy. Extensive experiments on both synthetic and real-world datasets show that EROAM significantly outperforms state-of-the-art methods in terms of accuracy, robustness, and computational efficiency. Our method maintains consistent performance under challenging conditions, including high angular velocities and extended sequences, where other methods often fail or show significant drift. Additionally, EROAM produces high-quality panoramic reconstructions with preserved fine structural details.
△ Less
Submitted 17 November, 2024;
originally announced November 2024.
-
Synthesis Method for Obtaining Characteristic Modes of Multi-Structure Systems via independent Structure T-Matrix
Authors:
Chenbo Shi,
Xin Gu,
Shichen Liang,
Jin Pan,
Le Zuo
Abstract:
This paper presents a novel and efficient method for characteristic mode decomposition in multi-structure systems. By leveraging the translation and rotation matrices of vector spherical wavefunctions, our approach enables the synthesis of a composite system's characteristic modes using independently computed simulations of its constituent structures. The computationally intensive translation proc…
▽ More
This paper presents a novel and efficient method for characteristic mode decomposition in multi-structure systems. By leveraging the translation and rotation matrices of vector spherical wavefunctions, our approach enables the synthesis of a composite system's characteristic modes using independently computed simulations of its constituent structures. The computationally intensive translation process is simplified by decomposing it into three streamlined sub-tasks: rotation, z-axis translation, and inverse rotation, collectively achieving significant improvements in computational efficiency. Furthermore, this method facilitates the exploration of structural orientation effects without incurring additional computational overhead. A series of illustrative numerical examples is provided to validate the accuracy of the proposed method and underscore its substantial advantages in both computational efficiency and practical applicability.
△ Less
Submitted 21 March, 2025; v1 submitted 29 October, 2024;
originally announced November 2024.
-
Generalized Scattering Matrix of Antenna: Moment Solution, Compression Storage and Application
Authors:
Chenbo Shi,
Jin Pan,
Xin Gu,
Shichen Liang,
Le Zuo
Abstract:
This paper presents a computation method of generalized scattering matrix (GSM) based on integral equations and the method of moments (MoM), specifically designed for antennas excited through waveguide ports. By leveraging two distinct formulations -- magnetic-type and electric-type integral equations -- we establish concise algebraic relations linking the GSM directly to the impedance matrices ob…
▽ More
This paper presents a computation method of generalized scattering matrix (GSM) based on integral equations and the method of moments (MoM), specifically designed for antennas excited through waveguide ports. By leveraging two distinct formulations -- magnetic-type and electric-type integral equations -- we establish concise algebraic relations linking the GSM directly to the impedance matrices obtained from MoM. To address practical challenges in storing GSM data across wide frequency bands and multiple antenna scenarios, we propose a efficient compression scheme. This approach alleviates memory demands by selectively storing the dominant eigencomponents that govern scattering behavior. Numerical validation examples confirm the accuracy of our method by comparisons with full-wave simulation results. Furthermore, we introduce an efficient iterative procedure to predict antenna array performance, highlighting remarkable improvements in computational speed compared to conventional numerical methods. These results collectively demonstrate the GSM framework's strong potential for antenna-array design processes.
△ Less
Submitted 23 April, 2025; v1 submitted 29 October, 2024;
originally announced November 2024.
-
DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions
Authors:
Shu-Tong Niu,
Jun Du,
Ruo-Yu Wang,
Gao-Bin Yang,
Tian Gao,
Jia Pan,
Yu Hu
Abstract:
We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end automatic speech recognition (ASR), combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization modu…
▽ More
We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end automatic speech recognition (ASR), combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively. Then, to complement DCF-DS training, we introduce a window-level decoding scheme that allows the DCF-DS framework to handle the sparse data convergence instability (SDCI) problem. We also explore using an NSD system trained on real datasets to provide more accurate speaker boundaries. Additionally, we incorporate an optional multi-input multi-output speech enhancement module (MIMO-SE) within the DCF-DS framework, which offers further performance gains. Finally, we enhance diarization results by re-clustering DCF-DS outputs, improving ASR accuracy. By incorporating the DCF-DS method, we achieved first place in the realistic single-channel track of the CHiME-8 NOTSOFAR-1 challenge. We also perform the evaluation on the open LibriCSS dataset, achieving a new state-of-the-art single-channel speech recognition performance.
△ Less
Submitted 27 December, 2024; v1 submitted 10 November, 2024;
originally announced November 2024.
-
State Chrono Representation for Enhancing Generalization in Reinforcement Learning
Authors:
Jianda Chen,
Wen Zheng Terence Ng,
Zichen Chen,
Sinno Jialin Pan,
Tianwei Zhang
Abstract:
In reinforcement learning with image-based inputs, it is crucial to establish a robust and generalizable state representation. Recent advancements in metric learning, such as deep bisimulation metric approaches, have shown promising results in learning structured low-dimensional representation space from pixel observations, where the distance between states is measured based on task-relevant featu…
▽ More
In reinforcement learning with image-based inputs, it is crucial to establish a robust and generalizable state representation. Recent advancements in metric learning, such as deep bisimulation metric approaches, have shown promising results in learning structured low-dimensional representation space from pixel observations, where the distance between states is measured based on task-relevant features. However, these approaches face challenges in demanding generalization tasks and scenarios with non-informative rewards. This is because they fail to capture sufficient long-term information in the learned representations. To address these challenges, we propose a novel State Chrono Representation (SCR) approach. SCR augments state metric-based representations by incorporating extensive temporal information into the update step of bisimulation metric learning. It learns state distances within a temporal framework that considers both future dynamics and cumulative rewards over current and long-term future states. Our learning strategy effectively incorporates future behavioral information into the representation space without introducing a significant number of additional parameters for modeling dynamics. Extensive experiments conducted in DeepMind Control and Meta-World environments demonstrate that SCR achieves better performance comparing to other recent metric-based methods in demanding generalization tasks. The codes of SCR are available in https://github.com/jianda-chen/SCR.
△ Less
Submitted 9 November, 2024;
originally announced November 2024.
-
Processing and Decoding Rydberg Decay Error with MBQC
Authors:
Cheng-Cheng Yu,
Zi-Han Chen,
Yu-Hao Deng,
Ming-Cheng Chen,
Chao-Yang Lu,
Jian-Wei Pan
Abstract:
Achieving fault-tolerant quantum computing with neutral atom necessitates careful consideration of the errors inherent to this system. One typical error is the leakage from Rydberg states during the implementation of multi-qubit gates, which may propagate to multiple correlated errors and deteriorate the performance of error correction. To address this, researchers have proposed an erasure convers…
▽ More
Achieving fault-tolerant quantum computing with neutral atom necessitates careful consideration of the errors inherent to this system. One typical error is the leakage from Rydberg states during the implementation of multi-qubit gates, which may propagate to multiple correlated errors and deteriorate the performance of error correction. To address this, researchers have proposed an erasure conversion protocol that employs fast leakage detection and continuous atomic replacement to convert leakage errors into benign erasure errors. While this method achieves a high threshold and a favorable error distance d_e = d, its applicability is restricted to certain atom species. In this work, we present a novel approach to manage Rydberg decay errors in measurement-based quantum computation (MBQC). From a hardware perspective, we utilize practical experimental techniques along with an adaptation of the Pauli twirling approximation (PTA) to mitigate the impacts of leakage error, which propagates similarly to Pauli error without degrading the error distance. From a decoding perspective, we leverage the inherent structure of topological cluster states and final leakage detection information to locate propagated errors from Rydberg decay error. This approach eliminates the need for mid-circuit leakage detection, while maintaining an error distance d_e = d and achieving a high threshold of 3.617(3)% per CZ gate for pure Rydberg decay. In the presence of additional Pauli errors, we demonstrate the performance of our protocol in logical error rate within a reasonable range of physical errors and draw a comparison with erasure conversion. The results show a comparable performance within a modest R_e, which reveals possible application of our method in near-term platform.
△ Less
Submitted 9 March, 2025; v1 submitted 7 November, 2024;
originally announced November 2024.
-
Differential absorption ozone Lidar with 4H-SiC single-photon detectors
Authors:
Xian-Song Zhao,
Chao Yu,
Chong Wang,
Tianyi Li,
Bo Liu,
Hai Lu,
Rong Zhang,
Xiankang Dou,
Jun Zhang,
Jian-Wei Pan
Abstract:
Differential absorption Lidar (DIAL) in the ultraviolet (UV) region is an effective approach for monitoring tropospheric ozone. 4H-SiC single-photon detectors (SPDs) are emergent devices for UV single-photon detection. Here, we demonstrate a 4H-SiC SPD-based ozone DIAL. We design and fabricate the 4H-SiC single-photon avalanche diode with a beveled mesa structure and optimized layer thickness. An…
▽ More
Differential absorption Lidar (DIAL) in the ultraviolet (UV) region is an effective approach for monitoring tropospheric ozone. 4H-SiC single-photon detectors (SPDs) are emergent devices for UV single-photon detection. Here, we demonstrate a 4H-SiC SPD-based ozone DIAL. We design and fabricate the 4H-SiC single-photon avalanche diode with a beveled mesa structure and optimized layer thickness. An active quenching circuit with a quenching time of 1.03 ns is developed to significantly mitigate the afterpulsing effect while enhancing the maximum count rate. After characterization, the SPD exhibits excellent performance with a photon detection efficiency of 16.6% at 266 nm, a dark count rate of 138 kcps, a maximum count rate of 13 Mcps, and an afterpulse probability of 2.7% at room temperature. Then, we apply two 4H-SiC SPDs in an ozone DIAL. The measured ozone concentrations at altitudes of 1-3.5 km agree well with the results of a commercial ozone DIAL. Our work provides an alternative solution for general UV Lidar applications.
△ Less
Submitted 6 March, 2025; v1 submitted 6 November, 2024;
originally announced November 2024.
-
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Authors:
Yingzi Ma,
Jiongxiao Wang,
Fei Wang,
Siyuan Ma,
Jiazhao Li,
Jinsheng Pan,
Xiujun Li,
Furong Huang,
Lichao Sun,
Bo Li,
Yejin Choi,
Muhao Chen,
Chaowei Xiao
Abstract:
Machine unlearning has emerged as an effective strategy for forgetting specific information in the training data. However, with the increasing integration of visual data, privacy concerns in Vision Language Models (VLMs) remain underexplored. To address this, we introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectivene…
▽ More
Machine unlearning has emerged as an effective strategy for forgetting specific information in the training data. However, with the increasing integration of visual data, privacy concerns in Vision Language Models (VLMs) remain underexplored. To address this, we introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectiveness of unlearning algorithms under the Right to be Forgotten setting. Specifically, we formulate the VLM unlearning task via constructing the Fictitious Facial Identity VQA dataset and apply a two-stage evaluation pipeline that is designed to precisely control the sources of information and their exposure levels. In terms of evaluation, since VLM supports various forms of ways to ask questions with the same semantic meaning, we also provide robust evaluation metrics including membership inference attacks and carefully designed adversarial privacy attacks to evaluate the performance of algorithms. Through the evaluation of four baseline VLM unlearning algorithms within FIUBench, we find that all methods remain limited in their unlearning performance, with significant trade-offs between model utility and forget quality. Furthermore, our findings also highlight the importance of privacy attacks for robust evaluations. We hope FIUBench will drive progress in developing more effective VLM unlearning algorithms.
△ Less
Submitted 7 March, 2025; v1 submitted 5 November, 2024;
originally announced November 2024.
-
xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Authors:
Jiarui Fang,
Jinzhe Pan,
Xibo Sun,
Aoyu Li,
Jiannan Wang
Abstract:
Diffusion models are pivotal for generating high-quality images and videos. Inspired by the success of OpenAI's Sora, the backbone of diffusion models is evolving from U-Net to Transformer, known as Diffusion Transformers (DiTs). However, generating high-quality content necessitates longer sequence lengths, exponentially increasing the computation required for the attention mechanism, and escalati…
▽ More
Diffusion models are pivotal for generating high-quality images and videos. Inspired by the success of OpenAI's Sora, the backbone of diffusion models is evolving from U-Net to Transformer, known as Diffusion Transformers (DiTs). However, generating high-quality content necessitates longer sequence lengths, exponentially increasing the computation required for the attention mechanism, and escalating DiTs inference latency. Parallel inference is essential for real-time DiTs deployments, but relying on a single parallel method is impractical due to poor scalability at large scales. This paper introduces xDiT, a comprehensive parallel inference engine for DiTs. After thoroughly investigating existing DiTs parallel approaches, xDiT chooses Sequence Parallel (SP) and PipeFusion, a novel Patch-level Pipeline Parallel method, as intra-image parallel strategies, alongside CFG parallel for inter-image parallelism. xDiT can flexibly combine these parallel approaches in a hybrid manner, offering a robust and scalable solution. Experimental results on two 8xL40 GPUs (PCIe) nodes interconnected by Ethernet and an 8xA100 (NVLink) node showcase xDiT's exceptional scalability across five state-of-the-art DiTs. Notably, we are the first to demonstrate DiTs scalability on Ethernet-connected GPU clusters. xDiT is available at https://github.com/xdit-project/xDiT.
△ Less
Submitted 3 November, 2024;
originally announced November 2024.
-
Thermodynamics of the Kerr-AdS black hole from an ensemble-averaged theory
Authors:
Peng Cheng,
Jindong Pan,
Haichen Xu,
Si-Jiang Yang
Abstract:
Exploring the universal structure of the gravitational path integral beyond semi-classical saddles and uncovering a compelling statistical interpretation of black hole thermodynamics have long been significant challenges. We investigate the statistical interpretation of the Kerr-AdS black hole thermodynamics through an ensemble-averaged theory. By extending the phase space to include all possible…
▽ More
Exploring the universal structure of the gravitational path integral beyond semi-classical saddles and uncovering a compelling statistical interpretation of black hole thermodynamics have long been significant challenges. We investigate the statistical interpretation of the Kerr-AdS black hole thermodynamics through an ensemble-averaged theory. By extending the phase space to include all possible states with conical singularities in their Euclidean counterparts, we derive the probability distribution of different states inherited from the Euclidean gravitational path integral. Moreover, we can define a density matrix of all states in the phase space. By ensemble-averaging over all states, we show that the black hole phase transition naturally arises in the semi-classical limit. Away from the semi-classical regime, the ensemble-averaged theory exhibits a notable deviation from the conventional phase transition. Expanding around the classical saddles yields the subleading-order correction to the Gibbs free energy, which is half of the Hawking temperature. We demonstrate that the half Hawking temperature correction is a universal feature inherent to black holes in asymptotically AdS spacetime. With the subleading-order correction to Gibbs free energy, we also suggest that the whole black hole thermodynamic should be corrected accordingly.
△ Less
Submitted 22 April, 2025; v1 submitted 30 October, 2024;
originally announced October 2024.
-
FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks
Authors:
Jiongxiao Wang,
Fangzhou Wu,
Wendi Li,
Jinsheng Pan,
Edward Suh,
Z. Morley Mao,
Muhao Chen,
Chaowei Xiao
Abstract:
Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications. However, integrating external information into LLM-integrated applications raises significant security concerns. Among these, prompt injection attacks are particularly threatening, where malicious instructions injected in the external text information can e…
▽ More
Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications. However, integrating external information into LLM-integrated applications raises significant security concerns. Among these, prompt injection attacks are particularly threatening, where malicious instructions injected in the external text information can exploit LLMs to generate answers as the attackers desire. While both training-time and test-time defense methods have been developed to mitigate such attacks, the unaffordable training costs associated with training-time methods and the limited effectiveness of existing test-time methods make them impractical. This paper introduces a novel test-time defense strategy, named Formatting AuThentication with Hash-based tags (FATH). Unlike existing approaches that prevent LLMs from answering additional instructions in external text, our method implements an authentication system, requiring LLMs to answer all received instructions with a security policy and selectively filter out responses to user instructions as the final output. To achieve this, we utilize hash-based authentication tags to label each response, facilitating accurate identification of responses according to the user's instructions and improving the robustness against adaptive attacks. Comprehensive experiments demonstrate that our defense method can effectively defend against indirect prompt injection attacks, achieving state-of-the-art performance under Llama3 and GPT3.5 models across various attack methods. Our code is released at: https://github.com/Jayfeather1024/FATH
△ Less
Submitted 25 November, 2024; v1 submitted 28 October, 2024;
originally announced October 2024.
-
Multi-modal AI for comprehensive breast cancer prognostication
Authors:
Jan Witowski,
Ken G. Zeng,
Joseph Cappadona,
Jailan Elayoubi,
Khalil Choucair,
Elena Diana Chiru,
Nancy Chan,
Young-Joon Kang,
Frederick Howard,
Irina Ostrovnaya,
Carlos Fernandez-Granda,
Freya Schnabel,
Zoe Steinsnyder,
Ugur Ozerdem,
Kangning Liu,
Waleed Abdulsattar,
Yu Zong,
Lina Daoud,
Rafic Beydoun,
Anas Saad,
Nitya Thakore,
Mohammad Sadic,
Frank Yeung,
Elisa Liu,
Theodore Hill
, et al. (26 additional authors not shown)
Abstract:
Treatment selection in breast cancer is guided by molecular subtypes and clinical characteristics. However, current tools including genomic assays lack the accuracy required for optimal clinical decision-making. We developed a novel artificial intelligence (AI)-based approach that integrates digital pathology images with clinical data, providing a more robust and effective method for predicting th…
▽ More
Treatment selection in breast cancer is guided by molecular subtypes and clinical characteristics. However, current tools including genomic assays lack the accuracy required for optimal clinical decision-making. We developed a novel artificial intelligence (AI)-based approach that integrates digital pathology images with clinical data, providing a more robust and effective method for predicting the risk of cancer recurrence in breast cancer patients. Specifically, we utilized a vision transformer pan-cancer foundation model trained with self-supervised learning to extract features from digitized H&E-stained slides. These features were integrated with clinical data to form a multi-modal AI test predicting cancer recurrence and death. The test was developed and evaluated using data from a total of 8,161 female breast cancer patients across 15 cohorts originating from seven countries. Of these, 3,502 patients from five cohorts were used exclusively for evaluation, while the remaining patients were used for training. Our test accurately predicted our primary endpoint, disease-free interval, in the five evaluation cohorts (C-index: 0.71 [0.68-0.75], HR: 3.63 [3.02-4.37, p<0.001]). In a direct comparison (n=858), the AI test was more accurate than Oncotype DX, the standard-of-care 21-gene assay, achieving a C-index of 0.67 [0.61-0.74] versus 0.61 [0.49-0.73], respectively. Additionally, the AI test added independent prognostic information to Oncotype DX in a multivariate analysis (HR: 3.11 [1.91-5.09, p<0.001)]). The test demonstrated robust accuracy across major molecular breast cancer subtypes, including TNBC (C-index: 0.71 [0.62-0.81], HR: 3.81 [2.35-6.17, p=0.02]), where no diagnostic tools are currently recommended by clinical guidelines. These results suggest that our AI test improves upon the accuracy of existing prognostic tests, while being applicable to a wider range of patients.
△ Less
Submitted 2 March, 2025; v1 submitted 28 October, 2024;
originally announced October 2024.
-
Fractal and Turbulent Feature Extraction and NFT Label Generation for Pollock Style Migration Paintings Based on VGG19
Authors:
Yiquan Wang,
Xu Wang,
Jiazhuo Pan
Abstract:
This paper puts forth an innovative approach that fuses deep learning, fractal analysis, and turbulence feature extraction techniques to create abstract artworks in the style of Pollock. The content and style characteristics of the image are extracted by the MindSpore deep learning framework and a pre-trained VGG19 model. An optimisation process is then employed to The method generates high-qualit…
▽ More
This paper puts forth an innovative approach that fuses deep learning, fractal analysis, and turbulence feature extraction techniques to create abstract artworks in the style of Pollock. The content and style characteristics of the image are extracted by the MindSpore deep learning framework and a pre-trained VGG19 model. An optimisation process is then employed to The method generates high-quality Pollock-style images by combining content loss, style loss and full variance loss to achieve accurate style migration. Furthermore, this paper implements a fractal dimension calculation method based on the difference box-counting method, which effectively estimates the fractal dimension of an image through edge extraction and fractal analysis. The method is based on a two-dimensional discrete wavelet transform using a Haar wavelet to decompose the image in order to extract different frequency information. This is followed by the combination of multiple features to generate unique non-homogeneous token (NFT) labels for the authentication and protection of digital artwork. The experimental results demonstrate that the generated artworks exhibit The method demonstrates significant diversity and complexity in terms of fractal dimensions and turbulence features, while the generated NFT tags ensure the uniqueness and tamperability of each digital collection. The present method organically combines computer vision, digital signal processing and blockchain technology to provide a new solution for the creation and authentication of digital artworks.
△ Less
Submitted 3 November, 2024; v1 submitted 27 October, 2024;
originally announced October 2024.
-
AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction
Authors:
Hongru Wang,
Rui Wang,
Boyang Xue,
Heming Xia,
Jingtao Cao,
Zeming Liu,
Jeff Z. Pan,
Kam-Fai Wong
Abstract:
Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaborative…
▽ More
Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively from various sources (e.g., different Apps in the iPhone), especially for complex user instructions. In this paper, we introduce \texttt{AppBench}, the first benchmark to evaluate LLMs' ability to plan and execute multiple APIs from various sources in order to complete the user's task. Specifically, we consider two significant challenges in multiple APIs: \textit{1) graph structures:} some APIs can be executed independently while others need to be executed one by one, resulting in graph-like execution order; and \textit{2) permission constraints:} which source is authorized to execute the API call. We have experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0\% success rate at the most complex instruction, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. Our code and data are publicly available at https://github.com/ruleGreen/AppBench.
△ Less
Submitted 10 October, 2024;
originally announced October 2024.
-
Predicting Liquidity Coverage Ratio with Gated Recurrent Units: A Deep Learning Model for Risk Management
Authors:
Zhen Xu,
Jingming Pan,
Siyuan Han,
Hongju Ouyang,
Yuan Chen,
Mohan Jiang
Abstract:
With the global economic integration and the high interconnection of financial markets, financial institutions are facing unprecedented challenges, especially liquidity risk. This paper proposes a liquidity coverage ratio (LCR) prediction model based on the gated recurrent unit (GRU) network to help financial institutions manage their liquidity risk more effectively. By utilizing the GRU network i…
▽ More
With the global economic integration and the high interconnection of financial markets, financial institutions are facing unprecedented challenges, especially liquidity risk. This paper proposes a liquidity coverage ratio (LCR) prediction model based on the gated recurrent unit (GRU) network to help financial institutions manage their liquidity risk more effectively. By utilizing the GRU network in deep learning technology, the model can automatically learn complex patterns from historical data and accurately predict LCR for a period of time in the future. The experimental results show that compared with traditional methods, the GRU model proposed in this study shows significant advantages in mean absolute error (MAE), proving its higher accuracy and robustness. This not only provides financial institutions with a more reliable liquidity risk management tool but also provides support for regulators to formulate more scientific and reasonable policies, which helps to improve the stability of the entire financial system.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
Hook-valued tableaux uncrowding and tableau switching
Authors:
Jihyeug Jang,
Jang Soo Kim,
Jianping Pan,
Joseph Pappe,
Anne Schilling
Abstract:
Refined canonical stable Grothendieck polynomials were introduced by Hwang, Jang, Kim, Song, and Song. There exist two combinatorial models for these polynomials: one using hook-valued tableaux and the other using pairs of a semistandard Young tableau and (what we call) an exquisite tableau. An uncrowding algorithm on hook-valued tableaux was introduced by Pan, Pappe, Poh, and Schilling. In this p…
▽ More
Refined canonical stable Grothendieck polynomials were introduced by Hwang, Jang, Kim, Song, and Song. There exist two combinatorial models for these polynomials: one using hook-valued tableaux and the other using pairs of a semistandard Young tableau and (what we call) an exquisite tableau. An uncrowding algorithm on hook-valued tableaux was introduced by Pan, Pappe, Poh, and Schilling. In this paper, we discover a novel connection between the two models via the uncrowding and Goulden--Greene's jeu de taquin algorithms, using a classical result of Benkart, Sottile, and Stroomer on tableau switching. This connection reveals a hidden symmetry of the uncrowding algorithm defined on hook-valued tableaux. As a corollary, we obtain another combinatorial model for the refined canonical stable Grothendieck polynomials in terms of biflagged tableaux, which naturally appear in the characterization of the image of the uncrowding map.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
Magnetoresistance oscillations in vertical junctions of 2D antiferromagnetic semiconductor CrPS$_4$
Authors:
Pengyuan Shi,
Xiaoyu Wang,
Lihao Zhang,
Wenqin Song,
Kunlin Yang,
Shuxi Wang,
Ruisheng Zhang,
Liangliang Zhang,
Takashi Taniguchi,
Kenji Watanabe,
Sen Yang,
Lei Zhang,
Lei Wang,
Wu Shi,
Jie Pan,
Zhe Wang
Abstract:
Magnetoresistance (MR) oscillations serve as a hallmark of intrinsic quantum behavior, traditionally observed only in conducting systems. Here we report the discovery of MR oscillations in an insulating system, the vertical junctions of CrPS$_4$ which is a two dimensional (2D) A-type antiferromagnetic semiconductor. Systematic investigations of MR peaks under varying conditions, including electrod…
▽ More
Magnetoresistance (MR) oscillations serve as a hallmark of intrinsic quantum behavior, traditionally observed only in conducting systems. Here we report the discovery of MR oscillations in an insulating system, the vertical junctions of CrPS$_4$ which is a two dimensional (2D) A-type antiferromagnetic semiconductor. Systematic investigations of MR peaks under varying conditions, including electrode materials, magnetic field direction, temperature, voltage bias and layer number, elucidate a correlation between MR oscillations and spin-canted states in CrPS$_4$. Experimental data and analysis point out the important role of the in-gap electronic states in generating MR oscillations, and we proposed that spin selected interlayer hopping of localized defect states may be responsible for it. Our findings not only illuminate the unusual electronic transport in CrPS$_4$ but also underscore the potential of van der Waals magnets for exploring interesting phenomena.
△ Less
Submitted 19 November, 2024; v1 submitted 23 October, 2024;
originally announced October 2024.
-
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
Authors:
Xintong Wang,
Jingheng Pan,
Liang Ding,
Longyue Wang,
Longqin Jiang,
Xingshan Li,
Chris Biemann
Abstract:
Large Language Models (LLMs) achieve remarkable performance through pretraining on extensive data. This enables efficient adaptation to diverse downstream tasks. However, the lack of interpretability in their underlying mechanisms limits the ability to effectively steer LLMs for specific applications. In this work, we investigate the intrinsic mechanisms of LLMs from a cognitive perspective using…
▽ More
Large Language Models (LLMs) achieve remarkable performance through pretraining on extensive data. This enables efficient adaptation to diverse downstream tasks. However, the lack of interpretability in their underlying mechanisms limits the ability to effectively steer LLMs for specific applications. In this work, we investigate the intrinsic mechanisms of LLMs from a cognitive perspective using eye movement measures. Specifically, we analyze the layer-wise correlation between human cognitive indicators and LLM representations. Building on these insights, we propose a heuristic approach for selecting the optimal steering layer to modulate LLM semantics. To this end, we introduce an efficient selective layer intervention based on prominent parameter-efficient fine-tuning methods, which conventionally adjust either all layers or only the final layer. Additionally, we present an implicit layer contrastive intervention during inference to steer LLMs away from toxic outputs. Extensive experiments on natural language understanding, reasoning, and generation tasks, conducted on GPT-2, LLaMa2-7B, and Mixtral-7B, demonstrate the effectiveness and efficiency of our approach. As a model-agnostic framework, it enhances the interpretability of LLMs while improving efficiency for safe deployment.
△ Less
Submitted 18 February, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
Atomic Fact Decomposition Helps Attributed Question Answering
Authors:
Zhichao Yan,
Jiapu Wang,
Jiaoyan Chen,
Xiaoli Li,
Ru Li,
Jeff Z. Pan
Abstract:
Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a given question. Retrieval is a widely adopted approach, including two general paradigms: Retrieval-Then-Read (RTR) and post-hoc retrieval. Recently, Large Language Models (LLMs) have shown remarkable proficiency, prompting growing interest in AQA among researchers. However, RTR-bas…
▽ More
Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a given question. Retrieval is a widely adopted approach, including two general paradigms: Retrieval-Then-Read (RTR) and post-hoc retrieval. Recently, Large Language Models (LLMs) have shown remarkable proficiency, prompting growing interest in AQA among researchers. However, RTR-based AQA often suffers from irrelevant knowledge and rapidly changing information, even when LLMs are adopted, while post-hoc retrieval-based AQA struggles with comprehending long-form answers with complex logic, and precisely identifying the content needing revision and preserving the original intent. To tackle these problems, this paper proposes an Atomic fact decomposition-based Retrieval and Editing (ARE) framework, which decomposes the generated long-form answers into molecular clauses and atomic facts by the instruction-tuned LLMs. Notably, the instruction-tuned LLMs are fine-tuned using a well-constructed dataset, generated from large scale Knowledge Graphs (KGs). This process involves extracting one-hop neighbors from a given set of entities and transforming the result into coherent long-form text. Subsequently, ARE leverages a search engine to retrieve evidences related to atomic facts, inputting these evidences into an LLM-based verifier to determine whether the facts require expansion for re-retrieval or editing. Furthermore, the edited facts are backtracked into the original answer, with evidence aggregated based on the relationship between molecular clauses and atomic facts. Extensive evaluations demonstrate the superior performance of our proposed method over the state-of-the-arts on several datasets, with an additionally proposed new metric $Attr_{p}$ for evaluating the precision of evidence attribution.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Search for gravitational waves emitted from SN 2023ixf
Authors:
The LIGO Scientific Collaboration,
the Virgo Collaboration,
the KAGRA Collaboration,
A. G. Abac,
R. Abbott,
I. Abouelfettouh,
F. Acernese,
K. Ackley,
S. Adhicary,
N. Adhikari,
R. X. Adhikari,
V. K. Adkins,
D. Agarwal,
M. Agathos,
M. Aghaei Abchouyeh,
O. D. Aguiar,
I. Aguilar,
L. Aiello,
A. Ain,
T. Akutsu,
S. Albanesi,
R. A. Alfaidi,
A. Al-Jodah,
C. Alléné,
A. Allocca
, et al. (1758 additional authors not shown)
Abstract:
We present the results of a search for gravitational-wave transients associated with core-collapse supernova SN 2023ixf, which was observed in the galaxy Messier 101 via optical emission on 2023 May 19th, during the LIGO-Virgo-KAGRA 15th Engineering Run. We define a five-day on-source window during which an accompanying gravitational-wave signal may have occurred. No gravitational waves have been…
▽ More
We present the results of a search for gravitational-wave transients associated with core-collapse supernova SN 2023ixf, which was observed in the galaxy Messier 101 via optical emission on 2023 May 19th, during the LIGO-Virgo-KAGRA 15th Engineering Run. We define a five-day on-source window during which an accompanying gravitational-wave signal may have occurred. No gravitational waves have been identified in data when at least two gravitational-wave observatories were operating, which covered $\sim 14\%$ of this five-day window. We report the search detection efficiency for various possible gravitational-wave emission models. Considering the distance to M101 (6.7 Mpc), we derive constraints on the gravitational-wave emission mechanism of core-collapse supernovae across a broad frequency spectrum, ranging from 50 Hz to 2 kHz where we assume the gravitational-wave emission occurred when coincident data are available in the on-source window. Considering an ellipsoid model for a rotating proto-neutron star, our search is sensitive to gravitational-wave energy $1 \times 10^{-4} M_{\odot} c^2$ and luminosity $2.6 \times 10^{-4} M_{\odot} c^2/s$ for a source emitting at 82 Hz. These constraints are around an order of magnitude more stringent than those obtained so far with gravitational-wave data. The constraint on the ellipticity of the proto-neutron star that is formed is as low as 1.08, at frequencies above 1200 Hz, surpassing past results.
△ Less
Submitted 11 March, 2025; v1 submitted 21 October, 2024;
originally announced October 2024.
-
Cohomotopy Sets of $(n-1)$-connected $(2n+2)$-manifolds for small $n$
Authors:
Pengcheng Li,
Jianzhong Pan,
Jie Wu
Abstract:
Let $M$ be a closed orientable $(n-1)$-connected $(2n+2)$-manifold, $n\geq 2$. In this paper we combine the Postnikov tower of spheres and the homotopy decomposition of the reduced suspension space $ΣM$ to investigate the cohomotopy sets $π^\ast(M)$ for $n=2,3,4$, under the assumption that $M$ has $2$-torsion-free homology. All cohomotopy sets $π^i(M)$ of such manifolds $M$ are characterized excep…
▽ More
Let $M$ be a closed orientable $(n-1)$-connected $(2n+2)$-manifold, $n\geq 2$. In this paper we combine the Postnikov tower of spheres and the homotopy decomposition of the reduced suspension space $ΣM$ to investigate the cohomotopy sets $π^\ast(M)$ for $n=2,3,4$, under the assumption that $M$ has $2$-torsion-free homology. All cohomotopy sets $π^i(M)$ of such manifolds $M$ are characterized except $π^4(M)$ for $n=3,4$.
△ Less
Submitted 5 May, 2025; v1 submitted 21 October, 2024;
originally announced October 2024.
-
MAC Revivo: Artificial Intelligence Paves the Way
Authors:
Jinzhe Pan,
Jingqing Wang,
Zelin Yun,
Zhiyong Xiao,
Yuehui Ouyang,
Wenchi Cheng,
Wei Zhang
Abstract:
The vast adoption of Wi-Fi and/or Bluetooth capabilities in Internet of Things (IoT) devices, along with the rapid growth of deployed smart devices, has caused significant interference and congestion in the industrial, scientific, and medical (ISM) bands. Traditional Wi-Fi Medium Access Control (MAC) design faces significant challenges in managing increasingly complex wireless environments while e…
▽ More
The vast adoption of Wi-Fi and/or Bluetooth capabilities in Internet of Things (IoT) devices, along with the rapid growth of deployed smart devices, has caused significant interference and congestion in the industrial, scientific, and medical (ISM) bands. Traditional Wi-Fi Medium Access Control (MAC) design faces significant challenges in managing increasingly complex wireless environments while ensuring network Quality of Service (QoS) performance. This paper explores the potential integration of advanced Artificial Intelligence (AI) methods into the design of Wi-Fi MAC protocols. We propose AI-MAC, an innovative approach that employs machine learning algorithms to dynamically adapt to changing network conditions, optimize channel access, mitigate interference, and ensure deterministic latency. By intelligently predicting and managing interference, AI-MAC aims to provide a robust solution for next generation of Wi-Fi networks, enabling seamless connectivity and enhanced QoS. Our experimental results demonstrate that AI-MAC significantly reduces both interference and latency, paving the way for more reliable and efficient wireless communications in the increasingly crowded ISM band.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
On the topology of manifolds with nonnegative Ricci curvature and linear volume growth
Authors:
Dimitri Navarro,
Jiayin Pan,
Xingyu Zhu
Abstract:
Understanding the relationships between geometry and topology is a central theme in Riemannian geometry. We establish two results on the fundamental groups of open (complete and noncompact) $n$-manifolds with nonnegative Ricci curvature and linear volume growth. First, we show that the fundamental group of such a manifold contains a subgroup $\mathbb{Z}^k$ of finite index, where $0\le k\le n-1$. S…
▽ More
Understanding the relationships between geometry and topology is a central theme in Riemannian geometry. We establish two results on the fundamental groups of open (complete and noncompact) $n$-manifolds with nonnegative Ricci curvature and linear volume growth. First, we show that the fundamental group of such a manifold contains a subgroup $\mathbb{Z}^k$ of finite index, where $0\le k\le n-1$. Second, we prove that if the Ricci curvature is positive everywhere, then the fundamental group is finite. The proofs are based on an analysis of the equivariant asymptotic geometry of successive covering spaces and a plane/halfplane rigidity result for RCD spaces.
△ Less
Submitted 20 October, 2024;
originally announced October 2024.
-
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
Authors:
Xiongtao Zhou,
Jie He,
Lanyu Chen,
Jingyu Li,
Haojing Chen,
Víctor Gutiérrez-Basulto,
Jeff Z. Pan,
Hanjie Chen
Abstract:
Multimodal Chain of Thought (MCoT) is a popular prompting strategy for improving the performance of multimodal large language models (MLLMs) across a range of complex reasoning tasks. Despite its popularity, there is a notable absence of automated methods for evaluating the quality of reasoning steps in MCoT. To address this gap, we propose Multimodal Chain-of-Thought Evaluation (MiCEval), a frame…
▽ More
Multimodal Chain of Thought (MCoT) is a popular prompting strategy for improving the performance of multimodal large language models (MLLMs) across a range of complex reasoning tasks. Despite its popularity, there is a notable absence of automated methods for evaluating the quality of reasoning steps in MCoT. To address this gap, we propose Multimodal Chain-of-Thought Evaluation (MiCEval), a framework designed to assess the correctness of reasoning chains by evaluating the quality of both the description and each reasoning step. The evaluation of the description component focuses on the accuracy of the image descriptions, while the reasoning step evaluates the quality of each step as it is conditionally generated based on the preceding steps. MiCEval is built upon a fine-grained dataset with annotations that rate each step according to correctness, relevance, and informativeness. Extensive experiments on four state-of-the-art MLLMs show that step-wise evaluations using MiCEval align more closely with human judgments compared to existing methods based on cosine similarity or fine-tuning approaches. MiCEval datasets and code can be found in https://github.com/alenai97/MiCEval.
△ Less
Submitted 28 February, 2025; v1 submitted 18 October, 2024;
originally announced October 2024.
-
DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation
Authors:
Hanbo Cheng,
Limin Lin,
Chenyu Liu,
Pengcheng Xia,
Pengfei Hu,
Jiefeng Ma,
Jun Du,
Jia Pan
Abstract:
Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed…
▽ More
Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly available at https://github.com/Hanbo-Cheng/DAWN-pytorch.
△ Less
Submitted 26 March, 2025; v1 submitted 17 October, 2024;
originally announced October 2024.
-
Super-resolving Real-world Image Illumination Enhancement: A New Dataset and A Conditional Diffusion Model
Authors:
Yang Liu,
Yaofang Liu,
Jinshan Pan,
Yuxiang Hui,
Fan Jia,
Raymond H. Chan,
Tieyong Zeng
Abstract:
Most existing super-resolution methods and datasets have been developed to improve the image quality in well-lighted conditions. However, these methods do not work well in real-world low-light conditions as the images captured in such conditions lose most important information and contain significant unknown noises. To solve this problem, we propose a SRRIIE dataset with an efficient conditional d…
▽ More
Most existing super-resolution methods and datasets have been developed to improve the image quality in well-lighted conditions. However, these methods do not work well in real-world low-light conditions as the images captured in such conditions lose most important information and contain significant unknown noises. To solve this problem, we propose a SRRIIE dataset with an efficient conditional diffusion probabilistic models-based method. The proposed dataset contains 4800 paired low-high quality images. To ensure that the dataset are able to model the real-world image degradation in low-illumination environments, we capture images using an ILDC camera and an optical zoom lens with exposure levels ranging from -6 EV to 0 EV and ISO levels ranging from 50 to 12800. We comprehensively evaluate with various reconstruction and perceptual metrics and demonstrate the practicabilities of the SRRIIE dataset for deep learning-based methods. We show that most existing methods are less effective in preserving the structures and sharpness of restored images from complicated noises. To overcome this problem, we revise the condition for Raw sensor data and propose a novel time-melding condition for diffusion probabilistic model. Comprehensive quantitative and qualitative experimental results on the real-world benchmark datasets demonstrate the feasibility and effectivenesses of the proposed conditional diffusion probabilistic model on Raw sensor data. Code and dataset will be available at https://github.com/Yaofang-Liu/Super-Resolving
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
DaDiff: Domain-aware Diffusion Model for Nighttime UAV Tracking
Authors:
Haobo Zuo,
Changhong Fu,
Guangze Zheng,
Liangliang Yao,
Kunhan Lu,
Jia Pan
Abstract:
Domain adaptation is an inspiring solution to the misalignment issue of day/night image features for nighttime UAV tracking. However, the one-step adaptation paradigm is inadequate in addressing the prevalent difficulties posed by low-resolution (LR) objects when viewed from the UAVs at night, owing to the blurry edge contour and limited detail information. Moreover, these approaches struggle to p…
▽ More
Domain adaptation is an inspiring solution to the misalignment issue of day/night image features for nighttime UAV tracking. However, the one-step adaptation paradigm is inadequate in addressing the prevalent difficulties posed by low-resolution (LR) objects when viewed from the UAVs at night, owing to the blurry edge contour and limited detail information. Moreover, these approaches struggle to perceive LR objects disturbed by nighttime noise. To address these challenges, this work proposes a novel progressive alignment paradigm, named domain-aware diffusion model (DaDiff), aligning nighttime LR object features to the daytime by virtue of progressive and stable generations. The proposed DaDiff includes an alignment encoder to enhance the detail information of nighttime LR objects, a tracking-oriented layer designed to achieve close collaboration with tracking tasks, and a successive distribution discriminator presented to distinguish different feature distributions at each diffusion timestep successively. Furthermore, an elaborate nighttime UAV tracking benchmark is constructed for LR objects, namely NUT-LR, consisting of 100 annotated sequences. Exhaustive experiments have demonstrated the robustness and feature alignment ability of the proposed DaDiff. The source code and video demo are available at https://github.com/vision4robotics/DaDiff.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
Improving the Generalization of Unseen Crowd Behaviors for Reinforcement Learning based Local Motion Planners
Authors:
Wen Zheng Terence Ng,
Jianda Chen,
Sinno Jialin Pan,
Tianwei Zhang
Abstract:
Deploying a safe mobile robot policy in scenarios with human pedestrians is challenging due to their unpredictable movements. Current Reinforcement Learning-based motion planners rely on a single policy to simulate pedestrian movements and could suffer from the over-fitting issue. Alternatively, framing the collision avoidance problem as a multi-agent framework, where agents generate dynamic movem…
▽ More
Deploying a safe mobile robot policy in scenarios with human pedestrians is challenging due to their unpredictable movements. Current Reinforcement Learning-based motion planners rely on a single policy to simulate pedestrian movements and could suffer from the over-fitting issue. Alternatively, framing the collision avoidance problem as a multi-agent framework, where agents generate dynamic movements while learning to reach their goals, can lead to conflicts with human pedestrians due to their homogeneity.
To tackle this problem, we introduce an efficient method that enhances agent diversity within a single policy by maximizing an information-theoretic objective. This diversity enriches each agent's experiences, improving its adaptability to unseen crowd behaviors. In assessing an agent's robustness against unseen crowds, we propose diverse scenarios inspired by pedestrian crowd behaviors. Our behavior-conditioned policies outperform existing works in these challenging scenes, reducing potential collisions without additional time or travel.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution
Authors:
Zhengxue Wang,
Zhiqiang Yan,
Jinshan Pan,
Guangwei Gao,
Kai Zhang,
Jian Yang
Abstract:
Recent RGB-guided depth super-resolution methods have achieved impressive performance under the assumption of fixed and known degradation (e.g., bicubic downsampling). However, in real-world scenarios, captured depth data often suffer from unconventional and unknown degradation due to sensor limitations and complex imaging environments (e.g., low reflective surfaces, varying illumination). Consequ…
▽ More
Recent RGB-guided depth super-resolution methods have achieved impressive performance under the assumption of fixed and known degradation (e.g., bicubic downsampling). However, in real-world scenarios, captured depth data often suffer from unconventional and unknown degradation due to sensor limitations and complex imaging environments (e.g., low reflective surfaces, varying illumination). Consequently, the performance of these methods significantly declines when real-world degradation deviate from their assumptions. In this paper, we propose the Degradation Oriented and Regularized Network (DORNet), a novel framework designed to adaptively address unknown degradation in real-world scenes through implicit degradation representations. Our approach begins with the development of a self-supervised degradation learning strategy, which models the degradation representations of low-resolution depth data using routing selection-based degradation regularization. To facilitate effective RGB-D fusion, we further introduce a degradation-oriented feature transformation module that selectively propagates RGB content into the depth data based on the learned degradation priors. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our DORNet in handling unknown degradation, outperforming existing methods. The code is available at https://github.com/yanzq95/DORNet.
△ Less
Submitted 19 March, 2025; v1 submitted 15 October, 2024;
originally announced October 2024.
-
Towards Understanding Why FixMatch Generalizes Better Than Supervised Learning
Authors:
Jingyang Li,
Jiachun Pan,
Vincent Y. F. Tan,
Kim-Chuan Toh,
Pan Zhou
Abstract:
Semi-supervised learning (SSL), exemplified by FixMatch (Sohn et al., 2020), has shown significant generalization advantages over supervised learning (SL), particularly in the context of deep neural networks (DNNs). However, it is still unclear, from a theoretical standpoint, why FixMatch-like SSL algorithms generalize better than SL on DNNs. In this work, we present the first theoretical justific…
▽ More
Semi-supervised learning (SSL), exemplified by FixMatch (Sohn et al., 2020), has shown significant generalization advantages over supervised learning (SL), particularly in the context of deep neural networks (DNNs). However, it is still unclear, from a theoretical standpoint, why FixMatch-like SSL algorithms generalize better than SL on DNNs. In this work, we present the first theoretical justification for the enhanced test accuracy observed in FixMatch-like SSL applied to DNNs by taking convolutional neural networks (CNNs) on classification tasks as an example. Our theoretical analysis reveals that the semantic feature learning processes in FixMatch and SL are rather different. In particular, FixMatch learns all the discriminative features of each semantic class, while SL only randomly captures a subset of features due to the well-known lottery ticket hypothesis. Furthermore, we show that our analysis framework can be applied to other FixMatch-like SSL methods, e.g., FlexMatch, FreeMatch, Dash, and SoftMatch. Inspired by our theoretical analysis, we develop an improved variant of FixMatch, termed Semantic-Aware FixMatch (SA-FixMatch). Experimental results corroborate our theoretical findings and the enhanced generalization capability of SA-FixMatch.
△ Less
Submitted 9 March, 2025; v1 submitted 14 October, 2024;
originally announced October 2024.
-
Tunable Einstein-Bohr recoiling-slit gedankenexperiment at the quantum limit
Authors:
Yu-Chen Zhang,
Hao-Wen Cheng,
Zhao-Qiu Zengxu,
Zhan Wu,
Rui Lin,
Yu-Cheng Duan,
Jun Rui,
Ming-Cheng Chen,
Chao-Yang Lu,
Jian-Wei Pan
Abstract:
In 1927, during the fifth Solvay Conference, Einstein and Bohr described a double-slit interferometer with a "movable slit" that can detect the momentum recoil of one photon. Here, we report a faithful realization of the Einstein-Bohr interferometer using a single atom in an optical tweezer, cooled to the motional ground state in three dimensions. The single atom has an intrinsic momentum uncertai…
▽ More
In 1927, during the fifth Solvay Conference, Einstein and Bohr described a double-slit interferometer with a "movable slit" that can detect the momentum recoil of one photon. Here, we report a faithful realization of the Einstein-Bohr interferometer using a single atom in an optical tweezer, cooled to the motional ground state in three dimensions. The single atom has an intrinsic momentum uncertainty comparable to a single photon, which serves as a movable slit obeying the minimum Heisenberg uncertainty principle. The atom's momentum wavefunction is dynamically tunable by the tweezer laser power, which enables observation of an interferometric visibility reduction at a shallower trap, demonstrating the quantum nature of this interferometer. We further identify classical noise due to atom heating and precession, illustrating a quantum-to-classical transition.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.