Search | arXiv e-print repository

VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression

Authors: Qiang Hu, Houqiang Zhong, Zihan Zheng, Xiaoyun Zhang, Zhengxue Cheng, Li Song, Guangtao Zhai, Yanfeng Wang

Abstract: Neural Radiance Field (NeRF)-based volumetric video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences that provide audiences with unprecedented immersion and interactivity. However, the substantial data volumes pose significant challenges for storage and transmission. Existing solutions typically optimize NeRF representation and compression indepen… ▽ More Neural Radiance Field (NeRF)-based volumetric video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences that provide audiences with unprecedented immersion and interactivity. However, the substantial data volumes pose significant challenges for storage and transmission. Existing solutions typically optimize NeRF representation and compression independently or focus on a single fixed rate-distortion (RD) tradeoff. In this paper, we propose VRVVC, a novel end-to-end joint optimization variable-rate framework for volumetric video compression that achieves variable bitrates using a single model while maintaining superior RD performance. Specifically, VRVVC introduces a compact tri-plane implicit residual representation for inter-frame modeling of long-duration dynamic scenes, effectively reducing temporal redundancy. We further propose a variable-rate residual representation compression scheme that leverages a learnable quantization and a tiny MLP-based entropy model. This approach enables variable bitrates through the utilization of predefined Lagrange multipliers to manage the quantization error of all latent representations. Finally, we present an end-to-end progressive training strategy combined with a multi-rate-distortion loss function to optimize the entire framework. Extensive experiments demonstrate that VRVVC achieves a wide range of variable bitrates within a single model and surpasses the RD performance of existing methods across various datasets. △ Less

Submitted 15 December, 2024; originally announced December 2024.

arXiv:2412.10680 [pdf, other]

UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval

Authors: Haoyu Jiang, Zhi-Qi Cheng, Gabriel Moreira, Jiawen Zhu, Jingdong Sun, Bukun Ren, Jun-Yan He, Qi Dai, Xian-Sheng Hua

Abstract: Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen domains and classes without semantic labels, ensuring robust generalization. Existing methods commonly employ prompt tuning with pre-trained vision-language models but are inherently limited by static prompts, reducing adaptability. We propose UCDR-Adapter, which enhances pre-trained models with adapters and dynamic prom… ▽ More Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen domains and classes without semantic labels, ensuring robust generalization. Existing methods commonly employ prompt tuning with pre-trained vision-language models but are inherently limited by static prompts, reducing adaptability. We propose UCDR-Adapter, which enhances pre-trained models with adapters and dynamic prompt generation through a two-phase training strategy. First, Source Adapter Learning integrates class semantics with domain-specific visual knowledge using a Learnable Textual Semantic Template and optimizes Class and Domain Prompts via momentum updates and dual loss functions for robust alignment. Second, Target Prompt Generation creates dynamic prompts by attending to masked source prompts, enabling seamless adaptation to unseen domains and classes. Unlike prior approaches, UCDR-Adapter dynamically adapts to evolving data distributions, enhancing both flexibility and generalization. During inference, only the image branch and generated prompts are used, eliminating reliance on textual inputs for highly efficient retrieval. Extensive benchmark experiments show that UCDR-Adapter consistently outperforms ProS in most cases and other state-of-the-art methods on UCDR, U(c)CDR, and U(d)CDR settings. △ Less

Submitted 13 December, 2024; originally announced December 2024.

Comments: Accepted to WACV 2025. Project link: https://github.com/fine68/UCDR2024

arXiv:2412.08781 [pdf, other]

GMem: A Modular Approach for Ultra-Efficient Generative Models

Authors: Yi Tang, Peng Sun, Zhenglin Cheng, Tao Lin

Abstract: Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and infe… ▽ More Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and inference of diffusion models. To this end, we introduce GMem: A Modular Approach for Ultra-Efficient Generative Models. Our approach GMem decouples the memory capacity from model and implements it as a separate, immutable memory set that preserves the essential semantic information in the data. The results are significant: GMem enhances both training, sampling efficiency, and diversity generation. This design on one hand reduces the reliance on network for memorize complex data distribution and thus enhancing both training and sampling efficiency. On ImageNet at $256 \times 256$ resolution, GMem achieves a $50\times$ training speedup compared to SiT, reaching FID $=7.66$ in fewer than $28$ epochs ($\sim 4$ hours training time), while SiT requires $1400$ epochs. Without classifier-free guidance, GMem achieves state-of-the-art (SoTA) performance FID $=1.53$ in $160$ epochs with only $\sim 20$ hours of training, outperforming LightningDiT which requires $800$ epochs and $\sim 95$ hours to attain FID $=2.17$. △ Less

Submitted 11 February, 2025; v1 submitted 11 December, 2024; originally announced December 2024.

Comments: 9 pages, 5 figures, 3 tables

arXiv:2412.08086 [pdf, other]

Frequency-resolved Transient Absorption Spectroscopy for High Pressure System

Authors: Zi-Qian Cheng, Xiao-Shuang Yin, Liu-Xiang Yang, Hui Dong

Abstract: Dynamics of materials under high-pressure conditions has been an important focus of materials science, especially in the timescale of pico- and femto-second of electronic and vibrational motion, which is typically probed by ultrafast laser pulses. To probe such dynamics, it requires an integration of high-pressure devices with the ultrafast laser system. In this work, we construct a frequency-reso… ▽ More Dynamics of materials under high-pressure conditions has been an important focus of materials science, especially in the timescale of pico- and femto-second of electronic and vibrational motion, which is typically probed by ultrafast laser pulses. To probe such dynamics, it requires an integration of high-pressure devices with the ultrafast laser system. In this work, we construct a frequency-resolved high-pressure transient absorption spectroscopy system based on a diamond anvil cell (DAC) with transmissive detection. In this setup, we use the narrowband laser as the pump beam and the supercontinuum white light as the probe beam. To effectively eliminate the scattering noise from the pump light, we design a double-chopper operating mode, which allows us to obtain signals in the complete frequency domain including the overlap region with the pump pulse. And we test system with Rhodamine B solution with the probe wavelength range of 450-750 nm and the 550nm pump, and observe that the intensity of the signal peak corresponding to the monomer at 560 nm continuously decreased relative to the signal peak corresponding to the dimer at 530 nm. This indicates that the portion of Rhodamine B molecules in the dimer form increases under increasing pressure. Additionally, we find two dynamic components of the signal peaks for both monomer and dimer, and the short-lifetime component increases as the pressure is increased, and the long-lifetime component decreases. △ Less

Submitted 10 December, 2024; originally announced December 2024.

arXiv:2412.08074 [pdf, other]

EM-Net: Gaze Estimation with Expectation Maximization Algorithm

Authors: Zhang Cheng, Yanxia Wang, Guoyu Xia

Abstract: In recent years, the accuracy of gaze estimation techniques has gradually improved, but existing methods often rely on large datasets or large models to improve performance, which leads to high demands on computational resources. In terms of this issue, this paper proposes a lightweight gaze estimation model EM-Net based on deep learning and traditional machine learning algorithms Expectation Maxi… ▽ More In recent years, the accuracy of gaze estimation techniques has gradually improved, but existing methods often rely on large datasets or large models to improve performance, which leads to high demands on computational resources. In terms of this issue, this paper proposes a lightweight gaze estimation model EM-Net based on deep learning and traditional machine learning algorithms Expectation Maximization algorithm. First, the proposed Global Attention Mechanism(GAM) is added to extract features related to gaze estimation to improve the model's ability to capture global dependencies and thus improve its performance. Second, by learning hierarchical feature representations through the EM module, the model has strong generalization ability, which reduces the need for sample size. Experiments have confirmed that, on the premise of using only 50% of the training data, EM-Net improves the performance of Gaze360, MPIIFaceGaze, and RT-Gene datasets by 2.2%, 2.02%, and 2.03%, respectively, compared with GazeNAS-ETH. It also shows good robustness in the face of Gaussian noise interference. △ Less

Submitted 10 December, 2024; originally announced December 2024.

arXiv:2412.05970 [pdf]

doi 10.1002/adma.202502575

Robust magnetoelectric coupling in altermagnetic-ferroelectric type-III multiferroics

Authors: Wei Sun, Wenxuan Wang, Changhong Yang, Ying Liu, Xiaotian Wang, Shifeng Huang, Zhenxiang Cheng

Abstract: Multiferroic materials, characterized by the coexisting of ferroelectric polarization (breaking spatial inversion symmetry) and magnetism (breaking time-reversal symmetry), with strong magnetoelectric coupling, are highly sought after for advanced technological applications. Novel altermagnets, distinct from conventional magnets, have recently been revealed to exhibit unique spin polarization prot… ▽ More Multiferroic materials, characterized by the coexisting of ferroelectric polarization (breaking spatial inversion symmetry) and magnetism (breaking time-reversal symmetry), with strong magnetoelectric coupling, are highly sought after for advanced technological applications. Novel altermagnets, distinct from conventional magnets, have recently been revealed to exhibit unique spin polarization protected by crystal symmetry, which naturally overcomes the isolation of magnetism from ferroelectrics associated with spatial symmetry. In this study, we propose a novel class of type-III multiferroics, where ferroelectricity and altermagnetism are inherently interlocked by crystal symmetry, setting them apart from conventional multiferroics. Through first-principles calculations, ferroelectric switching is shown to fully invert the spin polarization of altermagnets, equivalent to a 180° reversal of magnetic spin. This strong magnetoelectric coupling is further supported by the magneto-optical Kerr effect, revealing a new class of multiferroics with robust, symmetry-driven magnetoelectric coupling and providing a theoretical foundation for the design of next-generation spintronic devices leveraging altermagnetism. △ Less

Submitted 8 December, 2024; originally announced December 2024.

Comments: 16 pages, 4 figures

Journal ref: Adv. Mater. 2025, 2502575

arXiv:2412.05798 [pdf, other]

A new pathway to impact ionization in a photo-excited one-dimensional ionic Hubbard model

Authors: Zhenyu Cheng, Li Yang, Xiang Hu, Hantao Lu, Zhongbing Huang, Liang Du

Abstract: Using the time-dependent Lanczos method, we study the non-equilibrium dynamics of the half-filled one-dimensional ionic Hubbard model, deep within the Mott insulating regime, under the influence of a transient laser pulse. In equilibrium, increasing the staggered potential in the Mott regime reduces the Mott gap and broadens the Hubbard bands, creating favorable conditions for impact ionization. A… ▽ More Using the time-dependent Lanczos method, we study the non-equilibrium dynamics of the half-filled one-dimensional ionic Hubbard model, deep within the Mott insulating regime, under the influence of a transient laser pulse. In equilibrium, increasing the staggered potential in the Mott regime reduces the Mott gap and broadens the Hubbard bands, creating favorable conditions for impact ionization. After laser excitation, impact ionization is observed, with its occurrence depending on both the staggered potential and the laser pump frequency. By analyzing the time evolution of the kinetic, ionic, and Coulomb interaction energies, we identify a novel mechanism for impact ionization, in which excess ionic potential energy is converted into additional double occupancy-distinct from the conventional mechanism where excess kinetic energy drives this process. We further show that impact ionization arises from interference between excited states driven by photon excitation of the same order. These results present a new pathway for realizing impact ionization in strongly correlated electron systems. △ Less

Submitted 7 December, 2024; originally announced December 2024.

Comments: 6 pages, 3 figures

arXiv:2412.04414 [pdf, other]

Emergent unitary designs for encoded qubits from coherent errors and syndrome measurements

Authors: Zihan Cheng, Eric Huang, Vedika Khemani, Michael J. Gullans, Matteo Ippoliti

Abstract: Unitary $k$-designs are distributions of unitary gates that match the Haar distribution up to its $k$-th statistical moment. They are a crucial resource for randomized quantum protocols. However, their implementation on encoded logical qubits is nontrivial due to the need for magic gates, which can require a large resource overhead. In this work, we propose an efficient approach to generate unitar… ▽ More Unitary $k$-designs are distributions of unitary gates that match the Haar distribution up to its $k$-th statistical moment. They are a crucial resource for randomized quantum protocols. However, their implementation on encoded logical qubits is nontrivial due to the need for magic gates, which can require a large resource overhead. In this work, we propose an efficient approach to generate unitary designs for encoded qubits in surface codes by applying local unitary rotations ("coherent errors") on the physical qubits followed by syndrome measurement and error correction. We prove that under some conditions on the coherent errors (notably including all single-qubit unitaries) and on the error correcting code, this process induces a unitary transformation of the logical subspace. We numerically show that the ensemble of logical unitaries (indexed by the random syndrome outcomes) converges to a unitary design in the thermodynamic limit, provided the density or strength of coherent errors is above a finite threshold. This "unitary design" phase transition coincides with the code's coherent error threshold under optimal decoding. Furthermore, we propose a classical algorithm to simulate the protocol based on a "staircase" implementation of the surface code encoder and decoder circuits. This enables a mapping to a 1+1D monitored circuit, where we observe an entanglement phase transition (and thus a classical complexity phase transition of the decoding algorithm) coinciding with the aforementioned unitary design phase transition. Our results provide a practical way to realize unitary designs on encoded qubits, with applications including quantum state tomography and benchmarking in error correcting codes. △ Less

Submitted 5 December, 2024; originally announced December 2024.

Comments: 15+3 pages, 8+2 figures

arXiv:2412.02335 [pdf, other]

An Adaptive Grasping Force Tracking Strategy for Nonlinear and Time-Varying Object Behaviors

Authors: Ziyang Cheng, Xiangyu Tian, Ruomin Sui, Tiemin Li, Yao Jiang

Abstract: Accurate grasp force control is one of the key skills for ensuring successful and damage-free robotic grasping of objects. Although existing methods have conducted in-depth research on slip detection and grasping force planning, they often overlook the issue of adaptive tracking of the actual force to the target force when handling objects with different material properties. The optimal parameters… ▽ More Accurate grasp force control is one of the key skills for ensuring successful and damage-free robotic grasping of objects. Although existing methods have conducted in-depth research on slip detection and grasping force planning, they often overlook the issue of adaptive tracking of the actual force to the target force when handling objects with different material properties. The optimal parameters of a force tracking controller are significantly influenced by the object's stiffness, and many adaptive force tracking algorithms rely on stiffness estimation. However, real-world objects often exhibit viscous, plastic, or other more complex nonlinear time-varying behaviors, and existing studies provide insufficient support for these materials in terms of stiffness definition and estimation. To address this, this paper introduces the concept of generalized stiffness, extending the definition of stiffness to nonlinear time-varying grasp system models, and proposes an online generalized stiffness estimator based on Long Short-Term Memory (LSTM) networks. Based on generalized stiffness, this paper proposes an adaptive parameter adjustment strategy using a PI controller as an example, enabling dynamic force tracking for objects with varying characteristics. Experimental results demonstrate that the proposed method achieves high precision and short probing time, while showing better adaptability to non-ideal objects compared to existing methods. The method effectively solves the problem of grasp force tracking in unknown, nonlinear, and time-varying grasp systems, demonstrating the generalization capability of our neural network and enhancing the robotic grasping ability in unstructured environments. △ Less

Submitted 25 April, 2025; v1 submitted 3 December, 2024; originally announced December 2024.

arXiv:2412.01165 [pdf, other]

Double-Directional V2V Channel Measurement using ReRoMA at 60 GHz

Authors: Hussein Hammoud, Yuning Zhang, Zihang Cheng, Seun Sangodoyin, Markus Hofer, Faruk Pasic, Thomas M. Pohl, Radek Závorka, Ales Prokes, Thomas Zemen, Christoph F. Mecklenbräuker, Andreas F. Molisch

Abstract: The coordination of vehicles is a crucial element of autonomous driving, as it enhances the efficiency, convenience, and safety of road traffic. In order to fully exploit the capabilities of such coordination, communication with high data rate and low latency is required. It can be reasonably argued that millimeter-wave (mm-wave) vehicle-to-vehicle (V2V) systems are capable of fulfilling the afore… ▽ More The coordination of vehicles is a crucial element of autonomous driving, as it enhances the efficiency, convenience, and safety of road traffic. In order to fully exploit the capabilities of such coordination, communication with high data rate and low latency is required. It can be reasonably argued that millimeter-wave (mm-wave) vehicle-to-vehicle (V2V) systems are capable of fulfilling the aforementioned requirements. Nevertheless, in order to develop a system that can be deployed in real-world scenarios and to gain an understanding of the various effects of mm-wave propagation, it is necessary to perform radio propagation measurements and to derive radio channel models from them across a range of scenarios and environments. To this end, we have conducted measurement campaigns at 60\,GHz in a variety of situations, including driving in a convoy, driving in opposite direction on a six-lane road, and overtaking. These measurements employ a channel sounder based on ReRoMA, a recently introduced concept that enables the real-time measurement of dynamic double-directional radio channels. The evaluations presented herein encompass key channel parameters, including the path loss (path loss coefficient of approximately 1.9), the root mean square (RMS) delay spread (within a range of 5\,ns to 110\,ns), the angular spreads (in a range of 0.05 to 0.4), the power distribution among multipath components, and the channel stationarity time (multiple seconds). △ Less

Submitted 3 December, 2024; v1 submitted 2 December, 2024; originally announced December 2024.

Comments: 15 pages

arXiv:2411.18064 [pdf, other]

doi 10.1109/IJCNN60899.2024.10651446

Lightweight Gaze Estimation Model Via Fusion Global Information

Authors: Zhang Cheng, Yanxia Wang

Abstract: Deep learning-based appearance gaze estimation methods are gaining popularity due to their high accuracy and fewer constraints from the environment. However, existing high-precision models often rely on deeper networks, leading to problems such as large parameters, long training time, and slow convergence. In terms of this issue, this paper proposes a novel lightweight gaze estimation model FGI-Ne… ▽ More Deep learning-based appearance gaze estimation methods are gaining popularity due to their high accuracy and fewer constraints from the environment. However, existing high-precision models often rely on deeper networks, leading to problems such as large parameters, long training time, and slow convergence. In terms of this issue, this paper proposes a novel lightweight gaze estimation model FGI-Net(Fusion Global Information). The model fuses global information into the CNN, effectively compensating for the need of multi-layer convolution and pooling to indirectly capture global information, while reducing the complexity of the model, improving the model accuracy and convergence speed. To validate the performance of the model, a large number of experiments are conducted, comparing accuracy with existing classical models and lightweight models, comparing convergence speed with models of different architectures, and conducting ablation experiments. Experimental results show that compared with GazeCaps, the latest gaze estimation model, FGI-Net achieves a smaller angle error with 87.1% and 79.1% reduction in parameters and FLOPs, respectively (MPIIFaceGaze is 3.74°, EyeDiap is 5.15°, Gaze360 is 10.50° and RT-Gene is 6.02°). Moreover, compared with different architectural models such as CNN and Transformer, FGI-Net is able to quickly converge to a higher accuracy range with fewer iterations of training, when achieving optimal accuracy on the Gaze360 and EyeDiap datasets, the FGI-Net model has 25% and 37.5% fewer iterations of training compared to GazeTR, respectively. △ Less

Submitted 27 November, 2024; originally announced November 2024.

arXiv:2411.18061 [pdf, other]

Multi-task Gaze Estimation Via Unidirectional Convolution

Authors: Zhang Cheng, Yanxia Wang

Abstract: Using lightweight models as backbone networks in gaze estimation tasks often results in significant performance degradation. The main reason is that the number of feature channels in lightweight networks is usually small, which makes the model expression ability limited. In order to improve the performance of lightweight models in gaze estimation tasks, a network model named Multitask-Gaze is prop… ▽ More Using lightweight models as backbone networks in gaze estimation tasks often results in significant performance degradation. The main reason is that the number of feature channels in lightweight networks is usually small, which makes the model expression ability limited. In order to improve the performance of lightweight models in gaze estimation tasks, a network model named Multitask-Gaze is proposed. The main components of Multitask-Gaze include Unidirectional Convolution (UC), Spatial and Channel Attention (SCA), Global Convolution Module (GCM), and Multi-task Regression Module(MRM). UC not only significantly reduces the number of parameters and FLOPs, but also extends the receptive field and improves the long-distance modeling capability of the model, thereby improving the model performance. SCA highlights gaze-related features and suppresses gaze-irrelevant features. The GCM replaces the pooling layer and avoids the performance degradation due to information loss. MRM improves the accuracy of individual tasks and strengthens the connections between tasks for overall performance improvement. The experimental results show that compared with the State-of-the-art method SUGE, the performance of Multitask-Gaze on MPIIFaceGaze and Gaze360 datasets is improved by 1.71% and 2.75%, respectively, while the number of parameters and FLOPs are significantly reduced by 75.5% and 86.88%. △ Less

Submitted 8 December, 2024; v1 submitted 27 November, 2024; originally announced November 2024.

arXiv:2411.17697 [pdf, other]

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Authors: Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu

Abstract: Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designe… ▽ More Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively. △ Less

Submitted 27 November, 2024; v1 submitted 26 November, 2024; originally announced November 2024.

arXiv:2411.17474 [pdf, other]

Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Authors: Xuweiyi Chen, Markus Marks, Zezhou Cheng

Abstract: Mid-level vision capabilities - such as generic object localization and 3D geometric understanding - are not only fundamental to human vision but are also crucial for many real-world applications of computer vision. These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are p… ▽ More Mid-level vision capabilities - such as generic object localization and 3D geometric understanding - are not only fundamental to human vision but are also crucial for many real-world applications of computer vision. These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are primarily designed and evaluated for high-level recognition tasks, leaving their mid-level vision capabilities largely unexamined. In this study, we introduce a suite of benchmark protocols to systematically assess mid-level vision capabilities and present a comprehensive, controlled evaluation of 22 prominent SSL models across 8 mid-level vision tasks. Our experiments reveal a weak correlation between mid-level and high-level task performance. We also identify several SSL methods with highly imbalanced performance across mid-level and high-level capabilities, as well as some that excel in both. Additionally, we investigate key factors contributing to mid-level vision performance, such as pretraining objectives and network architectures. Our study provides a holistic and timely view of what SSL models have learned, complementing existing research that primarily focuses on high-level vision tasks. We hope our findings guide future SSL research to benchmark models not only on high-level vision tasks but on mid-level as well. △ Less

Submitted 16 December, 2024; v1 submitted 25 November, 2024; originally announced November 2024.

Comments: Project Page: https://midvision-probe.cs.virginia.edu/

arXiv:2411.17467 [pdf, ps, other]

Learning 3D Representations from Procedural 3D Programs

Authors: Xuweiyi Chen, Zezhou Cheng

Abstract: Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representat… ▽ More Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations. Remarkably, despite lacking semantic content, the 3D representations learned from the procedurally generated 3D shapes perform on par with state-of-the-art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, including shape classification, part segmentation, and masked point cloud completion. We provide a detailed analysis on factors that make a good 3D procedural program. Extensive experiments further suggest that current self-supervised learning methods on point clouds do not rely on the semantics of 3D shapes, shedding light on the nature of 3D representations learned. △ Less

Submitted 4 June, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

Comments: SynData4CV @ CVPR2025 | Project Page: https://point-mae-zero.cs.virginia.edu/

arXiv:2411.17091 [pdf, other]

LESS: Efficient Log Storage System Based on Learned Model and Minimum Attribute Tree

Authors: Zhiyang Cheng, Zizhen Zhu, Haoran Dang, Hai Wan, Xibin Zhao

Abstract: In recent years, cyber attacks have become increasingly sophisticated and persistent. Detection and investigation based on the provenance graph can effectively mitigate cyber intrusion. However, in the long time span of defenses, the sheer size of the provenance graph will pose significant challenges to the storage systems. Faced with long-term storage tasks, existing methods are unable to simulta… ▽ More In recent years, cyber attacks have become increasingly sophisticated and persistent. Detection and investigation based on the provenance graph can effectively mitigate cyber intrusion. However, in the long time span of defenses, the sheer size of the provenance graph will pose significant challenges to the storage systems. Faced with long-term storage tasks, existing methods are unable to simultaneously achieve lossless information, efficient compression, and fast query support. In this paper, we propose a novel provenance graph storage system, LESS, which consumes smaller storage space and supports faster storage and queries compared to current approaches. We innovatively partition the provenance graph into two distinct components, the graph structure and attribute, and store them separately. Based on their respective characteristics, we devise two appropriate storage schemes: the provenance graph structure storage method based on machine learning and the use of the minimal spanning tree to store the graph attributes. Compared with the state-of-the-art approach, LEONARD, LESS reduces 6.29 times in storage time, while also achieving a 5.24 times reduction in disk usage and an 18.3 times faster query speed while using only 11.5% of the memory on DARPA TC dataset. △ Less

Submitted 25 November, 2024; originally announced November 2024.

arXiv:2411.16833 [pdf, other]

Open Vocabulary Monocular 3D Object Detection

Authors: Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng

Abstract: In this work, we pioneer the study of open-vocabulary monocular 3D object detection, a novel task that aims to detect and localize objects in 3D space from a single RGB image without limiting detection to a predefined set of categories. We formalize this problem, establish baseline methods, and introduce a class-agnostic approach that leverages open-vocabulary 2D detectors and lifts 2D bounding bo… ▽ More In this work, we pioneer the study of open-vocabulary monocular 3D object detection, a novel task that aims to detect and localize objects in 3D space from a single RGB image without limiting detection to a predefined set of categories. We formalize this problem, establish baseline methods, and introduce a class-agnostic approach that leverages open-vocabulary 2D detectors and lifts 2D bounding boxes into 3D space. Our approach decouples the recognition and localization of objects in 2D from the task of estimating 3D bounding boxes, enabling generalization across unseen categories. Additionally, we propose a target-aware evaluation protocol to address inconsistencies in existing datasets, improving the reliability of model performance assessment. Extensive experiments on the Omni3D dataset demonstrate the effectiveness of the proposed method in zero-shot 3D detection for novel object categories, validating its robust generalization capabilities. Our method and evaluation protocols contribute towards the development of open-vocabulary object detection models that can effectively operate in real-world, category-diverse environments. △ Less

Submitted 25 November, 2024; originally announced November 2024.

Comments: Project page: https://cvlab.cs.virginia.edu/ovmono3d

arXiv:2411.15614 [pdf, ps, other]

Constructing topological biquandles via skew braces

Authors: Zhiyun Cheng

Abstract: In this short note, we construct some nontrivial examples of topological biquandle. The key ingredient of the construction is the notion of skew brace. In this short note, we construct some nontrivial examples of topological biquandle. The key ingredient of the construction is the notion of skew brace. △ Less

Submitted 23 November, 2024; originally announced November 2024.

Comments: 9 pages, no figures

MSC Class: 57K12; 16T25

arXiv:2411.15333 [pdf]

doi 10.1038/s41567-024-02770-z

Unconventional gapping behavior in a kagome superconductor

Authors: Md Shafayat Hossain, Qi Zhang, Eun Sang Choi, Danilo Ratkovski, Bernhard Lüscher, Yongkai Li, Yu-Xiao Jiang, Maksim Litskevich, Zi-Jia Cheng, Jia-Xin Yin, Tyler A. Cochran, Brian Casas, Byunghoon Kim, Xian Yang, Jinjin Liu, Yugui Yao, Ali Bangura, Zhiwei Wang, Mark H. Fischer, Titus Neupert, Luis Balicas, M. Zahid Hasan

Abstract: Determining the types of superconducting order in quantum materials is a challenge, especially when multiple degrees of freedom, such as bands or orbitals, contribute to the fermiology and when superconductivity competes, intertwines, or coexists with other symmetry-breaking orders. Here, we study the Kagome-lattice superconductor CsV3Sb5, in which multiband superconductivity coexists with a charg… ▽ More Determining the types of superconducting order in quantum materials is a challenge, especially when multiple degrees of freedom, such as bands or orbitals, contribute to the fermiology and when superconductivity competes, intertwines, or coexists with other symmetry-breaking orders. Here, we study the Kagome-lattice superconductor CsV3Sb5, in which multiband superconductivity coexists with a charge order that substantially reduces the compound's space group symmetries. Through a combination of thermodynamic as well as electrical and thermal transport measurements, we uncover two superconducting regimes with distinct transport and thermodynamic characteristics, while finding no evidence for a phase transition separating them. Thermodynamic measurements reveal substantial quasiparticle weight in a high-temperature regime. At lower temperatures, this weight is removed via the formation of a second gap. The two regimes are sharply distinguished by a pronounced enhancement of the upper critical field at low temperatures and by a switch in the anisotropy of the longitudinal thermal conductivity as a function of in-plane magnetic field orientation. We argue that the band with a gap opening at lower temperatures continues to host low-energy quasiparticles, possibly due to a nodal structure of the gap. Taken together, our results present evidence for band-selective superconductivity with remarkable decoupling of the (two) superconducting gaps. The commonly employed multiband scenario, whereby superconductivity emerges in a primary band and is then induced in other bands appears to fail in this unconventional kagome superconductor. Instead, band-selective superconducting pairing is a paradigm that seems to unify seemingly contradicting results in this intensely studied family of materials and beyond. △ Less

Submitted 22 November, 2024; originally announced November 2024.

Comments: Nature Physics (2024); in press

Journal ref: Nature Physics 21, 556 (2025)

arXiv:2411.14355 [pdf, other]

Measurement of two-neutrino double electron capture half-life of $^{124}$Xe with PandaX-4T

Authors: PandaX Collaboration, Zihao Bo, Wei Chen, Xun Chen, Yunhua Chen, Zhaokan Cheng, Xiangyi Cui, Yingjie Fan, Deqing Fang, Zhixing Gao, Lisheng Geng, Karl Giboni, Xunan Guo, Xuyuan Guo, Zichao Guo, Chencheng Han, Ke Han, Changda He, Jinrong He, Di Huang, Houqi Huang, Junting Huang, Ruquan Hou, Yu Hou, Xiangdong Ji , et al. (77 additional authors not shown)

Abstract: Detailed studies of two-neutrino double electron capture (2$ν$DEC) is a crucial step towards searching for the neutrino-less mode to explore the Majorana nature of neutrinos. We have measured precisely the half-life of the 2$ν$DEC process in $^{124}$Xe, utilizing a total exposure of 1.73 tonne$\cdot$year from the commissioning run and the first science run of the PandaX-4T experiment. A time-depen… ▽ More Detailed studies of two-neutrino double electron capture (2$ν$DEC) is a crucial step towards searching for the neutrino-less mode to explore the Majorana nature of neutrinos. We have measured precisely the half-life of the 2$ν$DEC process in $^{124}$Xe, utilizing a total exposure of 1.73 tonne$\cdot$year from the commissioning run and the first science run of the PandaX-4T experiment. A time-dependent background model in the $\mathcal{O}$(10 keV) energy is constructed for the first time in PandaX-4T data. With an unbinned maximum likelihood fit, we determine the half-life of the 2$ν$DEC process to be $(1.03\pm0.15_{\rm stat}\pm0.08_{\rm sys})\times 10^{22}$$\,$yr. Furthermore, we have evaluated the branching ratio for both electrons captured from the $K$ shell ($KK$) to be $(65\pm5)\%$, which aligns with the $^{124}$Xe nuclear model calculations within 1.8$\,$$σ$. △ Less

Submitted 16 May, 2025; v1 submitted 21 November, 2024; originally announced November 2024.

Comments: 19 pages, 5 figures, 4 tables; version3 accepted by JHEP

arXiv:2411.13057 [pdf, other]

Branches, Assemble! Multi-Branch Cooperation Network for Large-Scale Click-Through Rate Prediction at Taobao

Authors: Xu Chen, Zida Cheng, Yuangang Pan, Shuai Xiao, Xiaoming Liu, Jinsong Lan, Qingwen Liu, Ivor W. Tsang

Abstract: Existing click-through rate (CTR) prediction works have studied the role of feature interaction through a variety of techniques. Each interaction technique exhibits its own strength, and solely using one type could constrain the model's capability to capture the complex feature relationships, especially for industrial large-scale data with enormous users and items. Recent research shows that effec… ▽ More Existing click-through rate (CTR) prediction works have studied the role of feature interaction through a variety of techniques. Each interaction technique exhibits its own strength, and solely using one type could constrain the model's capability to capture the complex feature relationships, especially for industrial large-scale data with enormous users and items. Recent research shows that effective CTR models often combine an MLP network with a dedicated feature interaction network in a two-parallel structure. However, the interplay and cooperative dynamics between different streams or branches remain under-researched. In this work, we introduce a novel Multi-Branch Cooperation Network (MBCnet) which enables multiple branch networks to collaborate with each other for better complex feature interaction modeling. Specifically, MBCnet consists of three branches: the Expert-based Feature Grouping and Crossing (EFGC) branch that promotes the model's memorization ability of specific feature fields, the low rank Cross Net branch and Deep branch to enhance both explicit and implicit feature crossing for improved generalization. Among branches, a novel cooperation scheme is proposed based on two principles: branch co-teaching and moderate differentiation. Branch co-teaching encourages well-learned branches to support poorly-learned ones on specific training samples. Moderate differentiation advocates branches to maintain a reasonable level of difference in their feature representations. The cooperation strategy improves learning through mutual knowledge sharing via co-teaching and boosts the discovery of diverse feature interactions across branches. Extensive experiments on large-scale industrial datasets and online A/B test demonstrate MBCnet's superior performance, delivering a 0.09 point increase in CTR, 1.49% growth in deals, and 1.62% rise in GMV. Core codes will be released soon. △ Less

Submitted 20 November, 2024; originally announced November 2024.

Comments: 10 pages

arXiv:2411.12170 [pdf]

doi 10.15302/frontphys.2025.024207

Layered semiconducting electrides in p-block metal oxides

Authors: Jiaqi Dai, Feng Yang, Cong Wang, Fei Pang, Zhihai Cheng, Wei Ji

Abstract: In conventional electrides, excess electrons are localized in crystal voids to serve as anions. Most of these electrides are metallic and the metal cations are primarily from the s-block, d-block, or rare-earth elements. Here, we report a class of p-block metal-based electrides found in bilayer SnO and PbO, which are semiconducting and feature electride states in both the valence band (VB) and con… ▽ More In conventional electrides, excess electrons are localized in crystal voids to serve as anions. Most of these electrides are metallic and the metal cations are primarily from the s-block, d-block, or rare-earth elements. Here, we report a class of p-block metal-based electrides found in bilayer SnO and PbO, which are semiconducting and feature electride states in both the valence band (VB) and conduction band (CB), as referred to 2D "bipolar" electrides. These bilayers are hybrid electrides where excess electrons are localized in the interlayer region and hybridize with the orbitals of Sn atoms in the VB, exhibiting strong covalent-like interactions with neighboring metal atoms. Compared to previously studied hybrid electrides, the higher electronegativity of Sn and Pb enhances these covalent-like interactions, leading to largely enhanced semiconducting bandgap of up to 2.5 eV. Moreover, the CBM primarily arises from the overlap between metal states and interstitial charges, denoting a potential electride and forming a free-electron-like (FEL) state with small effective mass. This state offers high carrier mobilities for both electron and hole in bilayer SnO, suggesting its potential as a promising p-type semiconductor material. △ Less

Submitted 18 November, 2024; originally announced November 2024.

arXiv:2411.08147 [pdf, other]

Large Language Models Can Self-Improve in Long-context Reasoning

Authors: Siheng Li, Cheng Yang, Zesen Cheng, Lemao Liu, Mo Yu, Yujiu Yang, Wai Lam

Abstract: Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to se… ▽ More Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of $4.2$ points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs. △ Less

Submitted 12 November, 2024; originally announced November 2024.

Comments: Project Page: https://github.com/SihengLi99/SEALONG

arXiv:2411.05322 [pdf, other]

Rate-aware Compression for NeRF-based Volumetric Video

Authors: Zhiyu Zhang, Guo Lu, Huanxiong Liang, Zhengxue Cheng, Anni Tang, Li Song

Abstract: The neural radiance fields (NeRF) have advanced the development of 3D volumetric video technology, but the large data volumes they involve pose significant challenges for storage and transmission. To address these problems, the existing solutions typically compress these NeRF representations after the training stage, leading to a separation between representation training and compression. In this… ▽ More The neural radiance fields (NeRF) have advanced the development of 3D volumetric video technology, but the large data volumes they involve pose significant challenges for storage and transmission. To address these problems, the existing solutions typically compress these NeRF representations after the training stage, leading to a separation between representation training and compression. In this paper, we try to directly learn a compact NeRF representation for volumetric video in the training stage based on the proposed rate-aware compression framework. Specifically, for volumetric video, we use a simple yet effective modeling strategy to reduce temporal redundancy for the NeRF representation. Then, during the training phase, an implicit entropy model is utilized to estimate the bitrate of the NeRF representation. This entropy model is then encoded into the bitstream to assist in the decoding of the NeRF representation. This approach enables precise bitrate estimation, thereby leading to a compact NeRF representation. Furthermore, we propose an adaptive quantization strategy and learn the optimal quantization step for the NeRF representations. Finally, the NeRF representation can be optimized by using the rate-distortion trade-off. Our proposed compression framework can be used for different representations and experimental results demonstrate that our approach significantly reduces the storage size with marginal distortion and achieves state-of-the-art rate-distortion performance for volumetric video on the HumanRF and ReRF datasets. Compared to the previous state-of-the-art method TeTriRF, we achieved an approximately -80% BD-rate on the HumanRF dataset and -60% BD-rate on the ReRF dataset. △ Less

Submitted 7 November, 2024; originally announced November 2024.

Comments: Accepted by ACM MM 2024 (Oral)

arXiv:2411.03887 [pdf, ps, other]

Reclaiming "Open AI" -- AI Model Serving Can Be Open Access, Yet Monetizable and Loyal

Authors: Zerui Cheng, Edoardo Contente, Ben Finch, Oleg Golev, Jonathan Hayase, Andrew Miller, Niusha Moshrefi, Anshul Nasery, Sandeep Nailwal, Sewoong Oh, Himanshu Tyagi, Pramod Viswanath

Abstract: The rapid rise of AI has split model serving between open-weight distribution, which often lacks owner control and monetization, and opaque API-based approaches that risk user privacy and model transparency, forming a dichotomy that hinders an equitable AI ecosystem. This position paper introduces, rigorously formulates, and champions the Open-access, Monetizable, and Loyal (OML) paradigm for AI m… ▽ More The rapid rise of AI has split model serving between open-weight distribution, which often lacks owner control and monetization, and opaque API-based approaches that risk user privacy and model transparency, forming a dichotomy that hinders an equitable AI ecosystem. This position paper introduces, rigorously formulates, and champions the Open-access, Monetizable, and Loyal (OML) paradigm for AI model serving: a foundational shift to securely distribute and serve AI models by synthesizing transparency with granular monetization and critical safety controls. We survey diverse OML constructions from theory and practice, analyze their security, performance, and practical trade-offs, outline a conceptual OML deployment protocol, and discuss market and policy implications. We assert that OML can foster a democratized, self-sustaining, and innovative AI landscape, mitigating centralized power risks. Finally, we call on the research community to further explore the broad design space of OML, spanning cryptographic, AI-native, and socio-economic mechanisms, to realize its full potential for a collaborative, accountable, and resilient AI future. △ Less

Submitted 3 June, 2025; v1 submitted 1 November, 2024; originally announced November 2024.

Comments: 54 pages

arXiv:2410.23872 [pdf, other]

Pressure-dependent magnetotransport measurement in Kagome metal Yb$_{0.5}$Co_3Ge$_3$

Authors: Zhiyuan Cheng, Yaojia Wang, Heng Wu, Mazhar N. Ali, Julia Y. Chan, Semonti Bhattacharyya

Abstract: Kagome materials are known to be an ideal platform that hosts a plethora of interesting phases such as topological states, electronic correlation, and magnetism, owing to their unique band structure and geometry. We report magnetotransport measurement in Kagome metal Yb$_{0.5}$Co_3Ge$_3$ as a function of pressure. Below $\sim25^\circ$ K the temperature dependence of resistance shows an upturn that… ▽ More Kagome materials are known to be an ideal platform that hosts a plethora of interesting phases such as topological states, electronic correlation, and magnetism, owing to their unique band structure and geometry. We report magnetotransport measurement in Kagome metal Yb$_{0.5}$Co_3Ge$_3$ as a function of pressure. Below $\sim25^\circ$ K the temperature dependence of resistance shows an upturn that is accompanied by a strong negative magnetoresistance, which could be attributed to Kondo effect. Upon pressurization above 1 GPa the resistance shows a reduction as a function of temperature below $4^\circ$ K, along with a further enhanced negative magnetoresistance. This might indicate an onset of a pressure-induced Kondo coherence effect. △ Less

Submitted 31 October, 2024; originally announced October 2024.

arXiv:2410.23170 [pdf, other]

Functional Gradient Flows for Constrained Sampling

Authors: Shiyue Zhang, Longlin Yu, Ziheng Cheng, Cheng Zhang

Abstract: Recently, through a unified gradient flow perspective of Markov chain Monte Carlo (MCMC) and variational inference (VI), particle-based variational inference methods (ParVIs) have been proposed that tend to combine the best of both worlds. While typical ParVIs such as Stein Variational Gradient Descent (SVGD) approximate the gradient flow within a reproducing kernel Hilbert space (RKHS), many atte… ▽ More Recently, through a unified gradient flow perspective of Markov chain Monte Carlo (MCMC) and variational inference (VI), particle-based variational inference methods (ParVIs) have been proposed that tend to combine the best of both worlds. While typical ParVIs such as Stein Variational Gradient Descent (SVGD) approximate the gradient flow within a reproducing kernel Hilbert space (RKHS), many attempts have been made recently to replace RKHS with more expressive function spaces, such as neural networks. While successful, these methods are mainly designed for sampling from unconstrained domains. In this paper, we offer a general solution to constrained sampling by introducing a boundary condition for the gradient flow which would confine the particles within the specific domain. This allows us to propose a new functional gradient ParVI method for constrained sampling, called constrained functional gradient flow (CFG), with provable continuous-time convergence in total variation (TV). We also present novel numerical strategies to handle the boundary integral term arising from the domain constraints. Our theory and experiments demonstrate the effectiveness of the proposed framework. △ Less

Submitted 30 October, 2024; originally announced October 2024.

Comments: NeurIPS 2024 camera-ready (30 pages, 26 figures)

arXiv:2410.22823 [pdf]

Coexistence of superconductivity and sliding polar metal state in HgPSe3

Authors: Xiaohui Yu, Wei Zhong, Saori Kawaguchi, Hirokazu Kadobayashi, Xiaolin Wang, Zhenxiang Cheng, Changfeng Chen, Binbin Yue, Jian-Tao Wang, Ho-Kwang Mao, Fang Hong

Abstract: The simultaneous presence of polarity and metallicity in a material signifies an exotic polar metal state, but such materials are extremely rare, especially in bulk form, due to mutually exclusive nature of the fundamental defining properties. Here, we report experimental findings that HgPSe3 is a robust bulk polar metal at room temperature with a chiral structure stabilized by pressure and, remar… ▽ More The simultaneous presence of polarity and metallicity in a material signifies an exotic polar metal state, but such materials are extremely rare, especially in bulk form, due to mutually exclusive nature of the fundamental defining properties. Here, we report experimental findings that HgPSe3 is a robust bulk polar metal at room temperature with a chiral structure stabilized by pressure and, remarkably, this polar metal hosts superconductivity with critical temperature Tc up to 11 K. Theoretical analysis reveals a two-step interlayer sliding-then-compressing mechanism for coexistence of polarity and metallicity in HgPSe3. This work unveils a new paradigm for creating the bulk polar metal state and simultaneous presence of coexisting quantum orders, raising the prospect of discovering novel emergent physics using pressure as a tuning knob. △ Less

Submitted 30 October, 2024; originally announced October 2024.

Comments: 19 pages, 4 main figures + 6 extented figures

arXiv:2410.22211 [pdf, other]

ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding

Authors: Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, Teruko Mitamura

Abstract: Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements i… ▽ More Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models' multimodal understanding capabilities. △ Less

Submitted 29 October, 2024; originally announced October 2024.

Comments: 18 pages, 11 figures

arXiv:2410.19636 [pdf]

Pomeranchuk instability of a topological crystal

Authors: Md Shafayat Hossain, Zahir Muhammad, Rajibul Islam, Zi-Jia Cheng, Yu-Xiao Jiang, Maksim Litskevich, Tyler A. Cochran, Xian P. Yang, Byunghoon Kim, Fei Xue, Ilias E. Perakis, Weisheng Zhao, Mehdi Kargarian, Luis Balicas, Titus Neupert, M. Zahid Hasan

Abstract: Nematic quantum fluids appear in strongly interacting systems and break the rotational symmetry of the crystallographic lattice. In metals, this is connected to a well-known instability of the Fermi liquid-the Pomeranchuk instability. Using scanning tunneling microscopy, we identified this instability in a highly unusual setting: on the surface of an elemental topological metal, arsenic. By direct… ▽ More Nematic quantum fluids appear in strongly interacting systems and break the rotational symmetry of the crystallographic lattice. In metals, this is connected to a well-known instability of the Fermi liquid-the Pomeranchuk instability. Using scanning tunneling microscopy, we identified this instability in a highly unusual setting: on the surface of an elemental topological metal, arsenic. By directly visualizing the Fermi surface of the surface state via scanning tunneling spectroscopy and photoemission spectroscopy, we find that the Fermi surface gets deformed and becomes elliptical at the energies where the nematic state is present. Known instances of nematic instability typically need van-Hove singularities or multi-orbital physics as drivers. In contrast, the surface states of arsenic are essentially indistinguishable from well-confined isotropic Rashba bands near the Fermi level, rendering our finding the first realization of Pomeranchuk instability of the topological surface state. △ Less

Submitted 25 October, 2024; originally announced October 2024.

arXiv:2410.19394 [pdf]

Analysis of Financial Risk Behavior Prediction Using Deep Learning and Big Data Algorithms

Authors: Haowei Yang, Zhan Cheng, Zhaoyang Zhang, Yuanshuai Luo, Shuaishuai Huang, Ao Xiang

Abstract: As the complexity and dynamism of financial markets continue to grow, traditional financial risk prediction methods increasingly struggle to handle large datasets and intricate behavior patterns. This paper explores the feasibility and effectiveness of using deep learning and big data algorithms for financial risk behavior prediction. First, the application and advantages of deep learning and big… ▽ More As the complexity and dynamism of financial markets continue to grow, traditional financial risk prediction methods increasingly struggle to handle large datasets and intricate behavior patterns. This paper explores the feasibility and effectiveness of using deep learning and big data algorithms for financial risk behavior prediction. First, the application and advantages of deep learning and big data algorithms in the financial field are analyzed. Then, a deep learning-based big data risk prediction framework is designed and experimentally validated on actual financial datasets. The experimental results show that this method significantly improves the accuracy of financial risk behavior prediction and provides valuable support for risk management in financial institutions. Challenges in the application of deep learning are also discussed, along with potential directions for future research. △ Less

Submitted 22 December, 2024; v1 submitted 25 October, 2024; originally announced October 2024.

arXiv:2410.17935 [pdf, other]

Semi-Implicit Functional Gradient Flow for Efficient Sampling

Authors: Shiyue Zhang, Ziheng Cheng, Cheng Zhang

Abstract: Particle-based variational inference methods (ParVIs) use nonparametric variational families represented by particles to approximate the target distribution according to the kernelized Wasserstein gradient flow for the Kullback-Leibler (KL) divergence. Although functional gradient flows have been introduced to expand the kernel space for better flexibility, the deterministic updating mechanism may… ▽ More Particle-based variational inference methods (ParVIs) use nonparametric variational families represented by particles to approximate the target distribution according to the kernelized Wasserstein gradient flow for the Kullback-Leibler (KL) divergence. Although functional gradient flows have been introduced to expand the kernel space for better flexibility, the deterministic updating mechanism may limit exploration and require expensive repetitive runs for new samples. In this paper, we propose Semi-Implicit Functional Gradient flow (SIFG), a functional gradient ParVI method that uses perturbed particles with Gaussian noise as the approximation family. We show that the corresponding functional gradient flow, which can be estimated via denoising score matching with neural networks, exhibits strong theoretical convergence guarantees due to a higher-order smoothness brought to the approximation family via Gaussian perturbation. In addition, we present an adaptive version of our method that automatically selects the appropriate noise magnitude during sampling, striking a good balance between exploration efficiency and approximation accuracy. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness and efficiency of the proposed framework. △ Less

Submitted 21 March, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

Comments: 46 pages, 13 figures

arXiv:2410.17243 [pdf, other]

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Authors: Zesen Cheng, Hang Zhang, Kehan Li, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing

Abstract: Contrastive loss is a powerful approach for representation learning, where larger batch sizes enhance performance by providing more negative samples to better distinguish between similar and dissimilar data. However, scaling batch sizes is constrained by the quadratic growth in GPU memory consumption, primarily due to the full instantiation of the similarity matrix. To address this, we propose a t… ▽ More Contrastive loss is a powerful approach for representation learning, where larger batch sizes enhance performance by providing more negative samples to better distinguish between similar and dissimilar data. However, scaling batch sizes is constrained by the quadratic growth in GPU memory consumption, primarily due to the full instantiation of the similarity matrix. To address this, we propose a tile-based computation strategy that partitions the contrastive loss calculation into arbitrary small blocks, avoiding full materialization of the similarity matrix. Furthermore, we introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems, employing ring-based communication at the GPU level to optimize synchronization and fused kernels at the CUDA core level to reduce I/O overhead. Experimental results show that the proposed method scales batch sizes to unprecedented levels. For instance, it enables contrastive training of a CLIP-ViT-L/14 model with a batch size of 4M or 12M using 8 or 32 A800 80GB without sacrificing any accuracy. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed. The code will be made publicly available. △ Less

Submitted 22 October, 2024; originally announced October 2024.

arXiv:2410.17193 [pdf, other]

Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios

Authors: Kai Wang, Zekai Li, Zhi-Qi Cheng, Samir Khaki, Ahmad Sajedi, Ramakrishna Vedantam, Konstantinos N Plataniotis, Alexander Hauptmann, Yang You

Abstract: Dataset distillation has demonstrated strong performance on simple datasets like CIFAR, MNIST, and TinyImageNet but struggles to achieve similar results in more complex scenarios. In this paper, we propose EDF (emphasizes the discriminative features), a dataset distillation method that enhances key discriminative regions in synthetic images using Grad-CAM activation maps. Our approach is inspired… ▽ More Dataset distillation has demonstrated strong performance on simple datasets like CIFAR, MNIST, and TinyImageNet but struggles to achieve similar results in more complex scenarios. In this paper, we propose EDF (emphasizes the discriminative features), a dataset distillation method that enhances key discriminative regions in synthetic images using Grad-CAM activation maps. Our approach is inspired by a key observation: in simple datasets, high-activation areas typically occupy most of the image, whereas in complex scenarios, the size of these areas is much smaller. Unlike previous methods that treat all pixels equally when synthesizing images, EDF uses Grad-CAM activation maps to enhance high-activation areas. From a supervision perspective, we downplay supervision signals that have lower losses, as they contain common patterns. Additionally, to help the DD community better explore complex scenarios, we build the Complex Dataset Distillation (Comp-DD) benchmark by meticulously selecting sixteen subsets, eight easy and eight hard, from ImageNet-1K. In particular, EDF consistently outperforms SOTA results in complex scenarios, such as ImageNet-1K subsets. Hopefully, more researchers will be inspired and encouraged to improve the practicality and efficacy of DD. Our code and benchmark will be made public at https://github.com/NUS-HPC-AI-Lab/EDF. △ Less

Submitted 31 March, 2025; v1 submitted 22 October, 2024; originally announced October 2024.

Comments: 24 pages, 13 figures

arXiv:2410.15392 [pdf, other]

EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting

Authors: Bohao Liao, Wei Zhai, Zengyu Wan, Zhixin Cheng, Wenfei Yang, Tianzhu Zhang, Yang Cao, Zheng-Jun Zha

Abstract: Scene reconstruction from casually captured videos has wide applications in real-world scenarios. With recent advancements in differentiable rendering techniques, several methods have attempted to simultaneously optimize scene representations (NeRF or 3DGS) and camera poses. Despite recent progress, existing methods relying on traditional camera input tend to fail in high-speed (or equivalently lo… ▽ More Scene reconstruction from casually captured videos has wide applications in real-world scenarios. With recent advancements in differentiable rendering techniques, several methods have attempted to simultaneously optimize scene representations (NeRF or 3DGS) and camera poses. Despite recent progress, existing methods relying on traditional camera input tend to fail in high-speed (or equivalently low-frame-rate) scenarios. Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution, providing valuable scene and motion information in blind inter-frame intervals. In this paper, we introduce the event camera to aid scene construction from a casually captured video for the first time, and propose Event-Aided Free-Trajectory 3DGS, called EF-3DGS, which seamlessly integrates the advantages of event cameras into 3DGS through three key components. First, we leverage the Event Generation Model (EGM) to fuse events and frames, supervising the rendered views observed by the event stream. Second, we adopt the Contrast Maximization (CMax) framework in a piece-wise manner to extract motion information by maximizing the contrast of the Image of Warped Events (IWE), thereby calibrating the estimated poses. Besides, based on the Linear Event Generation Model (LEGM), the brightness information encoded in the IWE is also utilized to constrain the 3DGS in the gradient domain. Third, to mitigate the absence of color information of events, we introduce photometric bundle adjustment (PBA) to ensure view consistency across events and frames. We evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS. Our project page is https://lbh666.github.io/ef-3dgs/. △ Less

Submitted 23 March, 2025; v1 submitted 20 October, 2024; originally announced October 2024.

Comments: Project Page: https://lbh666.github.io/ef-3dgs/

arXiv:2410.15389 [pdf, other]

doi 10.1126/sciadv.adr9527

Observation of quantum superposition of topological defects in a trapped ion quantum simulator

Authors: Zhijie Cheng, Yukai Wu, Shijiao Li, Quanxin Mei, Bowen Li, Gangxi Wang, Yue Jiang, Binxiang Qi, Zichao Zhou, Panyu Hou, Luming Duan

Abstract: Topological defects are discontinuities of a system protected by global properties, with wide applications in mathematics and physics. While previous experimental studies mostly focused on their classical properties, it has been predicted that topological defects can exhibit quantum superposition. Despite the fundamental interest and potential applications in understanding symmetry-breaking dynami… ▽ More Topological defects are discontinuities of a system protected by global properties, with wide applications in mathematics and physics. While previous experimental studies mostly focused on their classical properties, it has been predicted that topological defects can exhibit quantum superposition. Despite the fundamental interest and potential applications in understanding symmetry-breaking dynamics of quantum phase transitions, its experimental realization still remains a challenge. Here, we report the observation of quantum superposition of topological defects in a trapped-ion quantum simulator. By engineering long-range spin-spin interactions, we observe a spin kink splitting into a superposition of kinks at different positions, creating a ``Schrodinger kink'' that manifests non-locality and quantum interference. Furthermore, by preparing superposition states of neighboring kinks with different phases, we observe the propagation of the wave packet in different directions, thus unambiguously verifying the quantum coherence in the superposition states. Our work provides useful tools for non-equilibrium dynamics in quantum Kibble-Zurek physics. △ Less

Submitted 20 October, 2024; originally announced October 2024.

Comments: 8 pages, 6 figures, already published in Science Advances

Journal ref: Sci. Adv.10,eadr9527(2024)

arXiv:2410.14966 [pdf, other]

Attack as Defense: Run-time Backdoor Implantation for Image Content Protection

Authors: Haichuan Zhang, Meiyu Lin, Zhaoyi Liu, Renyuan Li, Zhiyuan Cheng, Carl Yang, Mingjie Tang

Abstract: As generative models achieve great success, tampering and modifying the sensitive image contents (i.e., human faces, artist signatures, commercial logos, etc.) have induced a significant threat with social impact. The backdoor attack is a method that implants vulnerabilities in a target model, which can be activated through a trigger. In this work, we innovatively prevent the abuse of image conten… ▽ More As generative models achieve great success, tampering and modifying the sensitive image contents (i.e., human faces, artist signatures, commercial logos, etc.) have induced a significant threat with social impact. The backdoor attack is a method that implants vulnerabilities in a target model, which can be activated through a trigger. In this work, we innovatively prevent the abuse of image content modification by implanting the backdoor into image-editing models. Once the protected sensitive content on an image is modified by an editing model, the backdoor will be triggered, making the editing fail. Unlike traditional backdoor attacks that use data poisoning, to enable protection on individual images and eliminate the need for model training, we developed the first framework for run-time backdoor implantation, which is both time- and resource- efficient. We generate imperceptible perturbations on the images to inject the backdoor and define the protected area as the only backdoor trigger. Editing other unprotected insensitive areas will not trigger the backdoor, which minimizes the negative impact on legal image modifications. Evaluations with state-of-the-art image editing models show that our protective method can increase the CLIP-FID of generated images from 12.72 to 39.91, or reduce the SSIM from 0.503 to 0.167 when subjected to malicious editing. At the same time, our method exhibits minimal impact on benign editing, which demonstrates the efficacy of our proposed framework. The proposed run-time backdoor can also achieve effective protection on the latest diffusion models. Code are available. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: 10 pages, 6 figures

arXiv:2410.14894 [pdf, other]

Soft-Label Integration for Robust Toxicity Classification

Authors: Zelei Cheng, Xian Wu, Jiahao Yu, Shuo Han, Xin-Qiang Cai, Xinyu Xing

Abstract: Toxicity classification in textual content remains a significant problem. Data with labels from a single annotator fall short of capturing the diversity of human perspectives. Therefore, there is a growing need to incorporate crowdsourced annotations for training an effective toxicity classifier. Additionally, the standard approach to training a classifier using empirical risk minimization (ERM) m… ▽ More Toxicity classification in textual content remains a significant problem. Data with labels from a single annotator fall short of capturing the diversity of human perspectives. Therefore, there is a growing need to incorporate crowdsourced annotations for training an effective toxicity classifier. Additionally, the standard approach to training a classifier using empirical risk minimization (ERM) may fail to address the potential shifts between the training set and testing set due to exploiting spurious correlations. This work introduces a novel bi-level optimization framework that integrates crowdsourced annotations with the soft-labeling technique and optimizes the soft-label weights by Group Distributionally Robust Optimization (GroupDRO) to enhance the robustness against out-of-distribution (OOD) risk. We theoretically prove the convergence of our bi-level optimization algorithm. Experimental results demonstrate that our approach outperforms existing baseline methods in terms of both average and worst-group accuracy, confirming its effectiveness in leveraging crowdsourced annotations to achieve more effective and robust toxicity classification. △ Less

Submitted 7 November, 2024; v1 submitted 18 October, 2024; originally announced October 2024.

Comments: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

arXiv:2410.14259 [pdf, other]

doi 10.1145/3696410.3714770

Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement

Authors: Zihao Cheng, Li Zhou, Feng Jiang, Benyou Wang, Haizhou Li

Abstract: The rapid development of large language models (LLMs), like ChatGPT, has resulted in the widespread presence of LLM-generated content on social media platforms, raising concerns about misinformation, data biases, and privacy violations, which can undermine trust in online discourse. While detecting LLM-generated content is crucial for mitigating these risks, current methods often focus on binary c… ▽ More The rapid development of large language models (LLMs), like ChatGPT, has resulted in the widespread presence of LLM-generated content on social media platforms, raising concerns about misinformation, data biases, and privacy violations, which can undermine trust in online discourse. While detecting LLM-generated content is crucial for mitigating these risks, current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-LLM collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content. This approach introduces two novel tasks: LLM Role Recognition (LLM-RR), a multi-class classification task that identifies specific roles of LLM in content generation, and LLM Influence Measurement (LLM-IM), a regression task that quantifies the extent of LLM involvement in content creation. To support these tasks, we propose LLMDetect, a benchmark designed to evaluate detectors' performance on these new tasks. LLMDetect includes the Hybrid News Detection Corpus (HNDC) for training detectors, as well as DetectEval, a comprehensive evaluation suite that considers five distinct cross-context variations and two multi-intensity variations within the same LLM role. This allows for a thorough assessment of detectors' generalization and robustness across diverse contexts. Our empirical validation of 10 baseline detection methods demonstrates that fine-tuned PLM-based models consistently outperform others on both tasks, while advanced LLMs face challenges in accurately detecting their own generated content. Our experimental results and analysis offer insights for developing more effective detection models for LLM-generated content. This research enhances the understanding of LLM-generated content and establishes a foundation for more nuanced detection methodologies. △ Less

Submitted 6 February, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

Comments: Social Media, Large Language Models, LLM-generated Text Detection, AI-assisted News Detection; Accepted by WWW2025

Journal ref: Proceedings of the ACM Web Conference 2025 (WWW '25), April 28-May 2, 2025, Sydney, NSW, Australia

arXiv:2410.12787 [pdf, other]

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Authors: Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing

Abstract: Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in va… ▽ More Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs. △ Less

Submitted 16 October, 2024; originally announced October 2024.

Comments: Project Page: cmm-damovl.site

arXiv:2410.10366 [pdf, other]

Affinity-Graph-Guided Contractive Learning for Pretext-Free Medical Image Segmentation with Minimal Annotation

Authors: Zehua Cheng, Di Yuan, Thomas Lukasiewicz

Abstract: The combination of semi-supervised learning (SemiSL) and contrastive learning (CL) has been successful in medical image segmentation with limited annotations. However, these works often rely on pretext tasks that lack the specificity required for pixel-level segmentation, and still face overfitting issues due to insufficient supervision signals resulting from too few annotations. Therefore, this p… ▽ More The combination of semi-supervised learning (SemiSL) and contrastive learning (CL) has been successful in medical image segmentation with limited annotations. However, these works often rely on pretext tasks that lack the specificity required for pixel-level segmentation, and still face overfitting issues due to insufficient supervision signals resulting from too few annotations. Therefore, this paper proposes an affinity-graph-guided semi-supervised contrastive learning framework (Semi-AGCL) by establishing additional affinity-graph-based supervision signals between the student and teacher network, to achieve medical image segmentation with minimal annotations without pretext. The framework first designs an average-patch-entropy-driven inter-patch sampling method, which can provide a robust initial feature space without relying on pretext tasks. Furthermore, the framework designs an affinity-graph-guided loss function, which can improve the quality of the learned representation and the model generalization ability by exploiting the inherent structure of the data, thus mitigating overfitting. Our experiments indicate that with merely 10% of the complete annotation set, our model approaches the accuracy of the fully annotated baseline, manifesting a marginal deviation of only 2.52%. Under the stringent conditions where only 5% of the annotations are employed, our model exhibits a significant enhancement in performance surpassing the second best baseline by 23.09% on the dice metric and achieving an improvement of 26.57% on the notably arduous CRAG and ACDC datasets. △ Less

Submitted 14 October, 2024; originally announced October 2024.

Comments: BIBM 2024

arXiv:2410.09583 [pdf, other]

POPoS: Improving Efficient and Robust Facial Landmark Detection with Parallel Optimal Position Search

Authors: Chong-Yang Xiang, Jun-Yan He, Zhi-Qi Cheng, Xiao Wu, Xian-Sheng Hua

Abstract: Achieving a balance between accuracy and efficiency is a critical challenge in facial landmark detection (FLD). This paper introduces Parallel Optimal Position Search (POPoS), a high-precision encoding-decoding framework designed to address the limitations of traditional FLD methods. POPoS employs three key contributions: (1) Pseudo-range multilateration is utilized to correct heatmap errors, impr… ▽ More Achieving a balance between accuracy and efficiency is a critical challenge in facial landmark detection (FLD). This paper introduces Parallel Optimal Position Search (POPoS), a high-precision encoding-decoding framework designed to address the limitations of traditional FLD methods. POPoS employs three key contributions: (1) Pseudo-range multilateration is utilized to correct heatmap errors, improving landmark localization accuracy. By integrating multiple anchor points, it reduces the impact of individual heatmap inaccuracies, leading to robust overall positioning. (2) To enhance the pseudo-range accuracy of selected anchor points, a new loss function, named multilateration anchor loss, is proposed. This loss function enhances the accuracy of the distance map, mitigates the risk of local optima, and ensures optimal solutions. (3) A single-step parallel computation algorithm is introduced, boosting computational efficiency and reducing processing time. Extensive evaluations across five benchmark datasets demonstrate that POPoS consistently outperforms existing methods, particularly excelling in low-resolution heatmaps scenarios with minimal computational overhead. These advantages make POPoS a highly efficient and accurate tool for FLD, with broad applicability in real-world scenarios. △ Less

Submitted 20 December, 2024; v1 submitted 12 October, 2024; originally announced October 2024.

Comments: Accepted to AAAI 2025, 9 pages, 6 figures. Code: https://github.com/teslatasy/POPoS

arXiv:2410.08565 [pdf, other]

Baichuan-Omni Technical Report

Authors: Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu , et al. (2 additional authors not shown)

Abstract: The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering… ▽ More The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction. △ Less

Submitted 27 December, 2024; v1 submitted 11 October, 2024; originally announced October 2024.

arXiv:2410.08454 [pdf, other]

doi 10.1109/ACCESS.2025.3547759

HorGait: A Hybrid Model for Accurate Gait Recognition in LiDAR Point Cloud Planar Projections

Authors: Jiaxing Hao, Yanxi Wang, Zhigang Chang, Hongmin Gao, Zihao Cheng, Chen Wu, Xin Zhao, Peiye Fang, Rachmat Muwardi

Abstract: Gait recognition is a remote biometric technology that utilizes the dynamic characteristics of human movement to identify individuals even under various extreme lighting conditions. Due to the limitation in spatial perception capability inherent in 2D gait representations, LiDAR can directly capture 3D gait features and represent them as point clouds, reducing environmental and lighting interferen… ▽ More Gait recognition is a remote biometric technology that utilizes the dynamic characteristics of human movement to identify individuals even under various extreme lighting conditions. Due to the limitation in spatial perception capability inherent in 2D gait representations, LiDAR can directly capture 3D gait features and represent them as point clouds, reducing environmental and lighting interference in recognition while significantly advancing privacy protection. For complex 3D representations, shallow networks fail to achieve accurate recognition, making vision Transformers the foremost prevalent method. However, the prevalence of dumb patches has limited the widespread use of Transformer architecture in gait recognition. This paper proposes a method named HorGait, which utilizes a hybrid model with a Transformer architecture for gait recognition on the planar projection of 3D point clouds from LiDAR. Specifically, it employs a hybrid model structure called LHM Block to achieve input adaptation, long-range, and high-order spatial interaction of the Transformer architecture. Additionally, it uses large convolutional kernel CNNs to segment the input representation, replacing attention windows to reduce dumb patches. We conducted extensive experiments, and the results show that HorGait achieves state-of-the-art performance among Transformer architecture methods on the SUSTech1K dataset, verifying that the hybrid model can complete the full Transformer process and perform better in point cloud planar projection. The outstanding performance of HorGait offers new insights for the future application of the Transformer architecture in gait recognition. △ Less

Submitted 23 October, 2024; v1 submitted 10 October, 2024; originally announced October 2024.

arXiv:2410.07946 [pdf]

Field-free spin-orbit switching of canted magnetization in Pt/Co/Ru/RuO2(101) multilayers

Authors: Yunzhuo Wu, Tong Wu, Haoran Chen, Yongwei Cui, Hongyue Xu, Nan Jiang, Zhen Cheng, Yizheng Wu

Abstract: Enabling field-free current-induced switching of perpendicular magnetization is essential for advancing spin-orbit-torque magnetic random access memory technology. Our research on the Pt/Co/Ru/RuO2(101) system has successfully demonstrated field-free switching through current injection along the RuO2[010] axis. We discovered that the system exhibits a tilted easy axis, inclined from the out-of-pla… ▽ More Enabling field-free current-induced switching of perpendicular magnetization is essential for advancing spin-orbit-torque magnetic random access memory technology. Our research on the Pt/Co/Ru/RuO2(101) system has successfully demonstrated field-free switching through current injection along the RuO2[010] axis. We discovered that the system exhibits a tilted easy axis, inclined from the out-of-plane towards the RuO2[-101] direction. The application of current perpendicular to this tilted axis generates a substantial out-of-plane effective field, which facilitates field-free magnetization switching. Our results also indicate that adjusting the thickness of the Ru layer to optimize the tilt angle can significantly reduce the critical switching current density. This work provides a viable strategy for controlling the tilting magnetization, essential for the development of RuO2-based magnetic devices. △ Less

Submitted 10 October, 2024; originally announced October 2024.

arXiv:2410.04109 [pdf]

Radiative cooling capacity on Earth

Authors: Cunhai Wang, Hao Chen, Yanyan Feng, Ziming Cheng, Jingchong Liu, Fuqiang Wang

Abstract: By passively dissipating thermal emission into the ultracold deep space, radiative cooling (RC) is an environment-friendly means for gaining cooling capacity, paving a bright future for global energy saving and carbon dioxide reduction. However, assessing the global RC capacity at the day-to-annual scale remains challenging as the RC capacity significantly depends on geographic and environmental c… ▽ More By passively dissipating thermal emission into the ultracold deep space, radiative cooling (RC) is an environment-friendly means for gaining cooling capacity, paving a bright future for global energy saving and carbon dioxide reduction. However, assessing the global RC capacity at the day-to-annual scale remains challenging as the RC capacity significantly depends on geographic and environmental conditions. To our knowledge, no analysis of global RC capacity has been reported. Herein, we show the distribution of RC capacity on Earth by establishing a precise assessment model for evaluating the performance of a radiative cooler. Our assessment is comprehensively validated against experimental data and extended to elucidate the capacity of representative broadband and selective cooler. We also categorize the global RC capacity into five representative regions based on the year-round cooling power. Our assessment can inform trade-offs between design and practical application for the RC systems, alongside promoting RC-based technologies to tackle worldwide energy and environment challenges. △ Less

Submitted 5 October, 2024; originally announced October 2024.

Comments: Four figures

arXiv:2410.04040 [pdf, other]

Flatbands from Bound States in the Continuum for Orbital Angular Momentum Localization

Authors: Weiwei Zhu, Hongyu Zou, Yong Ge, Yin Wang, Zheyu Cheng, Bing-bing Wang, Shou-qi Yuan, Hong-xiang Sun, Haoran Xue, Baile Zhang

Abstract: A flatband material is a system characterized by energy bands with zero dispersion, allowing for the compact localization of wavefunctions in real space. This compact localization significantly enhances inter-particle correlations and light-matter interactions, leading to notable advancements such as fractional Chern insulators in condensed matter systems and flat-band lasers in photonics. Previou… ▽ More A flatband material is a system characterized by energy bands with zero dispersion, allowing for the compact localization of wavefunctions in real space. This compact localization significantly enhances inter-particle correlations and light-matter interactions, leading to notable advancements such as fractional Chern insulators in condensed matter systems and flat-band lasers in photonics. Previous flatband platforms, including twisted bilayer graphene and artificial kagome/Lieb lattices, typically focused on nondegenerate flatbands, lacking access to the high degeneracy that can facilitate the localization of orbital angular momentum (OAM). Here, we propose a general framework to construct highly degenerate flatbands from bound states in the continuum (BICs)--a concept originating from quantum theory but significantly developed in photonics and acoustics in recent years. The degeneracy of flatbands is determined by the number of BICs within each unit cell in a lattice. We experimentally validate this approach in two-dimensional (2D) and three-dimensional (3D) acoustic crystals, demonstrating flatbands with 4-fold and 12-fold degeneracies, respectively. The high degeneracy provides sufficient internal degrees of freedom, enabling the selective excitation of localized OAM at any position in any direction. Our results pave the way for exploring BIC-constructed flatbands and their localization properties. △ Less

Submitted 5 October, 2024; originally announced October 2024.

Comments: 15 pages, 4 figures

arXiv:2410.02758 [pdf, other]

Pseudoentanglement from tensor networks

Authors: Zihan Cheng, Xiaozhou Feng, Matteo Ippoliti

Abstract: Pseudoentangled states are defined by their ability to hide their entanglement structure: they are indistinguishable from random states to any observer with polynomial resources, yet can have much less entanglement than random states. Existing constructions of pseudoentanglement based on phase- and/or subset-states are limited in the entanglement structures they can hide: e.g., the states may have… ▽ More Pseudoentangled states are defined by their ability to hide their entanglement structure: they are indistinguishable from random states to any observer with polynomial resources, yet can have much less entanglement than random states. Existing constructions of pseudoentanglement based on phase- and/or subset-states are limited in the entanglement structures they can hide: e.g., the states may have low entanglement on a single cut, on all cuts at once, or on local cuts in one dimension. Here we introduce new constructions of pseudoentangled states based on (pseudo)random tensor networks that affords much more flexibility in the achievable entanglement structures. We illustrate our construction with the simplest example of a matrix product state, realizable as a staircase circuit of pseudorandom unitary gates, which exhibits pseudo-area-law scaling of entanglement in one dimension. We then generalize our construction to arbitrary tensor network structures that admit an isometric realization. A notable application of this result is the construction of pseudoentangled `holographic' states whose entanglement entropy obeys a Ryu-Takayanagi `minimum-cut' formula, answering a question posed in [Aaronson et al., arXiv:2211.00747]. △ Less

Submitted 16 October, 2024; v1 submitted 3 October, 2024; originally announced October 2024.

Comments: 5+6 pages, 3 figures. v2: fixed typos and minor issues

arXiv:2410.01300 [pdf]

Atmospheric Pressure Ammonia Synthesis on AuRu Catalysts Enabled by Plasmon-Controlled Hydrogenation and Nitrogen-species Desorption

Authors: Lin Yuan, Briley B. Bourgeois, Elijah Begin, Yirui Zhang, Alan X. Dai, Zhihua Cheng, Amy S. McKeown-Green, Zhichen Xue, Yi Cui, Kun Xu, Yu Wang, Matthew R. Jones, Yi Cui, Arun Majumdar, Junwei Lucas Bao, Jennifer A. Dionne

Abstract: Ammonia is a key component of fertilizer and a potential clean fuel and hydrogen carrier. The Haber-Bosch process for ammonia synthesis consumes more than half of industrial hydrogen and contributes up to ~3% of global greenhouse gas emissions. Light-driven reactions via surface plasmon resonances offer a less energy-intensive pathway for ammonia production by altering reaction intermediates. Here… ▽ More Ammonia is a key component of fertilizer and a potential clean fuel and hydrogen carrier. The Haber-Bosch process for ammonia synthesis consumes more than half of industrial hydrogen and contributes up to ~3% of global greenhouse gas emissions. Light-driven reactions via surface plasmon resonances offer a less energy-intensive pathway for ammonia production by altering reaction intermediates. Here, we report gold-ruthenium plasmonic bimetallic alloys for ammonia synthesis at room temperature and pressure, driven by visible light. We use colloidal synthesis to create AuRu$_x$ alloys (x=0.1, 0.2, 0.3) and disperse these nanoparticles on MgO supports for gas-phase ammonia synthesis. We observe a ~60 $μ$mol/g/h reactivity and ~0.12% external quantum efficiency on a AuRu$_0$$_.$$_2$ sample under 100 mW/cm$^2$ visible light. In-situ diffuse reflective infrared Fourier transform spectroscopic measurements show that hydrogenation of nitrogen adsorbates is accelerated under light compared to thermocatalysis. Combining wavelength-dependent reactivity and spectroscopic findings with semi-classical electromagnetic modeling, we show plasmonic bimetallic alloys expedite ammonia synthesis by aiding hydrogenation of adsorbed nitrogen species via plasmon-mediated hot electrons. Quantum mechanical calculations reveal hydrogen-assisted N$_2$ splitting in the excited state is key to activating the reaction under ambient conditions. Therefore, light or H$_2$ alone cannot dissociate N$_2$ -- the key bottleneck to breaking N$_2$'s triple bond. Our findings are consistent with recent hypotheses on how nitrogenase enzymes catalyze ammonia production at mild conditions and provide insights for sustainable photochemical transformations. △ Less

Submitted 2 October, 2024; originally announced October 2024.

Comments: 21 pages, 4 figures, journal article submission soon

arXiv:2409.19043 [pdf, other]

Parallel Quantum Signal Processing Via Polynomial Factorization

Authors: John M. Martyn, Zane M. Rossi, Kevin Z. Cheng, Yuan Liu, Isaac L. Chuang

Abstract: Quantum signal processing (QSP) is a methodology for constructing polynomial transformations of a linear operator encoded in a unitary. Applied to an encoding of a state $ρ$, QSP enables the evaluation of nonlinear functions of the form $\text{tr}(P(ρ))$ for a polynomial $P(x)$, which encompasses relevant properties like entropies and fidelity. However, QSP is a sequential algorithm: implementing… ▽ More Quantum signal processing (QSP) is a methodology for constructing polynomial transformations of a linear operator encoded in a unitary. Applied to an encoding of a state $ρ$, QSP enables the evaluation of nonlinear functions of the form $\text{tr}(P(ρ))$ for a polynomial $P(x)$, which encompasses relevant properties like entropies and fidelity. However, QSP is a sequential algorithm: implementing a degree-$d$ polynomial necessitates $d$ queries to the encoding, equating to a query depth $d$. Here, we reduce the depth of these property estimation algorithms by developing Parallel Quantum Signal Processing. Our algorithm parallelizes the computation of $\text{tr} (P(ρ))$ over $k$ systems and reduces the query depth to $d/k$, thus enabling a family of time-space tradeoffs for QSP. This furnishes a property estimation algorithm suitable for distributed quantum computers, and is realized at the expense of increasing the number of measurements by a factor $O( \text{poly}(d) 2^{O(k)} )$. We achieve this result by factorizing $P(x)$ into a product of $k$ smaller polynomials of degree $O(d/k)$, which are each implemented in parallel with QSP, and subsequently multiplied together with a swap test to reconstruct $P(x)$. We characterize the achievable class of polynomials by appealing to the fundamental theorem of algebra, and demonstrate application to canonical problems including entropy estimation and partition function evaluation. △ Less

Submitted 27 September, 2024; originally announced September 2024.

Report number: MIT-CTP/5780

Showing 151–200 of 1,126 results for author: Chéng, Z