Search | arXiv e-print repository

Towards Flexible 3D Perception: Object-Centric Occupancy Completion Augments 3D Object Detection

Authors: Chaoda Zheng, Feng Wang, Naiyan Wang, Shuguang Cui, Zhen Li

Abstract: While 3D object bounding box (bbox) representation has been widely used in autonomous driving perception, it lacks the ability to capture the precise details of an object's intrinsic geometry. Recently, occupancy has emerged as a promising alternative for 3D scene perception. However, constructing a high-resolution occupancy map remains infeasible for large scenes due to computational constraints.… ▽ More While 3D object bounding box (bbox) representation has been widely used in autonomous driving perception, it lacks the ability to capture the precise details of an object's intrinsic geometry. Recently, occupancy has emerged as a promising alternative for 3D scene perception. However, constructing a high-resolution occupancy map remains infeasible for large scenes due to computational constraints. Recognizing that foreground objects only occupy a small portion of the scene, we introduce object-centric occupancy as a supplement to object bboxes. This representation not only provides intricate details for detected objects but also enables higher voxel resolution in practical applications. We advance the development of object-centric occupancy perception from both data and algorithm perspectives. On the data side, we construct the first object-centric occupancy dataset from scratch using an automated pipeline. From the algorithmic standpoint, we introduce a novel object-centric occupancy completion network equipped with an implicit shape decoder that manages dynamic-size occupancy generation. This network accurately predicts the complete object-centric occupancy volume for inaccurate object proposals by leveraging temporal information from long sequences. Our method demonstrates robust performance in completing object shapes under noisy detection and tracking conditions. Additionally, we show that our occupancy features significantly enhance the detection results of state-of-the-art 3D object detectors, especially for incomplete or distant objects in the Waymo Open Dataset. △ Less

Submitted 6 December, 2024; originally announced December 2024.

Comments: NeurIPS 2024

arXiv:2412.01253 [pdf, other]

Yi-Lightning Technical Report

Authors: Alan Wake, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Fan Zhou, Feng Hu, Ge Zhang, Guoyin Wang, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qicheng Hu, Shawn Wang, Shijun Zhou , et al. (19 additional authors not shown)

Abstract: This technical report presents Yi-Lightning, our latest flagship large language model (LLM). It achieves exceptional performance, ranking 6th overall on Chatbot Arena, with particularly strong results (2nd to 4th place) in specialized categories including Chinese, Math, Coding, and Hard Prompts. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, featuring advanced expert seg… ▽ More This technical report presents Yi-Lightning, our latest flagship large language model (LLM). It achieves exceptional performance, ranking 6th overall on Chatbot Arena, with particularly strong results (2nd to 4th place) in specialized categories including Chinese, Math, Coding, and Hard Prompts. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, featuring advanced expert segmentation and routing mechanisms coupled with optimized KV-caching techniques. Our development process encompasses comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), where we devise deliberate strategies for multi-stage training, synthetic data construction, and reward modeling. Furthermore, we implement RAISE (Responsible AI Safety Engine), a four-component framework to address safety issues across pre-training, post-training, and serving phases. Empowered by our scalable super-computing infrastructure, all these innovations substantially reduce training, deployment and inference costs while maintaining high-performance standards. With further evaluations on public academic benchmarks, Yi-Lightning demonstrates competitive performance against top-tier LLMs, while we observe a notable disparity between traditional, static benchmark results and real-world, dynamic human preferences. This observation prompts a critical reassessment of conventional benchmarks' utility in guiding the development of more intelligent and powerful AI systems for practical applications. Yi-Lightning is now available through our developer platform at https://platform.lingyiwanwu.com. △ Less

Submitted 22 January, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

arXiv:2412.00542 [pdf, other]

Rethinking Generalizability and Discriminability of Self-Supervised Learning from Evolutionary Game Theory Perspective

Authors: Jiangmeng Li, Zehua Zang, Qirui Ji, Chuxiong Sun, Wenwen Qiang, Junge Zhang, Changwen Zheng, Fuchun Sun, Hui Xiong

Abstract: Representations learned by self-supervised approaches are generally considered to possess sufficient generalizability and discriminability. However, we disclose a nontrivial mutual-exclusion relationship between these critical representation properties through an exploratory demonstration on self-supervised learning. State-of-the-art self-supervised methods tend to enhance either generalizability… ▽ More Representations learned by self-supervised approaches are generally considered to possess sufficient generalizability and discriminability. However, we disclose a nontrivial mutual-exclusion relationship between these critical representation properties through an exploratory demonstration on self-supervised learning. State-of-the-art self-supervised methods tend to enhance either generalizability or discriminability but not both simultaneously. Thus, learning representations jointly possessing strong generalizability and discriminability presents a specific challenge for self-supervised learning. To this end, we revisit the learning paradigm of self-supervised learning from the perspective of evolutionary game theory (EGT) and outline the theoretical roadmap to achieve a desired trade-off between these representation properties. EGT performs well in analyzing the trade-off point in a two-player game by utilizing dynamic system modeling. However, the EGT analysis requires sufficient annotated data, which contradicts the principle of self-supervised learning, i.e., the EGT analysis cannot be conducted without the annotations of the specific target domain for self-supervised learning. Thus, to enhance the methodological generalization, we propose a novel self-supervised learning method that leverages advancements in reinforcement learning to jointly benefit from the general guidance of EGT and sequentially optimize the model to chase the consistent improvement of generalizability and discriminability for specific target domains during pre-training. Theoretically, we establish that the proposed method tightens the generalization error upper bound of self-supervised learning. Empirically, our method achieves state-of-the-art performance on various benchmarks. △ Less

Submitted 30 November, 2024; originally announced December 2024.

Comments: Accepted by IJCV, 2024

arXiv:2411.18613 [pdf, other]

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Authors: Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T. Barron, Aleksander Holynski

Abstract: We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconst… ▽ More We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos. See our project page for results and interactive demos: https://cat-4d.github.io/. △ Less

Submitted 18 December, 2024; v1 submitted 27 November, 2024; originally announced November 2024.

Comments: Project page: https://cat-4d.github.io/

arXiv:2411.17773 [pdf, other]

Efficient Multi-modal Large Language Models via Visual Token Grouping

Authors: Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng

Abstract: The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their… ▽ More The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic segments after the linear projection layer before feeding into the vision encoder. Besides, with the isolated attention we adopt, VisToG can identify and eliminate redundant visual tokens utilizing the prior knowledge in the pre-trained vision encoder, which effectively reduces computational demands. Extensive experiments demonstrate the effectiveness of VisToG, maintaining 98.1% of the original performance while achieving a reduction of over 27\% inference time. △ Less

Submitted 2 December, 2024; v1 submitted 26 November, 2024; originally announced November 2024.

arXiv:2411.16301 [pdf, other]

DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation

Authors: Yuxuan Yang, Jingyao Wang, Tao Geng, Wenwen Qiang, Changwen Zheng, Fuchun Sun

Abstract: Interior design is a complex and creative discipline involving aesthetics, functionality, ergonomics, and materials science. Effective solutions must meet diverse requirements, typically producing multiple deliverables such as renderings and design drawings from various perspectives. Consequently, interior design processes are often inefficient and demand significant creativity. With advances in m… ▽ More Interior design is a complex and creative discipline involving aesthetics, functionality, ergonomics, and materials science. Effective solutions must meet diverse requirements, typically producing multiple deliverables such as renderings and design drawings from various perspectives. Consequently, interior design processes are often inefficient and demand significant creativity. With advances in machine learning, generative models have emerged as a promising means of improving efficiency by creating designs from text descriptions or sketches. However, few generative works focus on interior design, leading to substantial discrepancies between outputs and practical needs, such as differences in size, spatial scope, and the lack of controllable generation quality. To address these challenges, we propose DiffDesign, a controllable diffusion model with meta priors for efficient interior design generation. Specifically, we utilize the generative priors of a 2D diffusion model pre-trained on a large image dataset as our rendering backbone. We further guide the denoising process by disentangling cross-attention control over design attributes, such as appearance, pose, and size, and introduce an optimal transfer-based alignment module to enforce view consistency. Simultaneously, we construct an interior design-specific dataset, DesignHelper, consisting of over 400 solutions across more than 15 spatial types and 15 design styles. This dataset helps fine-tune DiffDesign. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness and robustness of DiffDesign. △ Less

Submitted 25 November, 2024; originally announced November 2024.

Comments: 32 pages

arXiv:2411.15526 [pdf, other]

Multi-scale Cascaded Large-Model for Whole-body ROI Segmentation

Authors: Rui Hao, Dayu Tan, Yansen Su, Chunhou Zheng

Abstract: Organs-at-risk segmentation is critical for ensuring the safety and precision of radiotherapy and surgical procedures. However, existing methods for organs-at-risk image segmentation often suffer from uncertainties and biases in target selection, as well as insufficient model validation experiments, limiting their generality and reliability in practical applications. To address these issues, we pr… ▽ More Organs-at-risk segmentation is critical for ensuring the safety and precision of radiotherapy and surgical procedures. However, existing methods for organs-at-risk image segmentation often suffer from uncertainties and biases in target selection, as well as insufficient model validation experiments, limiting their generality and reliability in practical applications. To address these issues, we propose an innovative cascaded network architecture called the Multi-scale Cascaded Fusing Network (MCFNet), which effectively captures complex multi-scale and multi-resolution features. MCFNet includes a Sharp Extraction Backbone and a Flexible Connection Backbone, which respectively enhance feature extraction in the downsampling and skip-connection stages. This design not only improves segmentation accuracy but also ensures computational efficiency, enabling precise detail capture even in low-resolution images. We conduct experiments using the A6000 GPU on diverse datasets from 671 patients, including 36,131 image-mask pairs across 10 different datasets. MCFNet demonstrates strong robustness, performing consistently well across 10 datasets. Additionally, MCFNet exhibits excellent generalizability, maintaining high accuracy in different clinical scenarios. We also introduce an adaptive loss aggregation strategy to further optimize the model training process, improving both segmentation accuracy and efficiency. Through extensive validation, MCFNet demonstrates superior performance compared to existing methods, providing more reliable image-guided support. Our solution aims to significantly improve the precision and safety of radiotherapy and surgical procedures, advancing personalized treatment. The code has been made available on GitHub:https://github.com/Henry991115/MCFNet. △ Less

Submitted 23 November, 2024; originally announced November 2024.

arXiv:2411.15183 [pdf, other]

Balancing property optimization and constraint satisfaction for constrained multi-property molecular optimization

Authors: Xin Xia, Yajie Zhang, Xiangxiang Zeng, Xingyi Zhang, Chunhou Zheng, Yansen Su

Abstract: Molecular optimization, which aims to discover improved molecules from a vast chemical search space, is a critical step in chemical development. Various artificial intelligence technologies have demonstrated high effectiveness and efficiency on molecular optimization tasks. However, few of these technologies focus on balancing property optimization with constraint satisfaction, making it difficult… ▽ More Molecular optimization, which aims to discover improved molecules from a vast chemical search space, is a critical step in chemical development. Various artificial intelligence technologies have demonstrated high effectiveness and efficiency on molecular optimization tasks. However, few of these technologies focus on balancing property optimization with constraint satisfaction, making it difficult to obtain high-quality molecules that not only possess desirable properties but also meet various constraints. To address this issue, we propose a constrained multi-property molecular optimization framework (CMOMO), which is a flexible and efficient method to simultaneously optimize multiple molecular properties while satisfying several drug-like constraints. CMOMO improves multiple properties of molecules with constraints based on dynamic cooperative optimization, which dynamically handles the constraints across various scenarios. Besides, CMOMO evaluates multiple properties within discrete chemical spaces cooperatively with the evolution of molecules within an implicit molecular space to guide the evolutionary search. Experimental results show the superior performance of the proposed CMOMO over five state-of-the-art molecular optimization methods on two benchmark tasks of simultaneously optimizing multiple non-biological activity properties while satisfying two structural constraints. Furthermore, the practical applicability of CMOMO is verified on two practical tasks, where it identified a collection of candidate ligands of $β$2-adrenoceptor GPCR and candidate inhibitors of glycogen synthase kinase-3$β$ with high properties and under drug-like constraints. △ Less

Submitted 18 November, 2024; originally announced November 2024.

arXiv:2411.12994 [pdf, other]

Revisiting the activity-rotation relation for evolved stars

Authors: Henggeng Han, Song Wang, Xue Li, Chuanjie Zheng, Jifeng Liu

Abstract: The magnetic dynamo mechanism of giant stars remains an open question, which can be explored by investigating their activity-rotation relations with multiple proxies. By using the data from the LAMOST and \emph{GALEX} surveys, we carried out a comprehensive study of activity-rotation relations of evolved stars based on \cahk lines, $\rm{Hα}$ lines and near ultraviolet (NUV) emissions. Our results… ▽ More The magnetic dynamo mechanism of giant stars remains an open question, which can be explored by investigating their activity-rotation relations with multiple proxies. By using the data from the LAMOST and \emph{GALEX} surveys, we carried out a comprehensive study of activity-rotation relations of evolved stars based on \cahk lines, $\rm{Hα}$ lines and near ultraviolet (NUV) emissions. Our results show that evolved stars and dwarfs obey a similar power-law in the unsaturated region of the activity-rotation relation, indicating a common dynamo mechanism in both giant and dwarfs. There is no clear difference in the activity levels between red giant branch stars and red clump stars, nor between single giants and those in binaries. Additionally, our results show that the NUV activity levels of giants are comparable to those of G- and K-type dwarfs and are higher than those of M dwarfs. △ Less

Submitted 19 November, 2024; originally announced November 2024.

Comments: APJ accepted

arXiv:2411.12026 [pdf, other]

Modified Gravity Constraints from the Full Shape Modeling of Clustering Measurements from DESI 2024

Authors: M. Ishak, J. Pan, R. Calderon, K. Lodha, G. Valogiannis, A. Aviles, G. Niz, L. Yi, C. Zheng, C. Garcia-Quintero, A. de Mattia, L. Medina-Varela, J. L. Cervantes-Cota, U. Andrade, D. Huterer, H. E. Noriega, G. Zhao, A. Shafieloo, W. Fang, S. Ahlen, D. Bianchi, D. Brooks, E. Burtin, E. Chaussidon, T. Claybaugh , et al. (45 additional authors not shown)

Abstract: We present cosmological constraints on deviations from general relativity (GR) from the first-year of clustering observations from the Dark Energy Spectroscopic Instrument (DESI) in combination with other datasets. We first consider the $μ(a,k)$-$Σ(a,k)$ modified gravity (MG) parametrization (as well as $η(a,k)$) in flat $Λ$CDM and $w_0 w_a$CDM backgrounds. Using a functional form for time-only ev… ▽ More We present cosmological constraints on deviations from general relativity (GR) from the first-year of clustering observations from the Dark Energy Spectroscopic Instrument (DESI) in combination with other datasets. We first consider the $μ(a,k)$-$Σ(a,k)$ modified gravity (MG) parametrization (as well as $η(a,k)$) in flat $Λ$CDM and $w_0 w_a$CDM backgrounds. Using a functional form for time-only evolution gives $μ_0= 0.11^{+0.44}_{-0.54}$ from DESI(FS+BAO)+BBN and a wide prior on $n_{s}$. Using DESI(FS+BAO)+CMB+DESY3+DESY5-SN, we obtain $μ_0 = 0.05\pm 0.22$ and $Σ_0 = 0.008\pm 0.045$ in the $Λ$CDM background. In $w_0 w_a$CDM, we obtain $μ_0 =-0.24^{+0.32}_{-0.28}$ and $Σ_0 = 0.006\pm 0.043$, consistent with GR, and we still find a preference of the data for dynamical dark energy with $w_0>-1$ and $w_a<0$. We then use binned forms in the two backgrounds starting with two bins in redshift and then combining them with two bins in scale for a total of 4 and 8 MG parameters, respectively. All MG parameters are found consistent with GR. We also find that the tension reported for $Σ_0$ with GR when using Planck PR3 goes away when we use the recent LoLLiPoP+HiLLiPoP likelihoods. As noted previously, this seems to indicate that the tension is related to the CMB lensing anomaly in PR3 which is also resolved when using these likelihoods. We then constrain the class of Horndeski theory in the effective field theory of dark energy. We consider both EFT-basis and $α$-basis. Assuming a power law parametrization for the function $Ω$, which controls non-minimal coupling, we obtain $Ω_0 = 0.012^{+0.001}_{-0.012}$ and $s_0 = 0.996^{+0.54}_{-0.20}$ from DESI(FS+BAO)+DESY5SN+CMB in a $Λ$CDM background. Similar results are obtained when using the $α$-basis, where we constrain $c_M<1.14$, and are all consistent with GR. [Abridged.] △ Less

Submitted 20 December, 2024; v1 submitted 18 November, 2024; originally announced November 2024.

Comments: 55 pages, 13 figures. This DESI Collaboration Publication is part of the 2024 publication series using the first year of observations (see https://data.desi.lbl.gov/doc/papers/). Added 3 figures and more discussions

arXiv:2411.11621 [pdf, other]

Plasma acceleration of polarized particle beams

Authors: Lars Reichwein, Zheng Gong, Chuan Zheng, Liangliang Ji, Alexander Pukhov, Markus Büscher

Abstract: Spin-polarized particle beams are of interest for applications like deep-inelastic scattering, e.g. to gain further understanding of the proton's nuclear structure. With the advent of high-intensity laser facilities, laser-plasma-based accelerators offer a promising alternative to common radiofrequency-based accelerators, as they can shorten the required acceleration length significantly. However,… ▽ More Spin-polarized particle beams are of interest for applications like deep-inelastic scattering, e.g. to gain further understanding of the proton's nuclear structure. With the advent of high-intensity laser facilities, laser-plasma-based accelerators offer a promising alternative to common radiofrequency-based accelerators, as they can shorten the required acceleration length significantly. However, in the scope of spin-polarized particles, they bring unique challenges. This paper reviews the developments in the field of spin-polarized particles on the basis of the interaction of laser pulses and high-energy particle beams with plasma. The relevant scaling laws for spin-dependent effects in laser-plasma interaction, as well as acceleration schemes for polarized leptons, ions and gamma quanta are discussed. △ Less

Submitted 18 November, 2024; originally announced November 2024.

Comments: 42 pages, 14 figures, submitted to Reports on Progress in Physics

arXiv:2411.08837 [pdf, other]

A massive white dwarf or low-mass neutron star discovered by LAMOST

Authors: Xinlin Zhao, Song Wang, Pengfei Wang, Chuanjie Zheng, Haibo Yuan, Jifeng Liu

Abstract: We report the discovery of a close binary J0606+2132 (Gaia DR3 3423365496448406272) with $P_{\rm obs}=2.77$ days containing a possible massive white dwarf or a neutron star using the LAMOST spectroscopic data. By a joint fitting of the radial velocity from LAMOST and the light curve from TESS, we derived a circular Keplerian orbit with an inclination of $i=$81.31$^{\circ}$… ▽ More We report the discovery of a close binary J0606+2132 (Gaia DR3 3423365496448406272) with $P_{\rm obs}=2.77$ days containing a possible massive white dwarf or a neutron star using the LAMOST spectroscopic data. By a joint fitting of the radial velocity from LAMOST and the light curve from TESS, we derived a circular Keplerian orbit with an inclination of $i=$81.31$^{\circ}$$^{+6.26^{\circ}}_{-7.85^{\circ}}$, which is consistent with that derived from $v{\rm sin}I$. Together with the mass of the visible star, we derived the mass of the invisible object to be 1.34$^{+0.35}_{-0.40} M_{\odot}$. Spectral disentangling with the LAMOST medium-resolution spectra shows no absorption feature from an additional component, suggesting the presence of a compact object. No X-ray or radio pulsed signal is detected from ROSAT and FAST archive observations. J0606+2132 could evolve into either a Type Ia supernova or a neutron star through accretion-induced collapse if it is a white dwarf, or into an intermediate-mass X-ray binary if it is a neutron star. △ Less

Submitted 13 November, 2024; originally announced November 2024.

Comments: 17 pages, 8 figures, accepted for publication in APJ

arXiv:2411.06746 [pdf, other]

Neuromodulated Meta-Learning

Authors: Jingyao Wang, Huijie Guo, Wenwen Qiang, Jiangmeng Li, Changwen Zheng, Hui Xiong, Gang Hua

Abstract: Humans excel at adapting perceptions and actions to diverse environments, enabling efficient interaction with the external world. This adaptive capability relies on the biological nervous system (BNS), which activates different brain regions for distinct tasks. Meta-learning similarly trains machines to handle multiple tasks but relies on a fixed network structure, not as flexible as BNS. To inves… ▽ More Humans excel at adapting perceptions and actions to diverse environments, enabling efficient interaction with the external world. This adaptive capability relies on the biological nervous system (BNS), which activates different brain regions for distinct tasks. Meta-learning similarly trains machines to handle multiple tasks but relies on a fixed network structure, not as flexible as BNS. To investigate the role of flexible network structure (FNS) in meta-learning, we conduct extensive empirical and theoretical analyses, finding that model performance is tied to structure, with no universally optimal pattern across tasks. This reveals the crucial role of FNS in meta-learning, ensuring meta-learning to generate the optimal structure for each task, thereby maximizing the performance and learning efficiency of meta-learning. Motivated by this insight, we propose to define, measure, and model FNS in meta-learning. First, we define that an effective FNS should possess frugality, plasticity, and sensitivity. Then, to quantify FNS in practice, we present three measurements for these properties, collectively forming the \emph{structure constraint} with theoretical supports. Building on this, we finally propose Neuromodulated Meta-Learning (NeuronML) to model FNS in meta-learning. It utilizes bi-level optimization to update both weights and structure with the structure constraint. Extensive theoretical and empirical evaluations demonstrate the effectiveness of NeuronML on various tasks. Code is publicly available at \href{https://github.com/WangJingyao07/NeuronML}{https://github.com/WangJingyao07/NeuronML}. △ Less

Submitted 11 November, 2024; originally announced November 2024.

arXiv:2411.06307 [pdf, other]

Acoustic Volume Rendering for Neural Impulse Response Fields

Authors: Zitong Lan, Chenhao Zheng, Zhiwei Zheng, Mingmin Zhao

Abstract: Realistic audio synthesis that captures accurate acoustic phenomena is essential for creating immersive experiences in virtual and augmented reality. Synthesizing the sound received at any position relies on the estimation of impulse response (IR), which characterizes how sound propagates in one scene along different paths before arriving at the listener's position. In this paper, we present Acous… ▽ More Realistic audio synthesis that captures accurate acoustic phenomena is essential for creating immersive experiences in virtual and augmented reality. Synthesizing the sound received at any position relies on the estimation of impulse response (IR), which characterizes how sound propagates in one scene along different paths before arriving at the listener's position. In this paper, we present Acoustic Volume Rendering (AVR), a novel approach that adapts volume rendering techniques to model acoustic impulse responses. While volume rendering has been successful in modeling radiance fields for images and neural scene representations, IRs present unique challenges as time-series signals. To address these challenges, we introduce frequency-domain volume rendering and use spherical integration to fit the IR measurements. Our method constructs an impulse response field that inherently encodes wave propagation principles and achieves state-of-the-art performance in synthesizing impulse responses for novel poses. Experiments show that AVR surpasses current leading methods by a substantial margin. Additionally, we develop an acoustic simulation platform, AcoustiX, which provides more accurate and realistic IR simulations than existing simulators. Code for AVR and AcoustiX are available at https://zitonglan.github.io/avr. △ Less

Submitted 9 November, 2024; originally announced November 2024.

Comments: NeurIPS 2024 Spotlight

arXiv:2411.04924 [pdf, other]

MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

Authors: Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, Jianfei Cai

Abstract: We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively comb… ▽ More We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360's performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360° NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model. The video results are available on our project page: https://donydchen.github.io/mvsplat360. △ Less

Submitted 7 November, 2024; originally announced November 2024.

Comments: NeurIPS 2024, Project page: https://donydchen.github.io/mvsplat360, Code: https://github.com/donydchen/mvsplat360

arXiv:2411.01107 [pdf]

High-space-bandwidth product characterization of metalenses with Fourier ptychographic microscopy

Authors: Chuanjian Zheng, Wenli Wang, Yanfang Ji, Yao Hu, Shaohui Zhang, Qun Hao

Abstract: Large numerical aperture (NA) and large aperture metalenses have shown significant performance and abundant applications in biomedical and astronomical imaging fields. However, the high space-bandwidth product (SBP) requirements for measuring the phase of these metalenses, characterized by small phase periods and large apertures, have resulted in no effective techniques for sufficient characteriza… ▽ More Large numerical aperture (NA) and large aperture metalenses have shown significant performance and abundant applications in biomedical and astronomical imaging fields. However, the high space-bandwidth product (SBP) requirements for measuring the phase of these metalenses, characterized by small phase periods and large apertures, have resulted in no effective techniques for sufficient characterization. In this paper, we propose a high SBP phase characterization technique using Fourier ptychographic microscopy (FPM), enabling a high spatial resolution and wide field of view simultaneously. To demonstrate the feasibility and effectiveness of this technique, we achieve a high SBP (4.91 megapixels) measurement and characterization for focusing and focusing vortex metalenses, quantitatively displaying the effect of fabrication error on their typical optical performance. Furthermore, we characterize the aberration type and amount of wavefront deviations caused by fabrication. We also analyze compensation methods for different aberrations based on the wavefront characterization results, providing a targeted alignment strategy for optimizing overall optical system performance. We believe that our high SBP characterization technique cannot only help to improve metalens design but also optimize its fabrication processing, which will pave the way for the diversified applications of metalenses. △ Less

Submitted 1 November, 2024; originally announced November 2024.

arXiv:2410.21549 [pdf, other]

Semantic Search Evaluation

Authors: Chujie Zheng, Jeffrey Wang, Shuqian Albee Zhang, Anand Kishore, Siddharth Singh

Abstract: We propose a novel method for evaluating the performance of a content search system that measures the semantic match between a query and the results returned by the search system. We introduce a metric called "on-topic rate" to measure the percentage of results that are relevant to the query. To achieve this, we design a pipeline that defines a golden query set, retrieves the top K results for eac… ▽ More We propose a novel method for evaluating the performance of a content search system that measures the semantic match between a query and the results returned by the search system. We introduce a metric called "on-topic rate" to measure the percentage of results that are relevant to the query. To achieve this, we design a pipeline that defines a golden query set, retrieves the top K results for each query, and sends calls to GPT 3.5 with formulated prompts. Our semantic evaluation pipeline helps identify common failure patterns and goals against the metric for relevance improvements. △ Less

Submitted 28 October, 2024; originally announced October 2024.

Comments: Accepted by 3rd International Workshop on Industrial Recommendation Systems (at CIKM 2024)

arXiv:2410.20230 [pdf, other]

FRTree Planner: Robot Navigation in Cluttered and Unknown Environments with Tree of Free Regions

Authors: Yulin Li, Zhicheng Song, Chunxin Zheng, Zhihai Bi, Kai Chen, Michael Yu Wang, Jun Ma

Abstract: In this work, we present FRTree planner, a novel robot navigation framework that leverages a tree structure of free regions, specifically designed for navigation in cluttered and unknown environments with narrow passages. The framework continuously incorporates real-time perceptive information to identify distinct navigation options and dynamically expands the tree toward explorable and traversabl… ▽ More In this work, we present FRTree planner, a novel robot navigation framework that leverages a tree structure of free regions, specifically designed for navigation in cluttered and unknown environments with narrow passages. The framework continuously incorporates real-time perceptive information to identify distinct navigation options and dynamically expands the tree toward explorable and traversable directions. This dynamically constructed tree incrementally encodes the geometric and topological information of the collision-free space, enabling efficient selection of the intermediate goals, navigating around dead-end situations, and avoidance of dynamic obstacles without a prior map. Crucially, our method performs a comprehensive analysis of the geometric relationship between free regions and the robot during online replanning. In particular, the planner assesses the accessibility of candidate passages based on the robot's geometries, facilitating the effective selection of the most viable intermediate goals through accessible narrow passages while minimizing unnecessary detours. By combining the free region information with a bi-level trajectory optimization tailored for robots with specific geometries, our approach generates robust and adaptable obstacle avoidance strategies in confined spaces. Through extensive simulations and real-world experiments, FRTree demonstrates its superiority over benchmark methods in generating safe, efficient motion plans through highly cluttered and unknown terrains with narrow gaps. △ Less

Submitted 13 February, 2025; v1 submitted 26 October, 2024; originally announced October 2024.

arXiv:2410.19577 [pdf, ps, other]

doi 10.1021/acs.nanolett.4c04461

Landau-Level Quantization and Band Splitting of FeSe Monolayers Revealed by Scanning Tunneling Spectroscopy

Authors: Wantong Huang, Haicheng Lin, Yuguo Yin, Cheng Zheng, Wei Chen, Lichen Ji, Jack Hughes, Fedor Kusmartsev, Anna Kusmartseva, Qi-Kun Xue, Xi Chen, Shuai-Hua Ji

Abstract: Two-dimensional (2D) superconductors that reside on substrates must be influenced by Rashba spin-orbit coupling (SOC). The intriguing effect of Rashba-type SOCs on iron-based superconductors (IBSs) has remained largely a mystery. In this work, we unveil modified Landau-level spectroscopy and the intricate band splitting of FeSe monolayers through the precision of scanning tunneling spectroscopy, w… ▽ More Two-dimensional (2D) superconductors that reside on substrates must be influenced by Rashba spin-orbit coupling (SOC). The intriguing effect of Rashba-type SOCs on iron-based superconductors (IBSs) has remained largely a mystery. In this work, we unveil modified Landau-level spectroscopy and the intricate band splitting of FeSe monolayers through the precision of scanning tunneling spectroscopy, which unequivocally demonstrates the presence of Rashba SOC. The discovery sheds light on a nonparabolic electron band at the X/Y point, displaying a distinctive Landau quantization behavior characterized by $E_n\propto(nB)^{4/3}$. The theoretical model aligns with our experimental insights, positing that the k$^4$-term of the electron band becomes predominant and profoundly reshapes the band structure. Our research underscores the pivotal role of the Rashba SOC effect on 2D superconductors and sets the stage to probe new quantum states in systems with remarkably low carrier concentrations. △ Less

Submitted 25 October, 2024; originally announced October 2024.

Comments: 21 pages, 5 figures

arXiv:2410.13032 [pdf, other]

Hypothesis Testing the Circuit Hypothesis in LLMs

Authors: Claudia Shi, Nicolas Beltran-Velez, Achille Nazaret, Carolina Zheng, Adrià Garriga-Alonso, Andrew Jesson, Maggie Makar, David M. Blei

Abstract: Large language models (LLMs) demonstrate surprising capabilities, but we do not understand how they are implemented. One hypothesis suggests that these capabilities are primarily executed by small subnetworks within the LLM, known as circuits. But how can we evaluate this hypothesis? In this paper, we formalize a set of criteria that a circuit is hypothesized to meet and develop a suite of hypothe… ▽ More Large language models (LLMs) demonstrate surprising capabilities, but we do not understand how they are implemented. One hypothesis suggests that these capabilities are primarily executed by small subnetworks within the LLM, known as circuits. But how can we evaluate this hypothesis? In this paper, we formalize a set of criteria that a circuit is hypothesized to meet and develop a suite of hypothesis tests to evaluate how well circuits satisfy them. The criteria focus on the extent to which the LLM's behavior is preserved, the degree of localization of this behavior, and whether the circuit is minimal. We apply these tests to six circuits described in the research literature. We find that synthetic circuits -- circuits that are hard-coded in the model -- align with the idealized properties. Circuits discovered in Transformer models satisfy the criteria to varying degrees. To facilitate future empirical studies of circuits, we created the \textit{circuitry} package, a wrapper around the \textit{TransformerLens} library, which abstracts away lower-level manipulations of hooks and activations. The software is available at \url{https://github.com/blei-lab/circuitry}. △ Less

Submitted 16 October, 2024; originally announced October 2024.

Comments: Code available here: https://github.com/blei-lab/circuitry

arXiv:2410.10527 [pdf, other]

Motion-guided small MAV detection in complex and non-planar scenes

Authors: Hanqing Guo, Canlun Zheng, Shiyu Zhao

Abstract: In recent years, there has been a growing interest in the visual detection of micro aerial vehicles (MAVs) due to its importance in numerous applications. However, the existing methods based on either appearance or motion features encounter difficulties when the background is complex or the MAV is too small. In this paper, we propose a novel motion-guided MAV detector that can accurately identify… ▽ More In recent years, there has been a growing interest in the visual detection of micro aerial vehicles (MAVs) due to its importance in numerous applications. However, the existing methods based on either appearance or motion features encounter difficulties when the background is complex or the MAV is too small. In this paper, we propose a novel motion-guided MAV detector that can accurately identify small MAVs in complex and non-planar scenes. This detector first exploits a motion feature enhancement module to capture the motion features of small MAVs. Then it uses multi-object tracking and trajectory filtering to eliminate false positives caused by motion parallax. Finally, an appearance-based classifier and an appearance-based detector that operates on the cropped regions are used to achieve precise detection results. Our proposed method can effectively and efficiently detect extremely small MAVs from dynamic and complex backgrounds because it aggregates pixel-level motion features and eliminates false positives based on the motion and appearance features of MAVs. Experiments on the ARD-MAV dataset demonstrate that the proposed method could achieve high performance in small MAV detection under challenging conditions and outperform other state-of-the-art methods across various metrics △ Less

Submitted 14 October, 2024; originally announced October 2024.

Comments: 8 pages, 6 figures

Journal ref: Pattern Recognition Letters 2024

arXiv:2410.10102 [pdf, other]

Trust-Region Eigenvalue Filtering for Projected Newton

Authors: Honglin Chen, Hsueh-Ti Derek Liu, Alec Jacobson, David I. W. Levin, Changxi Zheng

Abstract: We introduce a novel adaptive eigenvalue filtering strategy to stabilize and accelerate the optimization of Neo-Hookean energy and its variants under the Projected Newton framework. For the first time, we show that Newton's method, Projected Newton with eigenvalue clamping and Projected Newton with absolute eigenvalue filtering can be unified using ideas from the generalized trust region method. B… ▽ More We introduce a novel adaptive eigenvalue filtering strategy to stabilize and accelerate the optimization of Neo-Hookean energy and its variants under the Projected Newton framework. For the first time, we show that Newton's method, Projected Newton with eigenvalue clamping and Projected Newton with absolute eigenvalue filtering can be unified using ideas from the generalized trust region method. Based on the trust-region fit, our model adaptively chooses the correct eigenvalue filtering strategy to apply during the optimization. Our method is simple but effective, requiring only two lines of code change in the existing Projected Newton framework. We validate our model outperforms stand-alone variants across a number of experiments on quasistatic simulation of deformable solids over a large dataset. △ Less

Submitted 13 October, 2024; originally announced October 2024.

Comments: SIGGRAPH Asia 2024 (Conference track). Project page: https://www.cs.columbia.edu/cg/trust-region/

arXiv:2410.10015 [pdf, other]

doi 10.1051/0004-6361/202451825

New JWST redshifts for the host galaxies of CDF-S XT1 and XT2: understanding their nature

Authors: J. Quirola-Vásquez, F. E. Bauer, P. G. Jonker, A. Levan, W. N. Brandt, M. Ravasio, D. Eappachen, Y. Q. Xue, X. C. Zheng

Abstract: CDF-S XT1 and XT2 are considered two canonical extragalactic fast X-ray transients (FXTs). In this work, we report new constraints on both FXTs, based on recent JWST NIRCam and MIRI photometry, as well as NIRspec spectroscopy for CDF-S XT2 that allow us to improve our understanding of their distances, energetics, and host galaxy properties compared to the pre-JWST era. We use the available HST and… ▽ More CDF-S XT1 and XT2 are considered two canonical extragalactic fast X-ray transients (FXTs). In this work, we report new constraints on both FXTs, based on recent JWST NIRCam and MIRI photometry, as well as NIRspec spectroscopy for CDF-S XT2 that allow us to improve our understanding of their distances, energetics, and host galaxy properties compared to the pre-JWST era. We use the available HST and JWST archival data to determine the host properties and constrain the energetics of each FXT based on spectral energy distribution (SED) photometric fitting. The host of CDF-S XT1 is now constrained to lie at $z$=2.76, implying a host absolute magnitude $M_R=-19.14$~mag, stellar mass $M_{*}=$1.8e8~$M_\odot$, and star formation rate SFR$=0.62 M_\odot$/yr. These properties lie at the upper end of previous estimates, leaving CDF-S XT1 with a peak X-ray luminosity of 2.8e47 erg/s. We argue that the best progenitor scenario for XT1 is a low-luminosity gamma-ray burst (GRB), although we do not fully rule out a proto-magnetar association or a jetted tidal disruption event involving a white dwarf and an intermediate-massive black hole. In the case of CDF-S XT2, JWST imaging reveals a new highly obscured component of the host galaxy, previously missed in HST images, while NIRspec spectroscopy securely places the host at $z$=3.4598. The new redshift implies a host with $M_R=-21.76$~mag, $M_*=5.5e10 M_\odot$, SFR=160~$M_\odot$/yr, and FXT $L_{X,peak}=1.4e47$~erg/s. The revised energetics, similarity to X-ray flash event light curves, small host offset, and high host SFR favor a low-luminosity collapsar progenitor for CDF-S XT2. Although a magnetar model is not ruled out, it appears improbable. While these HST and JWST observations shed light on the host galaxies of XT1 and XT2, and by extension, on the nature of FXTs, a unique explanation for both sources remains elusive. △ Less

Submitted 24 February, 2025; v1 submitted 13 October, 2024; originally announced October 2024.

Comments: The manuscript was accepted by Astronomy & Astrophysics in January 2025

Journal ref: A&A 695, A279 (2025)

arXiv:2410.08935 [pdf, other]

Voxel-SLAM: A Complete, Accurate, and Versatile LiDAR-Inertial SLAM System

Authors: Zheng Liu, Haotian Li, Chongjian Yuan, Xiyuan Liu, Jiarong Lin, Rundong Li, Chunran Zheng, Bingyang Zhou, Wenyi Liu, Fu Zhang

Abstract: In this work, we present Voxel-SLAM: a complete, accurate, and versatile LiDAR-inertial SLAM system that fully utilizes short-term, mid-term, long-term, and multi-map data associations to achieve real-time estimation and high precision mapping. The system consists of five modules: initialization, odometry, local mapping, loop closure, and global mapping, all employing the same map representation,… ▽ More In this work, we present Voxel-SLAM: a complete, accurate, and versatile LiDAR-inertial SLAM system that fully utilizes short-term, mid-term, long-term, and multi-map data associations to achieve real-time estimation and high precision mapping. The system consists of five modules: initialization, odometry, local mapping, loop closure, and global mapping, all employing the same map representation, an adaptive voxel map. The initialization provides an accurate initial state estimation and a consistent local map for subsequent modules, enabling the system to start with a highly dynamic initial state. The odometry, exploiting the short-term data association, rapidly estimates current states and detects potential system divergence. The local mapping, exploiting the mid-term data association, employs a local LiDAR-inertial bundle adjustment (BA) to refine the states (and the local map) within a sliding window of recent LiDAR scans. The loop closure detects previously visited places in the current and all previous sessions. The global mapping refines the global map with an efficient hierarchical global BA. The loop closure and global mapping both exploit long-term and multi-map data associations. We conducted a comprehensive benchmark comparison with other state-of-the-art methods across 30 sequences from three representative scenes, including narrow indoor environments using hand-held equipment, large-scale wilderness environments with aerial robots, and urban environments on vehicle platforms. Other experiments demonstrate the robustness and efficiency of the initialization, the capacity to work in multiple sessions, and relocalization in degenerated environments. △ Less

Submitted 11 October, 2024; originally announced October 2024.

arXiv:2410.06854 [pdf, other]

Focal Surface Holographic Light Transport using Learned Spatially Adaptive Convolutions

Authors: Chuanjun Zheng, Yicheng Zhan, Liang Shi, Ozan Cakmakci, Kaan Akşit

Abstract: Computer-Generated Holography (CGH) is a set of algorithmic methods for identifying holograms that reconstruct Three-Dimensional (3D) scenes in holographic displays. CGH algorithms decompose 3D scenes into multiplanes at different depth levels and rely on simulations of light that propagated from a source plane to a targeted plane. Thus, for n planes, CGH typically optimizes holograms using n plan… ▽ More Computer-Generated Holography (CGH) is a set of algorithmic methods for identifying holograms that reconstruct Three-Dimensional (3D) scenes in holographic displays. CGH algorithms decompose 3D scenes into multiplanes at different depth levels and rely on simulations of light that propagated from a source plane to a targeted plane. Thus, for n planes, CGH typically optimizes holograms using n plane-to-plane light transport simulations, leading to major time and computational demands. Our work replaces multiple planes with a focal surface and introduces a learned light transport model that could propagate a light field from a source plane to the focal surface in a single inference. Our learned light transport model leverages spatially adaptive convolution to achieve depth-varying propagation demanded by targeted focal surfaces. The proposed model reduces the hologram optimization process up to 1.5x, which contributes to hologram dataset generation and the training of future learned CGH models. △ Less

Submitted 14 October, 2024; v1 submitted 9 October, 2024; originally announced October 2024.

Comments: SIGGRAPH Asia 2024 Technical Communications

arXiv:2410.05739 [pdf, other]

Array2BR: An End-to-End Noise-immune Binaural Audio Synthesis from Microphone-array Signals

Authors: Cheng Chi, Xiaoyu Li, Andong Li, Yuxuan Ke, Xiaodong Li, Chengshi Zheng

Abstract: Telepresence technology aims to provide an immersive virtual presence for remote conference applications, and it is extremely important to synthesize high-quality binaural audio signals for this aim. Because the ambient noise is often inevitable in practical application scenarios, it is highly desired that binaural audio signals without noise can be obtained from microphone-array signals directly.… ▽ More Telepresence technology aims to provide an immersive virtual presence for remote conference applications, and it is extremely important to synthesize high-quality binaural audio signals for this aim. Because the ambient noise is often inevitable in practical application scenarios, it is highly desired that binaural audio signals without noise can be obtained from microphone-array signals directly. For this purpose, this paper proposes a new end-to-end noise-immune binaural audio synthesis framework from microphone-array signals, abbreviated as Array2BR, and experimental results show that binaural cues can be correctly mapped and noise can be well suppressed simultaneously using the proposed framework. Compared with existing methods, the proposed method achieved better performance in terms of both objective and subjective metric scores. △ Less

Submitted 8 October, 2024; originally announced October 2024.

arXiv:2410.04798 [pdf, other]

DAPE V2: Process Attention Score as Feature Map for Length Extrapolation

Authors: Chuanyang Zheng, Yihang Gao, Han Shi, Jing Xiong, Jiankai Sun, Jingyao Li, Minbin Huang, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li

Abstract: The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work's occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encodi… ▽ More The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work's occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance. △ Less

Submitted 10 October, 2024; v1 submitted 7 October, 2024; originally announced October 2024.

Comments: Tech Report. Compared to DAPE, this work (DAPE V2) further analyzes the length extrapolation problem and translate the length extrapolation issue into a well-understood feature map processing problem. arXiv admin note: text overlap with arXiv:2405.14722

arXiv:2410.03090 [pdf, other]

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

Authors: Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong

Abstract: Deploying large language models (LLMs) is challenging due to their high memory and computational demands, especially during long-context inference. While key-value (KV) caching accelerates inference by reusing previously computed keys and values, it also introduces significant memory overhead. Existing KV cache compression methods such as eviction and merging typically compress the KV cache after… ▽ More Deploying large language models (LLMs) is challenging due to their high memory and computational demands, especially during long-context inference. While key-value (KV) caching accelerates inference by reusing previously computed keys and values, it also introduces significant memory overhead. Existing KV cache compression methods such as eviction and merging typically compress the KV cache after it is generated and overlook the eviction of hidden states, failing to improve the speed of the prefilling stage. Additionally, applying a uniform compression rate across different attention heads can harm crucial retrieval heads in needle-in-a-haystack tasks due to excessive compression. In this paper, we propose UNComp, an uncertainty-aware compression scheme that leverages matrix entropy to estimate model uncertainty across layers and heads at the token sequence level. By grouping layers and heads based on their uncertainty, UNComp adaptively compresses both the hidden states and the KV cache. Our method achieves a 1.6x speedup in the prefilling stage and reduces the KV cache to 4.74% of its original size, resulting in a 6.4x increase in throughput and a 1.4x speedup in inference with only a 1.41% performance loss. Remarkably, in needle-in-a-haystack tasks, UNComp outperforms the full-size KV cache even when compressed to 9.38% of its original size. Our approach offers an efficient, training-free Grouped-Query Attention paradigm that can be seamlessly integrated into existing KV cache schemes. △ Less

Submitted 3 October, 2024; originally announced October 2024.

arXiv:2410.02719 [pdf, other]

UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation

Authors: Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang, Chengming Li, Zhenan Sun, Lingpeng Kong, Ngai Wong

Abstract: We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG) that utilizes Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks. This span uncertainty enhances model calibration, improving robustness and mitigating semantic inconsistencies introduced by random chunking. Leveraging this insight, we propose an efficient un… ▽ More We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG) that utilizes Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks. This span uncertainty enhances model calibration, improving robustness and mitigating semantic inconsistencies introduced by random chunking. Leveraging this insight, we propose an efficient unsupervised learning technique to train the retrieval model, alongside an effective data sampling and scaling strategy. UncertaintyRAG outperforms baselines by 2.03% on LLaMA-2-7B, achieving state-of-the-art results while using only 4% of the training data compared to other advanced open-source retrieval models under distribution shift settings. Our method demonstrates strong calibration through span uncertainty, leading to improved generalization and robustness in long-context RAG tasks. Additionally, UncertaintyRAG provides a lightweight retrieval model that can be integrated into any large language model with varying context window lengths, without the need for fine-tuning, showcasing the flexibility of our approach. △ Less

Submitted 3 October, 2024; originally announced October 2024.

arXiv:2410.00772 [pdf, other]

On the Generalization and Causal Explanation in Self-Supervised Learning

Authors: Wenwen Qiang, Zeen Song, Ziyin Gu, Jiangmeng Li, Changwen Zheng, Fuchun Sun, Hui Xiong

Abstract: Self-supervised learning (SSL) methods learn from unlabeled data and achieve high generalization performance on downstream tasks. However, they may also suffer from overfitting to their training data and lose the ability to adapt to new tasks. To investigate this phenomenon, we conduct experiments on various SSL methods and datasets and make two observations: (1) Overfitting occurs abruptly in lat… ▽ More Self-supervised learning (SSL) methods learn from unlabeled data and achieve high generalization performance on downstream tasks. However, they may also suffer from overfitting to their training data and lose the ability to adapt to new tasks. To investigate this phenomenon, we conduct experiments on various SSL methods and datasets and make two observations: (1) Overfitting occurs abruptly in later layers and epochs, while generalizing features are learned in early layers for all epochs; (2) Coding rate reduction can be used as an indicator to measure the degree of overfitting in SSL models. Based on these observations, we propose Undoing Memorization Mechanism (UMM), a plug-and-play method that mitigates overfitting of the pre-trained feature extractor by aligning the feature distributions of the early and the last layers to maximize the coding rate reduction of the last layer output. The learning process of UMM is a bi-level optimization process. We provide a causal analysis of UMM to explain how UMM can help the pre-trained feature extractor overcome overfitting and recover generalization. We also demonstrate that UMM significantly improves the generalization performance of SSL methods on various downstream tasks. △ Less

Submitted 1 October, 2024; originally announced October 2024.

arXiv:2409.19699 [pdf, other]

Efficient Verification of Stabilizer Code Subspaces with Local Measurements

Authors: Congcong Zheng, Xutao Yu, Zaichen Zhang, Ping Xu, Kun Wang

Abstract: We address the task of verifying whether a quantum computer, designed to be protected by a specific stabilizer code, correctly encodes the corresponding logical qubits. To achieve this, we develop a general framework for subspace verification and explore several stabilizer code subspaces of practical significance. First, we present two efficient verification strategies for general stabilizer code… ▽ More We address the task of verifying whether a quantum computer, designed to be protected by a specific stabilizer code, correctly encodes the corresponding logical qubits. To achieve this, we develop a general framework for subspace verification and explore several stabilizer code subspaces of practical significance. First, we present two efficient verification strategies for general stabilizer code subspaces, utilizing measurements of their stabilizer generators and stabilizer groups, respectively. Then, building on the observation that certain tests can be conducted in parallel when the subspace exhibits specific structural properties, we propose a coloring strategy tailored to graph code subspaces and an XZ strategy tailored to Calderbank-Shor-Steane (CSS) code subspaces. Compared to stabilizer-based strategies, these new strategies require significantly fewer measurement settings and consume fewer state copies, approaching near-global optimality. Notably, all the strategies employ a limited number of Pauli measurements, are non-adaptive, and work on mixed states, enabling efficient experimental certification of both logical qubits and logical operations in noisy quantum computers. This work contributes to the first systematic study of efficient verification of stabilizer code subspaces with local measurements. △ Less

Submitted 7 December, 2024; v1 submitted 29 September, 2024; originally announced September 2024.

Comments: After the submission of this work, we have become aware of a related work by Chen et al. in arXiv:2410.12551

arXiv:2409.19676 [pdf, other]

See Detail Say Clear: Towards Brain CT Report Generation via Pathological Clue-driven Representation Learning

Authors: Chengxin Zheng, Junzhong Ji, Yanzhao Shi, Xiaodan Zhang, Liangqiong Qu

Abstract: Brain CT report generation is significant to aid physicians in diagnosing cranial diseases. Recent studies concentrate on handling the consistency between visual and textual pathological features to improve the coherence of report. However, there exist some challenges: 1) Redundant visual representing: Massive irrelevant areas in 3D scans distract models from representing salient visual contexts.… ▽ More Brain CT report generation is significant to aid physicians in diagnosing cranial diseases. Recent studies concentrate on handling the consistency between visual and textual pathological features to improve the coherence of report. However, there exist some challenges: 1) Redundant visual representing: Massive irrelevant areas in 3D scans distract models from representing salient visual contexts. 2) Shifted semantic representing: Limited medical corpus causes difficulties for models to transfer the learned textual representations to generative layers. This study introduces a Pathological Clue-driven Representation Learning (PCRL) model to build cross-modal representations based on pathological clues and naturally adapt them for accurate report generation. Specifically, we construct pathological clues from perspectives of segmented regions, pathological entities, and report themes, to fully grasp visual pathological patterns and learn cross-modal feature representations. To adapt the representations for the text generation task, we bridge the gap between representation learning and report generation by using a unified large language model (LLM) with task-tailored instructions. These crafted instructions enable the LLM to be flexibly fine-tuned across tasks and smoothly transfer the semantic representation for report generation. Experiments demonstrate that our method outperforms previous methods and achieves SoTA performance. Our code is available at "https://github.com/Chauncey-Jheng/PCRL-MRG". △ Less

Submitted 1 October, 2024; v1 submitted 29 September, 2024; originally announced September 2024.

Comments: Our work has been accepted by EMNLP2024 findings

arXiv:2409.17830 [pdf, other]

Unsupervised Learning Based Multi-Scale Exposure Fusion

Authors: Chaobing Zheng, Shiqian Wu, Zhenggguo Li

Abstract: Unsupervised learning based multi-scale exposure fusion (ULMEF) is efficient for fusing differently exposed low dynamic range (LDR) images into a higher quality LDR image for a high dynamic range (HDR) scene. Unlike supervised learning, loss functions play a crucial role in the ULMEF. In this paper, novel loss functions are proposed for the ULMEF and they are defined by using all the images to be… ▽ More Unsupervised learning based multi-scale exposure fusion (ULMEF) is efficient for fusing differently exposed low dynamic range (LDR) images into a higher quality LDR image for a high dynamic range (HDR) scene. Unlike supervised learning, loss functions play a crucial role in the ULMEF. In this paper, novel loss functions are proposed for the ULMEF and they are defined by using all the images to be fused and other differently exposed images from the same HDR scene. The proposed loss functions can guide the proposed ULMEF to learn more reliable information from the HDR scene than existing loss functions which are defined by only using the set of images to be fused. As such, the quality of the fused image is significantly improved. The proposed ULMEF also adopts a multi-scale strategy that includes a multi-scale attention module to effectively preserve the scene depth and local contrast in the fused image. Meanwhile, the proposed ULMEF can be adopted to achieve exposure interpolation and exposure extrapolation. Extensive experiments show that the proposed ULMEF algorithm outperforms state-of-the-art exposure fusion algorithms. △ Less

Submitted 26 September, 2024; originally announced September 2024.

Comments: 11 pages

arXiv:2409.16997 [pdf, other]

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Authors: Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang

Abstract: As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-F… ▽ More As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format. △ Less

Submitted 26 September, 2024; v1 submitted 25 September, 2024; originally announced September 2024.

arXiv:2409.15269 [pdf, other]

ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild

Authors: Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, Otmar Hilliges

Abstract: While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. Our method, ReLoo, overcomes this limitation and reco… ▽ More While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. Our method, ReLoo, overcomes this limitation and reconstructs high-quality 3D models of humans dressed in loose garments from monocular in-the-wild videos. To tackle this problem, we first establish a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing. On top of the layered neural representation, we further introduce a non-hierarchical virtual bone deformation module for the clothing layer that can freely move, which allows the accurate recovery of non-rigidly deforming loose clothing. A global optimization jointly optimizes the shape, appearance, and deformations of the human body and clothing via multi-layer differentiable volume rendering. To evaluate ReLoo, we record subjects with dynamically deforming garments in a multi-view capture studio. This evaluation, both on existing and our novel dataset, demonstrates ReLoo's clear superiority over prior art on both indoor datasets and in-the-wild videos. △ Less

Submitted 28 September, 2024; v1 submitted 23 September, 2024; originally announced September 2024.

Comments: Project page: https://moygcc.github.io/ReLoo/

arXiv:2409.14741 [pdf, other]

Less yet robust: crucial region selection for scene recognition

Authors: Jianqi Zhang, Mengxuan Wang, Jingyao Wang, Lingyu Si, Changwen Zheng, Fanjiang Xu

Abstract: Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropria… ▽ More Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets. △ Less

Submitted 20 October, 2024; v1 submitted 23 September, 2024; originally announced September 2024.

arXiv:2409.14228 [pdf, other]

Mentigo: An Intelligent Agent for Mentoring Students in the Creative Problem Solving Process

Authors: Siyu Zha, Yujia Liu, Chengbo Zheng, Jiaqi XU, Fuze Yu, Jiangtao Gong, Yingqing XU

Abstract: With the increasing integration of large lauguage models (LLMs) in education, there is growing interest in using AI agents to support student learning in creative tasks. This study presents an interactive Mentor Agent system named Mentigo, which is designed to assist middle school students in the creative problem solving (CPS) process. We created a comprehensive dataset of real classroom interacti… ▽ More With the increasing integration of large lauguage models (LLMs) in education, there is growing interest in using AI agents to support student learning in creative tasks. This study presents an interactive Mentor Agent system named Mentigo, which is designed to assist middle school students in the creative problem solving (CPS) process. We created a comprehensive dataset of real classroom interactions between students and mentors, which include the structured CPS task management, diverse guidance techniques, personalized feedback mechanisms. Based on this dataset, we create agentic workflow for the Mentigo system. The system's effectiveness was evaluated through a comparative experiment with 12 students and reviewed by five expert teachers. The Mentigo system demonstrated significant improvements in student engagement and creative outcomes. The findings provide design implications for leveraging LLMs to support CPS and offer insights into the application of AI mentor agents in educational contexts. △ Less

Submitted 21 September, 2024; originally announced September 2024.

Comments: Comments: 19 pages, 5 figures. Submitted to CHI 2025

MSC Class: 68U35 (Primary); 68T50 (Secondary) ACM Class: H.5.2; K.3.1

arXiv:2409.11505 [pdf, other]

Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News

Authors: Andreas Grivas, Claire Grover, Richard Tobin, Clare Llewellyn, Eleojo Oluwaseun Abubakar, Chunyu Zheng, Chris Dibben, Alan Marshall, Jamie Pearce, Beatrice Alex

Abstract: The communities that we live in affect our health in ways that are complex and hard to define. Moreover, our understanding of the place-based processes affecting health and inequalities is limited. This undermines the development of robust policy interventions to improve local health and well-being. News media provides social and community information that may be useful in health studies. Here we… ▽ More The communities that we live in affect our health in ways that are complex and hard to define. Moreover, our understanding of the place-based processes affecting health and inequalities is limited. This undermines the development of robust policy interventions to improve local health and well-being. News media provides social and community information that may be useful in health studies. Here we propose a methodology for characterising neighbourhoods by using local news articles. More specifically, we show how we can use Natural Language Processing (NLP) to unlock further information about neighbourhoods by analysing, geoparsing and clustering news articles. Our work is novel because we combine street-level geoparsing tailored to the locality with clustering of full news articles, enabling a more detailed examination of neighbourhood characteristics. We evaluate our outputs and show via a confluence of evidence, both from a qualitative and a quantitative perspective, that the themes we extract from news articles are sensible and reflect many characteristics of the real world. This is significant because it allows us to better understand the effects of neighbourhoods on health. Our findings on neighbourhood characterisation using news data will support a new generation of place-based research which examines a wider set of spatial processes and how they affect health, enabling new epidemiological research. △ Less

Submitted 17 September, 2024; originally announced September 2024.

Comments: Preprint - paper under submission

arXiv:2409.08474 [pdf, other]

Rethinking Meta-Learning from a Learning Lens

Authors: Jingyao Wang, Wenwen Qiang, Changwen Zheng, Hui Xiong, Gang Hua

Abstract: Meta-learning seeks to learn a well-generalized model initialization from training tasks to solve unseen tasks. From the "learning to learn" perspective, the quality of the initialization is modeled with one-step gradient decent in the inner loop. However, contrary to theoretical expectations, our empirical analysis reveals that this may expose meta-learning to underfitting. To bridge the gap betw… ▽ More Meta-learning seeks to learn a well-generalized model initialization from training tasks to solve unseen tasks. From the "learning to learn" perspective, the quality of the initialization is modeled with one-step gradient decent in the inner loop. However, contrary to theoretical expectations, our empirical analysis reveals that this may expose meta-learning to underfitting. To bridge the gap between theoretical understanding and practical implementation, we reconsider meta-learning from the "Learning" lens. We propose that the meta-learning model comprises two interrelated components: parameters for model initialization and a meta-layer for task-specific fine-tuning. These components will lead to the risks of overfitting and underfitting depending on tasks, and their solutions, fewer parameters vs. more meta-layer, are often in conflict. To address this, we aim to regulate the task information the model receives without modifying the data or model structure. Our theoretical analysis indicates that models adapted to different tasks can mutually reinforce each other, highlighting the effective information. Based on this insight, we propose TRLearner, a plug-and-play method that leverages task relation to calibrate meta-learning. It first extracts task relation matrices and then applies relation-aware consistency regularization to guide optimization. Extensive theoretical and empirical evaluations demonstrate its effectiveness. △ Less

Submitted 6 May, 2025; v1 submitted 12 September, 2024; originally announced September 2024.

arXiv:2409.05310 [pdf, other]

Neural Surface Reconstruction and Rendering for LiDAR-Visual Systems

Authors: Jianheng Liu, Chunran Zheng, Yunfei Wan, Bowen Wang, Yixi Cai, Fu Zhang

Abstract: This paper presents a unified surface reconstruction and rendering framework for LiDAR-visual systems, integrating Neural Radiance Fields (NeRF) and Neural Distance Fields (NDF) to recover both appearance and structural information from posed images and point clouds. We address the structural visible gap between NeRF and NDF by utilizing a visible-aware occupancy map to classify space into the fre… ▽ More This paper presents a unified surface reconstruction and rendering framework for LiDAR-visual systems, integrating Neural Radiance Fields (NeRF) and Neural Distance Fields (NDF) to recover both appearance and structural information from posed images and point clouds. We address the structural visible gap between NeRF and NDF by utilizing a visible-aware occupancy map to classify space into the free, occupied, visible unknown, and background regions. This classification facilitates the recovery of a complete appearance and structure of the scene. We unify the training of the NDF and NeRF using a spatial-varying scale SDF-to-density transformation for levels of detail for both structure and appearance. The proposed method leverages the learned NDF for structure-aware NeRF training by an adaptive sphere tracing sampling strategy for accurate structure rendering. In return, NeRF further refines structural in recovering missing or fuzzy structures in the NDF. Extensive experiments demonstrate the superior quality and versatility of the proposed method across various scenarios. To benefit the community, the codes will be released at \url{https://github.com/hku-mars/M2Mapping}. △ Less

Submitted 8 September, 2024; originally announced September 2024.

arXiv:2409.04679 [pdf, other]

Neural Augmentation Based Panoramic High Dynamic Range Stitching

Authors: Chaobing Zheng, Yilun Xu, Weihai Chen, Shiqian Wu, Sen Zhang, Zhengguo Li

Abstract: Due to saturated regions of inputting low dynamic range (LDR) images and large intensity changes among the LDR images caused by different exposures, it is challenging to produce an information enriched panoramic LDR image without visual artifacts for a high dynamic range (HDR) scene through stitching multiple geometrically synchronized LDR images with different exposures and pairwise overlapping f… ▽ More Due to saturated regions of inputting low dynamic range (LDR) images and large intensity changes among the LDR images caused by different exposures, it is challenging to produce an information enriched panoramic LDR image without visual artifacts for a high dynamic range (HDR) scene through stitching multiple geometrically synchronized LDR images with different exposures and pairwise overlapping fields of views (OFOVs). Fortunately, the stitching of such images is innately a perfect scenario for the fusion of a physics-driven approach and a data-driven approach due to their OFOVs. Based on this new insight, a novel neural augmentation based panoramic HDR stitching algorithm is proposed in this paper. The physics-driven approach is built up using the OFOVs. Different exposed images of each view are initially generated by using the physics-driven approach, are then refined by a data-driven approach, and are finally used to produce panoramic LDR images with different exposures. All the panoramic LDR images with different exposures are combined together via a multi-scale exposure fusion algorithm to produce the final panoramic LDR image. Experimental results demonstrate the proposed algorithm outperforms existing panoramic stitching algorithms. △ Less

Submitted 20 February, 2025; v1 submitted 6 September, 2024; originally announced September 2024.

Comments: 11 pages

arXiv:2409.02795 [pdf, other]

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Authors: Bofei Gao, Feifan Song, Yibo Miao, Zefan Cai, Zhe Yang, Liang Chen, Helan Hu, Runxin Xu, Qingxiu Dong, Ce Zheng, Shanghaoran Quan, Wen Xiao, Ge Zhang, Daoguang Zan, Keming Lu, Bowen Yu, Dayiheng Liu, Zeyu Cui, Jian Yang, Lei Sha, Houfeng Wang, Zhifang Sui, Peiyi Wang, Tianyu Liu, Baobao Chang

Abstract: Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to unde… ▽ More Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences. △ Less

Submitted 31 October, 2024; v1 submitted 4 September, 2024; originally announced September 2024.

Comments: 23 pages, 6 figures

arXiv:2409.00992 [pdf, other]

MFCalib: Single-shot and Automatic Extrinsic Calibration for LiDAR and Camera in Targetless Environments Based on Multi-Feature Edge

Authors: Tianyong Ye, Wei Xu, Chunran Zheng, Yukang Cui

Abstract: This paper presents MFCalib, an innovative extrinsic calibration technique for LiDAR and RGB camera that operates automatically in targetless environments with a single data capture. At the heart of this method is using a rich set of edge information, significantly enhancing calibration accuracy and robustness. Specifically, we extract both depth-continuous and depth-discontinuous edges, along wit… ▽ More This paper presents MFCalib, an innovative extrinsic calibration technique for LiDAR and RGB camera that operates automatically in targetless environments with a single data capture. At the heart of this method is using a rich set of edge information, significantly enhancing calibration accuracy and robustness. Specifically, we extract both depth-continuous and depth-discontinuous edges, along with intensity-discontinuous edges on planes. This comprehensive edge extraction strategy ensures our ability to achieve accurate calibration with just one round of data collection, even in complex and varied settings. Addressing the uncertainty of depth-discontinuous edges, we delve into the physical measurement principles of LiDAR and develop a beam model, effectively mitigating the issue of edge inflation caused by the LiDAR beam. Extensive experiment results demonstrate that MFCalib outperforms the state-of-the-art targetless calibration methods across various scenes, achieving and often surpassing the precision of multi-scene calibrations in a single-shot collection. To support community development, we make our code available open-source on GitHub. △ Less

Submitted 2 September, 2024; originally announced September 2024.

Comments: 8 pages, 10 figures, accepted by IROS2024

arXiv:2408.16228 [pdf, other]

Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation

Authors: Vivek Myers, Bill Chunyuan Zheng, Oier Mees, Sergey Levine, Kuan Fang

Abstract: Learned language-conditioned robot policies often struggle to effectively adapt to new real-world tasks even when pre-trained across a diverse set of instructions. We propose a novel approach for few-shot adaptation to unseen tasks that exploits the semantic understanding of task decomposition provided by vision-language models (VLMs). Our method, Policy Adaptation via Language Optimization (PALO)… ▽ More Learned language-conditioned robot policies often struggle to effectively adapt to new real-world tasks even when pre-trained across a diverse set of instructions. We propose a novel approach for few-shot adaptation to unseen tasks that exploits the semantic understanding of task decomposition provided by vision-language models (VLMs). Our method, Policy Adaptation via Language Optimization (PALO), combines a handful of demonstrations of a task with proposed language decompositions sampled from a VLM to quickly enable rapid nonparametric adaptation, avoiding the need for a larger fine-tuning dataset. We evaluate PALO on extensive real-world experiments consisting of challenging unseen, long-horizon robot manipulation tasks. We find that PALO is able of consistently complete long-horizon, multi-tier tasks in the real world, outperforming state of the art pre-trained generalist policies, and methods that have access to the same demonstrations. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: 27 pages, 14 figures

Journal ref: Conference on Robot Learning, 2024

arXiv:2408.14089 [pdf, other]

Mini-Slot-Assisted Short Packet URLLC:Differential or Coherent Detection?

Authors: Canjian Zheng, Fu-Chun Zheng, Jingjing Luo, Pengcheng Zhu, Xiaohu You, Daquan Feng

Abstract: One of the primary challenges in short packet ultra-reliable and low-latency communications (URLLC) is to achieve reliable channel estimation and data detection while minimizing the impact on latency performance. Given the small packet size in mini-slot-assisted URLLC, relying solely on pilot-based coherent detection is almost impossible to meet the seemingly contradictory requirements of high cha… ▽ More One of the primary challenges in short packet ultra-reliable and low-latency communications (URLLC) is to achieve reliable channel estimation and data detection while minimizing the impact on latency performance. Given the small packet size in mini-slot-assisted URLLC, relying solely on pilot-based coherent detection is almost impossible to meet the seemingly contradictory requirements of high channel estimation accuracy, high reliability, low training overhead, and low latency. In this paper, we explore differential modulation both in the frequency domain and in the time domain, and propose adopting an adaptive approach that integrates both differential and coherent detection to achieve mini-slot-assisted short packet URLLC, striking a balance among training overhead, system performance, and computational complexity. Specifically, differential (especially in the frequency domain) and coherent detection schemes can be dynamically activated based on application scenarios, channel statistics, information payloads, mini-slot deployment options, and service requirements. Furthermore, we derive the block error rate (BLER) for pilot-based, frequency domain, and time domain differential OFDM using non-asymptotic information-theoretic bounds. Simulation results validate the feasibility and effectiveness of adaptive differential and coherent detection. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: 14 pages, 8 figures, journal

arXiv:2408.14035 [pdf, other]

FAST-LIVO2: Fast, Direct LiDAR-Inertial-Visual Odometry

Authors: Chunran Zheng, Wei Xu, Zuhao Zou, Tong Hua, Chongjian Yuan, Dongjiao He, Bingyang Zhou, Zheng Liu, Jiarong Lin, Fangcheng Zhu, Yunfan Ren, Rong Wang, Fanle Meng, Fu Zhang

Abstract: This paper proposes FAST-LIVO2: a fast, direct LiDAR-inertial-visual odometry framework to achieve accurate and robust state estimation in SLAM tasks and provide great potential in real-time, onboard robotic applications. FAST-LIVO2 fuses the IMU, LiDAR and image measurements efficiently through an ESIKF. To address the dimension mismatch between the heterogeneous LiDAR and image measurements, we… ▽ More This paper proposes FAST-LIVO2: a fast, direct LiDAR-inertial-visual odometry framework to achieve accurate and robust state estimation in SLAM tasks and provide great potential in real-time, onboard robotic applications. FAST-LIVO2 fuses the IMU, LiDAR and image measurements efficiently through an ESIKF. To address the dimension mismatch between the heterogeneous LiDAR and image measurements, we use a sequential update strategy in the Kalman filter. To enhance the efficiency, we use direct methods for both the visual and LiDAR fusion, where the LiDAR module registers raw points without extracting edge or plane features and the visual module minimizes direct photometric errors without extracting ORB or FAST corner features. The fusion of both visual and LiDAR measurements is based on a single unified voxel map where the LiDAR module constructs the geometric structure for registering new LiDAR scans and the visual module attaches image patches to the LiDAR points. To enhance the accuracy of image alignment, we use plane priors from the LiDAR points in the voxel map (and even refine the plane prior) and update the reference patch dynamically after new images are aligned. Furthermore, to enhance the robustness of image alignment, FAST-LIVO2 employs an on-demanding raycast operation and estimates the image exposure time in real time. Lastly, we detail three applications of FAST-LIVO2: UAV onboard navigation demonstrating the system's computation efficiency for real-time onboard navigation, airborne mapping showcasing the system's mapping accuracy, and 3D model rendering (mesh-based and NeRF-based) underscoring the suitability of our reconstructed dense map for subsequent rendering tasks. We open source our code, dataset and application on GitHub to benefit the robotics community. △ Less

Submitted 28 August, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

Comments: 30 pages, 31 figures, due to the limitation that 'The abstract field cannot exceed 1,920 characters', the abstract presented here is shorter than the one in the PDF file

arXiv:2408.13912 [pdf, other]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Authors: Brandon Smart, Chuanxia Zheng, Iro Laina, Victor Adrian Prisacariu

Abstract: In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we build Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R, by extend… ▽ More In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we build Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R, by extending it to deal with both 3D structure and appearance. Specifically, unlike the original MASt3R which reconstructs only 3D point clouds, we predict the additional Gaussian attributes required to construct a Gaussian primitive for each point. Hence, unlike other novel view synthesis methods, Splatt3R is first trained by optimizing the 3D point cloud's geometry loss, and then a novel view synthesis objective. By doing this, we avoid the local minima present in training 3D Gaussian Splats from stereo views. We also propose a novel loss masking strategy that we empirically find is critical for strong performance on extrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and demonstrate excellent generalisation to uncalibrated, in-the-wild images. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the resultant splats can be rendered in real-time. △ Less

Submitted 27 August, 2024; v1 submitted 25 August, 2024; originally announced August 2024.

Comments: Our project page can be found at: https://splatt3r.active.vision/

arXiv:2408.13861 [pdf, ps, other]

Topological rigidity of closures of certain sparse unipotent orbits in finite-volume quotients of $\prod_{i=1}^k\operatorname{SL}_2(\mathbb R)$

Authors: Cheng Zheng

Abstract: We give a simple proof about the topological rigidity of closures of certain sparse unipotent orbits in $G/Γ$ where $G=\prod_{i=1}^k\operatorname{SL}_2(\mathbb R)$ and $Γ$ is an irreducible lattice in $G$. We give a simple proof about the topological rigidity of closures of certain sparse unipotent orbits in $G/Γ$ where $G=\prod_{i=1}^k\operatorname{SL}_2(\mathbb R)$ and $Γ$ is an irreducible lattice in $G$. △ Less

Submitted 25 August, 2024; originally announced August 2024.

Comments: 18 pages

MSC Class: Primary 37A17; Secondary 11J99

arXiv:2408.13598 [pdf, other]

Advancing Gamma-Ray Burst Identification through Transfer Learning with Convolutional Neural Networks

Authors: Peng Zhang, Bing Li, Ren-zhou Gui, Shao-lin Xiong, Yu Wang, Yan-qiu Zhang, Chen-wei Wang, Jia-cong Liu, Wang-chen Xue, Chao Zheng, Zheng-hang Yu, Wen-long Zhang

Abstract: The Rapid and accurate identification of Gamma-Ray Bursts (GRBs) is crucial for unraveling their origins. However, current burst search algorithms frequently miss low-threshold signals or lack universality for observations. In this study, we propose a novel approach utilizing transfer learning experiment based on convolutional neural network (CNN) to establish a universal GRB identification method… ▽ More The Rapid and accurate identification of Gamma-Ray Bursts (GRBs) is crucial for unraveling their origins. However, current burst search algorithms frequently miss low-threshold signals or lack universality for observations. In this study, we propose a novel approach utilizing transfer learning experiment based on convolutional neural network (CNN) to establish a universal GRB identification method, which validated successfully using GECAM-B data. By employing data augmentation techniques, we enhance the diversity and quantity of the GRB sample. We develop a 1D CNN model with a multi-scale feature cross fusion module (MSCFM) to extract features from samples and perform classification. The comparative results demonstrated significant performance improvements following pre-training and transferring on a large-scale dataset. Our optimal model achieved an impressive accuracy of 96.41% on the source dataset of GECAM-B, and identified three previously undiscovered GRBs by contrast with manual analysis of GECAM-B observations. These innovative transfer learning and data augmentation methods presented in this work hold promise for applications in multi-satellite exploration scenarios characterized by limited data sets and a scarcity of labeled samples in high-energy astronomy. △ Less

Submitted 24 August, 2024; originally announced August 2024.

Comments: 17 pages, 7 figures

arXiv:2408.10519 [pdf, other]

Almost Optimal Algorithms for Token Collision in Anonymous Networks

Authors: Sirui Bai, Xinyu Fu, Xudong Wu, Penghui Yao, Chaodong Zheng

Abstract: In distributed systems, situations often arise where some nodes each holds a collection of tokens, and all nodes collectively need to determine whether all tokens are distinct. For example, if each token represents a logged-in user, the problem corresponds to checking whether there are duplicate logins. Similarly, if each token represents a data object or a timestamp, the problem corresponds to ch… ▽ More In distributed systems, situations often arise where some nodes each holds a collection of tokens, and all nodes collectively need to determine whether all tokens are distinct. For example, if each token represents a logged-in user, the problem corresponds to checking whether there are duplicate logins. Similarly, if each token represents a data object or a timestamp, the problem corresponds to checking whether there are conflicting operations in distributed databases. In distributed computing theory, unique identifiers generation is also related to this problem: each node generates one token, which is its identifier, then a verification phase is needed to ensure all identifiers are unique. In this paper, we formalize and initiate the study of token collision. In this problem, a collection of $k$ tokens, each represented by some length-$L$ bit string, are distributed to $n$ nodes of an anonymous CONGEST network in an arbitrary manner. The nodes need to determine whether there are tokens with an identical value. We present near optimal deterministic algorithms for the token collision problem with $\tilde{O}(D+k\cdot L/\log{n})$ round complexity, where $D$ denotes the network diameter. Besides high efficiency, the prior knowledge required by our algorithms is also limited. For completeness, we further present a near optimal randomized algorithm for token collision. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Showing 101–150 of 922 results for author: Zheng, C