-
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Authors:
Wenxuan Zhu,
Bing Li,
Cheng Zheng,
Jinjie Mai,
Jun Chen,
Letian Jiang,
Abdullah Hamdi,
Sara Rojas Martinez,
Chia-Wen Lin,
Mohamed Elhoseiny,
Bernard Ghanem
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understand…
▽ More
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
RustMap: Towards Project-Scale C-to-Rust Migration via Program Analysis and LLM
Authors:
Xuemeng Cai,
Jiakun Liu,
Xiping Huang,
Yijun Yu,
Haitao Wu,
Chunmiao Li,
Bo Wang,
Imam Nur Bani Yusuf,
Lingxiao Jiang
Abstract:
Migrating existing C programs into Rust is increasingly desired, as Rust offers superior memory safety while maintaining C's high performance. However, vastly different features between C and Rust--e.g., distinct definitions and usages of pointers and references--pose significant challenges beyond mere syntactic translation. Existing automated translation tools, such as C2Rust, may rely too much o…
▽ More
Migrating existing C programs into Rust is increasingly desired, as Rust offers superior memory safety while maintaining C's high performance. However, vastly different features between C and Rust--e.g., distinct definitions and usages of pointers and references--pose significant challenges beyond mere syntactic translation. Existing automated translation tools, such as C2Rust, may rely too much on syntactic, template-based translation and generate unsafe Rust code that is hard for human developers to read, maintain, or even compile. More semantic-aware translation that produces safer, idiomatic, and runnable Rust code is much needed. This paper introduces a novel dependency-guided and large language model (LLM)-based C-to-Rust translation approach, RustMap, based on three key ideas: (1) Utilize LLM capabilities to produce idiomatic Rust code from given small pieces of C code, (2) Mitigate LLM limitations in handling large codebases by breaking project-scale C programs into smaller units for translation according to their usage dependencies and composing them into a runnable Rust program, and (3) Enhance the correctness of the translated Rust program by using test cases to check input/output equivalence, isolate faulty code when execution states deviate, and iteratively refine the translation using feedback from compilation and test errors. We empirically evaluate RustMap on 126 real-world programs, including 125 from Rosetta Code and a 7000+ line bzip2 implementation using GPT-4o as the LLM. RustMap shows promising results, guiding GPT-4o to produce idiomatic, readable, and functional Rust code with significantly less unsafe code than other tools, and revealing non-trivial translation patterns reusable for future research.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
PT-PINNs: A Parametric Engineering Turbulence Solver based on Physics-Informed Neural Networks
Authors:
Liang Jiang,
Yuzhou Cheng,
Kun Luo,
Jianren Fan
Abstract:
Physics-informed neural networks (PINNs) demonstrate promising potential in parameterized engineering turbulence optimization problems but face challenges, such as high data requirements and low computational accuracy when applied to engineering turbulence problems. This study proposes a framework that enhances the ability of PINNs to solve parametric turbulence problems without training datasets…
▽ More
Physics-informed neural networks (PINNs) demonstrate promising potential in parameterized engineering turbulence optimization problems but face challenges, such as high data requirements and low computational accuracy when applied to engineering turbulence problems. This study proposes a framework that enhances the ability of PINNs to solve parametric turbulence problems without training datasets from experiments or CFD-Parametric Turbulence PINNs (PT-PINNs)). Two key methods are introduced to improve the accuracy and robustness of this framework. The first is a soft constraint method for turbulent viscosity calculation. The second is a pre-training method based on the conservation of flow rate in the flow field. The effectiveness of PT-PINNs is validated using a three-dimensional backward-facing step (BFS) turbulence problem with two varying parameters (Re = 3000-200000, ER = 1.1-1.5). PT-PINNs produce predictions that closely match experimental data and computational fluid dynamics (CFD) results across various conditions. Moreover, PT-PINNs offer a computational efficiency advantage over traditional CFD methods. The total time required to construct the parametric BFS turbulence model is 39 hours, one-sixteenth of the time required by traditional numerical methods. The inference time for a single-condition prediction is just 40 seconds-only 0.5% of a single CFD computation. These findings highlight the potential of PT-PINNs for future applications in engineering turbulence optimization problems.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
Criteria for unbiased estimation: applications to noise-agnostic sensing and learnability of quantum channel
Authors:
Hyukgun Kwon,
Kento Tsubouchi,
Chia-Tung Chu,
Liang Jiang
Abstract:
We establish the necessary and sufficient conditions for unbiased estimation in multi-parameter estimation tasks. More specifically, we first consider quantum state estimation, where multiple parameters are encoded in a quantum state, and derive two equivalent necessary and sufficient conditions for an unbiased estimation: one formulated in terms of the quantum Fisher information matrix (QFIM) and…
▽ More
We establish the necessary and sufficient conditions for unbiased estimation in multi-parameter estimation tasks. More specifically, we first consider quantum state estimation, where multiple parameters are encoded in a quantum state, and derive two equivalent necessary and sufficient conditions for an unbiased estimation: one formulated in terms of the quantum Fisher information matrix (QFIM) and the other based on the derivatives of the encoded state. Furthermore, we introduce a generalized quantum Cramér-Rao bound, which provides a fundamental achievable lower bound on the estimation error even when the QFIM is non-invertible. To demonstrate the utility of our framework, we consider phase estimation under unknown Pauli noise. We show that while unbiased phase estimation is infeasible with a naive scheme, employing an entangled probe with a noiseless ancilla enables unbiased estimation. Next, we extend our analysis to quantum channel estimation (equivalently, quantum channel learning), where the goal is to estimate parameters characterizing an unknown quantum channel. We establish the necessary and sufficient condition for unbiased estimation of these parameters. Notably, by interpreting unbiased estimation as learnability, our result applies to the fundamental learnability of parameters in general quantum channels. As a concrete application, we investigate the learnability of noise affecting non-Clifford gates via cycle benchmarking.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Authors:
Liming Jiang,
Qing Yan,
Yumin Jia,
Zichuan Liu,
Hao Kang,
Xin Lu
Abstract:
Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low gener…
▽ More
Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
ALLMod: Exploring $\underline{\mathbf{A}}$rea-Efficiency of $\underline{\mathbf{L}}$UT-based $\underline{\mathbf{L}}$arge Number $\underline{\mathbf{Mod}}$ular Reduction via Hybrid Workloads
Authors:
Fangxin Liu,
Haomin Li,
Zongwu Wang,
Bo Zhang,
Mingzhe Zhang,
Shoumeng Yan,
Li Jiang,
Haibing Guan
Abstract:
Modular arithmetic, particularly modular reduction, is widely used in cryptographic applications such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). High-bit-width operations are crucial for enhancing security; however, they are computationally intensive due to the large number of modular operations required. The lookup-table-based (LUT-based) approach, a ``space-for-time'' techni…
▽ More
Modular arithmetic, particularly modular reduction, is widely used in cryptographic applications such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). High-bit-width operations are crucial for enhancing security; however, they are computationally intensive due to the large number of modular operations required. The lookup-table-based (LUT-based) approach, a ``space-for-time'' technique, reduces computational load by segmenting the input number into smaller bit groups, pre-computing modular reduction results for each segment, and storing these results in LUTs. While effective, this method incurs significant hardware overhead due to extensive LUT usage. In this paper, we introduce ALLMod, a novel approach that improves the area efficiency of LUT-based large-number modular reduction by employing hybrid workloads. Inspired by the iterative method, ALLMod splits the bit groups into two distinct workloads, achieving lower area costs without compromising throughput. We first develop a template to facilitate workload splitting and ensure balanced distribution. Then, we conduct design space exploration to evaluate the optimal timing for fusing workload results, enabling us to identify the most efficient design under specific constraints. Extensive evaluations show that ALLMod achieves up to $1.65\times$ and $3\times$ improvements in area efficiency over conventional LUT-based methods for bit-widths of $128$ and $8,192$, respectively.
△ Less
Submitted 27 May, 2025; v1 submitted 20 March, 2025;
originally announced March 2025.
-
MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
Authors:
Hengjia Li,
Lifan Jiang,
Xi Xiao,
Tianyang Wang,
Hongwei Yi,
Boxi Wu,
Deng Cai
Abstract:
Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static imag…
▽ More
Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce $\textbf{MagicID}$, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Virtual purification complements quantum error correction in quantum metrology
Authors:
Hyukgun Kwon,
Changhun Oh,
Youngrong Lim,
Hyunseok Jeong,
Seung-Woo Lee,
Liang Jiang
Abstract:
A practical realization of quantum metrology, enhancing the sensitivity of parameter estimation beyond the classical limit, is significantly hindered by the effect of noise. To tackle this challenge, quantum error correction (QEC) has been considered, however, indistinguishable noise from the signal and the bias induced by unknown noise prevents it from recovering the enhanced precision in practic…
▽ More
A practical realization of quantum metrology, enhancing the sensitivity of parameter estimation beyond the classical limit, is significantly hindered by the effect of noise. To tackle this challenge, quantum error correction (QEC) has been considered, however, indistinguishable noise from the signal and the bias induced by unknown noise prevents it from recovering the enhanced precision in practice. Meanwhile, virtual purification (VP), an error mitigation technique, has been recently shown to mitigate the bias induced by noise in quantum metrology. In this work, we comparatively analyze the performance of QEC and VP in a realistic quantum metrology scenario. We show that while an ideal QEC setup fails to correct indistinguishable noise from the signal and induces bias, VP can mitigate such indistinguishable noise and bias, resulting in more accurate estimations. We then demonstrate that VP with a stabilizer state probe in $5$-qubit GHZ state and $7$-qubit Steane code state can efficiently suppress the bias under local depolarizing noise. Our result highlights that VP along with encoded probe states can effectively suppress the effect of noise in realistic setups, where error distinguishability poses significant challenges.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution
Authors:
Zhi Chen,
Wei Ma,
Lingxiao Jiang
Abstract:
AI-driven software development has rapidly advanced with the emergence of software development agents that leverage large language models (LLMs) to tackle complex, repository-level software engineering tasks. These agents go beyond just generation of final code; they engage in multi-step reasoning, utilize various tools for code modification and debugging, and interact with execution environments…
▽ More
AI-driven software development has rapidly advanced with the emergence of software development agents that leverage large language models (LLMs) to tackle complex, repository-level software engineering tasks. These agents go beyond just generation of final code; they engage in multi-step reasoning, utilize various tools for code modification and debugging, and interact with execution environments to diagnose and iteratively resolve issues. However, most existing evaluations focus primarily on static analyses of final code outputs, yielding limited insights into the agents' dynamic problem-solving processes. To fill this gap, we conduct an in-depth empirical study on 3,977 solving-phase trajectories and 3,931 testing-phase logs from 8 top-ranked agents evaluated on 500 GitHub issues in the SWE-Bench benchmark. Our exploratory analysis shows that Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads. We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues (e.g., IntegrityError) that demand significantly more debugging effort. Furthermore, we have discovered 3 bugs in the SWE-Bench platform that affect benchmark fairness and accuracy; these issues have been reported to and confirmed by the maintainers. To promote transparency and foster future research, we publicly share our datasets and analysis scripts.
△ Less
Submitted 19 March, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
Fast Sideband Control of a Weakly Coupled Multimode Bosonic Memory
Authors:
Jordan Huang,
Thomas J. DiNapoli,
Gavin Rockwood,
Ming Yuan,
Prathyankara Narasimhan,
Eesh Gupta,
Mustafa Bal,
Francesco Crisa,
Sabrina Garattoni,
Yao Lu,
Liang Jiang,
Srivatsan Chakram
Abstract:
Circuit quantum electrodynamics (cQED) with superconducting cavities coupled to nonlinear circuits like transmons offers a promising platform for hardware-efficient quantum information processing. We address critical challenges in realizing this architecture by weakening the dispersive coupling while also demonstrating fast, high-fidelity multimode control by dynamically amplifying gate speeds thr…
▽ More
Circuit quantum electrodynamics (cQED) with superconducting cavities coupled to nonlinear circuits like transmons offers a promising platform for hardware-efficient quantum information processing. We address critical challenges in realizing this architecture by weakening the dispersive coupling while also demonstrating fast, high-fidelity multimode control by dynamically amplifying gate speeds through transmon-mediated sideband interactions. This approach enables transmon-cavity SWAP gates, for which we achieve speeds up to 30 times larger than the bare dispersive coupling. Combined with transmon rotations, this allows for efficient, universal state preparation in a single cavity mode, though achieving unitary gates and extending control to multiple modes remains a challenge. In this work, we overcome this by introducing two sideband control strategies: (1) a shelving technique that prevents unwanted transitions by temporarily storing populations in sideband-transparent transmon states and (2) a method that exploits the dispersive shift to synchronize sideband transition rates across chosen photon-number pairs to implement transmon-cavity SWAP gates that are selective on photon number. We leverage these protocols to prepare Fock and binomial code states across any of ten modes of a multimode cavity with millisecond cavity coherence times. We demonstrate the encoding of a qubit from a transmon into arbitrary vacuum and Fock state superpositions, as well as entangled NOON states of cavity mode pairs\textemdash a scheme extendable to arbitrary multimode Fock encodings. Furthermore, we implement a new binomial encoding gate that converts arbitrary transmon superpositions into binomial code states in $\qty{4}{\micro\second}$ (less than $1/χ$), achieving an average post-selected final state fidelity of $\qty{96.3}{\percent}$ across different fiducial input states.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
Authors:
Hao He,
Ceyuan Yang,
Shanchuan Lin,
Yinghao Xu,
Meng Wei,
Liangke Gui,
Qi Zhao,
Gordon Wetzstein,
Lu Jiang,
Hongsheng Li
Abstract:
This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic sce…
▽ More
This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes -- first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Long Context Tuning for Video Generation
Authors:
Yuwei Guo,
Ceyuan Yang,
Ziyan Yang,
Zhibei Ma,
Zhijie Lin,
Zhenheng Yang,
Dahua Lin,
Lu Jiang
Abstract:
Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to…
▽ More
Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See https://guoyww.github.io/projects/long-context-video/ for more details.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild
Authors:
Damien Teney,
Liangze Jiang,
Florin Gogianu,
Ehsan Abbasnejad
Abstract:
Neural architectures tend to fit their data with relatively simple functions. This "simplicity bias" is widely regarded as key to their success. This paper explores the limits of this principle. Building on recent findings that the simplicity bias stems from ReLU activations [96], we introduce a method to meta-learn new activation functions and inductive biases better suited to specific tasks.
F…
▽ More
Neural architectures tend to fit their data with relatively simple functions. This "simplicity bias" is widely regarded as key to their success. This paper explores the limits of this principle. Building on recent findings that the simplicity bias stems from ReLU activations [96], we introduce a method to meta-learn new activation functions and inductive biases better suited to specific tasks.
Findings: We identify multiple tasks where the simplicity bias is inadequate and ReLUs suboptimal. In these cases, we learn new activation functions that perform better by inducing a prior of higher complexity. Interestingly, these cases correspond to domains where neural networks have historically struggled: tabular data, regression tasks, cases of shortcut learning, and algorithmic grokking tasks. In comparison, the simplicity bias induced by ReLUs proves adequate on image tasks where the best learned activations are nearly identical to ReLUs and GeLUs.
Implications: Contrary to popular belief, the simplicity bias of ReLU networks is not universally useful. It is near-optimal for image classification, but other inductive biases are sometimes preferable. We showed that activation functions can control these inductive biases, but future tailored architectures might provide further benefits. Advances are still needed to characterize a model's inductive biases beyond "complexity", and their adequacy with the data.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Authors:
Yuanxin Liu,
Rui Zhu,
Shuhuai Ren,
Jiacong Wang,
Haoyuan Guo,
Xu Sun,
Lu Jiang
Abstract:
With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with t…
▽ More
With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 16 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation. The code is available at https://github.com/bytedance/UVE.
△ Less
Submitted 21 March, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention
Authors:
Yuhan Wang,
Fangzhou Hong,
Shuai Yang,
Liming Jiang,
Wayne Wu,
Chen Change Loy
Abstract:
Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mes…
▽ More
Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data
Authors:
Runjian Chen,
Wenqi Shao,
Bo Zhang,
Shaoshuai Shi,
Li Jiang,
Ping Luo
Abstract:
Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators…
▽ More
Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators like CARLA, generating labeled LiDAR point clouds with corner cases is a piece of cake. However, introducing synthetic point clouds to improve real perception is non-trivial. This stems from two challenges: 1) sample efficiency of simulation datasets 2) simulation-to-real gaps. To overcome both challenges, we propose a plug-and-play method called JiSAM , shorthand for Jittering augmentation, domain-aware backbone and memory-based Sectorized AlignMent. In extensive experiments conducted on the famous AD dataset NuScenes, we demonstrate that, with SOTA 3D object detector, JiSAM is able to utilize the simulation data and only labels on 2.5% available real data to achieve comparable performance to models trained on all real data. Additionally, JiSAM achieves more than 15 mAPs on the objects not labeled in the real training set. We will release models and codes.
△ Less
Submitted 13 March, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Balanced Image Stylization with Style Matching Score
Authors:
Yuxin Jiang,
Liming Jiang,
Shuai Yang,
Jia-Wei Liu,
Ivor Tsang,
Mike Zheng Shou
Abstract:
We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via care…
▽ More
We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies
Authors:
Luyi Jiang,
Jiayuan Chen,
Lu Lu,
Xinwei Peng,
Lihao Liu,
Junjun He,
Jie Xu
Abstract:
The evaluation and improvement of medical large language models (LLMs) are critical for their real-world deployment, particularly in ensuring accuracy, safety, and ethical alignment. Existing frameworks inadequately dissect domain-specific error patterns or address cross-modal challenges. This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench, cate…
▽ More
The evaluation and improvement of medical large language models (LLMs) are critical for their real-world deployment, particularly in ensuring accuracy, safety, and ethical alignment. Existing frameworks inadequately dissect domain-specific error patterns or address cross-modal challenges. This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench, categorizing incorrect responses into eight types: Omissions, Hallucination, Format Mismatch, Causal Reasoning Deficiency, Contextual Inconsistency, Unanswered, Output Error, and Deficiency in Medical Language Generation. Evaluation of 10 leading models reveals vulnerabilities: despite achieving 0.86 accuracy in medical knowledge recall, critical reasoning tasks show 96.3% omission, while safety ethics evaluations expose alarming inconsistency (robustness score: 0.79) under option shuffled. Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning. To address these, we propose a tiered optimization strategy spanning four levels, from prompt engineering and knowledge-augmented retrieval to hybrid neuro-symbolic architectures and causal reasoning frameworks. This work establishes an actionable roadmap for developing clinically robust LLMs while redefining evaluation paradigms through error-driven insights, ultimately advancing the safety and trustworthiness of AI in high-stakes medical environments.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Political Neutrality in AI Is Impossible- But Here Is How to Approximate It
Authors:
Jillian Fisher,
Ruth E. Appel,
Chan Young Park,
Yujin Potter,
Liwei Jiang,
Taylor Sorensen,
Shangbin Feng,
Yulia Tsvetkov,
Margaret E. Roberts,
Jennifer Pan,
Dawn Song,
Yejin Choi
Abstract:
AI systems often exhibit political bias, influencing users' opinions and decisions. While political neutrality-defined as the absence of bias-is often seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user…
▽ More
AI systems often exhibit political bias, influencing users' opinions and decisions. While political neutrality-defined as the absence of bias-is often seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user interactions. However, inspired by Joseph Raz's philosophical insight that "neutrality [...] can be a matter of degree" (Raz, 1986), we argue that striving for some neutrality remains essential for promoting balanced AI interactions and mitigating user manipulation. Therefore, we use the term "approximation" of political neutrality to shift the focus from unattainable absolutes to achievable, practical proxies. We propose eight techniques for approximating neutrality across three levels of conceptualizing AI, examining their trade-offs and implementation strategies. In addition, we explore two concrete applications of these approximations to illustrate their practicality. Finally, we assess our framework on current large language models (LLMs) at the output level, providing a demonstration of how it can be evaluated. This work seeks to advance nuanced discussions of political neutrality in AI and promote the development of responsible, aligned language models.
△ Less
Submitted 3 June, 2025; v1 submitted 18 February, 2025;
originally announced March 2025.
-
Implementation of a quantum addressable router using superconducting qubits
Authors:
Connie Miao,
Sébastien Léger,
Ziqian Li,
Gideon Lee,
Liang Jiang,
David I. Schuster
Abstract:
The implementation of a quantum router capable of performing both quantum signal routing and quantum addressing (a Q2-router) represents a key step toward building quantum networks and quantum random access memories. We realize a Q2-router that uses fixed-frequency transmon qubits to implement a routing protocol based on two native controlled-iSWAP gates. These gates leverage a large ZZ interactio…
▽ More
The implementation of a quantum router capable of performing both quantum signal routing and quantum addressing (a Q2-router) represents a key step toward building quantum networks and quantum random access memories. We realize a Q2-router that uses fixed-frequency transmon qubits to implement a routing protocol based on two native controlled-iSWAP gates. These gates leverage a large ZZ interaction to selectively route information according to a quantum address. We find an estimated average routing fidelity of 95.3%, with errors arising primarily from decoherence or state preparation and measurement. We present a comprehensive calibration and characterization of both the c-iSWAP gates and the overall routing protocol through randomized benchmarking techniques and state tomography.
△ Less
Submitted 3 April, 2025; v1 submitted 6 March, 2025;
originally announced March 2025.
-
Compact and fully functional high-frequency sine wave gating InGaAs/InP single-photon detector module
Authors:
Qi Xu,
Chao Yu,
Dajian Cui,
Xuan-Yi Zhang,
Wei Chen,
Yu-Qiang Fang,
Lianjun Jiang,
Qixia Tong,
Jianglin Zhao,
Jun Zhang
Abstract:
High-frequency sine wave gating (SWG) InGaAs/InP single-photon detectors (SPDs) are widely used for synchronous near-infrared single-photon detection. For practical use, the size of SPD is one of the most concerning features for system integration. Here we present, to the best of our knowledge, the most compact and fully functional high-frequency SWG InGaAs/InP SPD. We develop a sine wave gating i…
▽ More
High-frequency sine wave gating (SWG) InGaAs/InP single-photon detectors (SPDs) are widely used for synchronous near-infrared single-photon detection. For practical use, the size of SPD is one of the most concerning features for system integration. Here we present, to the best of our knowledge, the most compact and fully functional high-frequency SWG InGaAs/InP SPD. We develop a sine wave gating integrated circuit (SWGIC) using system-in-package technology that supports functions including large amplitude sine wave gate generation, coincidence gate generation, phase regulation, amplitude monitoring, and amplitude modulation. Moreover, we design and fabricate a high-performance multi-mode fiber coupled InGaAs/InP single-photon avalanche diode (SPAD) with a compact butterfly package. Furthermore, we implement a monolithically integrated readout circuit (MIRC) to extract the weak avalanche signal from large capacitance response of SWG. Finally, the SWGIC, SPAD, MIRC, and the affiliated circuits are integrated into a single module with a size of 6 cm x 5.7 cm x 1.7 cm. After characterization, the SPD module exhibits a photon detection efficiency of 40%, a dark count rate of 9 kcps, and an afterpulse probability of 4.6% at an operation temperature of 238 K and a hold-off time of 160 ns. Our work provides a practical solution for applications necessitating highly integrated near-infrared single-photon detection.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
BPQA Dataset: Evaluating How Well Language Models Leverage Blood Pressures to Answer Biomedical Questions
Authors:
Chi Hang,
Ruiqi Deng,
Lavender Yao Jiang,
Zihao Yang,
Anton Alyakin,
Daniel Alber,
Eric Karl Oermann
Abstract:
Clinical measurements such as blood pressures and respiration rates are critical in diagnosing and monitoring patient outcomes. It is an important component of biomedical data, which can be used to train transformer-based language models (LMs) for improving healthcare delivery. It is, however, unclear whether LMs can effectively interpret and use clinical measurements. We investigate two questions…
▽ More
Clinical measurements such as blood pressures and respiration rates are critical in diagnosing and monitoring patient outcomes. It is an important component of biomedical data, which can be used to train transformer-based language models (LMs) for improving healthcare delivery. It is, however, unclear whether LMs can effectively interpret and use clinical measurements. We investigate two questions: First, can LMs effectively leverage clinical measurements to answer related medical questions? Second, how to enhance an LM's performance on medical question-answering (QA) tasks that involve measurements? We performed a case study on blood pressure readings (BPs), a vital sign routinely monitored by medical professionals. We evaluated the performance of four LMs: BERT, BioBERT, MedAlpaca, and GPT-3.5, on our newly developed dataset, BPQA (Blood Pressure Question Answering). BPQA contains $100$ medical QA pairs that were verified by medical students and designed to rely on BPs . We found that GPT-3.5 and MedAlpaca (larger and medium sized LMs) benefit more from the inclusion of BPs than BERT and BioBERT (small sized LMs). Further, augmenting measurements with labels improves the performance of BioBERT and Medalpaca (domain specific LMs), suggesting that retrieval may be useful for improving domain-specific LMs.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Acoustic phonon phase gates with number-resolving phonon detection
Authors:
Hong Qiao,
Zhaoyou Wang,
Gustav Andersson,
Alexander Anferov,
Christopher R. Conner,
Yash J. Joshi,
Shiheng Li,
Jacob M. Miller,
Xuntao Wu,
Haoxiong Yan,
Liang Jiang,
Andrew N. Cleland
Abstract:
Linear optical quantum computing (LOQC) provides a compelling approach to quantum information processing, with a short list of physical requirements; however, experimental implementations have faced significant challenges. Itinerant phonons in quantum acoustics, combined with superconducting qubits, offer a compelling alternative to the quantum optics approach. Here we demonstrate key advances in…
▽ More
Linear optical quantum computing (LOQC) provides a compelling approach to quantum information processing, with a short list of physical requirements; however, experimental implementations have faced significant challenges. Itinerant phonons in quantum acoustics, combined with superconducting qubits, offer a compelling alternative to the quantum optics approach. Here we demonstrate key advances in the ability to manipulate and measure acoustic phonon quantum states: First, we demonstrate deterministic phase control of itinerant one- and two-phonon qubit states, measured using an acoustic Mach-Zehnder interferometer. We implement phonon phase control using the frequency-dependent scattering of phonon states from a superconducting transmon qubit. The acoustic interferometer used to measure the resulting phonon phase achieves a noise-floor-limited Hong-Ou-Mandel (HOM) interference visibility of 98.1%, representing a significant improvement over our previous demonstration. Additionally, we propose and implement a multi-phonon detection scheme that enables coherent conversion between itinerant one- and two-phonon Fock states and transmon qutrit states, transforming for example the Hong-Ou-Mandel two-phonon entangled output state $|02\rangle - |20\rangle$ into the entangled state of two transmons. The tight integration of quantum acoustics with superconducting circuits native to our implementation promises further advances, including deterministic phonon quantum gates with direct applications to quantum computing.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Medium-band Astrophysics with the Grism of NIRCam In Frontier fields (MAGNIF): Spectroscopic Census of H$α$ Luminosity Functions and Cosmic Star Formation at $z\sim 4.5$ and 6.3
Authors:
Shuqi Fu,
Fengwu Sun,
Linhua Jiang,
Xiaojing Lin,
Jose M. Diego,
Lukas J. Furtak,
Mathilde Jauzac,
Anton M. Koekemoer,
Mingyu Li,
Masamune Oguri,
Nency R. Patel,
Christopher N. A. Willmer,
Rogier A. Windhorst,
Adi Zitrin,
Franz E. Bauer,
Chian-Chou Chen,
Wenlei Chen,
Cheng Cheng,
Christopher J. Conselice,
Daniel J. Eisenstein,
Eiichi Egami,
Daniel Espada,
Xiaohui Fan,
Seiji Fujimoto,
Tiger Yu-Yang Hsiao
, et al. (13 additional authors not shown)
Abstract:
We measure H$α$ luminosity functions (LFs) at redshifts $z \sim 4.5$ and 6.3 using the JWST MAGNIF (Medium-band Astrophysics with the Grism of NIRCam In Frontier fields) survey. MAGNIF obtained NIRCam grism spectra with the F360M and F480M filters in four Frontier Fields. We identify 248 H$α$ emitters based on the grism spectra and photometric redshifts from combined HST and JWST imaging data. The…
▽ More
We measure H$α$ luminosity functions (LFs) at redshifts $z \sim 4.5$ and 6.3 using the JWST MAGNIF (Medium-band Astrophysics with the Grism of NIRCam In Frontier fields) survey. MAGNIF obtained NIRCam grism spectra with the F360M and F480M filters in four Frontier Fields. We identify 248 H$α$ emitters based on the grism spectra and photometric redshifts from combined HST and JWST imaging data. The numbers of the H$α$ emitters show a large field-to-field variation, highlighting the necessity of multiple fields to mitigate cosmic variance. We calculate both observed and dust-corrected H$α$ LFs in the two redshift bins. Thanks to the gravitational lensing, the measured H$α$ LFs span three orders of magnitude in luminosity, and the faint-end luminosity reaches $L_{\mathrm{H}α} \sim 10^{40.3} \mathrm{erg} \mathrm{s}^{-1}$ at $z \sim 4.5$ and $10^{41.5} \mathrm{erg} \mathrm{s}^{-1}$ at $z \sim 6.3$, corresponding to star-formation rates (SFRs) of $\sim$ 0.1 and 1.7 $\mathrm{M}_\odot \mathrm{yr}^{-1}$. We conclude no or weak redshift evolution of the faint-end slope of H$α$ LF across $z\simeq0.4-6.3$, and the comparison with the faint-end slopes of UV LF indicates stochastic star formation history among low-mass H$α$ emitters. The derived cosmic SFR densities are $0.058^{+0.008}_{-0.006}\ \ M_\odot\ \mathrm{yr}^{-1}\ \mathrm{Mpc}^{-3}$ at $z \sim 4.5$ and $0.025^{+0.009}_{-0.007}\ \ M_\odot\ \mathrm{yr}^{-1}\ \mathrm{Mpc}^{-3}$ at $z \sim 6.3$. These are approximately 2.2 times higher than previous estimates based on dust-corrected UV LFs, but consistent with recent measurements from infrared surveys. We discuss uncertainties in the H$α$ LF measurements, including those propagate from the lens models, cosmic variance, and AGN contribution.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Efficient quantum tomography of a polynomial subspace
Authors:
Yat Wong,
Ming Yuan,
Kevin He,
Srivatsan Chakram,
Alireza Seif,
David I. Schuster,
Liang Jiang
Abstract:
Quantum tomography is crucial for characterizing the quantum states of multipartite systems, but its practicality is often limited by the exponentially large dimension of the Hilbert space. Most existing approaches, such as compressed sensing and tensor network-based tomography, impose structural constraints on the state to enable more resource-efficient characterization. However, not all physical…
▽ More
Quantum tomography is crucial for characterizing the quantum states of multipartite systems, but its practicality is often limited by the exponentially large dimension of the Hilbert space. Most existing approaches, such as compressed sensing and tensor network-based tomography, impose structural constraints on the state to enable more resource-efficient characterization. However, not all physical states can be well-approximated with highly structured states. Here, we develop a partial quantum tomography method based on direct fidelity estimation (DFE) that focuses on a neighborhood subspace -- the subspace spanned by states physically close to a given target state. Using this generalized DFE method, we estimate elements of the density operator within this subspace in a self-verifying manner. We investigate the efficiency of this approach under different sets of available measurements for various states and find that the set of available measurements significantly impacts the cost of DFE. For example, we show that Pauli measurements alone are insufficient for performing efficient DFE on all product states, whereas the full set of product measurements is sufficient. This method can be applied in many situations, including characterizing quantum systems with confined dynamics and verifying preparations of quantum states and processes.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
Authors:
Fangxu Yu,
Lai Jiang,
Shenyi Huang,
Zhen Wu,
Xinyu Dai
Abstract:
The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social scenarios. Although recent studies have evaluated ToM in Large Language Models (LLMs), existing benchmarks focus on simplified settings (e.g., Sally-Anne-style tasks) and overlook the complexity of real-world social interactions. To mitigate this gap, we…
▽ More
The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social scenarios. Although recent studies have evaluated ToM in Large Language Models (LLMs), existing benchmarks focus on simplified settings (e.g., Sally-Anne-style tasks) and overlook the complexity of real-world social interactions. To mitigate this gap, we propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues. Our framework contains two core tasks: ToM Reasoning, which tests tracking of evolving desires, beliefs, and intentions; and ToM Application, which assesses the use of inferred mental states to predict and evaluate persuasion strategies. Experiments across eight leading LLMs reveal that while models excel on multiple questions, they struggle with the tasks that need tracking the dynamics and shifts of mental states and understanding the mental states in the whole dialogue comprehensively. Our aim with PersuasiveToM is to allow an effective evaluation of the ToM reasoning ability of LLMs with more focus on complex psychological activities. Our code is available at https://github.com/Yu-Fangxu/PersuasiveToM.
△ Less
Submitted 25 May, 2025; v1 submitted 28 February, 2025;
originally announced February 2025.
-
Digital Player: Evaluating Large Language Models based Human-like Agent in Games
Authors:
Jiawei Wang,
Kai Wang,
Shaojie Lin,
Runze Wu,
Bihan Xu,
Lingeng Jiang,
Shiwei Zhao,
Renyu Zhu,
Haoyu Liu,
Zhipeng Hu,
Zhong Fan,
Le Li,
Tangjie Lyu,
Changjie Fan
Abstract:
With the rapid advancement of Large Language Models (LLMs), LLM-based autonomous agents have shown the potential to function as digital employees, such as digital analysts, teachers, and programmers. In this paper, we develop an application-level testbed based on the open-source strategy game "Unciv", which has millions of active players, to enable researchers to build a "data flywheel" for studyi…
▽ More
With the rapid advancement of Large Language Models (LLMs), LLM-based autonomous agents have shown the potential to function as digital employees, such as digital analysts, teachers, and programmers. In this paper, we develop an application-level testbed based on the open-source strategy game "Unciv", which has millions of active players, to enable researchers to build a "data flywheel" for studying human-like agents in the "digital players" task. This "Civilization"-like game features expansive decision-making spaces along with rich linguistic interactions such as diplomatic negotiations and acts of deception, posing significant challenges for LLM-based agents in terms of numerical reasoning and long-term planning. Another challenge for "digital players" is to generate human-like responses for social interaction, collaboration, and negotiation with human players. The open-source project can be found at https:/github.com/fuxiAIlab/CivAgent.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
On the Extremely X-ray Variable Active Galactic Nuclei in the XMM-LSS Field
Authors:
Zijian Zhang,
Bin Luo,
Linhua Jiang,
W. N. Brandt,
Jian Huang,
Qingling Ni
Abstract:
We present a systematic investigation of extremely X-ray variable active galactic nuclei (AGNs) in the $\approx 5.3~{\rm deg}^2$ XMM-SERVS XMM-LSS region. Eight variable AGNs are identified with rest-frame 2 keV flux density variability amplitudes around 6-12. We comprehensively analyze the X-ray and multiwavelength data to probe the origin of their extreme X-ray variability. It is found that thei…
▽ More
We present a systematic investigation of extremely X-ray variable active galactic nuclei (AGNs) in the $\approx 5.3~{\rm deg}^2$ XMM-SERVS XMM-LSS region. Eight variable AGNs are identified with rest-frame 2 keV flux density variability amplitudes around 6-12. We comprehensively analyze the X-ray and multiwavelength data to probe the origin of their extreme X-ray variability. It is found that their extreme X-ray variability can be ascribed to changing accretion state or changing obscuration from dust-free absorbers. For five AGNs, their X-ray variability is attributed to changing accretion state, supported by contemporaneous multiwavelength variability and the absence of X-ray absorption in the low-state spectra. With new Multiple Mirror Telescope (MMT) spectra for four of these sources, we confirm one changing-look AGN. One MMT AGN lacks multi-epoch spectroscopic observations, while the other two AGNs do not exhibit changing-look behavior, likely because the MMT observations did not capture their high states. The X-ray variability of the other three AGNs is explained by changing obscuration, and they show only mild long-term optical/IR variability. The absorbers of these sources are likely clumpy accretion-disk winds, with variable column densities and covering factors along the lines of sight.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Lightweight Contrastive Distilled Hashing for Online Cross-modal Retrieval
Authors:
Jiaxing Li,
Lin Jiang,
Zeqi Ma,
Kaihang Jiang,
Xiaozhao Fang,
Jie Wen
Abstract:
Deep online cross-modal hashing has gained much attention from researchers recently, as its promising applications with low storage requirement, fast retrieval efficiency and cross modality adaptive, etc. However, there still exists some technical hurdles that hinder its applications, e.g., 1) how to extract the coexistent semantic relevance of cross-modal data, 2) how to achieve competitive perfo…
▽ More
Deep online cross-modal hashing has gained much attention from researchers recently, as its promising applications with low storage requirement, fast retrieval efficiency and cross modality adaptive, etc. However, there still exists some technical hurdles that hinder its applications, e.g., 1) how to extract the coexistent semantic relevance of cross-modal data, 2) how to achieve competitive performance when handling the real time data streams, 3) how to transfer the knowledge learned from offline to online training in a lightweight manner. To address these problems, this paper proposes a lightweight contrastive distilled hashing (LCDH) for cross-modal retrieval, by innovatively bridging the offline and online cross-modal hashing by similarity matrix approximation in a knowledge distillation framework. Specifically, in the teacher network, LCDH first extracts the cross-modal features by the contrastive language-image pre-training (CLIP), which are further fed into an attention module for representation enhancement after feature fusion. Then, the output of the attention module is fed into a FC layer to obtain hash codes for aligning the sizes of similarity matrices for online and offline training. In the student network, LCDH extracts the visual and textual features by lightweight models, and then the features are fed into a FC layer to generate binary codes. Finally, by approximating the similarity matrices, the performance of online hashing in the lightweight student network can be enhanced by the supervision of coexistent semantic relevance that is distilled from the teacher network. Experimental results on three widely used datasets demonstrate that LCDH outperforms some state-of-the-art methods.
△ Less
Submitted 27 February, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
Hierarchical Semantic Compression for Consistent Image Semantic Restoration
Authors:
Shengxi Li,
Zifu Zhang,
Mai Xu,
Lai Jiang,
Yufan Liu,
Ce Zhu
Abstract:
The emerging semantic compression has been receiving increasing research efforts most recently, capable of achieving high fidelity restoration during compression, even at extremely low bitrates. However, existing semantic compression methods typically combine standard pipelines with either pre-defined or high-dimensional semantics, thus suffering from deficiency in compression. To address this iss…
▽ More
The emerging semantic compression has been receiving increasing research efforts most recently, capable of achieving high fidelity restoration during compression, even at extremely low bitrates. However, existing semantic compression methods typically combine standard pipelines with either pre-defined or high-dimensional semantics, thus suffering from deficiency in compression. To address this issue, we propose a novel hierarchical semantic compression (HSC) framework that purely operates within intrinsic semantic spaces from generative models, which is able to achieve efficient compression for consistent semantic restoration. More specifically, we first analyse the entropy models for the semantic compression, which motivates us to employ a hierarchical architecture based on a newly developed general inversion encoder. Then, we propose the feature compression network (FCN) and semantic compression network (SCN), such that the middle-level semantic feature and core semantics are hierarchically compressed to restore both accuracy and consistency of image semantics, via an entropy model progressively shared by channel-wise context. Experimental results demonstrate that the proposed HSC framework achieves the state-of-the-art performance on subjective quality and consistency for human vision, together with superior performances on machine vision tasks given compressed bitstreams. This essentially coincides with human visual system in understanding images, thus providing a new framework for future image/video compression paradigms. Our code shall be released upon acceptance.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
CipherPrune: Efficient and Scalable Private Transformer Inference
Authors:
Yancheng Zhang,
Jiaqi Xue,
Mengxin Zheng,
Mimi Xie,
Mingzhe Zhang,
Lei Jiang,
Qian Lou
Abstract:
Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning; however, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token inputs (scalability issues). We observe that the Transformer's operational complexity scales quadratically with the number of input tokens, making it essential t…
▽ More
Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning; however, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token inputs (scalability issues). We observe that the Transformer's operational complexity scales quadratically with the number of input tokens, making it essential to reduce the input token length. Notably, each token varies in importance, and many inputs contain redundant tokens. Additionally, prior private inference methods that rely on high-degree polynomial approximations for non-linear activations are computationally expensive. Therefore, reducing the polynomial degree for less important tokens can significantly accelerate private inference. Building on these observations, we propose \textit{CipherPrune}, an efficient and scalable private inference framework that includes a secure encrypted token pruning protocol, a polynomial reduction protocol, and corresponding Transformer network optimizations. At the protocol level, encrypted token pruning adaptively removes unimportant tokens from encrypted inputs in a progressive, layer-wise manner. Additionally, encrypted polynomial reduction assigns lower-degree polynomials to less important tokens after pruning, enhancing efficiency without decryption. At the network level, we introduce protocol-aware network optimization via a gradient-based search to maximize pruning thresholds and polynomial reduction conditions while maintaining the desired accuracy. Our experiments demonstrate that CipherPrune reduces the execution overhead of private Transformer inference by approximately $6.1\times$ for 128-token inputs and $10.6\times$ for 512-token inputs, compared to previous methods, with only a marginal drop in accuracy. The code is publicly available at https://github.com/UCF-Lou-Lab-PET/cipher-prune-inference.
△ Less
Submitted 5 March, 2025; v1 submitted 23 February, 2025;
originally announced February 2025.
-
Diagnosing Moral Reasoning Acquisition in Language Models: Pragmatics and Generalization
Authors:
Guangliang Liu,
Lei Jiang,
Xitong Zhang,
Kristen Marie Johnson
Abstract:
Ensuring that Large Language Models (LLMs) return just responses which adhere to societal values is crucial for their broader application. Prior research has shown that LLMs often fail to perform satisfactorily on tasks requiring moral cognizance, such as ethics-based judgments. While current approaches have focused on fine-tuning LLMs with curated datasets to improve their capabilities on such ta…
▽ More
Ensuring that Large Language Models (LLMs) return just responses which adhere to societal values is crucial for their broader application. Prior research has shown that LLMs often fail to perform satisfactorily on tasks requiring moral cognizance, such as ethics-based judgments. While current approaches have focused on fine-tuning LLMs with curated datasets to improve their capabilities on such tasks, choosing the optimal learning paradigm to enhance the ethical responses of LLMs remains an open research debate. In this work, we aim to address this fundamental question: can current learning paradigms enable LLMs to acquire sufficient moral reasoning capabilities? Drawing from distributional semantics theory and the pragmatic nature of moral discourse, our analysis indicates that performance improvements follow a mechanism similar to that of semantic-level tasks, and therefore remain affected by the pragmatic nature of morals latent in discourse, a phenomenon we name the pragmatic dilemma. We conclude that this pragmatic dilemma imposes significant limitations on the generalization ability of current learning paradigms, making it the primary bottleneck for moral reasoning acquisition in LLMs.
△ Less
Submitted 6 March, 2025; v1 submitted 23 February, 2025;
originally announced February 2025.
-
H$α$ Variability of AB Aur b with the Hubble Space Telescope: Probing the Nature of a Protoplanet Candidate with Accretion Light Echoes
Authors:
Brendan P. Bowler,
Yifan Zhou,
Lauren I. Biddle,
Lillian Yushu Jiang,
Jaehan Bae,
Laird M. Close,
Katherine B. Follette,
Kyle Franson,
Adam L. Kraus,
Aniket Sanghi,
Quang Tran,
Kimberly Ward-Duong,
Ya-Lin Wu,
Zhaohuan Zhu
Abstract:
Giant planets generate accretion luminosity as they form. Much of this energy is radiated in strong H$α$ line emission, which has motivated direct imaging surveys at optical wavelengths to search for accreting protoplanets. However, compact disk structures can mimic accreting planets by scattering emission from the host star. This can complicate the interpretation of H$α$ point sources, especially…
▽ More
Giant planets generate accretion luminosity as they form. Much of this energy is radiated in strong H$α$ line emission, which has motivated direct imaging surveys at optical wavelengths to search for accreting protoplanets. However, compact disk structures can mimic accreting planets by scattering emission from the host star. This can complicate the interpretation of H$α$ point sources, especially if the host star itself is accreting. We describe an approach to distinguish accreting protoplanets from scattered-light disk features using "accretion light echoes." This method relies on variable H$α$ emission from a stochastically accreting host star to search for a delayed brightness correlation with a candidate protoplanet. We apply this method to the candidate protoplanet AB Aur b with a dedicated Hubble Space Telescope Wide Field Camera 3 program designed to sequentially sample the host star and the candidate planet in H$α$ while accounting for the light travel time delay and orbital geometry of the source within the protoplanetary disk. Across five epochs spanning 14 months, AB Aur b is over 20 times more variable than its host star; AB Aur's H$α$ emission changes by 15% while AB Aur b varies by 330%. These brightness changes are not correlated, which rules out unobstructed scattered starlight from the host star as the only source of AB Aur b's H$α$ emission and is consistent with tracing emission from an independently accreting protoplanet, inner disk shadowing effects, or a physically evolving compact disk structure. More broadly, accretion light echoes offer a novel tool to explore the nature of protoplanet candidates with well-timed observations of the host star prior to deep imaging in H$α$.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Reflection of Episodes: Learning to Play Game from Expert and Self Experiences
Authors:
Xiaojie Xu,
Zongyuan Li,
Chang Lu,
Runnan Qi,
Yanan Ni,
Lumin Jiang,
Xiangbei Liu,
Xuebo Zhang,
Yongchun Fang,
Kuihua Huang,
Xian Guo,
Zhanghua Wu,
Zhenya Li
Abstract:
StarCraft II is a complex and dynamic real-time strategy (RTS) game environment, which is very suitable for artificial intelligence and reinforcement learning research. To address the problem of Large Language Model(LLM) learning in complex environments through self-reflection, we propose a Reflection of Episodes(ROE) framework based on expert experience and self-experience. This framework first o…
▽ More
StarCraft II is a complex and dynamic real-time strategy (RTS) game environment, which is very suitable for artificial intelligence and reinforcement learning research. To address the problem of Large Language Model(LLM) learning in complex environments through self-reflection, we propose a Reflection of Episodes(ROE) framework based on expert experience and self-experience. This framework first obtains key information in the game through a keyframe selection method, then makes decisions based on expert experience and self-experience. After a game is completed, it reflects on the previous experience to obtain new self-experience. Finally, in the experiment, our method beat the robot under the Very Hard difficulty in TextStarCraft II. We analyze the data of the LLM in the process of the game in detail, verified its effectiveness.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Hierarchical Expert Prompt for Large-Language-Model: An Approach Defeat Elite AI in TextStarCraft II for the First Time
Authors:
Zongyuan Li,
Chang Lu,
Xiaojie Xu,
Runnan Qi,
Yanan Ni,
Lumin Jiang,
Xiangbei Liu,
Xuebo Zhang,
Yongchun Fang,
Kuihua Huang,
Xian Guo
Abstract:
Since the emergence of the Large Language Model (LLM), LLM has been widely used in fields such as writing, translating, and searching. However, there is still great potential for LLM-based methods in handling complex tasks such as decision-making in the StarCraft II environment. To address problems such as lack of relevant knowledge and poor control over subtasks of varying importance, we propose…
▽ More
Since the emergence of the Large Language Model (LLM), LLM has been widely used in fields such as writing, translating, and searching. However, there is still great potential for LLM-based methods in handling complex tasks such as decision-making in the StarCraft II environment. To address problems such as lack of relevant knowledge and poor control over subtasks of varying importance, we propose a Hierarchical Expert Prompt (HEP) for LLM. Our method improves the understanding of game situations through expert-level tactical knowledge, improving the processing quality of tasks of varying importance through a hierarchical framework. Our approach defeated the highest level (Elite) standard built-in agent in TextStarCraft II for the first time and consistently outperformed the baseline method in other difficulties. Our experiments suggest that the proposed method is a practical solution for tackling complex decision-making challenges. The replay video can be viewed on https://www.bilibili.com/video/BV1uz42187EF and https://youtu.be/dO3PshWLV5M, and our codes have been open-sourced on https://github.com/luchang1113/HEP-LLM-play-StarCraftII.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Nonreciprocal routing induced by chirality in an atom-dimer waveguide-QED system
Authors:
Shi-Yu Liu,
Lin-Lin Jiang,
Hai Zhu,
Jie-Qiao Liao,
Jin-Feng Huang
Abstract:
The implementation of quantum routers is an important and desired task in quantum information science, since quantum routers are important components of quantum networks. Here, we propose a scheme for implementing single-photon routers in a waveguide-QED system, which consists of two coupled two-level atoms coupled to two waveguides to form a four-port quantum device. We obtain the exact analytica…
▽ More
The implementation of quantum routers is an important and desired task in quantum information science, since quantum routers are important components of quantum networks. Here, we propose a scheme for implementing single-photon routers in a waveguide-QED system, which consists of two coupled two-level atoms coupled to two waveguides to form a four-port quantum device. We obtain the exact analytical expressions of the single-photon scattering amplitudes using the real-space method. By taking the propagating time of photons between two coupling points into account or not, we consider the system working in the Markovian and non-Markovian regimes, respectively. In addition, we introduce the chiral coupling, which breaks the symmetry of the waveguide model, to manipulate the transmission of single photons. We find that when the system works in the non-Markovian regime, the single photon can be transmitted on demand by adjusting the asymmetry coefficient. More interestingly, the complete single-photon routing in this device does not require an ideal chiral coupling, loosening the photon transport conditions. This work will motivate the studies concerning the nonreciprocal and chiral quantum devices in the waveguide-QED platform.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Janus: Collaborative Vision Transformer Under Dynamic Network Environment
Authors:
Linyi Jiang,
Silvery D. Fu,
Yifei Zhu,
Bo Li
Abstract:
Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Network architectures and achieved state-of-the-art results in various computer vision tasks. Since ViTs are computationally expensive, the models either have to be pruned to run on resource-limited edge devices only or have to be executed on remote cloud servers after receiving the raw data transmitted over fluctuating…
▽ More
Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Network architectures and achieved state-of-the-art results in various computer vision tasks. Since ViTs are computationally expensive, the models either have to be pruned to run on resource-limited edge devices only or have to be executed on remote cloud servers after receiving the raw data transmitted over fluctuating networks. The resulting degraded performance or high latency all hinder their widespread applications. In this paper, we present Janus, the first framework for low-latency cloud-device collaborative Vision Transformer inference over dynamic networks. Janus overcomes the intrinsic model limitations of ViTs and realizes collaboratively executing ViT models on both cloud and edge devices, achieving low latency, high accuracy, and low communication overhead. Specifically, Janus judiciously combines token pruning techniques with a carefully designed fine-to-coarse model splitting policy and non-static mixed pruning policy. It attains a balance between accuracy and latency by dynamically selecting the optimal pruning level and split point. Experimental results across various tasks demonstrate that Janus enhances throughput by up to 5.15 times and reduces latency violation ratios by up to 98.7% when compared with baseline approaches under various network environments.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Constant-Overhead Fault-Tolerant Bell-Pair Distillation using High-Rate Codes
Authors:
J. Pablo Bonilla Ataides,
Hengyun Zhou,
Qian Xu,
Gefen Baranes,
Bikun Li,
Mikhail D. Lukin,
Liang Jiang
Abstract:
We present a fault-tolerant Bell-pair distillation scheme achieving constant overhead through high-rate quantum low-density parity-check (qLDPC) codes. Our approach maintains a constant distillation rate equal to the code rate - as high as $1/3$ in our implementations - while requiring no additional overhead beyond the physical qubits of the code. Full circuit-level analysis demonstrates fault-tol…
▽ More
We present a fault-tolerant Bell-pair distillation scheme achieving constant overhead through high-rate quantum low-density parity-check (qLDPC) codes. Our approach maintains a constant distillation rate equal to the code rate - as high as $1/3$ in our implementations - while requiring no additional overhead beyond the physical qubits of the code. Full circuit-level analysis demonstrates fault-tolerance for input Bell pair infidelities below a threshold $\sim 5\%$, readily achievable with near-term capabilities. Unlike previous proposals, our scheme keeps the output Bell pairs encoded in qLDPC codes at each node, eliminating decoding overhead and enabling direct use in distributed quantum applications through recent advances in qLDPC computation. These results establish qLDPC-based distillation as a practical route toward resource-efficient quantum networks and distributed quantum computing.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Automatic Pruning via Structured Lasso with Class-wise Information
Authors:
Xiang Liu,
Mingchen Li,
Xia Li,
Leigang Qu,
Zifan Peng,
Yijun Song,
Zemin Liu,
Linshan Jiang,
Jialin Li
Abstract:
Most pruning methods concentrate on unimportant filters of neural networks. However, they face the loss of statistical information due to a lack of consideration for class-wise data. In this paper, from the perspective of leveraging precise class-wise information for model pruning, we utilize structured lasso with guidance from Information Bottleneck theory. Our approach ensures that statistical i…
▽ More
Most pruning methods concentrate on unimportant filters of neural networks. However, they face the loss of statistical information due to a lack of consideration for class-wise data. In this paper, from the perspective of leveraging precise class-wise information for model pruning, we utilize structured lasso with guidance from Information Bottleneck theory. Our approach ensures that statistical information is retained during the pruning process. With these techniques, we introduce two innovative adaptive network pruning schemes: sparse graph-structured lasso pruning with Information Bottleneck (\textbf{sGLP-IB}) and sparse tree-guided lasso pruning with Information Bottleneck (\textbf{sTLP-IB}). The key aspect is pruning model filters using sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to multiple state-of-the-art methods, our approaches demonstrate superior performance across three datasets and six model architectures in extensive experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain an accuracy of 94.10% (0.14% higher than the original model); we reduce the parameters by 55% with the accuracy at 76.12% using the ResNet architecture on ImageNet (only drops 0.03%). In summary, we successfully reduce model size and computational resource usage while maintaining accuracy. Our codes are at https://anonymous.4open.science/r/IJCAI-8104.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
One-shot Federated Learning Methods: A Practical Guide
Authors:
Xiang Liu,
Zhenheng Tang,
Xia Li,
Yijun Song,
Sijie Ji,
Zemin Liu,
Bo Han,
Linshan Jiang,
Jialin Li
Abstract:
One-shot Federated Learning (OFL) is a distributed machine learning paradigm that constrains client-server communication to a single round, addressing privacy and communication overhead issues associated with multiple rounds of data exchange in traditional Federated Learning (FL). OFL demonstrates the practical potential for integration with future approaches that require collaborative training mo…
▽ More
One-shot Federated Learning (OFL) is a distributed machine learning paradigm that constrains client-server communication to a single round, addressing privacy and communication overhead issues associated with multiple rounds of data exchange in traditional Federated Learning (FL). OFL demonstrates the practical potential for integration with future approaches that require collaborative training models, such as large language models (LLMs). However, current OFL methods face two major challenges: data heterogeneity and model heterogeneity, which result in subpar performance compared to conventional FL methods. Worse still, despite numerous studies addressing these limitations, a comprehensive summary is still lacking. To address these gaps, this paper presents a systematic analysis of the challenges faced by OFL and thoroughly reviews the current methods. We also offer an innovative categorization method and analyze the trade-offs of various techniques. Additionally, we discuss the most promising future directions and the technologies that should be integrated into the OFL field. This work aims to provide guidance and insights for future research.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Quantum communication over bandwidth-and-time-limited channels
Authors:
Aditya Gandotra,
Zhaoyou Wang,
Aashish A. Clerk,
Liang Jiang
Abstract:
Standard communication systems have transmission spectra that characterize their ability to perform frequency multiplexing over a finite bandwidth. Realistic quantum signals in quantum communication systems like transducers are inherently limited in time due to intrinsic decoherence and finite latency, which hinders the direct implementation of frequency-multiplexed encoding. We investigate quantu…
▽ More
Standard communication systems have transmission spectra that characterize their ability to perform frequency multiplexing over a finite bandwidth. Realistic quantum signals in quantum communication systems like transducers are inherently limited in time due to intrinsic decoherence and finite latency, which hinders the direct implementation of frequency-multiplexed encoding. We investigate quantum channel capacities for bandwidth-and-time-limited (BTL) channels to establish the optimal communication strategy in a realistic setting. For pure-loss bosonic channels, we derive analytical solutions of the optimal encoding and decoding modes for Lorentzian and box transmission spectra, along with numerical solutions for various other transmissions. Our findings reveal a general feature of sequential activation of quantum channels as the input signal duration increases, as well as the existence of optimal signal length for scenarios where only a limited number of channels are in use.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Quantum learning advantage on a scalable photonic platform
Authors:
Zheng-Hao Liu,
Romain Brunel,
Emil E. B. Østergaard,
Oscar Cordero,
Senrui Chen,
Yat Wong,
Jens A. H. Nielsen,
Axel B. Bregnsbo,
Sisi Zhou,
Hsin-Yuan Huang,
Changhun Oh,
Liang Jiang,
John Preskill,
Jonas S. Neergaard-Nielsen,
Ulrik L. Andersen
Abstract:
Recent advancements in quantum technologies have opened new horizons for exploring the physical world in ways once deemed impossible. Central to these breakthroughs is the concept of quantum advantage, where quantum systems outperform their classical counterparts in solving specific tasks. While much attention has been devoted to computational speedups, quantum advantage in learning physical syste…
▽ More
Recent advancements in quantum technologies have opened new horizons for exploring the physical world in ways once deemed impossible. Central to these breakthroughs is the concept of quantum advantage, where quantum systems outperform their classical counterparts in solving specific tasks. While much attention has been devoted to computational speedups, quantum advantage in learning physical systems remains a largely untapped frontier. Here, we present a photonic implementation of a quantum-enhanced protocol for learning the probability distribution of a multimode bosonic displacement process. By harnessing the unique properties of continuous-variable quantum entanglement, we obtain a massive advantage in sample complexity with respect to conventional methods without entangled resources. With approximately 5 dB of two-mode squeezing -- corresponding to imperfect Einstein--Podolsky--Rosen (EPR) entanglement -- we learn a 100-mode bosonic displacement process using 11.8 orders of magnitude fewer samples than a conventional scheme. Our results demonstrate that even with non-ideal, noisy entanglement, a significant quantum advantage can be realized in continuous-variable quantum systems. This marks an important step towards practical quantum-enhanced learning protocols with implications for quantum metrology, certification, and machine learning.
△ Less
Submitted 16 February, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
Online Covariance Estimation in Nonsmooth Stochastic Approximation
Authors:
Liwei Jiang,
Abhishek Roy,
Krishna Balasubramanian,
Damek Davis,
Dmitriy Drusvyatskiy,
Sen Na
Abstract:
We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potenti…
▽ More
We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potentially non-monotone (nonconvex) setting. In this paper, we study an online batch-means covariance matrix estimator introduced in Zhu et al.(2023). The estimator groups the SA iterates appropriately and computes the sample covariance among batches as an estimate of the limiting covariance. Its construction does not require prior knowledge of the total sample size, and updates can be performed recursively as new data arrives. We establish that, as long as the batch size sequence is properly specified (depending on the stepsize sequence), the estimator achieves a convergence rate of order $O(\sqrt{d}n^{-1/8+\varepsilon})$ for any $\varepsilon>0$, where $d$ and $n$ denote the problem dimensionality and the number of iterations (or samples) used. Although the problem is nonsmooth and potentially non-monotone (nonconvex), our convergence rate matches the best-known rate for covariance estimation methods using only first-order information in smooth and strongly-convex settings. The consistency of this covariance estimator enables asymptotically valid statistical inference, including constructing confidence intervals and performing hypothesis testing.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
S$^2$-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency
Authors:
Yuting Zeng,
Weizhe Huang,
Lei Jiang,
Tongxuan Liu,
Xitai Jin,
Chen Tianying Tiana,
Jing Li,
Xiaohua Xu
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across various natural language processing (NLP) scenarios, but they still face challenges when handling complex arithmetic and logical reasoning tasks. While Chain-Of-Thought (CoT) reasoning, self-consistency (SC) and self-correction strategies have attempted to guide models in sequential, multi-step reasoning, Multi-agent Deb…
▽ More
Large language models (LLMs) have demonstrated remarkable capabilities across various natural language processing (NLP) scenarios, but they still face challenges when handling complex arithmetic and logical reasoning tasks. While Chain-Of-Thought (CoT) reasoning, self-consistency (SC) and self-correction strategies have attempted to guide models in sequential, multi-step reasoning, Multi-agent Debate (MAD) has emerged as a viable approach for enhancing the reasoning capabilities of LLMs. By increasing both the number of agents and the frequency of debates, the performance of LLMs improves significantly. However, this strategy results in a significant increase in token costs, presenting a barrier to scalability. To address this challenge, we introduce a novel sparsification strategy designed to reduce token costs within MAD. This approach minimizes ineffective exchanges of information and unproductive discussions among agents, thereby enhancing the overall efficiency of the debate process. We conduct comparative experiments on multiple datasets across various models, demonstrating that our approach significantly reduces the token costs in MAD to a considerable extent. Specifically, compared to MAD, our approach achieves an impressive reduction of up to 94.5\% in token costs while maintaining performance degradation below 2.0\%.
△ Less
Submitted 9 April, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering
Authors:
Longquan Jiang,
Junbo Huang,
Cedric Möller,
Ricardo Usbeck
Abstract:
Most existing Knowledge Graph Question Answering (KGQA) approaches are designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the heterogeneity of the underlying graph schema, topology and assertions, most KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without resource-intensive training data. We present OntoSCPrompt, a novel Large Language Model (LLM)-based…
▽ More
Most existing Knowledge Graph Question Answering (KGQA) approaches are designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the heterogeneity of the underlying graph schema, topology and assertions, most KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without resource-intensive training data. We present OntoSCPrompt, a novel Large Language Model (LLM)-based KGQA approach with a two-stage architecture that separates semantic parsing from KG-dependent interactions. OntoSCPrompt first generates a SPARQL query structure (including SPARQL keywords such as SELECT, ASK, WHERE and placeholders for missing tokens) and then fills them with KG-specific information. To enhance the understanding of the underlying KG, we present an ontology-guided, hybrid prompt learning strategy that integrates KG ontology into the learning process of hybrid prompts (e.g., discrete and continuous vectors). We also present several task-specific decoding strategies to ensure the correctness and executability of generated SPARQL queries in both stages. Experimental results demonstrate that OntoSCPrompt performs as well as SOTA approaches without retraining on a number of KGQA datasets such as CWQ, WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: \href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Iron-corrected Single-epoch Black Hole Masses of DESI Quasars at low redshift
Authors:
Zhiwei Pan,
Linhua Jiang,
Wei-Jian Guo,
Shengxiu Sun,
Małgorzata Siudek,
Jessica Nicole Aguilar,
Steven Ahlen,
David Brooks,
Todd Claybaugh,
Axel de la Macorra,
Peter Doel,
Enrique Gaztañaga,
Satya Gontcho A Gontcho,
Stephanie Juneau,
Theodore Kisner,
Andrew Lambert,
Martin Landriau,
Laurent Le Guillou,
Marc Manera,
Paul Martini,
Aaron Meisner,
Ramon Miquel,
John Moustakas,
Adam Myers,
Claire Poppett
, et al. (9 additional authors not shown)
Abstract:
We present a study on the possible overestimation of single-epoch supermassive black hole (SMBH) masses in previous works, based on more than 55,000 type 1 quasars at $0.25 < z < 0.8$ from the Dark Energy Spectroscopic Instrument (DESI). We confirm that iron emission strength serves as a good tracer of the Eddington ratio, and estimate SMBH masses using an iron-corrected $R$-$L$ relation for H$β$,…
▽ More
We present a study on the possible overestimation of single-epoch supermassive black hole (SMBH) masses in previous works, based on more than 55,000 type 1 quasars at $0.25 < z < 0.8$ from the Dark Energy Spectroscopic Instrument (DESI). We confirm that iron emission strength serves as a good tracer of the Eddington ratio, and estimate SMBH masses using an iron-corrected $R$-$L$ relation for H$β$, where $R$ is the broad line region size and $L$ is the continuum luminosity. Compared to our measurements, previous canonical measurements without the iron correction are overestimated by a factor of 1.5 on average. The overestimation can be up to a factor of 5 for super-Eddington quasars. The fraction of super-Eddington quasars in our sample is about 5%, significantly higher than 0.4% derived from the canonical measurements. Using a sample featuring both H$β$ and MgII emission lines, we calibrate MgII-based SMBH masses using iron-corrected, H$β$-based SMBH masses and establish an iron-corrected $R$-$L$ relation for MgII. The new relation adds an extra term of $-0.34R_{\mathrm{Fe}}$ to the $R$-$L$ relation, where $R_{\mathrm{Fe}}$ denotes the relative iron strength. We use this formula to build a catalog of about 0.5 million DESI quasars at $0.6<z<1.6$. If these iron-corrected $R$-$L$ relations for H$β$ and MgII are valid at high redshift, current mass measurements of luminous quasars at $z\ge6$ would have been overestimated by a factor of 2.3 on average, alleviating the tension between SMBH mass and growth history in the early universe.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
Ruling out AGNs as the dominant source of cosmic reionization with JWST
Authors:
Danyang Jiang,
Linhua Jiang,
Shengxiu Sun,
Weiyang Liu,
Shuqi Fu
Abstract:
Cosmic reionization represents the latest phase transition of the intergalactic medium (IGM) in the Universe. It has long been debated whether galaxies or active galactic nuclei (AGNs) are the major source of Lyman continuum (LyC) photons responsible for reionization. Previous observations slightly favored galaxies as the major ionizing source. However, the James Webb Space Telescope (JWST) recent…
▽ More
Cosmic reionization represents the latest phase transition of the intergalactic medium (IGM) in the Universe. It has long been debated whether galaxies or active galactic nuclei (AGNs) are the major source of Lyman continuum (LyC) photons responsible for reionization. Previous observations slightly favored galaxies as the major ionizing source. However, the James Webb Space Telescope (JWST) recently discovered an unexpectedly high density of AGN candidates at high redshift, which has largely enhanced the influence of AGNs. Here we derive a definitive upper bound on the AGN contribution to reionization using the latest JWST data, and conclusively rule out AGNs as the dominant ionizing source during the epoch of reionization (EoR). We build a sample of objects (including galaxies and AGNs) in a specific redshift range between 7.15 and 7.75 that has a high completeness. Each object is then decomposed into a point-source component and an extended component in their rest-frame far-UV JWST images. Assuming all point-source components are AGNs, we obtain an absolute upper limit for the density of the AGN population. This fiducial AGN sample reaches an unprecedentedly low luminosity of $M_{\rm UV} \approx -15$ mag. Based on this sample, we find that AGNs can contribute at most one third of the LyC photons required to ionize the Universe in this redshift range. Our result implies that galaxies dominate the ionizing source during the EoR.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment
Authors:
Lifan Jiang,
Boxi Wu,
Jiahui Zhang,
Xiaotong Guan,
Shuang Chen
Abstract:
With the rapid development of AIGC technology, significant progress has been made in diffusion model-based technologies for text-to-image (T2I) and text-to-video (T2V). In recent years, a few studies have introduced the strategy of Direct Preference Optimization (DPO) into T2I tasks, significantly enhancing human preferences in generated images. However, existing T2V generation methods lack a well…
▽ More
With the rapid development of AIGC technology, significant progress has been made in diffusion model-based technologies for text-to-image (T2I) and text-to-video (T2V). In recent years, a few studies have introduced the strategy of Direct Preference Optimization (DPO) into T2I tasks, significantly enhancing human preferences in generated images. However, existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences using DPO strategies. Additionally, challenges such as the scarcity of paired video preference data hinder effective model training. At the same time, the lack of training datasets poses a risk of insufficient flexibility and poor video generation quality in the generated videos. Based on those problems, our work proposes three targeted solutions in sequence. 1) Our work is the first to introduce the DPO strategy into the T2V tasks. By deriving a carefully structured loss function, we utilize human feedback to align video generation with human preferences. We refer to this new method as HuViDPO. 2) Our work constructs small-scale human preference datasets for each action category and fine-tune this model, improving the aesthetic quality of the generated videos while reducing training costs. 3) We adopt a First-Frame-Conditioned strategy, leveraging the rich in formation from the first frame to guide the generation of subsequent frames, enhancing flexibility in video generation. At the same time, we employ a SparseCausal Attention mechanism to enhance the quality of the generated videos.More details and examples can be accessed on our website: https://tankowa.github.io/HuViDPO. github.io/.
△ Less
Submitted 2 February, 2025;
originally announced February 2025.
-
Assouad dimension of the Takagi function
Authors:
Lai Jiang
Abstract:
For any integer $b\geq2$ and real series $\{c_n\}$ such that $\sum_{n=0}^\infty|c_n|<\infty$, the generalized Takagi function $f_{{\mathbf c},b}(x)$ is defined by $$
f_{{\mathbf c},b}(x):=\sum_{n=0}^\infty c_nφ(b^n x), \quad x\in [0,1], $$ where $φ(x)=dist(x,\mathbb{Z})$ is the distance from $x$ to the nearest integer. The collection of functions with the form are called the Takagi class. In thi…
▽ More
For any integer $b\geq2$ and real series $\{c_n\}$ such that $\sum_{n=0}^\infty|c_n|<\infty$, the generalized Takagi function $f_{{\mathbf c},b}(x)$ is defined by $$
f_{{\mathbf c},b}(x):=\sum_{n=0}^\infty c_nφ(b^n x), \quad x\in [0,1], $$ where $φ(x)=dist(x,\mathbb{Z})$ is the distance from $x$ to the nearest integer. The collection of functions with the form are called the Takagi class. In this paper, we show that in the case that $\varlimsup_{n \to \infty} b^n |c_n|<\infty$, the Assouad dimension of the graph ${\mathcal G} f_{{\mathbf c},b}=\{(x,f_{{\mathbf c},b}(x)):x\in[0,1]\}$ for the generalized Takagi function $f_{{\mathbf c},b}(x)$ is equal to one, that is, $$ \dim_A {\mathcal G} f_{{\mathbf c},b}=1. $$ In particular, for each $0<a<1$ and integer $b \geq 2$, we define Takagi function $T_{a,b}$ as followed, $$
T_{a,b}(x):=\sum_{n=0}^\infty a^n φ(b^n x), \quad x\in [0,1]. $$ Then $
\dim_A {\mathcal G} T_{a,b}=1 $
if and only if $0<a \leq 1/b$.
△ Less
Submitted 14 March, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
VidSketch: Hand-drawn Sketch-Driven Video Generation with Diffusion Control
Authors:
Lifan Jiang,
Shuang Chen,
Boxi Wu,
Xiaotong Guan,
Jiahui Zhang
Abstract:
With the advancement of generative artificial intelligence, previous studies have achieved the task of generating aesthetic images from hand-drawn sketches, fulfilling the public's needs for drawing. However, these methods are limited to static images and lack the ability to control video animation generation using hand-drawn sketches. To address this gap, we propose VidSketch, the first method ca…
▽ More
With the advancement of generative artificial intelligence, previous studies have achieved the task of generating aesthetic images from hand-drawn sketches, fulfilling the public's needs for drawing. However, these methods are limited to static images and lack the ability to control video animation generation using hand-drawn sketches. To address this gap, we propose VidSketch, the first method capable of generating high-quality video animations directly from any number of hand-drawn sketches and simple text prompts, bridging the divide between ordinary users and professional artists. Specifically, our method introduces a Level-Based Sketch Control Strategy to automatically adjust the guidance strength of sketches during the generation process, accommodating users with varying drawing skills. Furthermore, a TempSpatial Attention mechanism is designed to enhance the spatiotemporal consistency of generated video animations, significantly improving the coherence across frames. You can find more detailed cases on our official website.
△ Less
Submitted 17 February, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.