-
Motion planning for highly-dynamic unconditioned reflexes based on chained Signed Distance Functions
Authors:
Ken Lin,
Qi Ye,
Tin Lun Lam,
Zhibin Li,
Jiming Chen,
Gaofeng Li
Abstract:
The unconditioned reflex (e.g., protective reflex), which is the innate reaction of the organism and usually performed through the spinal cord rather than the brain, can enable organisms to escape harms from environments. In this paper, we propose an online, highly-dynamic motion planning algorithm to endow manipulators the highly-dynamic unconditioned reflexes to humans and/or environments. Our m…
▽ More
The unconditioned reflex (e.g., protective reflex), which is the innate reaction of the organism and usually performed through the spinal cord rather than the brain, can enable organisms to escape harms from environments. In this paper, we propose an online, highly-dynamic motion planning algorithm to endow manipulators the highly-dynamic unconditioned reflexes to humans and/or environments. Our method is based on a chained version of Signed Distance Functions (SDFs), which can be pre-computed and stored. Our proposed algorithm is divided into two stages. In the offline stage, we create 3 groups of local SDFs to store the geometric information of the manipulator and its working environment. In the online stage, the pre-computed local SDFs are chained together according the configuration of the manipulator, to provide global geometric information about the environment. While the point clouds of the dynamic objects serve as query points to look up these local SDFs for quickly generating escape velocity. Then we propose a modified geometric Jacobian matrix and use the Jacobian-pseudo-inverse method to generate real-time reflex behaviors to avoid the static and dynamic obstacles in the environment. The benefits of our method are validated in both static and dynamic scenarios. In the static scenario, our method identifies the path solutions with lower time consumption and shorter trajectory length compared to existing solutions. In the dynamic scenario, our method can reliably pursue the dynamic target point, avoid dynamic obstacles, and react to these obstacles within 1ms, which surpasses the unconditioned reflex reaction time of humans.
△ Less
Submitted 18 February, 2025; v1 submitted 15 February, 2025;
originally announced February 2025.
-
Quantum automated learning with provable and explainable trainability
Authors:
Qi Ye,
Shuangyue Geng,
Zizhao Han,
Weikang Li,
L. -M. Duan,
Dong-Ling Deng
Abstract:
Machine learning is widely believed to be one of the most promising practical applications of quantum computing. Existing quantum machine learning schemes typically employ a quantum-classical hybrid approach that relies crucially on gradients of model parameters. Such an approach lacks provable convergence to global minima and will become infeasible as quantum learning models scale up. Here, we in…
▽ More
Machine learning is widely believed to be one of the most promising practical applications of quantum computing. Existing quantum machine learning schemes typically employ a quantum-classical hybrid approach that relies crucially on gradients of model parameters. Such an approach lacks provable convergence to global minima and will become infeasible as quantum learning models scale up. Here, we introduce quantum automated learning, where no variational parameter is involved and the training process is converted to quantum state preparation. In particular, we encode training data into unitary operations and iteratively evolve a random initial state under these unitaries and their inverses, with a target-oriented perturbation towards higher prediction accuracy sandwiched in between. Under reasonable assumptions, we rigorously prove that the evolution converges exponentially to the desired state corresponding to the global minimum of the loss function. We show that such a training process can be understood from the perspective of preparing quantum states by imaginary time evolution, where the data-encoded unitaries together with target-oriented perturbations would train the quantum learning model in an automated fashion. We further prove that the quantum automated learning paradigm features good generalization ability with the generalization error upper bounded by the ratio between a logarithmic function of the Hilbert space dimension and the number of training samples. In addition, we carry out extensive numerical simulations on real-life images and quantum data to demonstrate the effectiveness of our approach and validate the assumptions. Our results establish an unconventional quantum learning strategy that is gradient-free with provable and explainable trainability, which would be crucial for large-scale practical applications of quantum computing in machine learning scenarios.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
E-3SFC: Communication-Efficient Federated Learning with Double-way Features Synthesizing
Authors:
Yuhao Zhou,
Yuxin Tian,
Mingjia Shi,
Yuanxi Li,
Yanan Sun,
Qing Ye,
Jiancheng Lv
Abstract:
The exponential growth in model sizes has significantly increased the communication burden in Federated Learning (FL). Existing methods to alleviate this burden by transmitting compressed gradients often face high compression errors, which slow down the model's convergence. To simultaneously achieve high compression effectiveness and lower compression errors, we study the gradient compression prob…
▽ More
The exponential growth in model sizes has significantly increased the communication burden in Federated Learning (FL). Existing methods to alleviate this burden by transmitting compressed gradients often face high compression errors, which slow down the model's convergence. To simultaneously achieve high compression effectiveness and lower compression errors, we study the gradient compression problem from a novel perspective. Specifically, we propose a systematical algorithm termed Extended Single-Step Synthetic Features Compressing (E-3SFC), which consists of three sub-components, i.e., the Single-Step Synthetic Features Compressor (3SFC), a double-way compression algorithm, and a communication budget scheduler. First, we regard the process of gradient computation of a model as decompressing gradients from corresponding inputs, while the inverse process is considered as compressing the gradients. Based on this, we introduce a novel gradient compression method termed 3SFC, which utilizes the model itself as a decompressor, leveraging training priors such as model weights and objective functions. 3SFC compresses raw gradients into tiny synthetic features in a single-step simulation, incorporating error feedback to minimize overall compression errors. To further reduce communication overhead, 3SFC is extended to E-3SFC, allowing double-way compression and dynamic communication budget scheduling. Our theoretical analysis under both strongly convex and non-convex conditions demonstrates that 3SFC achieves linear and sub-linear convergence rates with aggregation noise. Extensive experiments across six datasets and six models reveal that 3SFC outperforms state-of-the-art methods by up to 13.4% while reducing communication costs by 111.6 times. These findings suggest that 3SFC can significantly enhance communication efficiency in FL without compromising model performance.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
GraphMinNet: Learning Dependencies in Graphs with Light Complexity Minimal Architecture
Authors:
Md Atik Ahamed,
Andrew Cheng,
Qiang Ye,
Qiang Cheng
Abstract:
Graph Neural Networks (GNNs) have demonstrated remarkable success in various applications, yet they often struggle to capture long-range dependencies (LRD) effectively. This paper introduces GraphMinNet, a novel GNN architecture that generalizes the idea of minimal Gated Recurrent Units to graph-structured data. Our approach achieves efficient LRD modeling with linear computational complexity whil…
▽ More
Graph Neural Networks (GNNs) have demonstrated remarkable success in various applications, yet they often struggle to capture long-range dependencies (LRD) effectively. This paper introduces GraphMinNet, a novel GNN architecture that generalizes the idea of minimal Gated Recurrent Units to graph-structured data. Our approach achieves efficient LRD modeling with linear computational complexity while maintaining permutation equivariance and stability. The model incorporates both structural and positional information through a unique combination of feature and positional encodings, leading to provably stronger expressiveness than the 1-WL test. Theoretical analysis establishes that GraphMinNet maintains non-decaying gradients over long distances, ensuring effective long-range information propagation. Extensive experiments on ten diverse datasets, including molecular graphs, image graphs, and synthetic networks, demonstrate that GraphMinNet achieves state-of-the-art performance while being computationally efficient. Our results show superior performance on 6 out of 10 datasets and competitive results on the others, validating the effectiveness of our approach in capturing both local and global graph structures.
△ Less
Submitted 31 January, 2025;
originally announced February 2025.
-
FUNU: Boosting Machine Unlearning Efficiency by Filtering Unnecessary Unlearning
Authors:
Zitong Li,
Qingqing Ye,
Haibo Hu
Abstract:
Machine unlearning is an emerging field that selectively removes specific data samples from a trained model. This capability is crucial for addressing privacy concerns, complying with data protection regulations, and correcting errors or biases introduced by certain data. Unlike traditional machine learning, where models are typically static once trained, machine unlearning facilitates dynamic upd…
▽ More
Machine unlearning is an emerging field that selectively removes specific data samples from a trained model. This capability is crucial for addressing privacy concerns, complying with data protection regulations, and correcting errors or biases introduced by certain data. Unlike traditional machine learning, where models are typically static once trained, machine unlearning facilitates dynamic updates that enable the model to ``forget'' information without requiring complete retraining from scratch. There are various machine unlearning methods, some of which are more time-efficient when data removal requests are fewer.
To decrease the execution time of such machine unlearning methods, we aim to reduce the size of data removal requests based on the fundamental assumption that the removal of certain data would not result in a distinguishable retrained model. We first propose the concept of unnecessary unlearning, which indicates that the model would not alter noticeably after removing some data points. Subsequently, we review existing solutions that can be used to solve our problem. We highlight their limitations in adaptability to different unlearning scenarios and their reliance on manually selected parameters. We consequently put forward FUNU, a method to identify data points that lead to unnecessary unlearning. FUNU circumvents the limitations of existing solutions. The idea is to discover data points within the removal requests that have similar neighbors in the remaining dataset. We utilize a reference model to set parameters for finding neighbors, inspired from the area of model memorization. We provide a theoretical analysis of the privacy guarantee offered by FUNU and conduct extensive experiments to validate its efficacy.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Quantum disorder induced by nuclear tunneling in lattice
Authors:
Yu-Cheng Zhu,
Jia-Xi Zeng,
Qi-Jun Ye,
Xin-Zheng Li
Abstract:
Lattice degrees of freedom (DoFs) may induce quantum disorder (QD) when nuclear tunneling outvies long-range order, but conventional phonon theory is incapable of describing such QD phases. Here we develop a method based on path-integral molecular dynamics to solve this problem. Its accuracy is verified in a double-well chain model and it is applied to a real material from first principles. A quan…
▽ More
Lattice degrees of freedom (DoFs) may induce quantum disorder (QD) when nuclear tunneling outvies long-range order, but conventional phonon theory is incapable of describing such QD phases. Here we develop a method based on path-integral molecular dynamics to solve this problem. Its accuracy is verified in a double-well chain model and it is applied to a real material from first principles. A quantum order-disorder-order phase transition sequence is demonstrated when varying the strength of quantum fluctuations using the lattice constants as the tuning factor. Combining the excitation spectra and Rényi entanglement entropy, we pinpoint the QD region. This picture may be general in lattice systems having soft phonon modes, not limited to quantum paraelectricity, in which novel entangled lattice motion and its coupling with other DoFs can be expected.
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
Fine-tuning is Not Fine: Mitigating Backdoor Attacks in GNNs with Limited Clean Data
Authors:
Jiale Zhang,
Bosen Rao,
Chengcheng Zhu,
Xiaobing Sun,
Qingming Li,
Haibo Hu,
Xiapu Luo,
Qingqing Ye,
Shouling Ji
Abstract:
Graph Neural Networks (GNNs) have achieved remarkable performance through their message-passing mechanism. However, recent studies have highlighted the vulnerability of GNNs to backdoor attacks, which can lead the model to misclassify graphs with attached triggers as the target class. The effectiveness of recent promising defense techniques, such as fine-tuning or distillation, is heavily continge…
▽ More
Graph Neural Networks (GNNs) have achieved remarkable performance through their message-passing mechanism. However, recent studies have highlighted the vulnerability of GNNs to backdoor attacks, which can lead the model to misclassify graphs with attached triggers as the target class. The effectiveness of recent promising defense techniques, such as fine-tuning or distillation, is heavily contingent on having comprehensive knowledge of the sufficient training dataset. Empirical studies have shown that fine-tuning methods require a clean dataset of 20% to reduce attack accuracy to below 25%, while distillation methods require a clean dataset of 15%. However, obtaining such a large amount of clean data is commonly impractical.
In this paper, we propose a practical backdoor mitigation framework, denoted as GRAPHNAD, which can capture high-quality intermediate-layer representations in GNNs to enhance the distillation process with limited clean data. To achieve this, we address the following key questions: How to identify the appropriate attention representations in graphs for distillation? How to enhance distillation with limited data? By adopting the graph attention transfer method, GRAPHNAD can effectively align the intermediate-layer attention representations of the backdoored model with that of the teacher model, forcing the backdoor neurons to transform into benign ones. Besides, we extract the relation maps from intermediate-layer transformation and enforce the relation maps of the backdoored model to be consistent with that of the teacher model, thereby ensuring model accuracy while further reducing the influence of backdoors. Extensive experimental results show that by fine-tuning a teacher model with only 3% of the clean data, GRAPHNAD can reduce the attack success rate to below 5%.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object Understanding for Bimanual Dexterous Manipulation
Authors:
Zhengnan Sun,
Zhaotai Shi,
Jiayin Chen,
Qingtao Liu,
Yu Cui,
Qi Ye,
Jiming Chen
Abstract:
Bimanual dexterous manipulation remains significant challenges in robotics due to the high DoFs of each hand and their coordination. Existing single-hand manipulation techniques often leverage human demonstrations to guide RL methods but fail to generalize to complex bimanual tasks involving multiple sub-skills. In this paper, we introduce VTAO-BiManip, a novel framework that combines visual-tacti…
▽ More
Bimanual dexterous manipulation remains significant challenges in robotics due to the high DoFs of each hand and their coordination. Existing single-hand manipulation techniques often leverage human demonstrations to guide RL methods but fail to generalize to complex bimanual tasks involving multiple sub-skills. In this paper, we introduce VTAO-BiManip, a novel framework that combines visual-tactile-action pretraining with object understanding to facilitate curriculum RL to enable human-like bimanual manipulation. We improve prior learning by incorporating hand motion data, providing more effective guidance for dual-hand coordination than binary tactile feedback. Our pretraining model predicts future actions as well as object pose and size using masked multimodal inputs, facilitating cross-modal regularization. To address the multi-skill learning challenge, we introduce a two-stage curriculum RL approach to stabilize training. We evaluate our method on a bottle-cap unscrewing task, demonstrating its effectiveness in both simulated and real-world environments. Our approach achieves a success rate that surpasses existing visual-tactile pretraining methods by over 20%.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
Structure-Preference Enabled Graph Embedding Generation under Differential Privacy
Authors:
Sen Zhang,
Qingqing Ye,
Haibo Hu
Abstract:
Graph embedding generation techniques aim to learn low-dimensional vectors for each node in a graph and have recently gained increasing research attention. Publishing low-dimensional node vectors enables various graph analysis tasks, such as structural equivalence and link prediction. Yet, improper publication opens a backdoor to malicious attackers, who can infer sensitive information of individu…
▽ More
Graph embedding generation techniques aim to learn low-dimensional vectors for each node in a graph and have recently gained increasing research attention. Publishing low-dimensional node vectors enables various graph analysis tasks, such as structural equivalence and link prediction. Yet, improper publication opens a backdoor to malicious attackers, who can infer sensitive information of individuals from the low-dimensional node vectors. Existing methods tackle this issue by developing deep graph learning models with differential privacy (DP). However, they often suffer from large noise injections and cannot provide structural preferences consistent with mining objectives. Recently, skip-gram based graph embedding generation techniques are widely used due to their ability to extract customizable structures. Based on skip-gram, we present SE-PrivGEmb, a structure-preference enabled graph embedding generation under DP. For arbitrary structure preferences, we design a unified noise tolerance mechanism via perturbing non-zero vectors. This mechanism mitigates utility degradation caused by high sensitivity. By carefully designing negative sampling probabilities in skip-gram, we theoretically demonstrate that skip-gram can preserve arbitrary proximities, which quantify structural features in graphs. Extensive experiments show that our method outperforms existing state-of-the-art methods under structural equivalence and link prediction tasks.
△ Less
Submitted 6 January, 2025;
originally announced January 2025.
-
Electron-Phonon Temperature Inversion in Nanostructures under Pulsed Photoexcitation
Authors:
Qian Ye,
Stephen K. Sanders,
Andrea Schirato,
Alessandro Alabastri
Abstract:
Photoexcitation of metallic nanostructures with short optical pulses can drive non-thermal electronic states, which, upon decay, lead to elevated electronic temperatures ($T_e \gtrapprox 1000\,\mathrm{K}$) eventually equilibrating with the lattice ($T_p$) through electron-phonon scattering. Here, we show that, in spatially extended nanostructures, the lattice temperature can locally exceed that of…
▽ More
Photoexcitation of metallic nanostructures with short optical pulses can drive non-thermal electronic states, which, upon decay, lead to elevated electronic temperatures ($T_e \gtrapprox 1000\,\mathrm{K}$) eventually equilibrating with the lattice ($T_p$) through electron-phonon scattering. Here, we show that, in spatially extended nanostructures, the lattice temperature can locally exceed that of the electrons, a seemingly counterintuitive transient effect termed hereafter ``temperature inversion'' ($T_p > T_e$). This phenomenon, fundamentally due to inhomogeneous absorption patterns and absent in smaller particles, emerges from a complex spatio-temporal interplay, between the electron-phonon coupling and competing electronic thermal diffusion. By combining rigorous three-dimensional (3D) finite-element-method-based simulations with practical reduced zero-dimensional (0D) analytical models, we identify the electron-phonon coupling coefficient ($G_{e-p}$) as the critical parameter governing this behavior. An optimal $G_{e-p}$ range allows the inversion, whereas a weak or overly strong coupling suppresses it. Among common plasmonic metals, platinum (Pt) exhibits the most pronounced and long-lived inversion, while gold (Au) and silver (Ag) show no significant inversion. Moreover, the close agreement between the 0D and 3D results, once an appropriate characteristic length is selected, highlights that the essential physics governing the inversion can be captured without full spatial complexity. These results provide insights for optimizing nanoscale energy transfer and hot-carrier-driven processes, guiding the strategic design of materials, geometries, and excitation conditions for enhanced ultrafast photothermal control.
△ Less
Submitted 4 January, 2025;
originally announced January 2025.
-
PrivDPR: Synthetic Graph Publishing with Deep PageRank under Differential Privacy
Authors:
Sen Zhang,
Haibo Hu,
Qingqing Ye,
Jianliang Xu
Abstract:
The objective of privacy-preserving synthetic graph publishing is to safeguard individuals' privacy while retaining the utility of original data. Most existing methods focus on graph neural networks under differential privacy (DP), and yet two fundamental problems in generating synthetic graphs remain open. First, the current research often encounters high sensitivity due to the intricate relation…
▽ More
The objective of privacy-preserving synthetic graph publishing is to safeguard individuals' privacy while retaining the utility of original data. Most existing methods focus on graph neural networks under differential privacy (DP), and yet two fundamental problems in generating synthetic graphs remain open. First, the current research often encounters high sensitivity due to the intricate relationships between nodes in a graph. Second, DP is usually achieved through advanced composition mechanisms that tend to converge prematurely when working with a small privacy budget. In this paper, inspired by the simplicity, effectiveness, and ease of analysis of PageRank, we design PrivDPR, a novel privacy-preserving deep PageRank for graph synthesis. In particular, we achieve DP by adding noise to the gradient for a specific weight during learning. Utilizing weight normalization as a bridge, we theoretically reveal that increasing the number of layers in PrivDPR can effectively mitigate the high sensitivity and privacy budget splitting. Through formal privacy analysis, we prove that the synthetic graph generated by PrivDPR satisfies node-level DP. Experiments on real-world graph datasets show that PrivDPR preserves high data utility across multiple graph structural properties.
△ Less
Submitted 4 January, 2025;
originally announced January 2025.
-
Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control
Authors:
Bingliang Li,
Fengyu Yang,
Yuxin Mao,
Qingwen Ye,
Hongkai Chen,
Yiran Zhong
Abstract:
Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorpora…
▽ More
Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.
△ Less
Submitted 29 December, 2024;
originally announced December 2024.
-
Data Poisoning Attacks to Local Differential Privacy Protocols for Graphs
Authors:
Xi He,
Kai Huang,
Qingqing Ye,
Haibo Hu
Abstract:
Graph analysis has become increasingly popular with the prevalence of big data and machine learning. Traditional graph data analysis methods often assume the existence of a trusted third party to collect and store the graph data, which does not align with real-world situations. To address this, some research has proposed utilizing Local Differential Privacy (LDP) to collect graph data or graph met…
▽ More
Graph analysis has become increasingly popular with the prevalence of big data and machine learning. Traditional graph data analysis methods often assume the existence of a trusted third party to collect and store the graph data, which does not align with real-world situations. To address this, some research has proposed utilizing Local Differential Privacy (LDP) to collect graph data or graph metrics (e.g., clustering coefficient). This line of research focuses on collecting two atomic graph metrics (the adjacency bit vectors and node degrees) from each node locally under LDP to synthesize an entire graph or generate graph metrics. However, they have not considered the security issues of LDP for graphs.
In this paper, we bridge the gap by demonstrating that an attacker can inject fake users into LDP protocols for graphs and design data poisoning attacks to degrade the quality of graph metrics. In particular, we present three data poisoning attacks to LDP protocols for graphs. As a proof of concept, we focus on data poisoning attacks on two classical graph metrics: degree centrality and clustering coefficient. We further design two countermeasures for these data poisoning attacks. Experimental study on real-world datasets demonstrates that our attacks can largely degrade the quality of collected graph metrics, and the proposed countermeasures cannot effectively offset the effect, which calls for the development of new defenses.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
Federated Heavy Hitter Analytics with Local Differential Privacy
Authors:
Yuemin Zhang,
Qingqing Ye,
Haibo Hu
Abstract:
Federated heavy hitter analytics enables service providers to better understand the preferences of cross-party users by analyzing the most frequent items. As with federated learning, it faces challenges of privacy concerns, statistical heterogeneity, and expensive communication. Local differential privacy (LDP), as the de facto standard for privacy-preserving data collection, solves the privacy ch…
▽ More
Federated heavy hitter analytics enables service providers to better understand the preferences of cross-party users by analyzing the most frequent items. As with federated learning, it faces challenges of privacy concerns, statistical heterogeneity, and expensive communication. Local differential privacy (LDP), as the de facto standard for privacy-preserving data collection, solves the privacy challenge by letting each user perturb her data locally and report the sanitized version. However, in federated settings, applying LDP complicates the other two challenges, due to the deteriorated utility by the injected LDP noise or increasing communication/computation costs by perturbation mechanism. To tackle these problems, we propose a novel target-aligning prefix tree mechanism satisfying $ε$-LDP, for federated heavy hitter analytics. In particular, we propose an adaptive extension strategy to address the inconsistencies between covering necessary prefixes and estimating heavy hitters within a party to enhance the utility. We also present a consensus-based pruning strategy that utilizes noisy prior knowledge from other parties to further align the inconsistency between finding heavy hitters in each party and providing reasonable frequency information to identify the global ones. To the best of our knowledge, our study is the first solution to the federated heavy hitter analytics in a cross-party setting while satisfying the stringent $ε$-LDP. Comprehensive experiments on both real-world and synthetic datasets confirm the effectiveness of our proposed mechanism.
△ Less
Submitted 2 January, 2025; v1 submitted 19 December, 2024;
originally announced December 2024.
-
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation
Authors:
Xiaoxi Li,
Jiajie Jin,
Yujia Zhou,
Yongkang Wu,
Zhonghua Li,
Qi Ye,
Zhicheng Dou
Abstract:
Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimizat…
▽ More
Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimization of retrieval and generation. To address these issues, we propose \textbf{RetroLLM}, a unified framework that integrates retrieval and generation into a single, cohesive process, enabling LLMs to directly generate fine-grained evidence from the corpus with constrained decoding. Moreover, to mitigate false pruning in the process of constrained evidence generation, we introduce (1) hierarchical FM-Index constraints, which generate corpus-constrained clues to identify a subset of relevant documents before evidence generation, reducing irrelevant decoding space; and (2) a forward-looking constrained decoding strategy, which considers the relevance of future sequences to improve evidence accuracy. Extensive experiments on five open-domain QA datasets demonstrate RetroLLM's superior performance across both in-domain and out-of-domain tasks. The code is available at \url{https://github.com/sunnynexus/RetroLLM}.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis
Authors:
Mu Zhang,
Yunfan Liu,
Yue Liu,
Yuzhong Zhao,
Qixiang Ye
Abstract:
Existing image synthesis methods for natural scenes focus primarily on foreground control, often reducing the background to simplistic textures. Consequently, these approaches tend to overlook the intrinsic correlation between foreground and background, which may lead to incoherent and unrealistic synthesis results in remote sensing (RS) scenarios. In this paper, we introduce CC-Diff, a…
▽ More
Existing image synthesis methods for natural scenes focus primarily on foreground control, often reducing the background to simplistic textures. Consequently, these approaches tend to overlook the intrinsic correlation between foreground and background, which may lead to incoherent and unrealistic synthesis results in remote sensing (RS) scenarios. In this paper, we introduce CC-Diff, a $\underline{\textbf{Diff}}$usion Model-based approach for RS image generation with enhanced $\underline{\textbf{C}}$ontext $\underline{\textbf{C}}$oherence. Specifically, we propose a novel Dual Re-sampler for feature extraction, with a built-in `Context Bridge' to explicitly capture the intricate interdependency between foreground and background. Moreover, we reinforce their connection by employing a foreground-aware attention mechanism during the generation of background features, thereby enhancing the plausibility of the synthesized context. Extensive experiments show that CC-Diff outperforms state-of-the-art methods across critical quality metrics, excelling in the RS domain and effectively generalizing to natural images. Remarkably, CC-Diff also shows high trainability, boosting detection accuracy by 1.83 mAP on DOTA and 2.25 mAP on the COCO benchmark.
△ Less
Submitted 10 March, 2025; v1 submitted 11 December, 2024;
originally announced December 2024.
-
Membership Inference Attacks and Defenses in Federated Learning: A Survey
Authors:
Li Bai,
Haibo Hu,
Qingqing Ye,
Haoyang Li,
Leixia Wang,
Jianliang Xu
Abstract:
Federated learning is a decentralized machine learning approach where clients train models locally and share model updates to develop a global model. This enables low-resource devices to collaboratively build a high-quality model without requiring direct access to the raw training data. However, despite only sharing model updates, federated learning still faces several privacy vulnerabilities. One…
▽ More
Federated learning is a decentralized machine learning approach where clients train models locally and share model updates to develop a global model. This enables low-resource devices to collaboratively build a high-quality model without requiring direct access to the raw training data. However, despite only sharing model updates, federated learning still faces several privacy vulnerabilities. One of the key threats is membership inference attacks, which target clients' privacy by determining whether a specific example is part of the training set. These attacks can compromise sensitive information in real-world applications, such as medical diagnoses within a healthcare system. Although there has been extensive research on membership inference attacks, a comprehensive and up-to-date survey specifically focused on it within federated learning is still absent. To fill this gap, we categorize and summarize membership inference attacks and their corresponding defense strategies based on their characteristics in this setting. We introduce a unique taxonomy of existing attack research and provide a systematic overview of various countermeasures. For these studies, we thoroughly analyze the strengths and weaknesses of different approaches. Finally, we identify and discuss key future research directions for readers interested in advancing the field.
△ Less
Submitted 8 December, 2024;
originally announced December 2024.
-
No-Free-Lunch Theories for Tensor-Network Machine Learning Models
Authors:
Jing-Chuan Wu,
Qi Ye,
Dong-Ling Deng,
Li-Wei Yu
Abstract:
Tensor network machine learning models have shown remarkable versatility in tackling complex data-driven tasks, ranging from quantum many-body problems to classical pattern recognitions. Despite their promising performance, a comprehensive understanding of the underlying assumptions and limitations of these models is still lacking. In this work, we focus on the rigorous formulation of their no-fre…
▽ More
Tensor network machine learning models have shown remarkable versatility in tackling complex data-driven tasks, ranging from quantum many-body problems to classical pattern recognitions. Despite their promising performance, a comprehensive understanding of the underlying assumptions and limitations of these models is still lacking. In this work, we focus on the rigorous formulation of their no-free-lunch theorem -- essential yet notoriously challenging to formalize for specific tensor network machine learning models. In particular, we rigorously analyze the generalization risks of learning target output functions from input data encoded in tensor network states. We first prove a no-free-lunch theorem for machine learning models based on matrix product states, i.e., the one-dimensional tensor network states. Furthermore, we circumvent the challenging issue of calculating the partition function for two-dimensional Ising model, and prove the no-free-lunch theorem for the case of two-dimensional projected entangled-pair state, by introducing the combinatorial method associated to the "puzzle of polyominoes". Our findings reveal the intrinsic limitations of tensor network-based learning models in a rigorous fashion, and open up an avenue for future analytical exploration of both the strengths and limitations of quantum-inspired machine learning frameworks.
△ Less
Submitted 7 December, 2024;
originally announced December 2024.
-
Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning
Authors:
Zhaozhi Wang,
Conghu Li,
Qixiang Ye,
Tong Zhang
Abstract:
Most parameter-efficient fine-tuning (PEFT) methods rely on low-rank representations to adapt models. However, these approaches often oversimplify representations, particularly when the underlying data has high-rank or high-frequency components. This limitation hinders the model's ability to capture complex data interactions effectively. In this paper, we propose a novel approach that models netwo…
▽ More
Most parameter-efficient fine-tuning (PEFT) methods rely on low-rank representations to adapt models. However, these approaches often oversimplify representations, particularly when the underlying data has high-rank or high-frequency components. This limitation hinders the model's ability to capture complex data interactions effectively. In this paper, we propose a novel approach that models network weights by leveraging a combination of physical priors, enabling more accurate approximations. We use three foundational equations -- heat diffusion, wave propagation, and Poisson's steady-state equation -- each contributing distinctive modeling properties: heat diffusion enforces local smoothness, wave propagation facilitates long-range interactions, and Poisson's equation captures global equilibrium. To combine these priors effectively, we introduce the Mixture of Physical Priors Adapter (MoPPA), using an efficient Discrete Cosine Transform (DCT) implementation. To dynamically balance these priors, a route regularization mechanism is designed to adaptively tune their contributions. MoPPA serves as a lightweight, plug-and-play module that seamlessly integrates into transformer architectures, with adaptable complexity depending on the local context. Specifically, using MAE pre-trained ViT-B, MoPPA improves PEFT accuracy by up to 2.1% on VTAB-1K image classification with a comparable number of trainable parameters, and advantages are further validated through experiments across various vision backbones, showcasing MoPPA's effectiveness and adaptability. The code will be made public available.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Multi-robot autonomous 3D reconstruction using Gaussian splatting with Semantic guidance
Authors:
Jing Zeng,
Qi Ye,
Tianle Liu,
Yang Xu,
Jin Li,
Jinming Xu,
Liang Li,
Jiming Chen
Abstract:
Implicit neural representations and 3D Gaussian splatting (3DGS) have shown great potential for scene reconstruction. Recent studies have expanded their applications in autonomous reconstruction through task assignment methods. However, these methods are mainly limited to single robot, and rapid reconstruction of large-scale scenes remains challenging. Additionally, task-driven planning based on s…
▽ More
Implicit neural representations and 3D Gaussian splatting (3DGS) have shown great potential for scene reconstruction. Recent studies have expanded their applications in autonomous reconstruction through task assignment methods. However, these methods are mainly limited to single robot, and rapid reconstruction of large-scale scenes remains challenging. Additionally, task-driven planning based on surface uncertainty is prone to being trapped in local optima. To this end, we propose the first 3DGS-based centralized multi-robot autonomous 3D reconstruction framework. To further reduce time cost of task generation and improve reconstruction quality, we integrate online open-vocabulary semantic segmentation with surface uncertainty of 3DGS, focusing view sampling on regions with high instance uncertainty. Finally, we develop a multi-robot collaboration strategy with mode and task assignments improving reconstruction quality while ensuring planning efficiency. Our method demonstrates the highest reconstruction quality among all planning methods and superior planning efficiency compared to existing multi-robot methods. We deploy our method on multiple robots, and results show that it can effectively plan view paths and reconstruct scenes with high quality.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Learning Locally, Revising Globally: Global Reviser for Federated Learning with Noisy Labels
Authors:
Yuxin Tian,
Mouxing Yang,
Yuhao Zhou,
Jian Wang,
Qing Ye,
Tongliang Liu,
Gang Niu,
Jiancheng Lv
Abstract:
The success of most federated learning (FL) methods heavily depends on label quality, which is often inaccessible in real-world scenarios, such as medicine, leading to the federated label-noise (F-LN) problem. In this study, we observe that the global model of FL memorizes the noisy labels slowly. Based on the observations, we propose a novel approach dubbed Global Reviser for Federated Learning w…
▽ More
The success of most federated learning (FL) methods heavily depends on label quality, which is often inaccessible in real-world scenarios, such as medicine, leading to the federated label-noise (F-LN) problem. In this study, we observe that the global model of FL memorizes the noisy labels slowly. Based on the observations, we propose a novel approach dubbed Global Reviser for Federated Learning with Noisy Labels (FedGR) to enhance the label-noise robustness of FL. In brief, FedGR employs three novel modules to achieve noisy label sniffing and refining, local knowledge revising, and local model regularization. Specifically, the global model is adopted to infer local data proxies for global sample selection and refine incorrect labels. To maximize the utilization of local knowledge, we leverage the global model to revise the local exponential moving average (EMA) model of each client and distill it into the clients' models. Additionally, we introduce a global-to-local representation regularization to mitigate the overfitting of noisy labels. Extensive experiments on three F-LNL benchmarks against seven baseline methods demonstrate the effectiveness of the proposed FedGR.
△ Less
Submitted 30 November, 2024;
originally announced December 2024.
-
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
Authors:
Feng Liu,
Shiwei Zhang,
Xiaofeng Wang,
Yujie Wei,
Haonan Qiu,
Yuzhong Zhao,
Yingya Zhang,
Qixiang Ye,
Fang Wan
Abstract:
As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appro…
▽ More
As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.
△ Less
Submitted 18 March, 2025; v1 submitted 28 November, 2024;
originally announced November 2024.
-
RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model
Authors:
Huiyang Hu,
Peijin Wang,
Hanbo Bi,
Boyuan Tong,
Zhaozhi Wang,
Wenhui Diao,
Hao Chang,
Yingchao Feng,
Ziqi Zhang,
Yaowei Wang,
Qixiang Ye,
Kun Fu,
Xian Sun
Abstract:
Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with large-scale remote sensing images. To overcome these, we draw inspiration from heat conduction, a physica…
▽ More
Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with large-scale remote sensing images. To overcome these, we draw inspiration from heat conduction, a physical process modeling local heat diffusion. Building on this idea, we are the first to explore the potential of using the parallel computing model of heat conduction to simulate the local region correlations in high-resolution remote sensing images, and introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies the Heat Conduction Operator (HCO) with a complexity of $O(N^{1.5})$ and a global receptive field, reducing computational overhead while capturing remote sensing object structure information to guide heat diffusion; 2) learns the frequency distribution representations of various scenes through a self-supervised strategy based on frequency domain hierarchical masking and multi-domain reconstruction; 3) significantly improves efficiency and performance over state-of-the-art techniques across 4 tasks and 10 datasets. Compared to attention-based remote sensing foundation models, we reduce memory usage by 84\%, FLOPs by 24\% and improves throughput by 2.7 times. The code will be made publicly available.
△ Less
Submitted 25 June, 2025; v1 submitted 26 November, 2024;
originally announced November 2024.
-
Cavity-Quantum Electrodynamics with Moiré Flatband Photonic Crystals
Authors:
Yu-Tong Wang,
Qi-Hang Ye,
Jun-Yong Yan,
Yufei Qiao,
Chen Chen,
Xiao-Tian Cheng,
Chen-Hui Li,
Zi-Jian Zhang,
Cheng-Nian Huang,
Yun Meng,
Kai Zou,
Wen-Kang Zhan,
Chao Zhao,
Xiaolong Hu,
Clarence Augustine T H Tee,
Wei E. I. Sha,
Zhixiang Huang,
Huiyun Liu,
Chao-Yuan Jin,
Lei Ying,
Feng Liu
Abstract:
Quantum emitters are a key component in photonic quantum technologies. Enhancing their single-photon emission by engineering the photonic environment using cavities can significantly improve the overall efficiency in quantum information processing. However, this enhancement is often constrained by the need for precise nanoscale control over the emitter's position within micro- or nano-cavities. In…
▽ More
Quantum emitters are a key component in photonic quantum technologies. Enhancing their single-photon emission by engineering the photonic environment using cavities can significantly improve the overall efficiency in quantum information processing. However, this enhancement is often constrained by the need for precise nanoscale control over the emitter's position within micro- or nano-cavities. Inspired by the fascinating physics of moiré patterns, we present an approach to strongly modify the spontaneous emission rate of a quantum emitter using a finely designed multilayer moiré photonic crystal with a robust isolated-flatband dispersion. Theoretical analysis reveals that, due to its nearly infinite photonic density of states, the moiré cavity can simultaneously achieve a high Purcell factor and exhibit large tolerance over the emitter's position. We experimentally demonstrate the coupling between this moiré cavity and a quantum dot through the cavity-determined polarization of the dot's emission. The radiative lifetime of the quantum dot can be tuned by a factor of 40, ranging from 42 ps to 1692 ps, which is attributed to strong Purcell enhancement and Purcell inhibition effects. Our findings pave the way for moiré flatband cavity-enhanced quantum light sources, quantum optical switches, and quantum nodes for quantum internet applications.
△ Less
Submitted 6 June, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
ClickTrack: Towards Real-time Interactive Single Object Tracking
Authors:
Kuiran Wang,
Xuehui Yu,
Wenwen Yu,
Guorong Li,
Xiangyuan Lan,
Qixiang Ye,
Jianbin Jiao,
Zhenjun Han
Abstract:
Single object tracking(SOT) relies on precise object bounding box initialization. In this paper, we reconsidered the deficiencies in the current approaches to initializing single object trackers and propose a new paradigm for single object tracking algorithms, ClickTrack, a new paradigm using clicking interaction for real-time scenarios. Moreover, click as an input type inherently lack hierarchica…
▽ More
Single object tracking(SOT) relies on precise object bounding box initialization. In this paper, we reconsidered the deficiencies in the current approaches to initializing single object trackers and propose a new paradigm for single object tracking algorithms, ClickTrack, a new paradigm using clicking interaction for real-time scenarios. Moreover, click as an input type inherently lack hierarchical information. To address ambiguity in certain special scenarios, we designed the Guided Click Refiner(GCR), which accepts point and optional textual information as inputs, transforming the point into the bounding box expected by the operator. The bounding box will be used as input of single object trackers. Experiments on LaSOT and GOT-10k benchmarks show that tracker combined with GCR achieves stable performance in real-time interactive scenarios. Furthermore, we explored the integration of GCR into the Segment Anything model(SAM), significantly reducing ambiguity issues when SAM receives point inputs.
△ Less
Submitted 24 November, 2024; v1 submitted 20 November, 2024;
originally announced November 2024.
-
The Volatile Composition and Activity Evolution of Main-Belt Comet 358P/PANSTARRS
Authors:
Henry H. Hsieh,
John W. Noonan,
Michael S. P. Kelley,
Dennis Bodewits,
Jana Pittichova,
Audrey Thirouin,
Marco Micheli,
Matthew M. Knight,
Michele T. Bannister,
Colin O. Chandler,
Carrie E. Holt,
Matthew J. Hopkins,
Yaeji Kim,
Nicholas A. Moskovitz,
William J. Oldroyd,
Jack Patterson,
Scott S. Sheppard,
Nicole Tan,
Chadwick A. Trujillo,
Quanzhi Ye
Abstract:
We report the detection of water vapor associated with main-belt comet 358P/PANSTARRS on UT 2024 January 8-9 using the NIRSPEC instrument aboard JWST. We derive a water production rate of Q(H2O)=(5.0+/-0.2)x10^25 molecules/s, marking only the second direct detection of sublimation products of any kind from a main-belt comet, after 238P/Read. Similar to 238P, we find a remarkable absence of hypervo…
▽ More
We report the detection of water vapor associated with main-belt comet 358P/PANSTARRS on UT 2024 January 8-9 using the NIRSPEC instrument aboard JWST. We derive a water production rate of Q(H2O)=(5.0+/-0.2)x10^25 molecules/s, marking only the second direct detection of sublimation products of any kind from a main-belt comet, after 238P/Read. Similar to 238P, we find a remarkable absence of hypervolatile species, finding Q(CO2)<7.6x10^22 molecules/s, corresponding to Q(CO2)/Q(H2O)<0.2%. Upper limits on CH3OH and CO emission are also estimated. Photometry from ground-based observations show that the dust coma brightened and faded slowly over ~250 days in 2023-2024, consistent with photometric behavior observed in 2012-2013, but also indicate a ~2.5x decline in the dust production rate between these two periods. Dynamical dust modeling shows that the coma's morphology as imaged by JWST's NIRCAM instrument on 2023 November 22 can be reproduced by asymmetric dust emission from a nucleus with a mid-range obliquity (~80 deg) with a steady-state mass loss rate of ~0.8 kg/s. Finally, we find similar Afrho-to-gas ratios of log10(Afrho/Q(H2O))=-24.8+/-0.2 for 358P and log10(Afrho/QH2O)=-24.4+/-0.2 for 238P, suggesting that Afrho could serve as an effective proxy for estimating water production rates in other active main-belt comets. The confirmation of water vapor outgassing in both main-belt comets observed by JWST to date reinforces the use of recurrent activity near perihelion as an indicator of sublimation-driven activity in active asteroids.
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
Classification Done Right for Vision-Language Pre-Training
Authors:
Zilong Huang,
Qinghao Ye,
Bingyi Kang,
Jiashi Feng,
Haoqi Fan
Abstract:
We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, Su…
▽ More
We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. https://github.com/x-cls/superclass
△ Less
Submitted 6 November, 2024; v1 submitted 5 November, 2024;
originally announced November 2024.
-
The Latent Road to Atoms: Backmapping Coarse-grained Protein Structures with Latent Diffusion
Authors:
Xu Han,
Yuancheng Sun,
Kai Chen,
Kang Liu,
Qiwei Ye
Abstract:
Coarse-grained(CG) molecular dynamics simulations offer computational efficiency for exploring protein conformational ensembles and thermodynamic properties. Though coarse representations enable large-scale simulations across extended temporal and spatial ranges, the sacrifice of atomic-level details limits their utility in tasks such as ligand docking and protein-protein interaction prediction. B…
▽ More
Coarse-grained(CG) molecular dynamics simulations offer computational efficiency for exploring protein conformational ensembles and thermodynamic properties. Though coarse representations enable large-scale simulations across extended temporal and spatial ranges, the sacrifice of atomic-level details limits their utility in tasks such as ligand docking and protein-protein interaction prediction. Backmapping, the process of reconstructing all-atom structures from coarse-grained representations, is crucial for recovering these fine details. While recent machine learning methods have made strides in protein structure generation, challenges persist in reconstructing diverse atomistic conformations that maintain geometric accuracy and chemical validity. In this paper, we present Latent Diffusion Backmapping (LDB), a novel approach leveraging denoising diffusion within latent space to address these challenges. By combining discrete latent encoding with diffusion, LDB bypasses the need for equivariant and internal coordinate manipulation, significantly simplifying the training and sampling processes as well as facilitating better and wider exploration in configuration space. We evaluate LDB's state-of-the-art performance on three distinct protein datasets, demonstrating its ability to efficiently reconstruct structures with high structural accuracy and chemical validity. Moreover, LDB shows exceptional versatility in capturing diverse protein ensembles, highlighting its capability to explore intricate conformational spaces. Our results position LDB as a powerful and scalable approach for backmapping, effectively bridging the gap between CG simulations and atomic-level analyses in computational biology.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
On the sample complexity of purity and inner product estimation
Authors:
Weiyuan Gong,
Jonas Haferkamp,
Qi Ye,
Zhihan Zhang
Abstract:
We study the sample complexity of the prototypical tasks quantum purity estimation and quantum inner product estimation. In purity estimation, we are to estimate $tr(ρ^2)$ of an unknown quantum state $ρ$ to additive error $ε$. Meanwhile, for quantum inner product estimation, Alice and Bob are to estimate $tr(ρσ)$ to additive error $ε$ given copies of unknown quantum state $ρ$ and $σ$ using classic…
▽ More
We study the sample complexity of the prototypical tasks quantum purity estimation and quantum inner product estimation. In purity estimation, we are to estimate $tr(ρ^2)$ of an unknown quantum state $ρ$ to additive error $ε$. Meanwhile, for quantum inner product estimation, Alice and Bob are to estimate $tr(ρσ)$ to additive error $ε$ given copies of unknown quantum state $ρ$ and $σ$ using classical communication and restricted quantum communication.
In this paper, we show a strong connection between the sample complexity of purity estimation with bounded quantum memory and inner product estimation with bounded quantum communication and unentangled measurements. We propose a protocol that solves quantum inner product estimation with $k$-qubit one-way quantum communication and unentangled local measurements using $O(median\{1/ε^2,2^{n/2}/ε,2^{n-k}/ε^2\})$ copies of $ρ$ and $σ$. Our protocol can be modified to estimate the purity of an unknown quantum state $ρ$ using $k$-qubit quantum memory with the same complexity. We prove that arbitrary protocols with $k$-qubit quantum memory that estimate purity to error $ε$ require $Ω(median\{1/ε^2,2^{n/2}/\sqrtε,2^{n-k}/ε^2\})$ copies of $ρ$. This indicates the same lower bound for quantum inner product estimation with one-way $k$-qubit quantum communication and classical communication, and unentangled local measurements. For purity estimation, we further improve the lower bound to $Ω(\max\{1/ε^2,2^{n/2}/ε\})$ for any protocols using an identical single-copy projection-valued measurement.
Additionally, we investigate a decisional variant of quantum distributed inner product estimation without quantum communication for mixed state and provide a lower bound on the sample complexity.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
New Paradigm of Adversarial Training: Releasing Accuracy-Robustness Trade-Off via Dummy Class
Authors:
Yanyun Wang,
Li Liu,
Zi Liang,
Yi R.,
Fung,
Qingqing Ye,
Haibo Hu
Abstract:
Adversarial Training (AT) is one of the most effective methods to enhance the robustness of Deep Neural Networks (DNNs). However, existing AT methods suffer from an inherent accuracy-robustness trade-off. Previous works have studied this issue under the current AT paradigm, but still face over 10% accuracy reduction without significant robustness improvement over simple baselines such as PGD-AT. T…
▽ More
Adversarial Training (AT) is one of the most effective methods to enhance the robustness of Deep Neural Networks (DNNs). However, existing AT methods suffer from an inherent accuracy-robustness trade-off. Previous works have studied this issue under the current AT paradigm, but still face over 10% accuracy reduction without significant robustness improvement over simple baselines such as PGD-AT. This inherent trade-off raises a question: Whether the current AT paradigm, which assumes to learn corresponding benign and adversarial samples as the same class, inappropriately mixes clean and robust objectives that may be essentially inconsistent. In fact, our empirical results show that up to 40% of CIFAR-10 adversarial samples always fail to satisfy such an assumption across various AT methods and robust models, explicitly indicating the room for improvement of the current AT paradigm. To relax from this overstrict assumption and the tension between clean and robust learning, in this work, we propose a new AT paradigm by introducing an additional dummy class for each original class, aiming to accommodate hard adversarial samples with shifted distribution after perturbation. The robustness w.r.t. these adversarial samples can be achieved by runtime recovery from the predicted dummy classes to the corresponding original ones, without conflicting with the clean objective on accuracy of benign samples. Finally, based on our new paradigm, we propose a novel DUmmy Classes-based Adversarial Training (DUCAT) method that concurrently improves accuracy and robustness in a plug-and-play manner only relevant to logits, loss, and a proposed two-hot soft label-based supervised signal. Our method outperforms state-of-the-art (SOTA) benchmarks, effectively releasing the current trade-off. The code is available at https://github.com/FlaAI/DUCAT.
△ Less
Submitted 26 May, 2025; v1 submitted 16 October, 2024;
originally announced October 2024.
-
LLaVA-Critic: Learning to Evaluate Multimodal Models
Authors:
Tianyi Xiong,
Xiyao Wang,
Dong Guo,
Qinghao Ye,
Haoqi Fan,
Quanquan Gu,
Heng Huang,
Chunyuan Li
Abstract:
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-a…
▽ More
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
△ Less
Submitted 3 March, 2025; v1 submitted 3 October, 2024;
originally announced October 2024.
-
RobustEMD: Domain Robust Matching for Cross-domain Few-shot Medical Image Segmentation
Authors:
Yazhou Zhu,
Minxian Li,
Qiaolin Ye,
Shidong Wang,
Tong Xin,
Haofeng Zhang
Abstract:
Few-shot medical image segmentation (FSMIS) aims to perform the limited annotated data learning in the medical image analysis scope. Despite the progress has been achieved, current FSMIS models are all trained and deployed on the same data domain, as is not consistent with the clinical reality that medical imaging data is always across different data domains (e.g. imaging modalities, institutions…
▽ More
Few-shot medical image segmentation (FSMIS) aims to perform the limited annotated data learning in the medical image analysis scope. Despite the progress has been achieved, current FSMIS models are all trained and deployed on the same data domain, as is not consistent with the clinical reality that medical imaging data is always across different data domains (e.g. imaging modalities, institutions and equipment sequences). How to enhance the FSMIS models to generalize well across the different specific medical imaging domains? In this paper, we focus on the matching mechanism of the few-shot semantic segmentation models and introduce an Earth Mover's Distance (EMD) calculation based domain robust matching mechanism for the cross-domain scenario. Specifically, we formulate the EMD transportation process between the foreground support-query features, the texture structure aware weights generation method, which proposes to perform the sobel based image gradient calculation over the nodes, is introduced in the EMD matching flow to restrain the domain relevant nodes. Besides, the point set level distance measurement metric is introduced to calculated the cost for the transportation from support set nodes to query set nodes. To evaluate the performance of our model, we conduct experiments on three scenarios (i.e., cross-modal, cross-sequence and cross-institution), which includes eight medical datasets and involves three body regions, and the results demonstrate that our model achieves the SoTA performance against the compared models.
△ Less
Submitted 25 March, 2025; v1 submitted 1 October, 2024;
originally announced October 2024.
-
Preconditioning for Accelerated Gradient Descent Optimization and Regularization
Authors:
Qiang Ye
Abstract:
Accelerated training algorithms, such as adaptive learning rates and various normalization methods, are widely used but not fully understood. When regularization is introduced, standard optimizers like adaptive learning rates may not perform effectively. This raises the need for alternative regularization approaches and the question of how to properly combine regularization with preconditioning. I…
▽ More
Accelerated training algorithms, such as adaptive learning rates and various normalization methods, are widely used but not fully understood. When regularization is introduced, standard optimizers like adaptive learning rates may not perform effectively. This raises the need for alternative regularization approaches and the question of how to properly combine regularization with preconditioning. In this paper, we address these challenges using the theory of preconditioning as follows: (1) We explain how preconditioning with AdaGrad, RMSProp, and Adam accelerates training; (2) We explore the interaction between regularization and preconditioning, outlining different options for selecting the variables for regularization, and in particular we discuss how to implement that for the gradient regularization; and (3) We demonstrate how normalization methods accelerate training by improving Hessian conditioning, and discuss how this perspective can lead to new preconditioning training algorithms. Our findings offer a unified mathematical framework for understanding various acceleration techniques and deriving appropriate regularization schemes.
△ Less
Submitted 30 September, 2024;
originally announced October 2024.
-
Minor planets, asteroids, comets and interplanetary dust within 30 au
Authors:
Quanzhi Ye
Abstract:
Our Solar System includes the Sun, eight major planets and their moons, along with numerous asteroids, comets, and dust particles, collectively known as the small Solar System bodies. Small bodies are relics from the birth of the Solar System and offer valuable insights into planetary formation and the origins of life. This chapter explores this important component of our Solar System, discussing…
▽ More
Our Solar System includes the Sun, eight major planets and their moons, along with numerous asteroids, comets, and dust particles, collectively known as the small Solar System bodies. Small bodies are relics from the birth of the Solar System and offer valuable insights into planetary formation and the origins of life. This chapter explores this important component of our Solar System, discussing the formation and evolution of key small body populations and their interrelations.
△ Less
Submitted 14 September, 2024;
originally announced September 2024.
-
AdaptiveFusion: Adaptive Multi-Modal Multi-View Fusion for 3D Human Body Reconstruction
Authors:
Anjun Chen,
Xiangyu Wang,
Zhi Xu,
Kun Shi,
Yan Qin,
Yuchi Huo,
Jiming Chen,
Qi Ye
Abstract:
Recent advancements in sensor technology and deep learning have led to significant progress in 3D human body reconstruction. However, most existing approaches rely on data from a specific sensor, which can be unreliable due to the inherent limitations of individual sensing modalities. Additionally, existing multi-modal fusion methods generally require customized designs based on the specific senso…
▽ More
Recent advancements in sensor technology and deep learning have led to significant progress in 3D human body reconstruction. However, most existing approaches rely on data from a specific sensor, which can be unreliable due to the inherent limitations of individual sensing modalities. Additionally, existing multi-modal fusion methods generally require customized designs based on the specific sensor combinations or setups, which limits the flexibility and generality of these methods. Furthermore, conventional point-image projection-based and Transformer-based fusion networks are susceptible to the influence of noisy modalities and sensor poses. To address these limitations and achieve robust 3D human body reconstruction in various conditions, we propose AdaptiveFusion, a generic adaptive multi-modal multi-view fusion framework that can effectively incorporate arbitrary combinations of uncalibrated sensor inputs. By treating different modalities from various viewpoints as equal tokens, and our handcrafted modality sampling module by leveraging the inherent flexibility of Transformer models, AdaptiveFusion is able to cope with arbitrary numbers of inputs and accommodate noisy modalities with only a single training network. Extensive experiments on large-scale human datasets demonstrate the effectiveness of AdaptiveFusion in achieving high-quality 3D human body reconstruction in various environments. In addition, our method achieves superior accuracy compared to state-of-the-art fusion methods.
△ Less
Submitted 13 March, 2025; v1 submitted 7 September, 2024;
originally announced September 2024.
-
"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation
Authors:
Zi Liang,
Qingqing Ye,
Yanyun Wang,
Sen Zhang,
Yaxin Xiao,
Ronghua Li,
Jianliang Xu,
Haibo Hu
Abstract:
Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this…
▽ More
Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that I) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and II) LoRD can reduce query complexity while mitigating watermark protection through our exploration-based stealing. Extensive experiments validate the superiority of our method in extracting various state-of-the-art commercial LLMs. Our code is available at: https://github.com/liangzid/LoRD-MEA .
△ Less
Submitted 19 May, 2025; v1 submitted 4 September, 2024;
originally announced September 2024.
-
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning
Authors:
Bohao Xing,
Zitong Yu,
Xin Liu,
Kaishen Yuan,
Qilang Ye,
Weicheng Xie,
Huanjing Yue,
Jingyu Yang,
Heikki Kälviäinen
Abstract:
Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimo…
▽ More
Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimodal emotion understanding and human-computer interaction difficult. Multimodal Large Language Models (MLLMs) have recently achieved success, offering advantages in addressing these issues and potentially overcoming the limitations of current FER paradigms. However, directly applying pre-trained MLLMs to FER still faces several challenges. Our zero-shot evaluations of existing open-source MLLMs on FER indicate a significant performance gap compared to GPT-4V and current supervised state-of-the-art (SOTA) methods. In this paper, we aim to enhance MLLMs' capabilities in understanding facial expressions. We first generate instruction data for five FER datasets with Gemini. We then propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information. Specifically, we design a Face Info Mining module to extract both global and local facial information. Additionally, we utilize a handcrafted prompt to introduce age-gender-race attributes, considering the emotional differences across different human groups. Extensive experiments show that EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets. The instruction dataset and code are available at https://github.com/xxtars/EMO-LLaMA.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
Vision Calorimeter: Migrating Visual Object Detector to High-energy Particle Images
Authors:
Hongtian Yu,
Yangu Li,
Yunfan Liu,
Yunxuan Song,
Xiaorui Lyu,
Qixiang Ye
Abstract:
In high-energy physics, accurately estimating the kinematic parameters (position and momentum) of anti-neutrons ($\bar{n}$) is essential for exploring the fundamental governing principles. However, this process is particularly challenging when using an electromagnetic calorimeter (EMC) as the energy detector, due to their limited accuracy and efficiency in interacting with $\bar{n}$. To address th…
▽ More
In high-energy physics, accurately estimating the kinematic parameters (position and momentum) of anti-neutrons ($\bar{n}$) is essential for exploring the fundamental governing principles. However, this process is particularly challenging when using an electromagnetic calorimeter (EMC) as the energy detector, due to their limited accuracy and efficiency in interacting with $\bar{n}$. To address this issue, we propose Vision Calorimeter (ViC), a data-driven framework which migrates visual object detection techniques to high-energy particle images. To accommodate the unique characteristics of particle images, we introduce the heat-conduction operator (HCO) into both the backbone and the head of the conventional object detector and conduct significant structural improvements. HCO enjoys the advantage of both radial prior and global attention, as it is inspired by physical heat conduction which naturally aligns with the pattern of particle incidence. Implemented via the Discrete Cosine Transform (DCT), HCO extracts frequency-domain features, bridging the distribution gap between the particle images and the natural images on which visual object detectors are pre-trained. Experimental results demonstrate that ViC significantly outperforms traditional approaches, reducing the incident position prediction error by 46.16% (from 17.31$^{\circ}$ to 9.32$^{\circ}$) and providing the first baseline result with an incident momentum regression error of 21.48%. This study underscores ViC's great potential as a general-purpose particle parameter estimator in high-energy physics. Code is available at https://github.com/yuhongtian17/ViC.
△ Less
Submitted 16 February, 2025; v1 submitted 20 August, 2024;
originally announced August 2024.
-
Depth-guided Texture Diffusion for Image Semantic Segmentation
Authors:
Wei Sun,
Yuan Li,
Qixiang Ye,
Jianbin Jiao,
Yanzhao Zhou
Abstract:
Depth information provides valuable insights into the 3D structure especially the outline of objects, which can be utilized to improve the semantic segmentation tasks. However, a naive fusion of depth information can disrupt feature and compromise accuracy due to the modality gap between the depth and the vision. In this work, we introduce a Depth-guided Texture Diffusion approach that effectively…
▽ More
Depth information provides valuable insights into the 3D structure especially the outline of objects, which can be utilized to improve the semantic segmentation tasks. However, a naive fusion of depth information can disrupt feature and compromise accuracy due to the modality gap between the depth and the vision. In this work, we introduce a Depth-guided Texture Diffusion approach that effectively tackles the outlined challenge. Our method extracts low-level features from edges and textures to create a texture image. This image is then selectively diffused across the depth map, enhancing structural information vital for precisely extracting object outlines. By integrating this enriched depth map with the original RGB image into a joint feature embedding, our method effectively bridges the disparity between the depth map and the image, enabling more accurate semantic segmentation. We conduct comprehensive experiments across diverse, commonly-used datasets spanning a wide range of semantic segmentation tasks, including Camouflaged Object Detection (COD), Salient Object Detection (SOD), and indoor semantic segmentation. With source-free estimated depth or depth captured by depth cameras, our method consistently outperforms existing baselines and achieves new state-of-theart results, demonstrating the effectiveness of our Depth-guided Texture Diffusion for image semantic segmentation.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
Correspondence-Guided SfM-Free 3D Gaussian Splatting for NVS
Authors:
Wei Sun,
Xiaosong Zhang,
Fang Wan,
Yanzhao Zhou,
Yuan Li,
Qixiang Ye,
Jianbin Jiao
Abstract:
Novel View Synthesis (NVS) without Structure-from-Motion (SfM) pre-processed camera poses--referred to as SfM-free methods--is crucial for promoting rapid response capabilities and enhancing robustness against variable operating conditions. Recent SfM-free methods have integrated pose optimization, designing end-to-end frameworks for joint camera pose estimation and NVS. However, most existing wor…
▽ More
Novel View Synthesis (NVS) without Structure-from-Motion (SfM) pre-processed camera poses--referred to as SfM-free methods--is crucial for promoting rapid response capabilities and enhancing robustness against variable operating conditions. Recent SfM-free methods have integrated pose optimization, designing end-to-end frameworks for joint camera pose estimation and NVS. However, most existing works rely on per-pixel image loss functions, such as L2 loss. In SfM-free methods, inaccurate initial poses lead to misalignment issue, which, under the constraints of per-pixel image loss functions, results in excessive gradients, causing unstable optimization and poor convergence for NVS. In this study, we propose a correspondence-guided SfM-free 3D Gaussian splatting for NVS. We use correspondences between the target and the rendered result to achieve better pixel alignment, facilitating the optimization of relative poses between frames. We then apply the learned poses to optimize the entire scene. Each 2D screen-space pixel is associated with its corresponding 3D Gaussians through approximated surface rendering to facilitate gradient back propagation. Experimental results underline the superior performance and time efficiency of the proposed approach compared to the state-of-the-art baselines.
△ Less
Submitted 16 August, 2024;
originally announced August 2024.
-
Stabilizer bootstrapping: A recipe for efficient agnostic tomography and magic estimation
Authors:
Sitan Chen,
Weiyuan Gong,
Qi Ye,
Zhihan Zhang
Abstract:
We study the task of agnostic tomography: given copies of an unknown $n$-qubit state $ρ$ which has fidelity $τ$ with some state in a given class $C$, find a state which has fidelity $\ge τ- ε$ with $ρ$. We give a new framework, stabilizer bootstrapping, for designing computationally efficient protocols for this task, and use this to get new agnostic tomography protocols for the following classes:…
▽ More
We study the task of agnostic tomography: given copies of an unknown $n$-qubit state $ρ$ which has fidelity $τ$ with some state in a given class $C$, find a state which has fidelity $\ge τ- ε$ with $ρ$. We give a new framework, stabilizer bootstrapping, for designing computationally efficient protocols for this task, and use this to get new agnostic tomography protocols for the following classes:
Stabilizer states: We give a protocol that runs in time $\mathrm{poly}(n,1/ε)\cdot (1/τ)^{O(\log(1/τ))}$, answering an open question posed by Grewal, Iyer, Kretschmer, Liang [43] and Anshu and Arunachalam [6]. Previous protocols ran in time $\mathrm{exp}(Θ(n))$ or required $τ>\cos^2(π/8)$.
States with stabilizer dimension $n - t$: We give a protocol that runs in time $n^3\cdot(2^t/τ)^{O(\log(1/ε))}$, extending recent work on learning quantum states prepared by circuits with few non-Clifford gates, which only applied in the realizable setting where $τ= 1$ [33, 40, 49, 66].
Discrete product states: If $C = K^{\otimes n}$ for some $μ$-separated discrete set $K$ of single-qubit states, we give a protocol that runs in time $(n/μ)^{O((1 + \log (1/τ))/μ)}/ε^2$. This strictly generalizes a prior guarantee which applied to stabilizer product states [42]. For stabilizer product states, we give a further improved protocol that runs in time $(n^2/ε^2)\cdot (1/τ)^{O(\log(1/τ))}$.
As a corollary, we give the first protocol for estimating stabilizer fidelity, a standard measure of magic for quantum states, to error $ε$ in $n^3 \mathrm{quasipoly}(1/ε)$ time.
△ Less
Submitted 4 December, 2024; v1 submitted 13 August, 2024;
originally announced August 2024.
-
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models
Authors:
Zi Liang,
Haibo Hu,
Qingqing Ye,
Yaxin Xiao,
Haoyang Li
Abstract:
The drastic increase of large language models' (LLMs) parameters has led to a new research direction of fine-tuning-free downstream customization by prompts, i.e., task descriptions. While these prompt-based services (e.g. OpenAI's GPTs) play an important role in many businesses, there has emerged growing concerns about the prompt leakage, which undermines the intellectual properties of these serv…
▽ More
The drastic increase of large language models' (LLMs) parameters has led to a new research direction of fine-tuning-free downstream customization by prompts, i.e., task descriptions. While these prompt-based services (e.g. OpenAI's GPTs) play an important role in many businesses, there has emerged growing concerns about the prompt leakage, which undermines the intellectual properties of these services and causes downstream attacks. In this paper, we analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies. By exploring the scaling laws in prompt extraction, we analyze key attributes that influence prompt extraction, including model sizes, prompt lengths, as well as the types of prompts. Then we propose two hypotheses that explain how LLMs expose their prompts. The first is attributed to the perplexity, i.e. the familiarity of LLMs to texts, whereas the second is based on the straightforward token translation path in attention matrices. To defend against such threats, we investigate whether alignments can undermine the extraction of prompts. We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks, even under the most straightforward user attacks. Therefore, we put forward several defense strategies with the inspiration of our findings, which achieve 83.8\% and 71.0\% drop in the prompt extraction rate for Llama2-7B and GPT-3.5, respectively. Source code is avaliable at https://github.com/liangzid/PromptExtractionEval.
△ Less
Submitted 12 February, 2025; v1 submitted 5 August, 2024;
originally announced August 2024.
-
Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack
Authors:
Xiaoyue Xu,
Qinyuan Ye,
Xiang Ren
Abstract:
We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to l…
▽ More
We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than those of the Single-task ICL baseline.
Task Haystack draws inspiration from the widely-adopted "needle-in-a-haystack" (NIAH) evaluation, but presents distinct new challenges. It requires models (1) to utilize the contexts at a deeper level, rather than resorting to simple copying and pasting; (2) to navigate through long streams of evolving topics and tasks, proxying the complexities and dynamism of contexts in real-world scenarios. Additionally, Task Haystack inherits the controllability of NIAH, providing model developers with tools and visualizations to identify model vulnerabilities effectively.
We benchmark 14 long-context LMs using Task Haystack, finding that frontier models like GPT-4o still struggle with the setting, failing on 15% of cases on average. Most open-weight models further lack behind by a large margin, with failure rates reaching up to 61%. In our controlled analysis, we identify factors such as distraction and recency bias as contributors to these failure cases. Further, performance declines when task instructions are paraphrased at test time or when ICL demonstrations are repeated excessively, raising concerns about the robustness, instruction understanding, and true context utilization of long-context LMs.
△ Less
Submitted 2 December, 2024; v1 submitted 23 July, 2024;
originally announced July 2024.
-
A Fast and Accurate Solver for the Fractional Fokker-Planck Equation with Dirac-Delta Initial Conditions
Authors:
Qihao Ye,
Xiaochuan Tian,
Dong Wang
Abstract:
The classical Fokker-Planck equation (FPE) is a key tool in physics for describing systems influenced by drag forces and Gaussian noise, with applications spanning multiple fields. We consider the fractional Fokker-Planck equation (FFPE), which models the time evolution of probability densities for systems driven by Lévy processes, relevant in scenarios where Gaussian assumptions fail. The paper p…
▽ More
The classical Fokker-Planck equation (FPE) is a key tool in physics for describing systems influenced by drag forces and Gaussian noise, with applications spanning multiple fields. We consider the fractional Fokker-Planck equation (FFPE), which models the time evolution of probability densities for systems driven by Lévy processes, relevant in scenarios where Gaussian assumptions fail. The paper presents an efficient and accurate numerical approach for the free-space FFPE with constant coefficients and Dirac-delta initial conditions. This method utilizes the integral representation of the solutions and enables the efficient handling of very high-dimensional problems using fast algorithms. Our work is the first to present a high-precision numerical solver for the free-space FFPE with Dirac-delta initial conditions. This opens the door for future research on more complex scenarios, including those with variable coefficients and other types of initial conditions.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
PriPL-Tree: Accurate Range Query for Arbitrary Distribution under Local Differential Privacy
Authors:
Leixia Wang,
Qingqing Ye,
Haibo Hu,
Xiaofeng Meng
Abstract:
Answering range queries in the context of Local Differential Privacy (LDP) is a widely studied problem in Online Analytical Processing (OLAP). Existing LDP solutions all assume a uniform data distribution within each domain partition, which may not align with real-world scenarios where data distribution is varied, resulting in inaccurate estimates. To address this problem, we introduce PriPL-Tree,…
▽ More
Answering range queries in the context of Local Differential Privacy (LDP) is a widely studied problem in Online Analytical Processing (OLAP). Existing LDP solutions all assume a uniform data distribution within each domain partition, which may not align with real-world scenarios where data distribution is varied, resulting in inaccurate estimates. To address this problem, we introduce PriPL-Tree, a novel data structure that combines hierarchical tree structures with piecewise linear (PL) functions to answer range queries for arbitrary distributions. PriPL-Tree precisely models the underlying data distribution with a few line segments, leading to more accurate results for range queries. Furthermore, we extend it to multi-dimensional cases with novel data-aware adaptive grids. These grids leverage the insights from marginal distributions obtained through PriPL-Trees to partition the grids adaptively, adapting the density of underlying distributions. Our extensive experiments on both real and synthetic datasets demonstrate the effectiveness and superiority of PriPL-Tree over state-of-the-art solutions in answering range queries across arbitrary data distributions.
△ Less
Submitted 24 August, 2024; v1 submitted 18 July, 2024;
originally announced July 2024.
-
Pre-training with Fractional Denoising to Enhance Molecular Property Prediction
Authors:
Yuyan Ni,
Shikun Feng,
Xin Hong,
Yuancheng Sun,
Wei-Ying Ma,
Zhi-Ming Ma,
Qiwei Ye,
Yanyan Lan
Abstract:
Deep learning methods have been considered promising for accelerating molecular screening in drug discovery and material design. Due to the limited availability of labelled data, various self-supervised molecular pre-training methods have been presented. While many existing methods utilize common pre-training tasks in computer vision (CV) and natural language processing (NLP), they often overlook…
▽ More
Deep learning methods have been considered promising for accelerating molecular screening in drug discovery and material design. Due to the limited availability of labelled data, various self-supervised molecular pre-training methods have been presented. While many existing methods utilize common pre-training tasks in computer vision (CV) and natural language processing (NLP), they often overlook the fundamental physical principles governing molecules. In contrast, applying denoising in pre-training can be interpreted as an equivalent force learning, but the limited noise distribution introduces bias into the molecular distribution. To address this issue, we introduce a molecular pre-training framework called fractional denoising (Frad), which decouples noise design from the constraints imposed by force learning equivalence. In this way, the noise becomes customizable, allowing for incorporating chemical priors to significantly improve molecular distribution modeling. Experiments demonstrate that our framework consistently outperforms existing methods, establishing state-of-the-art results across force prediction, quantum chemical properties, and binding affinity tasks. The refined noise design enhances force accuracy and sampling coverage, which contribute to the creation of physically consistent molecular representations, ultimately leading to superior predictive performance.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge
Authors:
Sriram Yenamandra,
Arun Ramachandran,
Mukul Khanna,
Karmesh Yadav,
Jay Vakil,
Andrew Melnik,
Michael Büttner,
Leon Harz,
Lyon Brown,
Gora Chand Nandi,
Arjun PS,
Gaurav Kumar Yadav,
Rahul Kala,
Robert Haschke,
Yang Luo,
Jinxin Zhu,
Yansen Han,
Bingyi Lu,
Xuan Gu,
Qinyuan Liu,
Yaping Zhao,
Qiting Ye,
Chenxiao Dou,
Yansong Chua,
Volodymyr Kuzma
, et al. (20 additional authors not shown)
Abstract:
In order to develop robots that can effectively serve as versatile and capable home assistants, it is crucial for them to reliably perceive and interact with a wide variety of objects across diverse environments. To this end, we proposed Open Vocabulary Mobile Manipulation as a key benchmark task for robotics: finding any object in a novel environment and placing it on any receptacle surface withi…
▽ More
In order to develop robots that can effectively serve as versatile and capable home assistants, it is crucial for them to reliably perceive and interact with a wide variety of objects across diverse environments. To this end, we proposed Open Vocabulary Mobile Manipulation as a key benchmark task for robotics: finding any object in a novel environment and placing it on any receptacle surface within that environment. We organized a NeurIPS 2023 competition featuring both simulation and real-world components to evaluate solutions to this task. Our baselines on the most challenging version of this task, using real perception in simulation, achieved only an 0.8% success rate; by the end of the competition, the best participants achieved an 10.8\% success rate, a 13x improvement. We observed that the most successful teams employed a variety of methods, yet two common threads emerged among the best solutions: enhancing error detection and recovery, and improving the integration of perception with decision-making processes. In this paper, we detail the results and methodologies used, both in simulation and real-world settings. We discuss the lessons learned and their implications for future research. Additionally, we compare performance in real and simulated environments, emphasizing the necessity for robust generalization to novel settings.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Authors:
Zhaorun Chen,
Yichao Du,
Zichen Wen,
Yiyang Zhou,
Chenhang Cui,
Zhenzhen Weng,
Haoqin Tu,
Chaoqi Wang,
Zhengwei Tong,
Qinglan Huang,
Canyu Chen,
Qinghao Ye,
Zhihong Zhu,
Yuqing Zhang,
Jiawei Zhou,
Zhuokai Zhao,
Rafael Rafailov,
Chelsea Finn,
Huaxiu Yao
Abstract:
While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequent…
▽ More
While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. All data, code, models are available at https://huggingface.co/MJ-Bench.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
Authors:
Mingxiang Liao,
Hannan Lu,
Xinyu Zhang,
Fang Wan,
Tianyu Wang,
Yuzhong Zhao,
Wangmeng Zuo,
Qixiang Ye,
Jingdong Wang
Abstract:
Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to…
▽ More
Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at https://github.com/MingXiangL/DEVIL.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multimodal Large Language Models
Authors:
Mengdan Zhu,
Raasikh Kanjiani,
Jiahui Lu,
Andrew Choi,
Qirui Ye,
Liang Zhao
Abstract:
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces \textit{LatentExplainer},…
▽ More
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces \textit{LatentExplainer}, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. \textit{LatentExplainer} tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. Our approach perturbs latent variables, interpreting changes in generated data, and uses multimodal large language models (MLLMs) to produce human-understandable explanations. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations for latent variables. The results highlight the effectiveness of incorporating inductive biases and uncertainty quantification, significantly enhancing model interpretability.
△ Less
Submitted 27 May, 2025; v1 submitted 21 June, 2024;
originally announced June 2024.