-
FreeDriveRF: Monocular RGB Dynamic NeRF without Poses for Autonomous Driving via Point-Level Dynamic-Static Decoupling
Authors:
Yue Wen,
Liang Song,
Yijia Liu,
Siting Zhu,
Yanzi Miao,
Lijun Han,
Hesheng Wang
Abstract:
Dynamic scene reconstruction for autonomous driving enables vehicles to perceive and interpret complex scene changes more precisely. Dynamic Neural Radiance Fields (NeRFs) have recently shown promising capability in scene modeling. However, many existing methods rely heavily on accurate poses inputs and multi-sensor data, leading to increased system complexity. To address this, we propose FreeDriv…
▽ More
Dynamic scene reconstruction for autonomous driving enables vehicles to perceive and interpret complex scene changes more precisely. Dynamic Neural Radiance Fields (NeRFs) have recently shown promising capability in scene modeling. However, many existing methods rely heavily on accurate poses inputs and multi-sensor data, leading to increased system complexity. To address this, we propose FreeDriveRF, which reconstructs dynamic driving scenes using only sequential RGB images without requiring poses inputs. We innovatively decouple dynamic and static parts at the early sampling level using semantic supervision, mitigating image blurring and artifacts. To overcome the challenges posed by object motion and occlusion in monocular camera, we introduce a warped ray-guided dynamic object rendering consistency loss, utilizing optical flow to better constrain the dynamic modeling process. Additionally, we incorporate estimated dynamic flow to constrain the pose optimization process, improving the stability and accuracy of unbounded scene reconstruction. Extensive experiments conducted on the KITTI and Waymo datasets demonstrate the superior performance of our method in dynamic scene modeling for autonomous driving.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Deep-ICE: The first globally optimal algorithm for empirical risk minimization of two-layer maxout and ReLU networks
Authors:
Xi He,
Yi Miao,
Max A. Little
Abstract:
This paper introduces the first globally optimal algorithm for the empirical risk minimization problem of two-layer maxout and ReLU networks, i.e., minimizing the number of misclassifications. The algorithm has a worst-case time complexity of $O\left(N^{DK+1}\right)$, where $K$ denotes the number of hidden neurons and $D$ represents the number of features. It can be can be generalized to accommoda…
▽ More
This paper introduces the first globally optimal algorithm for the empirical risk minimization problem of two-layer maxout and ReLU networks, i.e., minimizing the number of misclassifications. The algorithm has a worst-case time complexity of $O\left(N^{DK+1}\right)$, where $K$ denotes the number of hidden neurons and $D$ represents the number of features. It can be can be generalized to accommodate arbitrary computable loss functions without affecting its computational complexity. Our experiments demonstrate that the proposed algorithm provides provably exact solutions for small-scale datasets. To handle larger datasets, we introduce a novel coreset selection method that reduces the data size to a manageable scale, making it feasible for our algorithm. This extension enables efficient processing of large-scale datasets and achieves significantly improved performance, with a 20-30\% reduction in misclassifications for both training and prediction, compared to state-of-the-art approaches (neural networks trained using gradient descent and support vector machines), when applied to the same models (two-layer networks with fixed hidden nodes and linear models).
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
FalconWing: An Open-Source Platform for Ultra-Light Fixed-Wing Aircraft Research
Authors:
Yan Miao,
Will Shen,
Hang Cui,
Sayan Mitra
Abstract:
We present FalconWing -- an open-source, ultra-lightweight (150 g) fixed-wing platform for autonomy research. The hardware platform integrates a small camera, a standard airframe, offboard computation, and radio communication for manual overrides. We demonstrate FalconWing's capabilities by developing and deploying a purely vision-based control policy for autonomous landing (without IMU or motion…
▽ More
We present FalconWing -- an open-source, ultra-lightweight (150 g) fixed-wing platform for autonomy research. The hardware platform integrates a small camera, a standard airframe, offboard computation, and radio communication for manual overrides. We demonstrate FalconWing's capabilities by developing and deploying a purely vision-based control policy for autonomous landing (without IMU or motion capture) using a novel real-to-sim-to-real learning approach. Our learning approach: (1) constructs a photorealistic simulation environment via 3D Gaussian splatting trained on real-world images; (2) identifies nonlinear dynamics from vision-estimated real-flight data; and (3) trains a multi-modal Vision Transformer (ViT) policy through simulation-only imitation learning. The ViT architecture fuses single RGB image with the history of control actions via self-attention, preserving temporal context while maintaining real-time 20 Hz inference. When deployed zero-shot on the hardware platform, this policy achieves an 80% success rate in vision-based autonomous landings. Together with the hardware specifications, we also open-source the system dynamics, the software for photorealistic simulator and the learning approach.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Distributed Retrieval-Augmented Generation
Authors:
Chenhao Xu,
Longxiang Gao,
Yuan Miao,
Xi Zheng
Abstract:
As large language models (LLMs) become increasingly adopted on edge devices, Retrieval-Augmented Generation (RAG) is gaining prominence as a solution to address factual deficiencies and hallucinations by integrating external knowledge. However, centralized RAG architectures face significant challenges in data privacy and scalability. For instance, smart healthcare services often rely on collecting…
▽ More
As large language models (LLMs) become increasingly adopted on edge devices, Retrieval-Augmented Generation (RAG) is gaining prominence as a solution to address factual deficiencies and hallucinations by integrating external knowledge. However, centralized RAG architectures face significant challenges in data privacy and scalability. For instance, smart healthcare services often rely on collecting sensitive patient data and building a centralized knowledge base to provide better diagnosis and treatment advice, while privacy concerns significantly impede this process. Besides, maintaining a comprehensive and continuously updated knowledge base is costly, particularly in response to regional epidemics and rapidly mutating viruses. To address these challenges, this paper introduces Distributed Retrieval-Augmented Generation (DRAG), a novel framework that improves data privacy by eliminating the need for a centralized knowledge base and restoring data control to owners. DRAG incorporates a Topic-Aware Random Walk (TARW) algorithm that leverages LLMs to extract query topics and facilitate targeted peer discovery within a peer-to-peer network, enabling efficient knowledge retrieval in decentralized environments. Extensive experiments across three diverse datasets and LLMs demonstrate that DRAG with TARW achieves near-centralized RAG performance by using half as many messages as flooding. The code is available at https://github.com/xuchenhao001/DRAG.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
Authors:
Toghrul Abbasli,
Kentaroh Toyoda,
Yuan Wang,
Leon Witt,
Muhammad Asif Ali,
Yukai Miao,
Dan Li,
Qingsong Wei
Abstract:
Large Language Models (LLMs) have been transformative across many domains. However, hallucination -- confidently outputting incorrect information -- remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty a…
▽ More
Large Language Models (LLMs) have been transformative across many domains. However, hallucination -- confidently outputting incorrect information -- remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address the misalignment between uncertainty and accuracy. While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions. In this work, we fill this gap via a systematic survey of representative prior works on UQ and calibration for LLMs and introduce a rigorous benchmark. Using two widely used reliability datasets, we empirically evaluate six related methods, which justify the significant findings of our review. Finally, we provide outlooks for key future directions and outline open challenges. To the best of our knowledge, this survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
DeepGreen: Effective LLM-Driven Green-washing Monitoring System Designed for Empirical Testing -- Evidence from China
Authors:
Congluo Xu,
Yu Miao,
Yiling Xiao,
Chengmengjia Lin
Abstract:
This paper proposes DeepGreen, an Large Language Model Driven (LLM-Driven) system for detecting corporate green-washing behaviour. Utilizing dual-layer LLM analysis, DeepGreen preliminarily identifies potential green keywords in financial statements and then assesses their implementation degree via iterative semantic analysis of LLM. A core variable GreenImplement is derived from the ratio from th…
▽ More
This paper proposes DeepGreen, an Large Language Model Driven (LLM-Driven) system for detecting corporate green-washing behaviour. Utilizing dual-layer LLM analysis, DeepGreen preliminarily identifies potential green keywords in financial statements and then assesses their implementation degree via iterative semantic analysis of LLM. A core variable GreenImplement is derived from the ratio from the two layers' output. We extract 204 financial statements of 68 companies from A-share market over three years, comprising 89,893 words, and analyse them through DeepGreen. Our analysis, supported by violin plots and K-means clustering, reveals insights and validates the variable against the Huazheng ESG rating. It offers a novel perspective for regulatory agencies and investors, serving as a proactive monitoring tool that complements traditional methods.Empirical tests show that green implementation can significantly boost the asset return rate of companies, but there is heterogeneity in scale. Small and medium-sized companies have limited contribution to asset return via green implementation, so there is a stronger motivation for green-washing.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Kimi-VL Technical Report
Authors:
Kimi Team,
Angang Du,
Bohong Yin,
Bowei Xing,
Bowen Qu,
Bowen Wang,
Cheng Chen,
Chenlin Zhang,
Chenzhuang Du,
Chu Wei,
Congcong Wang,
Dehao Zhang,
Dikang Du,
Dongliang Wang,
Enming Yuan,
Enzhe Lu,
Fang Li,
Flood Sung,
Guangda Wei,
Guokun Lai,
Han Zhu,
Hao Ding,
Hao Hu,
Hao Yang,
Hao Zhang
, et al. (68 additional authors not shown)
Abstract:
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-…
▽ More
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.
△ Less
Submitted 15 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking
Authors:
Chang Nie,
Yiqing Xu,
Guangming Wang,
Zhe Liu,
Yanzi Miao,
Hesheng Wang
Abstract:
Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the ab…
▽ More
Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5\% on J\&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
Authors:
Yanhao Dong,
Yubo Miao,
Weinan Li,
Xiao Zheng,
Chao Wang,
Feng Lyu
Abstract:
Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during act…
▽ More
Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during active computation windows, our method proactively prefetches required KV Cache into GPU L2 cache, enabling high-speed L2 cache hits for subsequent accesses and effectively hiding HBM access latency within computational cycles. Extensive experiments on NVIDIA H20 GPUs demonstrate that the proposed method achieves 2.15x improvement in attention kernel efficiency and up to 1.97x end-to-end throughput enhancement, surpassing state-of-the-art baseline FlashAttention-3. Notably, our solution maintains orthogonality to existing optimization techniques and can be integrated with current inference frameworks, providing a scalable latency-hiding solution for next-generation LLM inference engines.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Authors:
Yubo Li,
Xiaobin Shen,
Xinyu Yao,
Xueying Ding,
Yidi Miao,
Ramayya Krishnan,
Rema Padman
Abstract:
Recent advancements in large language models (LLMs) have revolutionized their ability to handle single-turn tasks, yet real-world applications demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent advancements in evaluating and enhancing multi-turn interactions in LLMs. Focusing on task-specific scenarios, from instruction following in diverse domains…
▽ More
Recent advancements in large language models (LLMs) have revolutionized their ability to handle single-turn tasks, yet real-world applications demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent advancements in evaluating and enhancing multi-turn interactions in LLMs. Focusing on task-specific scenarios, from instruction following in diverse domains such as math and coding to complex conversational engagements in roleplay, healthcare, education, and even adversarial jailbreak settings, we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness over prolonged dialogues. The paper organizes current benchmarks and datasets into coherent categories that reflect the evolving landscape of multi-turn dialogue evaluation. In addition, we review a range of enhancement methodologies under multi-turn settings, including model-centric strategies (contextual learning, supervised fine-tuning, reinforcement learning, and new architectures), external integration approaches (memory-augmented, retrieval-based methods, and knowledge graph), and agent-based techniques for collaborative interactions. Finally, we discuss open challenges and propose future directions for research to further advance the robustness and effectiveness of multi-turn interactions in LLMs. Related resources and papers are available at https://github.com/yubol-cmu/Awesome-Multi-Turn-LLMs.
△ Less
Submitted 13 May, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Authors:
Yubo Li,
Yidi Miao,
Xueying Ding,
Ramayya Krishnan,
Rema Padman
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent performance across multiple interaction rounds. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we propose a novel Position-Weighted Consistency (PWC) score…
▽ More
Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent performance across multiple interaction rounds. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we propose a novel Position-Weighted Consistency (PWC) score that captures both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by incorporating model confidence signals into the generation process. Empirical results demonstrate that CARG significantly improves response stability without sacrificing accuracy, underscoring its potential for reliable LLM deployment in critical applications.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model
Authors:
Yingmao Miao,
Zhanpeng Huang,
Rui Han,
Zibin Wang,
Chenhao Lin,
Chao Shen
Abstract:
While virtual try-on for clothes and shoes with diffusion models has gained attraction, virtual try-on for ornaments, such as bracelets, rings, earrings, and necklaces, remains largely unexplored. Due to the intricate tiny patterns and repeated geometric sub-structures in most ornaments, it is much more difficult to guarantee identity and appearance consistency under large pose and scale variances…
▽ More
While virtual try-on for clothes and shoes with diffusion models has gained attraction, virtual try-on for ornaments, such as bracelets, rings, earrings, and necklaces, remains largely unexplored. Due to the intricate tiny patterns and repeated geometric sub-structures in most ornaments, it is much more difficult to guarantee identity and appearance consistency under large pose and scale variances between ornaments and models. This paper proposes the task of virtual try-on for ornaments and presents a method to improve the geometric and appearance preservation of ornament virtual try-ons. Specifically, we estimate an accurate wearing mask to improve the alignments between ornaments and models in an iterative scheme alongside the denoising process. To preserve structure details, we further regularize attention layers to map the reference ornament mask to the wearing mask in an implicit way. Experimental results demonstrate that our method successfully wears ornaments from reference images onto target models, handling substantial differences in scale and pose while preserving identity and achieving realistic visual effects.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
FedSAF: A Federated Learning Framework for Enhanced Gastric Cancer Detection and Privacy Preservation
Authors:
Yuxin Miao,
Xinyuan Yang,
Hongda Fan,
Yichun Li,
Yishu Hong,
Xiechen Guo,
Ali Braytee,
Weidong Huang,
Ali Anaissi
Abstract:
Gastric cancer is one of the most commonly diagnosed cancers and has a high mortality rate. Due to limited medical resources, developing machine learning models for gastric cancer recognition provides an efficient solution for medical institutions. However, such models typically require large sample sizes for training and testing, which can challenge patient privacy. Federated learning offers an e…
▽ More
Gastric cancer is one of the most commonly diagnosed cancers and has a high mortality rate. Due to limited medical resources, developing machine learning models for gastric cancer recognition provides an efficient solution for medical institutions. However, such models typically require large sample sizes for training and testing, which can challenge patient privacy. Federated learning offers an effective alternative by enabling model training across multiple institutions without sharing sensitive patient data. This paper addresses the limited sample size of publicly available gastric cancer data with a modified data processing method. This paper introduces FedSAF, a novel federated learning algorithm designed to improve the performance of existing methods, particularly in non-independent and identically distributed (non-IID) data scenarios. FedSAF incorporates attention-based message passing and the Fisher Information Matrix to enhance model accuracy, while a model splitting function reduces computation and transmission costs. Hyperparameter tuning and ablation studies demonstrate the effectiveness of this new algorithm, showing improvements in test accuracy on gastric cancer datasets, with FedSAF outperforming existing federated learning methods like FedAMP, FedAvg, and FedProx. The framework's robustness and generalization ability were further validated across additional datasets (SEED, BOT, FashionMNIST, and CIFAR-10), achieving high performance in diverse environments.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
StructVizor: Interactive Profiling of Semi-Structured Textual Data
Authors:
Yanwei Huang,
Yan Miao,
Di Weng,
Adam Perer,
Yingcai Wu
Abstract:
Data profiling plays a critical role in understanding the structure of complex datasets and supporting numerous downstream tasks, such as social media analytics and financial fraud detection. While existing research predominantly focuses on structured data formats, a substantial portion of semi-structured textual data still requires ad-hoc and arduous manual profiling to extract and comprehend its…
▽ More
Data profiling plays a critical role in understanding the structure of complex datasets and supporting numerous downstream tasks, such as social media analytics and financial fraud detection. While existing research predominantly focuses on structured data formats, a substantial portion of semi-structured textual data still requires ad-hoc and arduous manual profiling to extract and comprehend its internal structures. In this work, we propose StructVizor, an interactive profiling system that facilitates sensemaking and transformation of semi-structured textual data. Our tool mainly addresses two challenges: a) extracting and visualizing the diverse structural patterns within data, such as how information is organized or related, and b) enabling users to efficiently perform various wrangling operations on textual data. Through automatic data parsing and structure mining, StructVizor enables visual analytics of structural patterns, while incorporating novel interactions to enable profile-based data wrangling. A comparative user study involving 12 participants demonstrates the system's usability and its effectiveness in supporting exploratory data analysis and transformation tasks.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Provable Robust Overfitting Mitigation in Wasserstein Distributionally Robust Optimization
Authors:
Shuang Liu,
Yihan Wang,
Yifan Zhu,
Yibo Miao,
Xiao-Shan Gao
Abstract:
Wasserstein distributionally robust optimization (WDRO) optimizes against worst-case distributional shifts within a specified uncertainty set, leading to enhanced generalization on unseen adversarial examples, compared to standard adversarial training which focuses on pointwise adversarial perturbations. However, WDRO still suffers fundamentally from the robust overfitting problem, as it does not…
▽ More
Wasserstein distributionally robust optimization (WDRO) optimizes against worst-case distributional shifts within a specified uncertainty set, leading to enhanced generalization on unseen adversarial examples, compared to standard adversarial training which focuses on pointwise adversarial perturbations. However, WDRO still suffers fundamentally from the robust overfitting problem, as it does not consider statistical error. We address this gap by proposing a novel robust optimization framework under a new uncertainty set for adversarial noise via Wasserstein distance and statistical error via Kullback-Leibler divergence, called the Statistically Robust WDRO. We establish a robust generalization bound for the new optimization framework, implying that out-of-distribution adversarial performance is at least as good as the statistically robust training loss with high probability. Furthermore, we derive conditions under which Stackelberg and Nash equilibria exist between the learner and the adversary, giving an optimal robust model in certain sense. Finally, through extensive experiments, we demonstrate that our method significantly mitigates robust overfitting and enhances robustness within the framework of WDRO.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Ability
Authors:
Lijia Yu,
Yibo Miao,
Yifan Zhu,
Xiao-Shan Gao,
Lijun Zhang
Abstract:
The primary objective of learning methods is generalization. Classic uniform generalization bounds, which rely on VC-dimension or Rademacher complexity, fail to explain the significant attribute that over-parameterized models in deep learning exhibit nice generalizability. On the other hand, algorithm-dependent generalization bounds, like stability bounds, often rely on strict assumptions. To esta…
▽ More
The primary objective of learning methods is generalization. Classic uniform generalization bounds, which rely on VC-dimension or Rademacher complexity, fail to explain the significant attribute that over-parameterized models in deep learning exhibit nice generalizability. On the other hand, algorithm-dependent generalization bounds, like stability bounds, often rely on strict assumptions. To establish generalizability under less stringent assumptions, this paper investigates the generalizability of neural networks that minimize or approximately minimize empirical risk. We establish a lower bound for population accuracy based on the expressiveness of these networks, which indicates that with an adequate large number of training samples and network sizes, these networks, including over-parameterized ones, can generalize effectively. Additionally, we provide a necessary condition for generalization, demonstrating that, for certain data distributions, the quantity of training data required to ensure generalization exceeds the network size needed to represent the corresponding data distribution. Finally, we provide theoretical insights into several phenomena in deep learning, including robust generalization, importance of over-parameterization, and effect of loss function on generalization.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Zero-Shot Sim-to-Real Visual Quadrotor Control with Hard Constraints
Authors:
Yan Miao,
Will Shen,
Sayan Mitra
Abstract:
We present the first framework demonstrating zero-shot sim-to-real transfer of visual control policies learned in a Neural Radiance Field (NeRF) environment for quadrotors to fly through racing gates. Robust transfer from simulation to real flight poses a major challenge, as standard simulators often lack sufficient visual fidelity. To address this, we construct a photorealistic simulation environ…
▽ More
We present the first framework demonstrating zero-shot sim-to-real transfer of visual control policies learned in a Neural Radiance Field (NeRF) environment for quadrotors to fly through racing gates. Robust transfer from simulation to real flight poses a major challenge, as standard simulators often lack sufficient visual fidelity. To address this, we construct a photorealistic simulation environment of quadrotor racing tracks, called FalconGym, which provides effectively unlimited synthetic images for training. Within FalconGym, we develop a pipelined approach for crossing gates that combines (i) a Neural Pose Estimator (NPE) coupled with a Kalman filter to reliably infer quadrotor poses from single-frame RGB images and IMU data, and (ii) a self-attention-based multi-modal controller that adaptively integrates visual features and pose estimation. This multi-modal design compensates for perception noise and intermittent gate visibility. We train this controller purely in FalconGym with imitation learning and deploy the resulting policy to real hardware with no additional fine-tuning. Simulation experiments on three distinct tracks (circle, U-turn and figure-8) demonstrate that our controller outperforms a vision-only state-of-the-art baseline in both success rate and gate-crossing accuracy. In 30 live hardware flights spanning three tracks and 120 gates, our controller achieves a 95.8% success rate and an average error of just 10 cm when flying through 38 cm-radius gates.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
CREATE-FFPE: Cross-Resolution Compensated and Multi-Frequency Enhanced FS-to-FFPE Stain Transfer for Intraoperative IHC Images
Authors:
Yiyang Lin,
Danling Jiang,
Xinyu Liu,
Yun Miao,
Yixuan Yuan
Abstract:
In the immunohistochemical (IHC) analysis during surgery, frozen-section (FS) images are used to determine the benignity or malignancy of the tumor. However, FS image faces problems such as image contamination and poor nuclear detail, which may disturb the pathologist's diagnosis. In contrast, formalin-fixed and paraffin-embedded (FFPE) image has a higher staining quality, but it requires quite a…
▽ More
In the immunohistochemical (IHC) analysis during surgery, frozen-section (FS) images are used to determine the benignity or malignancy of the tumor. However, FS image faces problems such as image contamination and poor nuclear detail, which may disturb the pathologist's diagnosis. In contrast, formalin-fixed and paraffin-embedded (FFPE) image has a higher staining quality, but it requires quite a long time to prepare and thus is not feasible during surgery. To help pathologists observe IHC images with high quality in surgery, this paper proposes a Cross-REsolution compensATed and multi-frequency Enhanced FS-to-FFPE (CREATE-FFPE) stain transfer framework, which is the first FS-to-FFPE method for the intraoperative IHC images. To solve the slide contamination and poor nuclear detail mentioned above, we propose the cross-resolution compensation module (CRCM) and the wavelet detail guidance module (WDGM). Specifically, CRCM compensates for information loss due to contamination by providing more tissue information across multiple resolutions, while WDGM produces the desirable details in a wavelet way, and the details can be used to guide the stain transfer to be more precise. Experiments show our method can beat all the competing methods on our dataset. In addition, the FID has decreased by 44.4%, and KID*100 has decreased by 71.2% by adding the proposed CRCM and WDGM in ablation studies, and the performance of a downstream microsatellite instability prediction task with public dataset can be greatly improved by performing our FS-to-FFPE stain transfer.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Trinity: A Scalable and Forward-Secure DSSE for Spatio-Temporal Range Query
Authors:
Zhijun Li,
Kuizhi Liu,
Minghui Xu,
Xiangyu Wang,
Yinbin Miao,
Jianfeng Ma,
Xiuzhen Cheng
Abstract:
Cloud-based outsourced Location-based services have profound impacts on various aspects of people's lives but bring security concerns. Existing spatio-temporal data secure retrieval schemes have significant shortcomings regarding dynamic updates, either compromising privacy through leakage during updates (forward insecurity) or incurring excessively high update costs that hinder practical applicat…
▽ More
Cloud-based outsourced Location-based services have profound impacts on various aspects of people's lives but bring security concerns. Existing spatio-temporal data secure retrieval schemes have significant shortcomings regarding dynamic updates, either compromising privacy through leakage during updates (forward insecurity) or incurring excessively high update costs that hinder practical application. Under these circumstances, we first propose a basic filter-based spatio-temporal range query scheme \TrinityI that supports low-cost dynamic updates and automatic expansion. Furthermore, to improve security, reduce storage cost, and false positives, we propose a forward secure and verifiable scheme \TrinityII that simultaneously minimizes storage overhead. A formal security analysis proves that \TrinityI and \TrinityII are Indistinguishable under Selective Chosen-Plaintext Attack (IND-SCPA). Finally, extensive experiments demonstrate that our design \TrinityII significantly reduces storage requirements by 80\%, enables data retrieval at the 1 million-record level in just 0.01 seconds, and achieves 10 $\times$ update efficiency than state-of-art.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Multispectral Remote Sensing for Weed Detection in West Australian Agricultural Lands
Authors:
Haitian Wang,
Muhammad Ibrahim,
Yumeng Miao,
D ustin Severtson,
Atif Mansoor,
Ajmal S. Mian
Abstract:
The Kondinin region in Western Australia faces significant agricultural challenges due to pervasive weed infestations, causing economic losses and ecological impacts. This study constructs a tailored multispectral remote sensing dataset and an end-to-end framework for weed detection to advance precision agriculture practices. Unmanned aerial vehicles were used to collect raw multispectral data fro…
▽ More
The Kondinin region in Western Australia faces significant agricultural challenges due to pervasive weed infestations, causing economic losses and ecological impacts. This study constructs a tailored multispectral remote sensing dataset and an end-to-end framework for weed detection to advance precision agriculture practices. Unmanned aerial vehicles were used to collect raw multispectral data from two experimental areas (E2 and E8) over four years, covering 0.6046 km^{2} and ground truth annotations were created with GPS-enabled vehicles to manually label weeds and crops. The dataset is specifically designed for agricultural applications in Western Australia. We propose an end-to-end framework for weed detection that includes extensive preprocessing steps, such as denoising, radiometric calibration, image alignment, orthorectification, and stitching. The proposed method combines vegetation indices (NDVI, GNDVI, EVI, SAVI, MSAVI) with multispectral channels to form classification features, and employs several deep learning models to identify weeds based on the input features. Among these models, ResNet achieves the highest performance, with a weed detection accuracy of 0.9213, an F1-Score of 0.8735, an mIOU of 0.7888, and an mDC of 0.8865, validating the efficacy of the dataset and the proposed weed detection method.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Multi-Agent Collaboration for Multilingual Code Instruction Tuning
Authors:
Jian Yang,
Wei Zhang,
Jiaxi Yang,
Yibo Miao,
Shanghaoran Quan,
Zhenhe Wu,
Qiyao Peng,
Liqun Yang,
Tianyu Liu,
Zeyu Cui,
Binyuan Hui,
Junyang Lin
Abstract:
Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among diff…
▽ More
Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language-specific instruction data from the code snippets and then provide the generated data as the seed data for language-specific agents. Multiple language-specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross-lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high-quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5-xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge, highlighting its potential to reduce the cross-lingual gap.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
Authors:
Yuchun Miao,
Sen Zhang,
Liang Ding,
Yuqi Zhang,
Lefei Zhang,
Dacheng Tao
Abstract:
This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with an excessive increase in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundati…
▽ More
This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with an excessive increase in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward model-favored patterns in RL. To address this issue, we propose an Energy loss-aware PPO algorithm (EPPO) which penalizes the increase in energy loss in the LLM's final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of EPPO in mitigating reward hacking and improving RLHF performance.
△ Less
Submitted 4 February, 2025; v1 submitted 31 January, 2025;
originally announced January 2025.
-
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Authors:
Yujia Qin,
Yining Ye,
Junjie Fang,
Haoming Wang,
Shihao Liang,
Shizuo Tian,
Junda Zhang,
Jiahao Li,
Yunxin Li,
Shijue Huang,
Wanjun Zhong,
Kuanye Li,
Jiale Yang,
Yu Miao,
Woyu Lin,
Longxiang Liu,
Xu Jiang,
Qianli Ma,
Jingyu Li,
Xiaojun Xiao,
Kai Cai,
Chuang Li,
Yaowei Zheng,
Chaolin Jin,
Chen Li
, et al. (10 additional authors not shown)
Abstract:
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks.…
▽ More
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
△ Less
Submitted 21 January, 2025;
originally announced January 2025.
-
GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction
Authors:
Yuwei Miao,
Yuzhi Guo,
Hehuan Ma,
Jingquan Yan,
Feng Jiang,
Rui Liao,
Junzhou Huang
Abstract:
Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structure…
▽ More
Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.
△ Less
Submitted 3 January, 2025;
originally announced January 2025.
-
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Authors:
Shanghaoran Quan,
Jiaxi Yang,
Bowen Yu,
Bo Zheng,
Dayiheng Liu,
An Yang,
Xuancheng Ren,
Bofei Gao,
Yibo Miao,
Yunlong Feng,
Zekun Wang,
Jian Yang,
Zeyu Cui,
Yang Fan,
Yichang Zhang,
Binyuan Hui,
Junyang Lin
Abstract:
With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of…
▽ More
With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 25 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.
△ Less
Submitted 3 January, 2025; v1 submitted 2 January, 2025;
originally announced January 2025.
-
FasterSTS: A Faster Spatio-Temporal Synchronous Graph Convolutional Networks for Traffic flow Forecasting
Authors:
Ben-Ao Dai,
Nengchao Lyu,
Yongchao Miao
Abstract:
Accurate traffic flow prediction heavily relies on the spatio-temporal correlation of traffic flow data. Most current studies separately capture correlations in spatial and temporal dimensions, making it difficult to capture complex spatio-temporal heterogeneity, and often at the expense of increasing model complexity to improve prediction accuracy. Although there have been groundbreaking attempts…
▽ More
Accurate traffic flow prediction heavily relies on the spatio-temporal correlation of traffic flow data. Most current studies separately capture correlations in spatial and temporal dimensions, making it difficult to capture complex spatio-temporal heterogeneity, and often at the expense of increasing model complexity to improve prediction accuracy. Although there have been groundbreaking attempts in the field of spatio-temporal synchronous modeling, significant limitations remain in terms of performance and complexity control.This study proposes a quicker and more effective spatio-temporal synchronous traffic flow forecast model to address these issues.
△ Less
Submitted 1 January, 2025;
originally announced January 2025.
-
PowerMLP: An Efficient Version of KAN
Authors:
Ruichen Qiu,
Yibo Miao,
Shiwen Wang,
Lijia Yu,
Yifan Zhu,
Xiao-Shan Gao
Abstract:
The Kolmogorov-Arnold Network (KAN) is a new network architecture known for its high accuracy in several tasks such as function fitting and PDE solving. The superior expressive capability of KAN arises from the Kolmogorov-Arnold representation theorem and learnable spline functions. However, the computation of spline functions involves multiple iterations, which renders KAN significantly slower th…
▽ More
The Kolmogorov-Arnold Network (KAN) is a new network architecture known for its high accuracy in several tasks such as function fitting and PDE solving. The superior expressive capability of KAN arises from the Kolmogorov-Arnold representation theorem and learnable spline functions. However, the computation of spline functions involves multiple iterations, which renders KAN significantly slower than MLP, thereby increasing the cost associated with model training and deployment. The authors of KAN have also noted that ``the biggest bottleneck of KANs lies in its slow training. KANs are usually 10x slower than MLPs, given the same number of parameters.'' To address this issue, we propose a novel MLP-type neural network PowerMLP that employs simpler non-iterative spline function representation, offering approximately the same training time as MLP while theoretically demonstrating stronger expressive power than KAN. Furthermore, we compare the FLOPs of KAN and PowerMLP, quantifying the faster computation speed of PowerMLP. Our comprehensive experiments demonstrate that PowerMLP generally achieves higher accuracy and a training speed about 40 times faster than KAN in various tasks.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
UniEntrezDB: Large-scale Gene Ontology Annotation Dataset and Evaluation Benchmarks with Unified Entrez Gene Identifiers
Authors:
Yuwei Miao,
Yuzhi Guo,
Hehuan Ma,
Jingquan Yan,
Feng Jiang,
Weizhi An,
Jean Gao,
Junzhou Huang
Abstract:
Gene studies are crucial for fields such as protein structure prediction, drug discovery, and cancer genomics, yet they face challenges in fully utilizing the vast and diverse information available. Gene studies require clean, factual datasets to ensure reliable results. Ontology graphs, neatly organized domain terminology graphs, provide ideal sources for domain facts. However, available gene ont…
▽ More
Gene studies are crucial for fields such as protein structure prediction, drug discovery, and cancer genomics, yet they face challenges in fully utilizing the vast and diverse information available. Gene studies require clean, factual datasets to ensure reliable results. Ontology graphs, neatly organized domain terminology graphs, provide ideal sources for domain facts. However, available gene ontology annotations are currently distributed across various databases without unified identifiers for genes and gene products. To address these challenges, we introduce Unified Entrez Gene Identifier Dataset and Benchmarks (UniEntrezDB), the first systematic effort to unify large-scale public Gene Ontology Annotations (GOA) from various databases using unique gene identifiers. UniEntrezDB includes a pre-training dataset and four downstream tasks designed to comprehensively evaluate gene embedding performance from gene, protein, and cell levels, ultimately enhancing the reliability and applicability of LLMs in gene research and other professional settings.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
ExecRepoBench: Multi-level Executable Code Completion Evaluation
Authors:
Jian Yang,
Jiajun Zhang,
Jiaxi Yang,
Ke Jin,
Lei Zhang,
Qiyao Peng,
Ken Deng,
Yibo Miao,
Tianyu Liu,
Zeyu Cui,
Binyuan Hui,
Junyang Lin
Abstract:
Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant challenges, including limited context length, reliance on superficial evaluation metrics, and potential overfitting to training datasets. In this work, we introduce…
▽ More
Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant challenges, including limited context length, reliance on superficial evaluation metrics, and potential overfitting to training datasets. In this work, we introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench and the instruction corpora Repo-Instruct, aim at improving the functionality of open-source large language models (LLMs) in real-world coding scenarios that involve complex interdependencies across multiple files. ExecRepoBench includes 1.2K samples from active Python repositories. Plus, we present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units (e.g. statements, expressions, and functions). Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C based on the open-source model. Qwen2.5-Coder-Instruct-C is rigorously evaluated against existing benchmarks, including MultiPL-E and ExecRepoBench, which consistently outperforms prior baselines across all programming languages. The deployment of \ourmethod{} can be used as a high-performance, local service for programming development\footnote{\url{https://execrepobench.github.io/}}.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
Finite-PINN: A Physics-Informed Neural Network Architecture for Solving Solid Mechanics Problems with General Geometries
Authors:
Haolin Li,
Yuyang Miao,
Zahra Sharif Khodaei,
M. H. Aliabadi
Abstract:
PINN models have demonstrated impressive capabilities in addressing fluid PDE problems, and their potential in solid mechanics is beginning to emerge. This study identifies two key challenges when using PINN to solve general solid mechanics problems. These challenges become evident when comparing the limitations of PINN with the well-established numerical methods commonly used in solid mechanics,…
▽ More
PINN models have demonstrated impressive capabilities in addressing fluid PDE problems, and their potential in solid mechanics is beginning to emerge. This study identifies two key challenges when using PINN to solve general solid mechanics problems. These challenges become evident when comparing the limitations of PINN with the well-established numerical methods commonly used in solid mechanics, such as the finite element method (FEM). Specifically: a) PINN models generate solutions over an infinite domain, which conflicts with the finite boundaries typical of most solid structures; and b) the solution space utilised by PINN is Euclidean, which is inadequate for addressing the complex geometries often present in solid structures.
This work proposes a PINN architecture used for general solid mechanics problems, termed the Finite-PINN model. The proposed model aims to effectively address these two challenges while preserving as much of the original implementation of PINN as possible. The unique architecture of the Finite-PINN model addresses these challenges by separating the approximation of stress and displacement fields, and by transforming the solution space from the traditional Euclidean space to a Euclidean-topological joint space. Several case studies presented in this paper demonstrate that the Finite-PINN model provides satisfactory results for a variety of problem types, including both forward and inverse problems, in both 2D and 3D contexts. The developed Finite-PINN model offers a promising tool for addressing general solid mechanics problems, particularly those not yet well-explored in current research.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Evaluating and Aligning CodeLLMs on Human Preference
Authors:
Jian Yang,
Jiaxi Yang,
Ke Jin,
Yibo Miao,
Lei Zhang,
Liqun Yang,
Zeyu Cui,
Yichang Zhang,
Binyuan Hui,
Junyang Lin
Abstract:
Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the align…
▽ More
Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.\footnote{\url{https://codearenaeval.github.io/ }}
△ Less
Submitted 6 December, 2024;
originally announced December 2024.
-
From Principles to Practice: A Deep Dive into AI Ethics and Regulations
Authors:
Nan Sun,
Yuantian Miao,
Hao Jiang,
Ming Ding,
Jun Zhang
Abstract:
In the rapidly evolving domain of Artificial Intelligence (AI), the complex interaction between innovation and regulation has become an emerging focus of our society. Despite tremendous advancements in AI's capabilities to excel in specific tasks and contribute to diverse sectors, establishing a high degree of trust in AI-generated outputs and decisions necessitates meticulous caution and continuo…
▽ More
In the rapidly evolving domain of Artificial Intelligence (AI), the complex interaction between innovation and regulation has become an emerging focus of our society. Despite tremendous advancements in AI's capabilities to excel in specific tasks and contribute to diverse sectors, establishing a high degree of trust in AI-generated outputs and decisions necessitates meticulous caution and continuous oversight. A broad spectrum of stakeholders, including governmental bodies, private sector corporations, academic institutions, and individuals, have launched significant initiatives. These efforts include developing ethical guidelines for AI and engaging in vibrant discussions on AI ethics, both among AI practitioners and within the broader society. This article thoroughly analyzes the ground-breaking AI regulatory framework proposed by the European Union. It delves into the fundamental ethical principles of safety, transparency, non-discrimination, traceability, and environmental sustainability for AI developments and deployments. Considering the technical efforts and strategies undertaken by academics and industry to uphold these principles, we explore the synergies and conflicts among the five ethical principles. Through this lens, work presents a forward-looking perspective on the future of AI regulations, advocating for a harmonized approach that safeguards societal values while encouraging technological advancement.
△ Less
Submitted 6 February, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
Authors:
Da Ma,
Lu Chen,
Situo Zhang,
Yuxun Miao,
Su Zhu,
Zhi Chen,
Hongshen Xu,
Hanqi Li,
Shuai Fan,
Lei Pan,
Kai Yu
Abstract:
The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tok…
▽ More
The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation. In this paper, we propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding them.We address two challenges: 1) investigating the distribution of important tokens in the context, discovering recent tokens are more important than distant tokens in context, and 2) optimizing resources for distant tokens by sharing attention scores across layers. The experiments show that our method saves $35\%$ KV cache without compromising the performance.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Holistic Understanding of 3D Scenes as Universal Scene Description
Authors:
Anna-Maria Halacheva,
Yang Miao,
Jan-Nico Zaech,
Xi Wang,
Luc Van Gool,
Danda Pani Paudel
Abstract:
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task…
▽ More
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered by current works. In this work, we address this shortcoming and introduce (1) an expertly curated dataset in the Universal Scene Description (USD) format, featuring high-quality manual annotations, for instance, segmentation and articulation on 280 indoor scenes; (2) a learning-based model together with a novel baseline capable of predicting part segmentation along with a full specification of motion attributes, including motion type, articulated and interactable parts, and motion parameters; (3) a benchmark serving to compare upcoming methods for the task at hand. Overall, our dataset provides 8 types of annotations - object and part segmentations, motion types, movable and interactable parts, motion parameters, connectivity, and object mass annotations. With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models. All data is provided in the USD format, allowing interoperability and easy integration with downstream tasks. We provide open access to our dataset, benchmark, and method's source code.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
From Dashcam Videos to Driving Simulations: Stress Testing Automated Vehicles against Rare Events
Authors:
Yan Miao,
Georgios Fainekos,
Bardh Hoxha,
Hideki Okamoto,
Danil Prokhorov,
Sayan Mitra
Abstract:
Testing Automated Driving Systems (ADS) in simulation with realistic driving scenarios is important for verifying their performance. However, converting real-world driving videos into simulation scenarios is a significant challenge due to the complexity of interpreting high-dimensional video data and the time-consuming nature of precise manual scenario reconstruction. In this work, we propose a no…
▽ More
Testing Automated Driving Systems (ADS) in simulation with realistic driving scenarios is important for verifying their performance. However, converting real-world driving videos into simulation scenarios is a significant challenge due to the complexity of interpreting high-dimensional video data and the time-consuming nature of precise manual scenario reconstruction. In this work, we propose a novel framework that automates the conversion of real-world car crash videos into detailed simulation scenarios for ADS testing. Our approach leverages prompt-engineered Video Language Models(VLM) to transform dashcam footage into SCENIC scripts, which define the environment and driving behaviors in the CARLA simulator, enabling the generation of realistic simulation scenarios. Importantly, rather than solely aiming for one-to-one scenario reconstruction, our framework focuses on capturing the essential driving behaviors from the original video while offering flexibility in parameters such as weather or road conditions to facilitate search-based testing. Additionally, we introduce a similarity metric that helps iteratively refine the generated scenario through feedback by comparing key features of driving behaviors between the real and simulated videos. Our preliminary results demonstrate substantial time efficiency, finishing the real-to-sim conversion in minutes with full automation and no human intervention, while maintaining high fidelity to the original driving events.
△ Less
Submitted 27 January, 2025; v1 submitted 24 November, 2024;
originally announced November 2024.
-
COST CA20120 INTERACT Framework of Artificial Intelligence Based Channel Modeling
Authors:
Ruisi He,
Nicola D. Cicco,
Bo Ai,
Mi Yang,
Yang Miao,
Mate Boban
Abstract:
Accurate channel models are the prerequisite for communication-theoretic investigations as well as system design. Channel modeling generally relies on statistical and deterministic approaches. However, there are still significant limits for the traditional modeling methods in terms of accuracy, generalization ability, and computational complexity. The fundamental reason is that establishing a quan…
▽ More
Accurate channel models are the prerequisite for communication-theoretic investigations as well as system design. Channel modeling generally relies on statistical and deterministic approaches. However, there are still significant limits for the traditional modeling methods in terms of accuracy, generalization ability, and computational complexity. The fundamental reason is that establishing a quantified and accurate mapping between physical environment and channel characteristics becomes increasing challenging for modern communication systems. Here, in the context of COST CA20120 Action, we evaluate and discuss the feasibility and implementation of using artificial intelligence (AI) for channel modeling, and explore where the future of this field lies. Firstly, we present a framework of AI-based channel modeling to characterize complex wireless channels. Then, we highlight in detail some major challenges and present the possible solutions: i) estimating the uncertainty of AI-based channel predictions, ii) integrating prior knowledge of propagation to improve generalization capabilities, and iii) interpretable AI for channel modeling. We present and discuss illustrative numerical results to showcase the capabilities of AI-based channel modeling.
△ Less
Submitted 31 October, 2024;
originally announced November 2024.
-
Audience Reach of Scientific Data Visualizations in Planetarium-Screened Films
Authors:
Kalina Borkiewicz,
Eric Jensen,
Yiwen Miao,
Stuart Levy,
J. P. Naiman,
Jeff Carpenter,
Katherine E. Isaacs
Abstract:
Quantifying the global reach of planetarium dome shows presents significant challenges due to the lack of standardized viewership tracking mechanisms across diverse planetarium venues. We present an analysis of the global impact of dome shows, presenting data regarding four documentary films from a single visualization lab. Specifically, we designed and administered a viewership survey of four lon…
▽ More
Quantifying the global reach of planetarium dome shows presents significant challenges due to the lack of standardized viewership tracking mechanisms across diverse planetarium venues. We present an analysis of the global impact of dome shows, presenting data regarding four documentary films from a single visualization lab. Specifically, we designed and administered a viewership survey of four long-running shows that contained cinematic scientific visualizations. Reported survey data shows that between 1.2 - 2.6 million people have viewed these four films across the 68 responding planetariums (mean: 1.9 million). When we include estimates and extrapolate for the 315 planetariums that licensed these shows, we arrive at an estimate of 16.5 - 24.1 million people having seen these films (mean: 20.3 million).
△ Less
Submitted 30 October, 2024;
originally announced November 2024.
-
Generalizability of Memorization Neural Networks
Authors:
Lijia Yu,
Xiao-Shan Gao,
Lijun Zhang,
Yibo Miao
Abstract:
The neural network memorization problem is to study the expressive power of neural networks to interpolate a finite dataset. Although memorization is widely believed to have a close relationship with the strong generalizability of deep learning when using over-parameterized models, to the best of our knowledge, there exists no theoretical study on the generalizability of memorization neural networ…
▽ More
The neural network memorization problem is to study the expressive power of neural networks to interpolate a finite dataset. Although memorization is widely believed to have a close relationship with the strong generalizability of deep learning when using over-parameterized models, to the best of our knowledge, there exists no theoretical study on the generalizability of memorization neural networks. In this paper, we give the first theoretical analysis of this topic. Since using i.i.d. training data is a necessary condition for a learning algorithm to be generalizable, memorization and its generalization theory for i.i.d. datasets are developed under mild conditions on the data distribution. First, algorithms are given to construct memorization networks for an i.i.d. dataset, which have the smallest number of parameters and even a constant number of parameters. Second, we show that, in order for the memorization networks to be generalizable, the width of the network must be at least equal to the dimension of the data, which implies that the existing memorization networks with an optimal number of parameters are not generalizable. Third, a lower bound for the sample complexity of general memorization algorithms and the exact sample complexity for memorization algorithms with constant number of parameters are given. It is also shown that there exist data distributions such that, to be generalizable for them, the memorization network must have an exponential number of parameters in the data dimension. Finally, an efficient and generalizable memorization algorithm is given when the number of training samples is greater than the efficient memorization sample complexity of the data distribution.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
Aligning CodeLLMs with Direct Preference Optimization
Authors:
Yibo Miao,
Bofei Gao,
Shanghaoran Quan,
Junyang Lin,
Daoguang Zan,
Jiaheng Liu,
Jian Yang,
Tianyu Liu,
Zhijie Deng
Abstract:
The last year has witnessed the rapid progress of large language models (LLMs) across diverse domains. Among them, CodeLLMs have garnered particular attention because they can not only assist in completing various programming tasks but also represent the decision-making and logical reasoning capabilities of LLMs. However, current CodeLLMs mainly focus on pre-training and supervised fine-tuning sce…
▽ More
The last year has witnessed the rapid progress of large language models (LLMs) across diverse domains. Among them, CodeLLMs have garnered particular attention because they can not only assist in completing various programming tasks but also represent the decision-making and logical reasoning capabilities of LLMs. However, current CodeLLMs mainly focus on pre-training and supervised fine-tuning scenarios, leaving the alignment stage, which is important for post-training LLMs, under-explored. This work first identifies that the commonly used PPO algorithm may be suboptimal for the alignment of CodeLLM because the involved reward rules are routinely coarse-grained and potentially flawed. We then advocate addressing this using the DPO algorithm. Based on only preference data pairs, DPO can render the model rank data automatically, giving rise to a fine-grained rewarding pattern more robust than human intervention. We also contribute a pipeline for collecting preference pairs for DPO on CodeLLMs. Studies show that our method significantly improves the performance of existing CodeLLMs on benchmarks such as MBPP and HumanEval.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
Predicting from Strings: Language Model Embeddings for Bayesian Optimization
Authors:
Tung Nguyen,
Qiuyi Zhang,
Bangding Yang,
Chansoo Lee,
Jorg Bornschein,
Yingjie Miao,
Sagi Perel,
Yutian Chen,
Xingyou Song
Abstract:
Bayesian Optimization is ubiquitous in the field of experimental design and blackbox optimization for improving search efficiency, but has been traditionally restricted to regression models which are only applicable to fixed search spaces and tabular input features. We propose Embed-then-Regress, a paradigm for applying in-context regression over string inputs, through the use of string embedding…
▽ More
Bayesian Optimization is ubiquitous in the field of experimental design and blackbox optimization for improving search efficiency, but has been traditionally restricted to regression models which are only applicable to fixed search spaces and tabular input features. We propose Embed-then-Regress, a paradigm for applying in-context regression over string inputs, through the use of string embedding capabilities of pretrained language models. By expressing all inputs as strings, we are able to perform general-purpose regression for Bayesian Optimization over various domains including synthetic, combinatorial, and hyperparameter optimization, obtaining comparable results to state-of-the-art Gaussian Process-based algorithms. Code can be found at https://github.com/google-research/optformer/tree/main/optformer/embed_then_regress.
△ Less
Submitted 15 October, 2024; v1 submitted 14 October, 2024;
originally announced October 2024.
-
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Authors:
Bofei Gao,
Feifan Song,
Zhe Yang,
Zefan Cai,
Yibo Miao,
Qingxiu Dong,
Lei Li,
Chenghao Ma,
Liang Chen,
Runxin Xu,
Zhengyang Tang,
Benyou Wang,
Daoguang Zan,
Shanghaoran Quan,
Ge Zhang,
Lei Sha,
Yichang Zhang,
Xuancheng Ren,
Tianyu Liu,
Baobao Chang
Abstract:
Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8\% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benc…
▽ More
Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8\% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54\% and 52.55\% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
△ Less
Submitted 23 December, 2024; v1 submitted 10 October, 2024;
originally announced October 2024.
-
RespDiff: An End-to-End Multi-scale RNN Diffusion Model for Respiratory Waveform Estimation from PPG Signals
Authors:
Yuyang Miao,
Zehua Chen,
Chang Li,
Danilo Mandic
Abstract:
Respiratory rate (RR) is a critical health indicator often monitored under inconvenient scenarios, limiting its practicality for continuous monitoring. Photoplethysmography (PPG) sensors, increasingly integrated into wearable devices, offer a chance to continuously estimate RR in a portable manner. In this paper, we propose RespDiff, an end-to-end multi-scale RNN diffusion model for respiratory wa…
▽ More
Respiratory rate (RR) is a critical health indicator often monitored under inconvenient scenarios, limiting its practicality for continuous monitoring. Photoplethysmography (PPG) sensors, increasingly integrated into wearable devices, offer a chance to continuously estimate RR in a portable manner. In this paper, we propose RespDiff, an end-to-end multi-scale RNN diffusion model for respiratory waveform estimation from PPG signals. RespDiff does not require hand-crafted features or the exclusion of low-quality signal segments, making it suitable for real-world scenarios. The model employs multi-scale encoders, to extract features at different resolutions, and a bidirectional RNN to process PPG signals and extract respiratory waveform. Additionally, a spectral loss term is introduced to optimize the model further. Experiments conducted on the BIDMC dataset demonstrate that RespDiff outperforms notable previous works, achieving a mean absolute error (MAE) of 1.18 bpm for RR estimation while others range from 1.66 to 2.15 bpm, showing its potential for robust and accurate respiratory monitoring in real-world applications.
△ Less
Submitted 6 October, 2024;
originally announced October 2024.
-
RL-GSBridge: 3D Gaussian Splatting Based Real2Sim2Real Method for Robotic Manipulation Learning
Authors:
Yuxuan Wu,
Lei Pan,
Wenhua Wu,
Guangming Wang,
Yanzi Miao,
Fan Xu,
Hesheng Wang
Abstract:
Sim-to-Real refers to the process of transferring policies learned in simulation to the real world, which is crucial for achieving practical robotics applications. However, recent Sim2real methods either rely on a large amount of augmented data or large learning models, which is inefficient for specific tasks. In recent years, with the emergence of radiance field reconstruction methods, especially…
▽ More
Sim-to-Real refers to the process of transferring policies learned in simulation to the real world, which is crucial for achieving practical robotics applications. However, recent Sim2real methods either rely on a large amount of augmented data or large learning models, which is inefficient for specific tasks. In recent years, with the emergence of radiance field reconstruction methods, especially 3D Gaussian splatting, it has become possible to construct realistic real-world scenes. To this end, we propose RL-GSBridge, a novel real-to-sim-to-real framework which incorporates 3D Gaussian Splatting into the conventional RL simulation pipeline, enabling zero-shot sim-to-real transfer for vision-based deep reinforcement learning. We introduce a mesh-based 3D GS method with soft binding constraints, enhancing the rendering quality of mesh models. Then utilizing a GS editing approach to synchronize the rendering with the physics simulator, RL-GSBridge could reflect the visual interactions of the physical robot accurately. Through a series of sim-to-real experiments, including grasping and pick-and-place tasks, we demonstrate that RL-GSBridge maintains a satisfactory success rate in real-world task completion during sim-to-real transfer. Furthermore, a series of rendering metrics and visualization results indicate that our proposed mesh-based 3D GS reduces artifacts in unstructured objects, demonstrating more realistic rendering performance.
△ Less
Submitted 22 February, 2025; v1 submitted 30 September, 2024;
originally announced September 2024.
-
Qwen2.5-Coder Technical Report
Authors:
Binyuan Hui,
Jian Yang,
Zeyu Cui,
Jiaxi Yang,
Dayiheng Liu,
Lei Zhang,
Tianyu Liu,
Jiajun Zhang,
Bowen Yu,
Keming Lu,
Kai Dang,
Yang Fan,
Yichang Zhang,
An Yang,
Rui Men,
Fei Huang,
Bo Zheng,
Yibo Miao,
Shanghaoran Quan,
Yunlong Feng,
Xingzhang Ren,
Xuancheng Ren,
Jingren Zhou,
Junyang Lin
Abstract:
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data genera…
▽ More
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will advance research in code intelligence and, with its permissive licensing, support wider adoption by developers in real-world applications.
△ Less
Submitted 12 November, 2024; v1 submitted 18 September, 2024;
originally announced September 2024.
-
Ground-roll Separation From Land Seismic Records Based on Convolutional Neural Network
Authors:
Zhuang Jia,
Wenkai Lu,
Meng Zhang,
Yongkang Miao
Abstract:
Ground-roll wave is a common coherent noise in land field seismic data. This Rayleigh-type surface wave usually has low frequency, low apparent velocity, and high amplitude, therefore obscures the reflection events of seismic shot gathers. Commonly used techniques focus on the differences of ground-roll and reflection in transformed domain such as $f-k$ domain, wavelet domain, or curvelet domain.…
▽ More
Ground-roll wave is a common coherent noise in land field seismic data. This Rayleigh-type surface wave usually has low frequency, low apparent velocity, and high amplitude, therefore obscures the reflection events of seismic shot gathers. Commonly used techniques focus on the differences of ground-roll and reflection in transformed domain such as $f-k$ domain, wavelet domain, or curvelet domain. These approaches use a series of fixed atoms or bases to transform the data in time-space domain into transformed domain to separate different waveforms, thus tend to suffer from the complexity for a delicate design of the parameters of the transform domain filter. To deal with these problems, a novel way is proposed to separate ground-roll from reflections using convolutional neural network (CNN) model based method to learn to extract the features of ground-roll and reflections automatically based on training data. In the proposed method, low-pass filtered seismic data which is contaminated by ground-roll wave is used as input of CNN, and then outputs both ground-roll component and low-frequency part of reflection component simultaneously. Discriminative loss is applied together with similarity loss in the training process to enhance the similarity to their train labels as well as the difference between the two outputs. Experiments are conducted on both synthetic and real data, showing that CNN based method can separate ground roll from reflections effectively, and has generalization ability to a certain extent.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
VQ-DeepVSC: A Dual-Stage Vector Quantization Framework for Video Semantic Communication
Authors:
Yongyi Miao,
Zhongdang Li,
Yang Wang,
Die Hu,
Jun Yan,
Youfang Wang
Abstract:
In response to the rapid growth of global videomtraffic and the limitations of traditional wireless transmission systems, we propose a novel dual-stage vector quantization framework, VQ-DeepVSC, tailored to enhance video transmission over wireless channels. In the first stage, we design the adaptive keyframe extractor and interpolator, deployed respectively at the transmitter and receiver, which i…
▽ More
In response to the rapid growth of global videomtraffic and the limitations of traditional wireless transmission systems, we propose a novel dual-stage vector quantization framework, VQ-DeepVSC, tailored to enhance video transmission over wireless channels. In the first stage, we design the adaptive keyframe extractor and interpolator, deployed respectively at the transmitter and receiver, which intelligently select key frames to minimize inter-frame redundancy and mitigate the cliff-effect under challenging channel conditions. In the second stage, we propose the semantic vector quantization encoder and decoder, placed respectively at the transmitter and receiver, which efficiently compress key frames using advanced indexing and spatial normalization modules to reduce redundancy. Additionally, we propose adjustable index selection and recovery modules, enhancing compression efficiency and enabling flexible compression ratio adjustment. Compared to the joint source-channel coding (JSCC) framework, the proposed framework exhibits superior compatibility with current digital communication systems. Experimental results demonstrate that VQ-DeepVSC achieves substantial improvements in both Multi-Scale Structural Similarity (MS-SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) metrics than the H.265 standard, particularly under low channel signal-to-noise ratio (SNR) or multi-path channels, highlighting the significantly enhanced transmission capabilities of our approach.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Towards a Unified View of Preference Learning for Large Language Models: A Survey
Authors:
Bofei Gao,
Feifan Song,
Yibo Miao,
Zefan Cai,
Zhe Yang,
Liang Chen,
Helan Hu,
Runxin Xu,
Qingxiu Dong,
Ce Zheng,
Shanghaoran Quan,
Wen Xiao,
Ge Zhang,
Daoguang Zan,
Keming Lu,
Bowen Yu,
Dayiheng Liu,
Zeyu Cui,
Jian Yang,
Lei Sha,
Houfeng Wang,
Zhifang Sui,
Peiyi Wang,
Tianyu Liu,
Baobao Chang
Abstract:
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to unde…
▽ More
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences.
△ Less
Submitted 31 October, 2024; v1 submitted 4 September, 2024;
originally announced September 2024.
-
ProphetFuzz: Fully Automated Prediction and Fuzzing of High-Risk Option Combinations with Only Documentation via Large Language Model
Authors:
Dawei Wang,
Geng Zhou,
Li Chen,
Dan Li,
Yukai Miao
Abstract:
Vulnerabilities related to option combinations pose a significant challenge in software security testing due to their vast search space. Previous research primarily addressed this challenge through mutation or filtering techniques, which inefficiently treated all option combinations as having equal potential for vulnerabilities, thus wasting considerable time on non-vulnerable targets and resultin…
▽ More
Vulnerabilities related to option combinations pose a significant challenge in software security testing due to their vast search space. Previous research primarily addressed this challenge through mutation or filtering techniques, which inefficiently treated all option combinations as having equal potential for vulnerabilities, thus wasting considerable time on non-vulnerable targets and resulting in low testing efficiency. In this paper, we utilize carefully designed prompt engineering to drive the large language model (LLM) to predict high-risk option combinations (i.e., more likely to contain vulnerabilities) and perform fuzz testing automatically without human intervention. We developed a tool called ProphetFuzz and evaluated it on a dataset comprising 52 programs collected from three related studies. The entire experiment consumed 10.44 CPU years. ProphetFuzz successfully predicted 1748 high-risk option combinations at an average cost of only \$8.69 per program. Results show that after 72 hours of fuzzing, ProphetFuzz discovered 364 unique vulnerabilities associated with 12.30\% of the predicted high-risk option combinations, which was 32.85\% higher than that found by state-of-the-art in the same timeframe. Additionally, using ProphetFuzz, we conducted persistent fuzzing on the latest versions of these programs, uncovering 140 vulnerabilities, with 93 confirmed by developers and 21 awarded CVE numbers.
△ Less
Submitted 1 September, 2024;
originally announced September 2024.
-
Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning
Authors:
Yanting Miao,
William Loh,
Suraj Kothawade,
Pascal Poupart,
Abdullah Rashwan,
Yeqing Li
Abstract:
Text-to-image generative models have recently attracted considerable interest, enabling the synthesis of high-quality images from textual prompts. However, these models often lack the capability to generate specific subjects from given reference images or to synthesize novel renditions under varying conditions. Methods like DreamBooth and Subject-driven Text-to-Image (SuTI) have made significant p…
▽ More
Text-to-image generative models have recently attracted considerable interest, enabling the synthesis of high-quality images from textual prompts. However, these models often lack the capability to generate specific subjects from given reference images or to synthesize novel renditions under varying conditions. Methods like DreamBooth and Subject-driven Text-to-Image (SuTI) have made significant progress in this area. Yet, both approaches primarily focus on enhancing similarity to reference images and require expensive setups, often overlooking the need for efficient training and avoiding overfitting to the reference images. In this work, we present the $λ$-Harmonic reward function, which provides a reliable reward signal and enables early stopping for faster training and effective regularization. By combining the Bradley-Terry preference model, the $λ$-Harmonic reward function also provides preference labels for subject-driven generation tasks. We propose Reward Preference Optimization (RPO), which offers a simpler setup (requiring only $3\%$ of the negative samples used by DreamBooth) and fewer gradient steps for fine-tuning. Unlike most existing methods, our approach does not require training a text encoder or optimizing text embeddings and achieves text-image alignment by fine-tuning only the U-Net component. Empirically, $λ$-Harmonic proves to be a reliable approach for model selection in subject-driven generation tasks. Based on preference labels and early stopping validation from the $λ$-Harmonic reward function, our algorithm achieves a state-of-the-art CLIP-I score of 0.833 and a CLIP-T score of 0.314 on DreamBench.
△ Less
Submitted 22 December, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Toward Availability Attacks in 3D Point Clouds
Authors:
Yifan Zhu,
Yibo Miao,
Yinpeng Dong,
Xiao-Shan Gao
Abstract:
Despite the great progress of 3D vision, data privacy and security issues in 3D deep learning are not explored systematically. In the domain of 2D images, many availability attacks have been proposed to prevent data from being illicitly learned by unauthorized deep models. However, unlike images represented on a fixed dimensional grid, point clouds are characterized as unordered and unstructured s…
▽ More
Despite the great progress of 3D vision, data privacy and security issues in 3D deep learning are not explored systematically. In the domain of 2D images, many availability attacks have been proposed to prevent data from being illicitly learned by unauthorized deep models. However, unlike images represented on a fixed dimensional grid, point clouds are characterized as unordered and unstructured sets, posing a significant challenge in designing an effective availability attack for 3D deep learning. In this paper, we theoretically show that extending 2D availability attacks directly to 3D point clouds under distance regularization is susceptible to the degeneracy, rendering the generated poisons weaker or even ineffective. This is because in bi-level optimization, introducing regularization term can result in update directions out of control. To address this issue, we propose a novel Feature Collision Error-Minimization (FC-EM) method, which creates additional shortcuts in the feature space, inducing different update directions to prevent the degeneracy of bi-level optimization. Moreover, we provide a theoretical analysis that demonstrates the effectiveness of the FC-EM attack. Extensive experiments on typical point cloud datasets, 3D intracranial aneurysm medical dataset, and 3D face dataset verify the superiority and practicality of our approach. Code is available at https://github.com/hala64/fc-em.
△ Less
Submitted 26 June, 2024;
originally announced July 2024.