-
Generative 4D Scene Gaussian Splatting with Object View-Synthesis Priors
Authors:
Wen-Hsuan Chu,
Lei Ke,
Jianmeng Liu,
Mingxiao Huo,
Pavel Tokmakov,
Katerina Fragkiadaki
Abstract:
We tackle the challenge of generating dynamic 4D scenes from monocular, multi-object videos with heavy occlusions, and introduce GenMOJO, a novel approach that integrates rendering-based deformable 3D Gaussian optimization with generative priors for view synthesis. While existing models perform well on novel view synthesis for isolated objects, they struggle to generalize to complex, cluttered sce…
▽ More
We tackle the challenge of generating dynamic 4D scenes from monocular, multi-object videos with heavy occlusions, and introduce GenMOJO, a novel approach that integrates rendering-based deformable 3D Gaussian optimization with generative priors for view synthesis. While existing models perform well on novel view synthesis for isolated objects, they struggle to generalize to complex, cluttered scenes. To address this, GenMOJO decomposes the scene into individual objects, optimizing a differentiable set of deformable Gaussians per object. This object-wise decomposition allows leveraging object-centric diffusion models to infer unobserved regions in novel viewpoints. It performs joint Gaussian splatting to render the full scene, capturing cross-object occlusions, and enabling occlusion-aware supervision. To bridge the gap between object-centric priors and the global frame-centric coordinate system of videos, GenMOJO uses differentiable transformations that align generative and rendering constraints within a unified framework. The resulting model generates 4D object reconstructions over space and time, and produces accurate 2D and 3D point tracks from monocular input. Quantitative evaluations and perceptual human studies confirm that GenMOJO generates more realistic novel views of scenes and produces more accurate point tracks compared to existing approaches.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning
Authors:
Qiang Zhu,
Kuan Lu,
Menghao Huo,
Yuxiao Li
Abstract:
Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for image-to-image translation by adapting Diffusion Transformers (DiT), which combine the denoising capabilities of diffusion models with the global modeling power of t…
▽ More
Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for image-to-image translation by adapting Diffusion Transformers (DiT), which combine the denoising capabilities of diffusion models with the global modeling power of transformers. To guide the translation process, we condition the model on image embeddings extracted from a pre-trained CLIP encoder, allowing for fine-grained and structurally consistent translations without relying on text or class labels. We incorporate both a CLIP similarity loss to enforce semantic consistency and an LPIPS perceptual loss to enhance visual fidelity during training. We validate our approach on two benchmark datasets: face2comics, which translates real human faces to comic-style illustrations, and edges2shoes, which translates edge maps to realistic shoe images. Experimental results demonstrate that DiT, combined with CLIP-based conditioning and perceptual similarity objectives, achieves high-quality, semantically faithful translations, offering a promising alternative to GAN-based models for paired image-to-image translation tasks.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
lmgame-Bench: How Good are LLMs at Playing Games?
Authors:
Lanxiang Hu,
Mingjia Huo,
Yuxuan Zhang,
Haoyang Yu,
Eric P. Xing,
Ion Stoica,
Tajana Rosing,
Haojian Jin,
Hao Zhang
Abstract:
Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential…
▽ More
Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.
△ Less
Submitted 3 June, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
Efficient Temporal Consistency in Diffusion-Based Video Editing with Adaptor Modules: A Theoretical Framework
Authors:
Xinyuan Song,
Yangfan He,
Sida Li,
Jianhui Wang,
Hongyang He,
Xinhang Yuan,
Ruoyu Wang,
Jiaqi Chen,
Keqin Li,
Kuan Lu,
Menghao Huo,
Binxu Li,
Pei Liu
Abstract:
Adapter-based methods are commonly used to enhance model performance with minimal additional complexity, especially in video editing tasks that require frame-to-frame consistency. By inserting small, learnable modules into pretrained diffusion models, these adapters can maintain temporal coherence without extensive retraining. Approaches that incorporate prompt learning with both shared and frame-…
▽ More
Adapter-based methods are commonly used to enhance model performance with minimal additional complexity, especially in video editing tasks that require frame-to-frame consistency. By inserting small, learnable modules into pretrained diffusion models, these adapters can maintain temporal coherence without extensive retraining. Approaches that incorporate prompt learning with both shared and frame-specific tokens are particularly effective in preserving continuity across frames at low training cost. In this work, we want to provide a general theoretical framework for adapters that maintain frame consistency in DDIM-based models under a temporal consistency loss. First, we prove that the temporal consistency objective is differentiable under bounded feature norms, and we establish a Lipschitz bound on its gradient. Second, we show that gradient descent on this objective decreases the loss monotonically and converges to a local minimum if the learning rate is within an appropriate range. Finally, we analyze the stability of modules in the DDIM inversion procedure, showing that the associated error remains controlled. These theoretical findings will reinforce the reliability of diffusion-based video editing methods that rely on adapter strategies and provide theoretical insights in video generation tasks.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Twin Co-Adaptive Dialogue for Progressive Image Generation
Authors:
Jianhui Wang,
Yangfan He,
Yan Zhong,
Xinyuan Song,
Jiayi Su,
Yuheng Feng,
Hongyang He,
Wenyu Zhu,
Xinhang Yuan,
Kuan Lu,
Menghao Huo,
Miao Zhang,
Keqin Li,
Jiaqi Chen,
Tianyu Shi,
Xueqian Wang
Abstract:
Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, i…
▽ More
Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co-adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin-Co not only enhances user experience by reducing trial-and-error iterations but also improves the quality of the generated images, streamlining the creative process across various applications.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Enhancing Customer Contact Efficiency with Graph Neural Networks in Credit Card Fraud Detection Workflow
Authors:
Menghao Huo,
Kuan Lu,
Qiang Zhu,
Zhenrui Chen
Abstract:
Credit card fraud has been a persistent issue since the last century, causing significant financial losses to the industry. The most effective way to prevent fraud is by contacting customers to verify suspicious transactions. However, while these systems are designed to detect fraudulent activity, they often mistakenly flag legitimate transactions, leading to unnecessary declines that disrupt the…
▽ More
Credit card fraud has been a persistent issue since the last century, causing significant financial losses to the industry. The most effective way to prevent fraud is by contacting customers to verify suspicious transactions. However, while these systems are designed to detect fraudulent activity, they often mistakenly flag legitimate transactions, leading to unnecessary declines that disrupt the user experience and erode customer trust. Frequent false positives can frustrate customers, resulting in dissatisfaction, increased complaints, and a diminished sense of security. To address these limitations, we propose a fraud detection framework incorporating Relational Graph Convolutional Networks (RGCN) to enhance the accuracy and efficiency of identifying fraudulent transactions. By leveraging the relational structure of transaction data, our model reduces the need for direct customer confirmation while maintaining high detection performance. Our experiments are conducted using the IBM credit card transaction dataset to evaluate the effectiveness of this approach.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
SCORE: Story Coherence and Retrieval Enhancement for AI Narratives
Authors:
Qiang Yi,
Yangfan He,
Jianhui Wang,
Xinyuan Song,
Shiyao Qian,
Xinhang Yuan,
Li Sun,
Yi Xin,
Jingqun Tang,
Keqin Li,
Kuan Lu,
Menghao Huo,
Jiaqi Chen,
Tianyu Shi
Abstract:
Large Language Models (LLMs) can generate creative and engaging narratives from user-specified input, but maintaining coherence and emotional depth throughout these AI-generated stories remains a challenge. In this work, we propose SCORE, a framework for Story Coherence and Retrieval Enhancement, designed to detect and resolve narrative inconsistencies. By tracking key item statuses and generating…
▽ More
Large Language Models (LLMs) can generate creative and engaging narratives from user-specified input, but maintaining coherence and emotional depth throughout these AI-generated stories remains a challenge. In this work, we propose SCORE, a framework for Story Coherence and Retrieval Enhancement, designed to detect and resolve narrative inconsistencies. By tracking key item statuses and generating episode summaries, SCORE uses a Retrieval-Augmented Generation (RAG) approach, incorporating TF-IDF and cosine similarity to identify related episodes and enhance the overall story structure. Results from testing multiple LLM-generated stories demonstrate that SCORE significantly improves the consistency and stability of narrative coherence compared to baseline GPT models, providing a more robust method for evaluating and refining AI-generated narratives.
△ Less
Submitted 12 June, 2025; v1 submitted 30 March, 2025;
originally announced March 2025.
-
ConQuer: A Framework for Concept-Based Quiz Generation
Authors:
Yicheng Fu,
Zikui Wang,
Liuxin Yang,
Meiqing Huo,
Zhongdongming Dai
Abstract:
Quizzes play a crucial role in education by reinforcing students' understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated qu…
▽ More
Quizzes play a crucial role in education by reinforcing students' understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated quizzes and their educational impact on students. To address these issues, we introduce ConQuer, a concept-based quiz generation framework that leverages external knowledge sources. We employ comprehensive evaluation dimensions to assess the quality of the generated quizzes, using LLMs as judges. Our experiment results demonstrate a 4.8% improvement in evaluation scores and a 77.52% win rate in pairwise comparisons against baseline quiz sets. Ablation studies further underscore the effectiveness of each component in our framework. Code available at https://github.com/sofyc/ConQuer.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Multi-Cali Anything: Dense Feature Multi-Frame Structure-from-Motion for Large-Scale Camera Array Calibration
Authors:
Jinjiang You,
Hewei Wang,
Yijie Li,
Mingxiao Huo,
Long Van Tran Ha,
Mingyuan Ma,
Jinfeng Xu,
Puzhen Wu,
Shubham Garg,
Wei Pu
Abstract:
Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration metho…
▽ More
Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration method that refines intrinsics directly from scene data, eliminating the necessity for additional calibration captures. Our approach enhances traditional Structure-from-Motion (SfM) pipelines by introducing an extrinsics regularization term to progressively align estimated extrinsics with ground-truth values, a dense feature reprojection term to reduce keypoint errors by minimizing reprojection loss in the feature space, and an intrinsics variance term for joint optimization across multiple frames. Experiments on the Multiface dataset show that our method achieves nearly the same precision as dedicated calibration processes, and significantly enhances intrinsics and 3D reconstruction accuracy. Fully compatible with existing SfM pipelines, our method provides an efficient and practical plug-and-play solution for large-scale camera setups. Our code is publicly available at: https://github.com/YJJfish/Multi-Cali-Anything
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy
Authors:
Yangfan He,
Jianhui Wang,
Yijin Wang,
Kun Li,
Yan Zhong,
Xinyuan Song,
Li Sun,
Jingyuan Lu,
Jingqun Tang,
Miao Zhang,
Tianyu Shi,
Xinhang Yuan,
Yi Xin,
Kuan Lu,
Menghao Huo,
Keqin Li,
Jiaqi Chen
Abstract:
Today's image generation systems are capable of producing realistic and high-quality images. However, user prompts often contain ambiguities, making it difficult for these systems to interpret users' actual intentions. Consequently, many users must modify their prompts several times to ensure the generated images meet their expectations. While some methods focus on enhancing prompts to make the ge…
▽ More
Today's image generation systems are capable of producing realistic and high-quality images. However, user prompts often contain ambiguities, making it difficult for these systems to interpret users' actual intentions. Consequently, many users must modify their prompts several times to ensure the generated images meet their expectations. While some methods focus on enhancing prompts to make the generated images fit user needs, the model is still hard to understand users' real needs, especially for non-expert users. In this research, we aim to enhance the visual parameter-tuning process, making the model user-friendly for individuals without specialized knowledge and better understand user needs. We propose a human-machine co-adaption strategy using mutual information between the user's prompts and the pictures under modification as the optimizing target to make the system better adapt to user needs. We find that an improved model can reduce the necessity for multiple rounds of adjustments. We also collect multi-round dialogue datasets with prompts and images pairs and user intent. Various experiments demonstrate the effectiveness of the proposed method in our proposed dataset. Our annotation tools and several examples of our dataset are available at https://zenodo.org/records/14876029 for easier review. We will make open source our full dataset and code.
△ Less
Submitted 12 June, 2025; v1 submitted 25 January, 2025;
originally announced January 2025.
-
Beyond Speaker Identity: Text Guided Target Speech Extraction
Authors:
Mingyue Huo,
Abhinav Jain,
Cong Phuoc Huynh,
Fanjie Kong,
Pichao Wang,
Zhu Liu,
Vimal Bhat
Abstract:
Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's identity like enrollment audio, face images, or videos, which may not always be available. In this paper, we propose a text-guided TSE model StyleTSE that uses natural language descriptions of speaking style in addition to the audio clue to extract the desired speech from a given mixture. Our model integrates…
▽ More
Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's identity like enrollment audio, face images, or videos, which may not always be available. In this paper, we propose a text-guided TSE model StyleTSE that uses natural language descriptions of speaking style in addition to the audio clue to extract the desired speech from a given mixture. Our model integrates a speech separation network adapted from SepFormer with a bi-modality clue network that flexibly processes both audio and text clues. To train and evaluate our model, we introduce a new dataset TextrolMix with speech mixtures and natural language descriptions. Experimental results demonstrate that our method effectively separates speech based not only on who is speaking, but also on how they are speaking, enhancing TSE in scenarios where traditional audio clues are absent. Demos are at: https://mingyue66.github.io/TextrolMix/demo/
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
CT-PatchTST: Channel-Time Patch Time-Series Transformer for Long-Term Renewable Energy Forecasting
Authors:
Menghao Huo,
Kuan Lu,
Yuxiao Li,
Qiang Zhu,
Zhenrui Chen
Abstract:
Accurately predicting renewable energy output is crucial for the efficient integration of solar and wind power into modern energy systems. This study develops and evaluates an advanced deep learning model, Channel-Time Patch Time-Series Transformer (CT-PatchTST), to forecast the power output of photovoltaic and wind energy systems using annual offshore wind power, onshore wind power, and solar pow…
▽ More
Accurately predicting renewable energy output is crucial for the efficient integration of solar and wind power into modern energy systems. This study develops and evaluates an advanced deep learning model, Channel-Time Patch Time-Series Transformer (CT-PatchTST), to forecast the power output of photovoltaic and wind energy systems using annual offshore wind power, onshore wind power, and solar power generation data from Denmark. While the original Patch Time-Series Transformer(PatchTST) model employs a channel-independent (CI) approach, it tends to overlook inter-channel relationships during training, potentially leading to a loss of critical information. To address this limitation and further leverage the benefits of increased data granularity brought by CI, we propose CT-PatchTST. This enhanced model improves the processing of inter-channel information while maintaining the advantages of the channel-independent approach. The predictive performance of CT-PatchTST is rigorously analyzed, demonstrating its ability to provide precise and reliable energy forecasts. This work contributes to improving the predictability of renewable energy systems, supporting their broader adoption and integration into energy grids.
△ Less
Submitted 18 May, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion
Authors:
Yangfan He,
Sida Li,
Jianhui Wang,
Kun Li,
Xinyuan Song,
Xinhang Yuan,
Keqin Li,
Kuan Lu,
Menghao Huo,
Jingqun Tang,
Yi Xin,
Jiaqi Chen,
Miao Zhang,
Xueqian Wang
Abstract:
Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame-independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-ba…
▽ More
Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame-independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-based temporal propagation, but these approaches suffer from high training costs or limited temporal coherence. To address these challenges, we propose a General and Efficient Adapter (GE-Adapter) that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion. This framework introduces three key components: (1) Frame-based Temporal Consistency Blocks (FTC Blocks) to capture frame-specific features and enforce smooth inter-frame transitions via temporally-aware loss functions; (2) Channel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral filters to enhance spatial coherence by reducing noise and artifacts; and (3) Token-based Semantic Consistency Module (TSC Module) to maintain semantic alignment using shared prompt tokens and frame-specific tokens. Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset. Additionally, it achieves enhanced fidelity and frame-to-frame coherence, offering a practical solution for T2V editing.
△ Less
Submitted 11 June, 2025; v1 submitted 8 January, 2025;
originally announced January 2025.
-
Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey
Authors:
Junqiao Wang,
Zeng Zhang,
Yangfan He,
Zihao Zhang,
Yuyang Song,
Tianyu Shi,
Yuchen Li,
Hengyuan Xu,
Kunyu Wu,
Xin Yi,
Zhongwei Wan,
Xinhang Yuan,
Kuan Lu,
Menghao Huo,
Tang Jingqun,
Guangwu Qian,
Keqin Li,
Qiuwu Chen,
Lewei He
Abstract:
Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing large language models (LLMs) in code generation and optimization. This survey systematically reviews RL-driven techniques across the code development lifecycle, from compiler-level optimizations and resource allocation strategies to end-to-end code synthesis frameworks. We first examine classical and modern RL algorithms…
▽ More
Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing large language models (LLMs) in code generation and optimization. This survey systematically reviews RL-driven techniques across the code development lifecycle, from compiler-level optimizations and resource allocation strategies to end-to-end code synthesis frameworks. We first examine classical and modern RL algorithms -- spanning policy gradients, actor-critic methods, human-feedback alignment, and preference-based optimization -- and their adaptations to the unique challenges of code generation, such as sparse and delayed rewards. Next, we analyze key benchmarks, datasets, and evaluation metrics that drive progress in RL-augmented Code LLMs. Finally, we identify open problems, including the need for richer feedback sources, support for low-level and domain-specific languages, and methods to reduce computational overhead. By consolidating current insights and outlining future directions, this work aims to guide researchers and practitioners in leveraging RL to produce more robust, efficient, and human-aligned code generation systems.
△ Less
Submitted 11 June, 2025; v1 submitted 29 December, 2024;
originally announced December 2024.
-
Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning
Authors:
Yixiao Wang,
Yifei Zhang,
Mingxiao Huo,
Ran Tian,
Xiang Zhang,
Yichen Xie,
Chenfeng Xu,
Pengliang Ji,
Wei Zhan,
Mingyu Ding,
Masayoshi Tomizuka
Abstract:
The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). B…
▽ More
The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). By adopting Mixture of Experts (MoE) within a transformer-based diffusion policy, SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model. SDP not only reduces the burden of active parameters but also facilitates the seamless integration and reuse of experts across various tasks. Extensive experiments on diverse tasks in both simulations and real world show that SDP 1) excels in multitask scenarios with negligible increases in active parameters, 2) prevents forgetting in continual learning of new tasks, and 3) enables efficient task transfer, offering a promising solution for advanced robotic applications. Demos and codes can be found in https://forrest-110.github.io/sparse_diffusion_policy/.
△ Less
Submitted 24 October, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
Authors:
Yuheng Zhang,
Dian Yu,
Baolin Peng,
Linfeng Song,
Ye Tian,
Mingyue Huo,
Nan Jiang,
Haitao Mi,
Dong Yu
Abstract:
Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-th…
▽ More
Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.
△ Less
Submitted 2 March, 2025; v1 submitted 30 June, 2024;
originally announced July 2024.
-
Composition Vision-Language Understanding via Segment and Depth Anything Model
Authors:
Mingxiao Huo,
Pengliang Ji,
Haotian Lin,
Junchen Liu,
Yixiao Wang,
Yijun Chen
Abstract:
We introduce a pioneering unified library that leverages depth anything, segment anything models to augment neural comprehension in language-vision model zero-shot understanding. This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V, enhancing multimodal tasks such as vision-question-answering (VQA) and composition reasoning. Through t…
▽ More
We introduce a pioneering unified library that leverages depth anything, segment anything models to augment neural comprehension in language-vision model zero-shot understanding. This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V, enhancing multimodal tasks such as vision-question-answering (VQA) and composition reasoning. Through the fusion of segmentation and depth analysis at the symbolic instance level, our library provides nuanced inputs for language models, significantly advancing image interpretation. Validated across a spectrum of in-the-wild real-world images, our findings showcase progress in vision-language models through neural-symbolic integration. This novel approach melds visual and language analysis in an unprecedented manner. Overall, our library opens new directions for future research aimed at decoding the complexities of the real world through advanced multimodal technologies and our code is available at \url{https://github.com/AnthonyHuo/SAM-DAM-for-Compositional-Reasoning}.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Joint Pedestrian Trajectory Prediction through Posterior Sampling
Authors:
Haotian Lin,
Yixiao Wang,
Mingxiao Huo,
Chensheng Peng,
Zhiyuan Liu,
Masayoshi Tomizuka
Abstract:
Joint pedestrian trajectory prediction has long grappled with the inherent unpredictability of human behaviors. Recent investigations employing variants of conditional diffusion models in trajectory prediction have exhibited notable success. Nevertheless, the heavy dependence on accurate historical data results in their vulnerability to noise disturbances and data incompleteness. To improve the ro…
▽ More
Joint pedestrian trajectory prediction has long grappled with the inherent unpredictability of human behaviors. Recent investigations employing variants of conditional diffusion models in trajectory prediction have exhibited notable success. Nevertheless, the heavy dependence on accurate historical data results in their vulnerability to noise disturbances and data incompleteness. To improve the robustness and reliability, we introduce the Guided Full Trajectory Diffuser (GFTD), a novel diffusion model framework that captures the joint full (historical and future) trajectory distribution. By learning from the full trajectory, GFTD can recover the noisy and missing data, hence improving the robustness. In addition, GFTD can adapt to data imperfections without additional training requirements, leveraging posterior sampling for reliable prediction and controllable generation. Our approach not only simplifies the prediction process but also enhances generalizability in scenarios with noise and incomplete inputs. Through rigorous experimental evaluation, GFTD exhibits superior performance in both trajectory prediction and controllable generation.
△ Less
Submitted 3 September, 2024; v1 submitted 30 March, 2024;
originally announced April 2024.
-
Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models
Authors:
Mingjia Huo,
Sai Ashish Somayajula,
Youwei Liang,
Ruisi Zhang,
Farinaz Koushanfar,
Pengtao Xie
Abstract:
Large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing AI-generated and human-written texts. Watermarking is pivotal in this context, which involves embedding hidden markers in texts during the LLM inference phase, which is imperceptible to humans. Achieving both the detectability of inserted watermarks and the se…
▽ More
Large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing AI-generated and human-written texts. Watermarking is pivotal in this context, which involves embedding hidden markers in texts during the LLM inference phase, which is imperceptible to humans. Achieving both the detectability of inserted watermarks and the semantic quality of generated texts is challenging. While current watermarking algorithms have made promising progress in this direction, there remains significant scope for improvement. To address these challenges, we introduce a novel multi-objective optimization (MOO) approach for watermarking that utilizes lightweight networks to generate token-specific watermarking logits and splitting ratios. By leveraging MOO to optimize for both detection and semantic objective functions, our method simultaneously achieves detectability and semantic integrity. Experimental results show that our method outperforms current watermarking techniques in enhancing the detectability of texts generated by LLMs while maintaining their semantic coherence. Our code is available at https://github.com/mignonjia/TS_watermark.
△ Less
Submitted 6 June, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
Human-oriented Representation Learning for Robotic Manipulation
Authors:
Mingxiao Huo,
Mingyu Ding,
Chenfeng Xu,
Thomas Tian,
Xinghao Zhu,
Yao Mu,
Lingfeng Sun,
Masayoshi Tomizuka,
Wei Zhan
Abstract:
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited fo…
▽ More
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited for learning robot manipulation policies compared to current state-of-the-art visual representations purely based on self-supervised objectives. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the representation learning towards encoding meaningful structure for what's important for all perceptual skills, ultimately empowering learning of downstream robotic manipulation tasks. Extensive experiments across a range of robotic tasks and embodiments, in both simulations and real-world environments, show that our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders including R3M, MVP, and EgoVLP, for downstream manipulation policy-learning. Project page: https://sites.google.com/view/human-oriented-robot-learning
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
AbHE: All Attention-based Homography Estimation
Authors:
Mingxiao Huo,
Zhihao Zhang,
Xinyang Ren,
Xianqiang Yang
Abstract:
Homography estimation is a basic computer vision task, which aims to obtain the transformation from multi-view images for image alignment. Unsupervised learning homography estimation trains a convolution neural network for feature extraction and transformation matrix regression. While the state-of-theart homography method is based on convolution neural networks, few work focuses on transformer whi…
▽ More
Homography estimation is a basic computer vision task, which aims to obtain the transformation from multi-view images for image alignment. Unsupervised learning homography estimation trains a convolution neural network for feature extraction and transformation matrix regression. While the state-of-theart homography method is based on convolution neural networks, few work focuses on transformer which shows superiority in highlevel vision tasks. In this paper, we propose a strong-baseline model based on the Swin Transformer, which combines convolution neural network for local features and transformer module for global features. Moreover, a cross non-local layer is introduced to search the matched features within the feature maps coarsely. In the homography regression stage, we adopt an attention layer for the channels of correlation volume, which can drop out some weak correlation feature points. The experiment shows that in 8 Degree-of-Freedoms(DOFs) homography estimation our method overperforms the state-of-the-art method.
△ Less
Submitted 5 February, 2023; v1 submitted 6 December, 2022;
originally announced December 2022.
-
A Note on Lower Digits Extraction Polynomial for Bootstrapping
Authors:
Mingjia Huo,
Kewen Wu,
Qi Ye
Abstract:
Bootstrapping is a crucial but computationally expensive step for realizing Fully Homomorphic Encryption (FHE). Recently, Chen and Han (Eurocrypt 2018) introduced a family of low-degree polynomials to extract the lowest digit with respect to a certain congruence, which helps improve the bootstrapping for both FV and BGV schemes.
In this note, we present the following relevant findings about the…
▽ More
Bootstrapping is a crucial but computationally expensive step for realizing Fully Homomorphic Encryption (FHE). Recently, Chen and Han (Eurocrypt 2018) introduced a family of low-degree polynomials to extract the lowest digit with respect to a certain congruence, which helps improve the bootstrapping for both FV and BGV schemes.
In this note, we present the following relevant findings about the work of Chen and Han (referred to as CH18):
1. We provide a simpler construction of the low-degree polynomials that serve the same purpose and match the asymptotic bound achieved in CH18;
2. We show the optimality and limit of our approach by solving a minimal polynomial degree problem;
3. We consider the problem of extracting other low-order digits using polynomials, and provide negative results.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.