Skip to main content

Showing 1–22 of 22 results for author: Huo, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.12716  [pdf, ps, other

    cs.CV

    Generative 4D Scene Gaussian Splatting with Object View-Synthesis Priors

    Authors: Wen-Hsuan Chu, Lei Ke, Jianmeng Liu, Mingxiao Huo, Pavel Tokmakov, Katerina Fragkiadaki

    Abstract: We tackle the challenge of generating dynamic 4D scenes from monocular, multi-object videos with heavy occlusions, and introduce GenMOJO, a novel approach that integrates rendering-based deformable 3D Gaussian optimization with generative priors for view synthesis. While existing models perform well on novel view synthesis for isolated objects, they struggle to generalize to complex, cluttered sce… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

    Comments: This is an updated and extended version of our CVPR paper "Robust Multi-Object 4D Generation in Complex Video Scenarios"

  2. arXiv:2505.16001  [pdf, ps, other

    cs.CV

    Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning

    Authors: Qiang Zhu, Kuan Lu, Menghao Huo, Yuxiao Li

    Abstract: Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for image-to-image translation by adapting Diffusion Transformers (DiT), which combine the denoising capabilities of diffusion models with the global modeling power of t… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  3. arXiv:2505.15146  [pdf, ps, other

    cs.AI

    lmgame-Bench: How Good are LLMs at Playing Games?

    Authors: Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, Hao Zhang

    Abstract: Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential… ▽ More

    Submitted 3 June, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  4. arXiv:2504.16016  [pdf, ps, other

    cs.CV

    Efficient Temporal Consistency in Diffusion-Based Video Editing with Adaptor Modules: A Theoretical Framework

    Authors: Xinyuan Song, Yangfan He, Sida Li, Jianhui Wang, Hongyang He, Xinhang Yuan, Ruoyu Wang, Jiaqi Chen, Keqin Li, Kuan Lu, Menghao Huo, Binxu Li, Pei Liu

    Abstract: Adapter-based methods are commonly used to enhance model performance with minimal additional complexity, especially in video editing tasks that require frame-to-frame consistency. By inserting small, learnable modules into pretrained diffusion models, these adapters can maintain temporal coherence without extensive retraining. Approaches that incorporate prompt learning with both shared and frame-… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2501.04606

  5. arXiv:2504.14868  [pdf, ps, other

    cs.CV

    Twin Co-Adaptive Dialogue for Progressive Image Generation

    Authors: Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Hongyang He, Wenyu Zhu, Xinhang Yuan, Kuan Lu, Menghao Huo, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, Xueqian Wang

    Abstract: Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, i… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  6. arXiv:2504.02275  [pdf, other

    cs.LG

    Enhancing Customer Contact Efficiency with Graph Neural Networks in Credit Card Fraud Detection Workflow

    Authors: Menghao Huo, Kuan Lu, Qiang Zhu, Zhenrui Chen

    Abstract: Credit card fraud has been a persistent issue since the last century, causing significant financial losses to the industry. The most effective way to prevent fraud is by contacting customers to verify suspicious transactions. However, while these systems are designed to detect fraudulent activity, they often mistakenly flag legitimate transactions, leading to unnecessary declines that disrupt the… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  7. arXiv:2503.23512  [pdf, ps, other

    cs.CL

    SCORE: Story Coherence and Retrieval Enhancement for AI Narratives

    Authors: Qiang Yi, Yangfan He, Jianhui Wang, Xinyuan Song, Shiyao Qian, Xinhang Yuan, Li Sun, Yi Xin, Jingqun Tang, Keqin Li, Kuan Lu, Menghao Huo, Jiaqi Chen, Tianyu Shi

    Abstract: Large Language Models (LLMs) can generate creative and engaging narratives from user-specified input, but maintaining coherence and emotional depth throughout these AI-generated stories remains a challenge. In this work, we propose SCORE, a framework for Story Coherence and Retrieval Enhancement, designed to detect and resolve narrative inconsistencies. By tracking key item statuses and generating… ▽ More

    Submitted 12 June, 2025; v1 submitted 30 March, 2025; originally announced March 2025.

  8. arXiv:2503.14662  [pdf, other

    cs.CL cs.AI

    ConQuer: A Framework for Concept-Based Quiz Generation

    Authors: Yicheng Fu, Zikui Wang, Liuxin Yang, Meiqing Huo, Zhongdongming Dai

    Abstract: Quizzes play a crucial role in education by reinforcing students' understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated qu… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  9. arXiv:2503.00737  [pdf, other

    cs.CV

    Multi-Cali Anything: Dense Feature Multi-Frame Structure-from-Motion for Large-Scale Camera Array Calibration

    Authors: Jinjiang You, Hewei Wang, Yijie Li, Mingxiao Huo, Long Van Tran Ha, Mingyuan Ma, Jinfeng Xu, Puzhen Wu, Shubham Garg, Wei Pu

    Abstract: Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration metho… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: 8 pages

  10. arXiv:2501.15167  [pdf, ps, other

    cs.CV

    Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy

    Authors: Yangfan He, Jianhui Wang, Yijin Wang, Kun Li, Yan Zhong, Xinyuan Song, Li Sun, Jingyuan Lu, Jingqun Tang, Miao Zhang, Tianyu Shi, Xinhang Yuan, Yi Xin, Kuan Lu, Menghao Huo, Keqin Li, Jiaqi Chen

    Abstract: Today's image generation systems are capable of producing realistic and high-quality images. However, user prompts often contain ambiguities, making it difficult for these systems to interpret users' actual intentions. Consequently, many users must modify their prompts several times to ensure the generated images meet their expectations. While some methods focus on enhancing prompts to make the ge… ▽ More

    Submitted 12 June, 2025; v1 submitted 25 January, 2025; originally announced January 2025.

  11. arXiv:2501.09169  [pdf, other

    eess.AS cs.SD

    Beyond Speaker Identity: Text Guided Target Speech Extraction

    Authors: Mingyue Huo, Abhinav Jain, Cong Phuoc Huynh, Fanjie Kong, Pichao Wang, Zhu Liu, Vimal Bhat

    Abstract: Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's identity like enrollment audio, face images, or videos, which may not always be available. In this paper, we propose a text-guided TSE model StyleTSE that uses natural language descriptions of speaking style in addition to the audio clue to extract the desired speech from a given mixture. Our model integrates… ▽ More

    Submitted 15 January, 2025; originally announced January 2025.

    Comments: Accepted by ICASSP 2025

  12. arXiv:2501.08620  [pdf, other

    cs.LG

    CT-PatchTST: Channel-Time Patch Time-Series Transformer for Long-Term Renewable Energy Forecasting

    Authors: Menghao Huo, Kuan Lu, Yuxiao Li, Qiang Zhu, Zhenrui Chen

    Abstract: Accurately predicting renewable energy output is crucial for the efficient integration of solar and wind power into modern energy systems. This study develops and evaluates an advanced deep learning model, Channel-Time Patch Time-Series Transformer (CT-PatchTST), to forecast the power output of photovoltaic and wind energy systems using annual offshore wind power, onshore wind power, and solar pow… ▽ More

    Submitted 18 May, 2025; v1 submitted 15 January, 2025; originally announced January 2025.

  13. arXiv:2501.04606  [pdf, ps, other

    cs.CV

    Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion

    Authors: Yangfan He, Sida Li, Jianhui Wang, Kun Li, Xinyuan Song, Xinhang Yuan, Keqin Li, Kuan Lu, Menghao Huo, Jingqun Tang, Yi Xin, Jiaqi Chen, Miao Zhang, Xueqian Wang

    Abstract: Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame-independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-ba… ▽ More

    Submitted 11 June, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

  14. arXiv:2412.20367  [pdf, ps, other

    cs.SE cs.CL

    Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey

    Authors: Junqiao Wang, Zeng Zhang, Yangfan He, Zihao Zhang, Yuyang Song, Tianyu Shi, Yuchen Li, Hengyuan Xu, Kunyu Wu, Xin Yi, Zhongwei Wan, Xinhang Yuan, Kuan Lu, Menghao Huo, Tang Jingqun, Guangwu Qian, Keqin Li, Qiuwu Chen, Lewei He

    Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing large language models (LLMs) in code generation and optimization. This survey systematically reviews RL-driven techniques across the code development lifecycle, from compiler-level optimizations and resource allocation strategies to end-to-end code synthesis frameworks. We first examine classical and modern RL algorithms… ▽ More

    Submitted 11 June, 2025; v1 submitted 29 December, 2024; originally announced December 2024.

  15. arXiv:2407.01531  [pdf, other

    cs.RO cs.LG

    Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning

    Authors: Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xiang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, Masayoshi Tomizuka

    Abstract: The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). B… ▽ More

    Submitted 24 October, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: Published at CoRL 2024

  16. arXiv:2407.00617  [pdf, other

    cs.LG cs.AI cs.CL cs.GT

    Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

    Authors: Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu

    Abstract: Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-th… ▽ More

    Submitted 2 March, 2025; v1 submitted 30 June, 2024; originally announced July 2024.

  17. arXiv:2406.18591  [pdf, other

    cs.CV cs.AI cs.LG

    Composition Vision-Language Understanding via Segment and Depth Anything Model

    Authors: Mingxiao Huo, Pengliang Ji, Haotian Lin, Junchen Liu, Yixiao Wang, Yijun Chen

    Abstract: We introduce a pioneering unified library that leverages depth anything, segment anything models to augment neural comprehension in language-vision model zero-shot understanding. This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V, enhancing multimodal tasks such as vision-question-answering (VQA) and composition reasoning. Through t… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  18. arXiv:2404.00237  [pdf, other

    cs.RO

    Joint Pedestrian Trajectory Prediction through Posterior Sampling

    Authors: Haotian Lin, Yixiao Wang, Mingxiao Huo, Chensheng Peng, Zhiyuan Liu, Masayoshi Tomizuka

    Abstract: Joint pedestrian trajectory prediction has long grappled with the inherent unpredictability of human behaviors. Recent investigations employing variants of conditional diffusion models in trajectory prediction have exhibited notable success. Nevertheless, the heavy dependence on accurate historical data results in their vulnerability to noise disturbances and data incompleteness. To improve the ro… ▽ More

    Submitted 3 September, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

  19. arXiv:2402.18059  [pdf, other

    cs.LG cs.CL cs.CR

    Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models

    Authors: Mingjia Huo, Sai Ashish Somayajula, Youwei Liang, Ruisi Zhang, Farinaz Koushanfar, Pengtao Xie

    Abstract: Large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing AI-generated and human-written texts. Watermarking is pivotal in this context, which involves embedding hidden markers in texts during the LLM inference phase, which is imperceptible to humans. Achieving both the detectability of inserted watermarks and the se… ▽ More

    Submitted 6 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: 22 pages, 13 figures, 5 tables

  20. arXiv:2310.03023  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Human-oriented Representation Learning for Robotic Manipulation

    Authors: Mingxiao Huo, Mingyu Ding, Chenfeng Xu, Thomas Tian, Xinghao Zhu, Yao Mu, Lingfeng Sun, Masayoshi Tomizuka, Wei Zhan

    Abstract: Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited fo… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  21. arXiv:2212.03029  [pdf, other

    cs.CV

    AbHE: All Attention-based Homography Estimation

    Authors: Mingxiao Huo, Zhihao Zhang, Xinyang Ren, Xianqiang Yang

    Abstract: Homography estimation is a basic computer vision task, which aims to obtain the transformation from multi-view images for image alignment. Unsupervised learning homography estimation trains a convolution neural network for feature extraction and transformation matrix regression. While the state-of-theart homography method is based on convolution neural networks, few work focuses on transformer whi… ▽ More

    Submitted 5 February, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

  22. arXiv:1906.02867  [pdf, ps, other

    cs.CR

    A Note on Lower Digits Extraction Polynomial for Bootstrapping

    Authors: Mingjia Huo, Kewen Wu, Qi Ye

    Abstract: Bootstrapping is a crucial but computationally expensive step for realizing Fully Homomorphic Encryption (FHE). Recently, Chen and Han (Eurocrypt 2018) introduced a family of low-degree polynomials to extract the lowest digit with respect to a certain congruence, which helps improve the bootstrapping for both FV and BGV schemes. In this note, we present the following relevant findings about the… ▽ More

    Submitted 6 June, 2019; originally announced June 2019.

    Comments: 4 pages