Skip to main content

Showing 1–50 of 574 results for author: Lin, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09614  [pdf, ps, other

    cs.AI cs.CL

    Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

    Authors: Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino

    Abstract: Language model (LM) agents are increasingly used as autonomous decision-makers who need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world -- key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit syste… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  2. Virtualized 3D Gaussians: Flexible Cluster-based Level-of-Detail System for Real-Time Rendering of Composed Scenes

    Authors: Xijie Yang, Linning Xu, Lihan Jiang, Dahua Lin, Bo Dai

    Abstract: 3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless composition of complex digital worlds, offering significant advantages over previous neural implicit methods. However, when applied to large-scale compositions, su… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: project page: https://xijie-yang.github.io/V3DG/

  3. arXiv:2505.05799  [pdf, ps, other

    cs.LG cs.AI

    MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

    Authors: Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin

    Abstract: Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE,… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  4. arXiv:2505.00219  [pdf, other

    q-bio.NC cs.HC

    Real-Time Brain-Computer Interface Control of Walking Exoskeleton with Bilateral Sensory Feedback

    Authors: Jeffrey Lim, Po T. Wang, Won Joon Sohn, Derrick Lin, Shravan Thaploo, Luke Bashford, David Bjanes, Angelica Nguyen, Hui Gong, Michelle Armacost, Susan J. Shaw, Spencer Kellis, Brian Lee, Darrin Lee, Payam Heydari, Richard A. Andersen, Zoran Nenadic, Charles Y. Liu, An H. Do

    Abstract: Invasive brain-computer interface (BCI) technology has demonstrated the possibility of restoring brain-controlled walking in paraplegic spinal cord injury patients. However, current implementations of BCI-controlled walking still have significant drawbacks. In particular, prior systems are unidirectional and lack sensory feedback for insensate patients, have suboptimal reliance on brain signals fr… ▽ More

    Submitted 30 April, 2025; originally announced May 2025.

    Comments: Main text of pre-print and supplementary information included

  5. arXiv:2504.20532  [pdf, other

    cs.MM cs.CR cs.SD eess.AS

    TriniMark: A Robust Generative Speech Watermarking Method for Trinity-Level Attribution

    Authors: Yue Li, Weizhi Liu, Dongdong Lin

    Abstract: The emergence of diffusion models has facilitated the generation of speech with reinforced fidelity and naturalness. While deepfake detection technologies have manifested the ability to identify AI-generated content, their efficacy decreases as generative models become increasingly sophisticated. Furthermore, current research in the field has not adequately addressed the necessity for robust water… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  6. arXiv:2504.18448  [pdf, other

    cs.CV

    NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration

    Authors: Haotian Dong, Xin Wang, Di Lin, Yipeng Wu, Qin Chen, Ruonan Liu, Kairui Yang, Ping Li, Qing Guo

    Abstract: High-quality video generation is crucial for many fields, including the film industry and autonomous driving. However, generating videos with spatiotemporal consistencies remains challenging. Current methods typically utilize attention mechanisms or modify noise to achieve consistent videos, neglecting global spatiotemporal information that could help ensure spatial and temporal consistency during… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  7. arXiv:2504.17815  [pdf, other

    cs.CV

    Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning

    Authors: Mingxuan Cui, Qing Guo, Yuyi Wang, Hongkai Yu, Di Lin, Qin Zou, Ming-Ming Cheng, Xi Li

    Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful and efficient 3D representation for novel view synthesis. This paper extends 3DGS capabilities to inpainting, where masked objects in a scene are replaced with new contents that blend seamlessly with the surroundings. Unlike 2D image inpainting, 3D Gaussian inpainting (3DGI) is challenging in effectively leveraging complementary visual and sem… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 14 pages, 12 figures, ICCV

  8. arXiv:2504.15035  [pdf, other

    cs.CR cs.AI cs.SD

    SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation

    Authors: Yue Li, Weizhi Liu, Dongdong Lin

    Abstract: The accelerated advancement of speech generative models has given rise to security issues, including model infringement and unauthorized abuse of content. Although existing generative watermarking techniques have proposed corresponding solutions, most methods require substantial computational overhead and training costs. In addition, some methods have limitations in robustness when handling variab… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  9. arXiv:2504.14832  [pdf, other

    cs.CR cs.AI cs.SD

    Protecting Your Voice: Temporal-aware Robust Watermarking

    Authors: Yue Li, Weizhi Liu, Dongdong Lin

    Abstract: The rapid advancement of generative models has led to the synthesis of real-fake ambiguous voices. To erase the ambiguity, embedding watermarks into the frequency-domain features of synthesized voices has become a common routine. However, the robustness achieved by choosing the frequency domain often comes at the expense of fine-grained voice features, leading to a loss of fidelity. Maximizing the… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

  10. arXiv:2504.13175  [pdf, other

    cs.RO

    Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation

    Authors: Sizhe Yang, Wenye Yu, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, Jiangmiao Pang

    Abstract: Visuomotor policies learned from teleoperated demonstrations face challenges such as lengthy data collection, high costs, and limited data diversity. Existing approaches address these issues by augmenting image observations in RGB space or employing Real-to-Sim-to-Real pipelines based on physical simulators. However, the former is constrained to 2D data augmentation, while the latter suffers from… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Published at Robotics: Science and Systems (RSS) 2025

  11. arXiv:2504.13074  [pdf, other

    cs.CV

    SkyReels-V2: Infinite-length Film Generative Model

    Authors: Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, Yahui Zhou

    Abstract: Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming fro… ▽ More

    Submitted 21 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: 31 pages,10 figures

  12. arXiv:2504.12317  [pdf

    cs.CL

    ChatGPT as Linguistic Equalizer? Quantifying LLM-Driven Lexical Shifts in Academic Writing

    Authors: Dingkang Lin, Naixuan Zhao, Dan Tian, Jiang Li

    Abstract: The advent of ChatGPT has profoundly reshaped scientific research practices, particularly in academic writing, where non-native English-speakers (NNES) historically face linguistic barriers. This study investigates whether ChatGPT mitigates these barriers and fosters equity by analyzing lexical complexity shifts across 2.8 million articles from OpenAlex (2020-2024). Using the Measure of Textual Le… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: 13 pages, 2 figures

  13. arXiv:2504.10700  [pdf, other

    cs.DC cs.AI

    Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE

    Authors: Jesun Firoz, Franco Pellegrini, Mario Geiger, Darren Hsu, Jenna A. Bilbrey, Han-Yi Chou, Maximilian Stadler, Markus Hoehnerbach, Tingyu Wang, Dejun Lin, Emine Kucukbenli, Henry W. Sprueill, Ilyes Batatia, Sotiris S. Xantheas, MalSoon Lee, Chris Mundy, Gabor Csanyi, Justin S. Smith, Ponnuswamy Sadayappan, Sutanay Choudhury

    Abstract: Chemistry Foundation Models (CFMs) that leverage Graph Neural Networks (GNNs) operating on 3D molecular graph structures are becoming indispensable tools for computational chemists and materials scientists. These models facilitate the understanding of matter and the discovery of new molecules and materials. In contrast to GNNs operating on a large homogeneous graphs, GNNs used by CFMs process a la… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted at The 34th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2025)

  14. arXiv:2504.10479  [pdf, other

    cs.CV

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang , et al. (26 additional authors not shown)

    Abstract: We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single p… ▽ More

    Submitted 18 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: Technical Report

  15. arXiv:2504.09428  [pdf

    cs.SI cs.AI cs.IR

    FROG: Effective Friend Recommendation in Online Games via Modality-aware User Preferences

    Authors: Qiwei Wang, Dandan Lin, Wenqing Lin, Ziming Wu

    Abstract: Due to the convenience of mobile devices, the online games have become an important part for user entertainments in reality, creating a demand for friend recommendation in online games. However, none of existing approaches can effectively incorporate the multi-modal user features (e.g., images and texts) with the structural information in the friendship graph, due to the following limitations: (1)… ▽ More

    Submitted 26 April, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

    Comments: Accepted in SIGIR 2025

  16. arXiv:2504.07957  [pdf, other

    cs.CV

    MM-IFEngine: Towards Multimodal Instruction Following

    Authors: Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang

    Abstract: The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To addre… ▽ More

    Submitted 27 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  17. arXiv:2504.07083  [pdf, other

    cs.CV

    GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

    Authors: Mengchen Zhang, Tong Wu, Jing Tan, Ziwei Liu, Gordon Wetzstein, Dahua Lin

    Abstract: Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on ge… ▽ More

    Submitted 10 April, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

  18. arXiv:2504.06232  [pdf, other

    cs.CV

    HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance

    Authors: Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

    Abstract: Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the scarcity and complexity of high-resolution content. To this end, we present HiFlow, a training-free and model-agnostic framework to unlock the resolution potential… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  19. arXiv:2504.04126  [pdf, other

    cs.CV cs.AI

    Multi-identity Human Image Animation with Structural Video Diffusion

    Authors: Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Yuwei Guo, Dahua Lin, Tianfan Xue, Bo Dai

    Abstract: Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of hu… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

    Comments: 11 pages

  20. arXiv:2504.01822  [pdf, other

    cs.SE cs.CR

    Track and Trace: Automatically Uncovering Cross-chain Transactions in the Multi-blockchain Ecosystems

    Authors: Dan Lin, Ziye Zheng, Jiajing Wu, Jingjing Yang, Kaixin Lin, Huan Xiao, Bowen Song, Zibin Zheng

    Abstract: Cross-chain technology enables seamless asset transfer and message-passing within decentralized finance (DeFi) ecosystems, facilitating multi-chain coexistence in the current blockchain environment. However, this development also raises security concerns, as malicious actors exploit cross-chain asset flows to conceal the provenance and destination of assets, thereby facilitating illegal activities… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  21. arXiv:2503.21745  [pdf, other

    cs.CV

    3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

    Authors: Yuhan Zhang, Mengchen Zhang, Tong Wu, Tengfei Wang, Gordon Wetzstein, Dahua Lin, Ziwei Liu

    Abstract: 3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  22. arXiv:2503.21500  [pdf, other

    cs.CL

    OpenHuEval: Evaluating Large Language Model on Hungarian Specifics

    Authors: Haote Yang, Xingjian Wei, Jiang Wu, Noémi Ligeti-Nagy, Jiaxing Sun, Yinfan Wang, Zijian Győző Yang, Junyuan Gao, Jingchao Wang, Bowen Jiang, Shasha Wang, Nanjun Yu, Zihao Zhang, Shixin Hong, Hongwei Liu, Wei Li, Songyang Zhang, Dahua Lin, Lijun Wu, Gábor Prószéky, Conghui He

    Abstract: We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  23. arXiv:2503.18484  [pdf, other

    cs.CV cs.CL

    PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

    Authors: Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He

    Abstract: Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enab… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: Equal contribution: Junyuan Gao, Jiahe Song, Jiang Wu; Corresponding author: Conghui He

  24. arXiv:2503.17899  [pdf, other

    cs.CV

    What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images

    Authors: Dongheng Lin, Han Hu, Jianbo Jiao

    Abstract: Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: what time tells us? To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Lear… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

  25. arXiv:2503.17752  [pdf, other

    cs.CV

    HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving

    Authors: R. D. Lin, Pengcheng Weng, Yinqiao Wang, Han Ding, Jinsong Han, Fei Wang

    Abstract: LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlookin… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

    Comments: accepted by CVPR 2025

  26. arXiv:2503.17534  [pdf, other

    cs.LG cs.SE

    MetaSel: A Test Selection Approach for Fine-tuned DNN Models

    Authors: Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand, Dayi Lin

    Abstract: Deep Neural Networks (DNNs) face challenges during deployment due to data distribution shifts. Fine-tuning adapts pre-trained models to new contexts requiring smaller labeled sets. However, testing fine-tuned models under constrained labeling budgets remains a critical challenge. This paper introduces MetaSel, a new approach, tailored for fine-tuned DNN models, to select tests from unlabeled input… ▽ More

    Submitted 25 March, 2025; v1 submitted 21 March, 2025; originally announced March 2025.

  27. arXiv:2503.16852  [pdf, other

    cs.CV cs.AI

    Casual Inference via Style Bias Deconfounding for Domain Generalization

    Authors: Jiaxi Li, Di Lin, Hao Chen, Hongying Liu, Liang Wan, Wei Feng

    Abstract: Deep neural networks (DNNs) often struggle with out-of-distribution data, limiting their reliability in diverse realworld applications. To address this issue, domain generalization methods have been developed to learn domain-invariant features from single or multiple training domains, enabling generalization to unseen testing domains. However, existing approaches usually overlook the impact of sty… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: under review

  28. arXiv:2503.15264  [pdf, other

    cs.CV

    LEGION: Learning to Ground and Explain for Synthetic Image Detection

    Authors: Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, Conghui He

    Abstract: The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated g… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: Project Page: https://opendatalab.github.io/LEGION

  29. arXiv:2503.14478  [pdf, other

    cs.CV

    Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

    Authors: Xinyu Fang, Zhijian Chen, Kai Lan, Lixin Ma, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, Dahua Lin

    Abstract: Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a m… ▽ More

    Submitted 19 March, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

    Comments: Evaluation Code and dataset see https://github.com/open-compass/Creation-MMBench

  30. arXiv:2503.10589  [pdf, other

    cs.CV

    Long Context Tuning for Video Generation

    Authors: Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, Lu Jiang

    Abstract: Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Project Page: https://guoyww.github.io/projects/long-context-video/

  31. BioSpark: Beyond Analogical Inspiration to LLM-augmented Transfer

    Authors: Hyeonsu Kang, David Chuan-en Lin, Yan-Ying Chen, Matthew K. Hong, Nikolas Martelaro, Aniket Kittur

    Abstract: We present BioSpark, a system for analogical innovation designed to act as a creativity partner in reducing the cognitive effort in finding, mapping, and creatively adapting diverse inspirations. While prior approaches have focused on initial stages of finding inspirations, BioSpark uses LLMs embedded in a familiar, visual, Pinterest-like interface to go beyond inspiration to supporting users in i… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Journal ref: ACM CHI 2025

  32. arXiv:2503.07680  [pdf, other

    cs.LG cs.AI

    Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

    Authors: Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Yazhe Niu, Jiahao Hu, Ruihao Gong, Dahua Lin, Ningyi Xu

    Abstract: Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-c… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  33. arXiv:2503.06564  [pdf, other

    cs.CV

    TR-DQ: Time-Rotation Diffusion Quantization

    Authors: Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, Yan Wang, Haotong Qin, Hao Tang

    Abstract: Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to accoun… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  34. arXiv:2503.04250  [pdf, other

    cs.CV cs.HC

    An Egocentric Vision-Language Model based Portable Real-time Smart Assistant

    Authors: Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, Limin Wang

    Abstract: We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enha… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  35. arXiv:2503.02846  [pdf, other

    cs.CL

    Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

    Authors: Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen

    Abstract: Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grain… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: Accepted by ICLR 2025. Code is available at https://github.com/open-compass/ANAH

  36. arXiv:2503.01785  [pdf, other

    cs.CV

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Authors: Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

    Abstract: Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models,… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: project page: https://github.com/Liuziyu77/Visual-RFT

  37. arXiv:2502.18443  [pdf, other

    cs.CL

    olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

    Authors: Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, Luca Soldaini

    Abstract: PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. We present olmOCR, an open-source Python toolkit for processing PDFs… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  38. arXiv:2502.13128  [pdf, other

    cs.SD cs.AI

    SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

    Authors: Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

    Abstract: Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  39. arXiv:2502.13013  [pdf, other

    cs.RO cs.AI cs.HC

    HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit

    Authors: Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, Jiangmiao Pang

    Abstract: Generalizable humanoid loco-manipulation poses significant challenges, requiring coordinated whole-body control and precise, contact-rich object manipulation. To address this, this paper introduces HOMIE, a semi-autonomous teleoperation system that combines a reinforcement learning policy for body control mapped to a pedal, an isomorphic exoskeleton arm for arm control, and motion-sensing gloves f… ▽ More

    Submitted 28 April, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

  40. arXiv:2502.11695  [pdf

    cs.CY

    Information Sharing Among Countries: A Perspective from Country-Specific Websites in Global Brands

    Authors: Amit Pariyar, Yohei Murakami, Donghui Lin, Toru Ishida

    Abstract: Multiple official languages within a country along with languages common with other countries demand content consistency in both shared and unshared languages during information sharing. However, inconsistency due to conflict in content shared and content updates not propagated in languages between countries poses a problem. Towards addressing inconsistency, this research qualitatively studied tra… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  41. arXiv:2502.06781  [pdf, other

    cs.CL cs.LG

    Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

    Authors: Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen

    Abstract: Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

    Comments: We released our code, data, and model on https://github.com/InternLM/OREAL

  42. arXiv:2502.06669  [pdf, other

    cs.CL cs.AI

    Boosting Self-Efficacy and Performance of Large Language Models via Verbal Efficacy Stimulations

    Authors: Rui Chen, Tailai Peng, Xinran Xie, Dekun Lin, Zhe Cui, Zheng Chen

    Abstract: Significant improvements have been observed in the zero-shot capabilities of the Large Language Models (LLMs). Due to their high sensitivity to input, research has increasingly focused on enhancing LLMs' performance via direct and simple prompt engineering rather than intricate domain adaptation. Studies suggest that LLMs exhibit emotional intelligence, and both positive and negative emotions can… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

    Comments: to be published in ICONIP 2024

  43. arXiv:2502.05911  [pdf, other

    cs.CL

    GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation

    Authors: Runchuan Zhu, Zinco Jiang, Jiang Wu, Zhipeng Ma, Jiahe Song, Fengshuo Bai, Dahua Lin, Lijun Wu, Conghui He

    Abstract: Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions t… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: Equal contribution: Runchuan Zhu, Zinco Jiang, Jiang Wu; Corresponding author: Conghui He

  44. arXiv:2502.05173  [pdf, other

    cs.CV

    VideoRoPE: What Makes for Good Video Rotary Position Embedding?

    Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

    Abstract: While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully co… ▽ More

    Submitted 27 April, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

  45. arXiv:2501.19386  [pdf, ps, other

    stat.ME cs.CV eess.SP

    Multi-Frame Blind Manifold Deconvolution for Rotating Synthetic Aperture Imaging

    Authors: Dao Lin, Jian Zhang, Martin Benning

    Abstract: Rotating synthetic aperture (RSA) imaging system captures images of the target scene at different rotation angles by rotating a rectangular aperture. Deblurring acquired RSA images plays a critical role in reconstructing a latent sharp image underlying the scene. In the past decade, the emergence of blind convolution technology has revolutionised this field by its ability to model complex features… ▽ More

    Submitted 31 January, 2025; originally announced January 2025.

    Comments: 39 pages, 9 figures

    MSC Class: 62P30

  46. arXiv:2501.18588  [pdf, other

    cs.HC cs.AI cs.CV cs.MM

    Inkspire: Supporting Design Exploration with Generative AI through Analogical Sketching

    Authors: David Chuan-En Lin, Hyeonsu B. Kang, Nikolas Martelaro, Aniket Kittur, Yan-Ying Chen, Matthew K. Hong

    Abstract: With recent advancements in the capabilities of Text-to-Image (T2I) AI models, product designers have begun experimenting with them in their work. However, T2I models struggle to interpret abstract language and the current user experience of T2I tools can induce design fixation rather than a more iterative, exploratory process. To address these challenges, we developed Inkspire, a sketch-driven to… ▽ More

    Submitted 30 January, 2025; originally announced January 2025.

    Comments: Accepted to CHI 2025

  47. arXiv:2501.16330  [pdf, other

    cs.CV cs.AI

    RelightVid: Temporal-Consistent Diffusion Model for Video Relighting

    Authors: Ye Fang, Zeyi Sun, Shangzhan Zhang, Tong Wu, Yinghao Xu, Pan Zhang, Jiaqi Wang, Gordon Wetzstein, Dahua Lin

    Abstract: Diffusion models have demonstrated remarkable success in image generation and editing, with recent advancements enabling albedo-preserving image relighting. However, applying these models to video relighting remains challenging due to the lack of paired video relighting datasets and the high demands for output fidelity and temporal consistency, further complicated by the inherent randomness of dif… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

  48. arXiv:2501.14506  [pdf, other

    cs.CL

    WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

    Authors: Jia Yu, Fei Yuan, Rui Min, Jing Yu, Pei Chu, Jiayang Li, Wei Li, Ruijie Zhang, Zhenxiang Li, Zhifei Ren, Dong Zheng, Wenjian Zhang, Yan Teng, Lingyu Meng, ZhenJiang Jin, Jiantao Qiu, ShaSha Wang, Zhongying Tu, Dahua Lin, Yu Wang, Yu Qiao, Yanfeng Wang, Conghui He

    Abstract: This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, c… ▽ More

    Submitted 24 January, 2025; originally announced January 2025.

  49. arXiv:2501.14417  [pdf, other

    cs.DC

    DeepFlow: Serverless Large Language Model Serving at Scale

    Authors: Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Jie Meng, Chao He, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, Yizhou Shan

    Abstract: This paper introduces DeepFlow, a scalable and serverless AI platform designed to efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow addresses key challenges such as resource allocation, serving efficiency, and cold start latencies through four main design components. First, it uses a simple serverless abstraction called the request-job-task model, which helps… ▽ More

    Submitted 26 January, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

  50. arXiv:2501.12368  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

    Authors: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang

    Abstract: Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

    Comments: Tech Report