Skip to main content

Showing 1–50 of 237 results for author: Hu, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01573  [pdf, ps, other

    cs.CV

    A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation

    Authors: Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao

    Abstract: Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily o… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 20 pages, 14 figures

  2. arXiv:2507.01061  [pdf, ps, other

    cs.CY cs.AI cs.HC

    Epitome: Pioneering an Experimental Platform for AI-Social Science Integration

    Authors: Jingjing Qu, Kejia Hu, Jun Zhu, Wenhao Li, Teng Wang, Zhiyun Chen, Yulei Ye, Chaochao Lu, Aimin Zhou, Xiangfeng Wang, James Evan

    Abstract: The integration of Large Language Models (LLMs) into social science experiments represents a transformative approach to understanding human-AI interactions and their societal impacts. We introduce Epitome, the world's first open experimental platform dedicated to the deep integration of artificial intelligence and social science. Rooted in theoretical foundations from management, communication stu… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: 18 pages, 5figures

  3. arXiv:2506.07847  [pdf, ps, other

    cs.CV

    F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation

    Authors: Hengzhi Chen, Liqian Feng, Wenhua Wu, Xiaogang Zhu, Shawn Leo, Kun Hu

    Abstract: Semantic segmentation of ultra-high-resolution (UHR) remote sensing imagery is critical for applications like environmental monitoring and urban planning but faces computational and optimization challenges. Conventional methods either lose fine details through downsampling or fragment global context via patch processing. While multi-branch networks address this trade-off, they suffer from computat… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  4. arXiv:2506.01037  [pdf, ps, other

    cs.CV

    Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

    Authors: Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, Kai Hu

    Abstract: Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across ad… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 11 pages, 10 figures, accepted by CVPR 2025

    ACM Class: I.4.4; I.2.6

  5. arXiv:2505.22279  [pdf, ps, other

    cs.CV

    Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss

    Authors: Wenjun Lu, Haodong Chen, Anqi Yi, Yuk Ying Chung, Zhiyong Wang, Kun Hu

    Abstract: Novel view synthesis is a fundamental task in 3D computer vision that aims to reconstruct realistic images from a set of posed input views. However, reconstruction quality degrades significantly under sparse-view conditions due to limited geometric cues. Existing methods, such as Neural Radiance Fields (NeRF) and the more recent 3D Gaussian Splatting (3DGS), often suffer from blurred details and s… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  6. arXiv:2505.20767  [pdf, ps, other

    cs.CL cs.AI

    CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models

    Authors: Xiaqiang Tang, Jian Li, Keyu Hu, Du Nan, Xiaolong Li, Xi Zhang, Weigao Sun, Sihong Xie

    Abstract: Faithfulness hallucinations are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standards, existing benchmarks focus on "factual statements" that rephrase source materials while overlooking "cognitive statements" that involve making inferences from the given context. Consequently, evaluating and detecting the hallucination of cogni… ▽ More

    Submitted 25 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

    Comments: ACL 2025

  7. arXiv:2505.18763  [pdf, ps, other

    cs.LG

    GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

    Authors: Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi

    Abstract: Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of la… ▽ More

    Submitted 27 May, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

  8. arXiv:2505.16149  [pdf, ps, other

    cs.CV cs.AI cs.CL

    When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification

    Authors: Zirui Pang, Haosheng Tan, Yuhan Pu, Zhijie Deng, Zhouan Shen, Keyu Hu, Jiaheng Wei

    Abstract: Image classification benchmark datasets such as CIFAR, MNIST, and ImageNet serve as critical tools for model evaluation. However, despite the cleaning efforts, these datasets still suffer from pervasive noisy labels and often contain missing labels due to the co-existing image pattern where multiple classes appear in an image sample. This results in misleading model comparisons and unfair evaluati… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  9. arXiv:2505.15670  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

    Authors: Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg

    Abstract: Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent strea… ▽ More

    Submitted 6 June, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  10. arXiv:2505.15646  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Word Level Timestamp Generation for Automatic Speech Recognition and Translation

    Authors: Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg

    Abstract: We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignme… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  11. arXiv:2505.11175  [pdf, ps, other

    cs.RO cs.AI

    Real-Time Verification of Embodied Reasoning for Generative Skill Acquisition

    Authors: Bo Yue, Shuqi Guo, Kaiyu Hu, Chujiao Wang, Benyou Wang, Kui Jia, Guiliang Liu

    Abstract: Generative skill acquisition enables embodied agents to actively learn a scalable and evolving repertoire of control skills, crucial for the advancement of large decision models. While prior approaches often rely on supervision signals from generalist agents (e.g., LLMs), their effectiveness in complex 3D environments remains unclear; exhaustive evaluation incurs substantial computational costs, s… ▽ More

    Submitted 19 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

  12. arXiv:2505.01050  [pdf, other

    cs.CV cs.LG

    Transferable Adversarial Attacks on Black-Box Vision-Language Models

    Authors: Kai Hu, Weichen Yu, Li Zhang, Alexander Robey, Andy Zou, Chengming Xu, Haoqi Hu, Matt Fredrikson

    Abstract: Vision Large Language Models (VLLMs) are increasingly deployed to offer advanced capabilities on inputs comprising both text and images. While prior research has shown that adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts, the extent and effectiveness of such vulnerabilities remain underexplored for VLLMs. We present a comprehe… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  13. arXiv:2504.13835  [pdf, other

    cs.CL cs.AI

    MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

    Authors: Yicheng Chen, Yining Li, Kai Hu, Zerun Ma, Haochen Ye, Kai Chen

    Abstract: Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  14. arXiv:2504.11763  [pdf, other

    cs.CV

    Extended Short- and Long-Range Mesh Learning for Fast and Generalized Garment Simulation

    Authors: Aoran Liu, Kun Hu, Clinton Mo, Changyang Li, Zhiyong Wang

    Abstract: 3D garment simulation is a critical component for producing cloth-based graphics. Recent advancements in graph neural networks (GNNs) offer a promising approach for efficient garment simulation. However, GNNs require extensive message-passing to propagate information such as physical forces and maintain contact awareness across the entire garment mesh, which becomes computationally inefficient at… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  15. arXiv:2504.09586  [pdf, other

    cs.CL

    Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance

    Authors: Zuoli Tang, Junjie Ou, Kaiqin Hu, Chunwei Wu, Zhaoxin Huan, Chilin Fu, Xiaolu Zhang, Jun Zhou, Chenliang Li

    Abstract: Recent years have witnessed significant progress in large language models' (LLMs) reasoning, which is largely due to the chain-of-thought (CoT) approaches, allowing models to generate intermediate reasoning steps before reaching the final answer. Building on these advances, state-of-the-art LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related qu… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

    Comments: Under review

  16. arXiv:2504.08061  [pdf, other

    cs.CV cs.AI

    STEI-PCN: an efficient pure convolutional network for traffic prediction via spatial-temporal encoding and inferring

    Authors: Kai Hu, Zhidan Zhao, Zhifeng Hao

    Abstract: Traffic data exhibits complex temporal, spatial, and spatial-temporal correlations. Most of models use either independent modules to separately extract temporal and spatial correlations or joint modules to synchronously extract them, without considering the spatial-temporal correlations. Moreover, models that consider joint spatial-temporal correlations (temporal, spatial, and spatial-temporal cor… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  17. arXiv:2504.07691  [pdf, other

    cs.LG cs.CV

    Distilling Knowledge from Heterogeneous Architectures for Semantic Segmentation

    Authors: Yanglin Huang, Kai Hu, Yuan Zhang, Zhineng Chen, Xieping Gao

    Abstract: Current knowledge distillation (KD) methods for semantic segmentation focus on guiding the student to imitate the teacher's knowledge within homogeneous architectures. However, these methods overlook the diverse knowledge contained in architectures with different inductive biases, which is crucial for enabling the student to acquire a more precise and comprehensive understanding of the data during… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Accepted to AAAI 2025

  18. arXiv:2503.20734  [pdf, other

    cs.CV

    SChanger: Change Detection from a Semantic Change and Spatial Consistency Perspective

    Authors: Ziyu Zhou, Keyan Hu, Yutian Fang, Xiaoping Rui

    Abstract: Change detection is a key task in Earth observation applications. Recently, deep learning methods have demonstrated strong performance and widespread application. However, change detection faces data scarcity due to the labor-intensive process of accurately aligning remote sensing images of the same area, which limits the performance of deep learning algorithms. To address the data scarcity issue,… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  19. arXiv:2503.16921  [pdf, other

    cs.CV cs.AI

    When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO

    Authors: Lingfan Zhang, Chen Liu, Chengming Xu, Kai Hu, Donghao Luo, Chengjie Wang, Yanwei Fu, Yuan Yao

    Abstract: In recent years, the field of image generation has witnessed significant advancements, particularly in fine-tuning methods that align models with universal human preferences. This paper explores the critical role of preference data in the training process of diffusion models, particularly in the context of Diffusion-DPO and its subsequent adaptations. We investigate the complexities surrounding un… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  20. arXiv:2503.15893  [pdf, other

    cs.CV

    UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis

    Authors: Jiawei Wang, Kai Hu, Qiang Huo

    Abstract: Document structure analysis, aka document layout analysis, is crucial for understanding both the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. Hierarchical Document Structure Analysis (HDSA) specifically aims to restore the hierarchical structure of documents created using authoring software with hierarchical s… ▽ More

    Submitted 25 March, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: Accepted by Pattern Recognition. arXiv admin note: text overlap with arXiv:2405.11757

  21. arXiv:2503.15887  [pdf, other

    cs.CV

    DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering

    Authors: Haochen Wang, Kai Hu, Liangcai Gao

    Abstract: Remote work and online courses have become important methods of knowledge dissemination, leading to a large number of document-based instructional videos. Unlike traditional video datasets, these videos mainly feature rich-text images and audio that are densely packed with information closely tied to the visual content, requiring advanced multimodal understanding capabilities. However, this domain… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  22. arXiv:2503.05931  [pdf, other

    cs.CL eess.AS

    Training and Inference Efficiency of Encoder-Decoder Speech Models

    Authors: Piotr Żelasko, Kunal Dhawan, Daniel Galvez, Krishna C. Puvvada, Ankita Pasad, Nithin Rao Koluguri, Ke Hu, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

    Abstract: Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models e… ▽ More

    Submitted 19 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

  23. arXiv:2503.05398  [pdf, other

    cs.RO cs.CV

    Self-Modeling Robots by Photographing

    Authors: Kejun Hu, Peng Yu, Ning Tan

    Abstract: Self-modeling enables robots to build task-agnostic models of their morphology and kinematics based on data that can be automatically collected, with minimal human intervention and prior information, thereby enhancing machine intelligence. Recent research has highlighted the potential of data-driven technology in modeling the morphology and kinematics of robots. However, existing self-modeling met… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

  24. arXiv:2503.05077  [pdf, other

    cs.RO

    Adaptive-LIO: Enhancing Robustness and Precision through Environmental Adaptation in LiDAR Inertial Odometry

    Authors: Chengwei Zhao, Kun Hu, Jie Xu, Lijun Zhao, Baiwen Han, Kaidi Wu, Maoshan Tian, Shenghai Yuan

    Abstract: The emerging Internet of Things (IoT) applications, such as driverless cars, have a growing demand for high-precision positioning and navigation. Nowadays, LiDAR inertial odometry becomes increasingly prevalent in robotics and autonomous driving. However, many current SLAM systems lack sufficient adaptability to various scenarios. Challenges include decreased point cloud accuracy with longer frame… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  25. arXiv:2503.00507  [pdf, other

    cs.LG cs.IT

    Projection Head is Secretly an Information Bottleneck

    Authors: Zhuo Ouyang, Kaiwen Hu, Qi Zhang, Yifei Wang, Yisen Wang

    Abstract: Recently, contrastive learning has risen to be a promising paradigm for extracting meaningful data representations. Among various special designs, adding a projection head on top of the encoder during training and removing it for downstream tasks has proven to significantly enhance the performance of contrastive learning. However, despite its empirical success, the underlying mechanism of the proj… ▽ More

    Submitted 3 March, 2025; v1 submitted 1 March, 2025; originally announced March 2025.

  26. arXiv:2502.19680  [pdf, other

    cs.CV cs.AI

    M-LLM Based Video Frame Selection for Efficient Video Understanding

    Authors: Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, Trishul Chilimbi

    Abstract: Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-… ▽ More

    Submitted 26 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  27. arXiv:2502.16496  [pdf, other

    cs.LG cs.AI cs.MA

    PMAT: Optimizing Action Generation Order in Multi-Agent Reinforcement Learning

    Authors: Kun Hu, Muning Wen, Xihuai Wang, Shao Zhang, Yiwei Shi, Minne Li, Minglong Li, Ying Wen

    Abstract: Multi-agent reinforcement learning (MARL) faces challenges in coordinating agents due to complex interdependencies within multi-agent systems. Most MARL algorithms use the simultaneous decision-making paradigm but ignore the action-level dependencies among agents, which reduces coordination efficiency. In contrast, the sequential decision-making paradigm provides finer-grained supervision for agen… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.

    Comments: Accepted by AAMAS 2025

  28. arXiv:2502.13163  [pdf, other

    cs.OS cs.CR cs.SE

    A Survey of Fuzzing Open-Source Operating Systems

    Authors: Kun Hu, Qicai Chen, Zilong Lu, Wenzhuo Zhang, Bihuan Chen, You Lu, Haowen Jiang, Bingkun Sun, Xin Peng, Wenyun Zhao

    Abstract: Vulnerabilities in open-source operating systems (OSs) pose substantial security risks to software systems, making their detection crucial. While fuzzing has been an effective vulnerability detection technique in various domains, OS fuzzing (OSF) faces unique challenges due to OS complexity and multi-layered interaction, and has not been comprehensively reviewed. Therefore, this work systematicall… ▽ More

    Submitted 20 February, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

    Comments: 45 pages

  29. arXiv:2502.10330  [pdf, other

    cs.LG

    Exploring the Boundary of Diffusion-based Methods for Solving Constrained Optimization

    Authors: Shutong Ding, Yimiao Zhou, Ke Hu, Xi Yao, Junchi Yan, Xiaoying Tang, Ye Shi

    Abstract: Diffusion models have achieved remarkable success in generative tasks such as image and video synthesis, and in control domains like robotics, owing to their strong generalization capabilities and proficiency in fitting complex multimodal distributions. However, their full potential in solving Continuous Constrained Optimization problems remains largely underexplored. Our work commences by investi… ▽ More

    Submitted 27 May, 2025; v1 submitted 14 February, 2025; originally announced February 2025.

  30. arXiv:2502.07128  [pdf, other

    cs.CL cs.AI cs.MM

    Cardiverse: Harnessing LLMs for Novel Card Game Prototyping

    Authors: Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia

    Abstract: The prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation. Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes. However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and deve… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

    Comments: 13 pages, 7 figures, 3 tables

  31. arXiv:2502.06424  [pdf, other

    cs.LG cs.AI

    CS-SHAP: Extending SHAP to Cyclic-Spectral Domain for Better Interpretability of Intelligent Fault Diagnosis

    Authors: Qian Chen, Xingjian Dong, Kui Hu, Kangkang Chen, Zhike Peng, Guang Meng

    Abstract: Neural networks (NNs), with their powerful nonlinear mapping and end-to-end capabilities, are widely applied in mechanical intelligent fault diagnosis (IFD). However, as typical black-box models, they pose challenges in understanding their decision basis and logic, limiting their deployment in high-reliability scenarios. Hence, various methods have been proposed to enhance the interpretability of… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

    Comments: 21 pages, 21 figures

  32. arXiv:2501.13826  [pdf, other

    cs.CV cs.CL

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Authors: Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu

    Abstract: Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimoda… ▽ More

    Submitted 23 January, 2025; originally announced January 2025.

  33. arXiv:2501.12948  [pdf, other

    cs.CL cs.AI cs.LG

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu , et al. (175 additional authors not shown)

    Abstract: We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

  34. arXiv:2501.10966  [pdf, other

    cs.CV cs.AI

    DC-PCN: Point Cloud Completion Network with Dual-Codebook Guided Quantization

    Authors: Qiuxia Wu, Haiyang Huang, Kunming Su, Zhiyong Wang, Kun Hu

    Abstract: Point cloud completion aims to reconstruct complete 3D shapes from partial 3D point clouds. With advancements in deep learning techniques, various methods for point cloud completion have been developed. Despite achieving encouraging results, a significant issue remains: these methods often overlook the variability in point clouds sampled from a single 3D object surface. This variability can lead t… ▽ More

    Submitted 19 January, 2025; originally announced January 2025.

    Comments: AAAI25 Accepted

  35. arXiv:2501.02321  [pdf, other

    cs.CY

    KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation

    Authors: Yulong Li, Bolin Ren, Ke Hu, Changyuan Liu, Zhengyong Jiang, Kang Dang, Jionglong Su

    Abstract: Artificial intelligence has achieved notable results in sign language recognition and translation. However, relatively few efforts have been made to significantly improve the quality of life for the 72 million hearing-impaired people worldwide. Sign language translation models, relying on video inputs, involves with large parameter sizes, making it time-consuming and computationally intensive to b… ▽ More

    Submitted 13 January, 2025; v1 submitted 4 January, 2025; originally announced January 2025.

    Comments: AAAI 2025

  36. arXiv:2412.19437  [pdf, other

    cs.CL cs.AI

    DeepSeek-V3 Technical Report

    Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao , et al. (175 additional authors not shown)

    Abstract: We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for loa… ▽ More

    Submitted 18 February, 2025; v1 submitted 26 December, 2024; originally announced December 2024.

  37. arXiv:2412.18985  [pdf, other

    cs.AI cs.HC

    TravelAgent: Generative Agents in the Built Environment

    Authors: Ariel Noyman, Kai Hu, Kent Larson

    Abstract: Understanding human behavior in built environments is critical for designing functional, user centered urban spaces. Traditional approaches, such as manual observations, surveys, and simplified simulations, often fail to capture the complexity and dynamics of real world behavior. To address these limitations, we introduce TravelAgent, a novel simulation platform that models pedestrian navigation a… ▽ More

    Submitted 25 December, 2024; originally announced December 2024.

    Comments: 21 pages 9 figs

  38. arXiv:2412.14692  [pdf, other

    cs.CV

    Explicit Relational Reasoning Network for Scene Text Detection

    Authors: Yuchen Su, Zhineng Chen, Yongkun Du, Zhilong Ji, Kai Hu, Jinfeng Bai, Xieping Gao

    Abstract: Connected component (CC) is a proper text shape representation that aligns with human reading intuition. However, CC-based text detection methods have recently faced a developmental bottleneck that their time-consuming post-processing is difficult to eliminate. To address this issue, we introduce an explicit relational reasoning network (ERRNet) to elegantly model the component relationships witho… ▽ More

    Submitted 7 February, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

    Comments: Accepted to AAAI 2025

  39. arXiv:2412.10302  [pdf, other

    cs.CV cs.AI cs.CL

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Authors: Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao , et al. (2 additional authors not shown)

    Abstract: We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage Deep… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

  40. arXiv:2412.09919  [pdf, ps, other

    cs.CV cs.AI

    B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

    Authors: Zhuqiang Lu, Zhenfei Yin, Mengwei He, Zhihui Wang, Zicheng Liu, Zhiyong Wang, Kun Hu

    Abstract: Recently, Vision Large Language Models (VLLMs) integrated with vision encoders have shown promising performance in vision understanding. The key of VLLMs is to encode visual content into sequences of visual tokens, enabling VLLMs to simultaneously process both visual and textual content. However, understanding videos, especially long videos, remain a challenge to VLLMs as the number of visual toke… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

  41. arXiv:2412.05268  [pdf, other

    cs.RO cs.CV

    DenseMatcher: Learning 3D Semantic Correspondence for Category-Level Manipulation from a Single Demo

    Authors: Junzhe Zhu, Yuanchen Ju, Junyi Zhang, Muhan Wang, Zhecheng Yuan, Kaizhe Hu, Huazhe Xu

    Abstract: Dense 3D correspondence can enhance robotic manipulation by enabling the generalization of spatial, functional, and dynamic information from one object to an unseen counterpart. Compared to shape correspondence, semantic correspondence is more effective in generalizing across different object categories. To this end, we present DenseMatcher, a method capable of computing 3D correspondences between… ▽ More

    Submitted 6 December, 2024; originally announced December 2024.

    Comments: Project Page: https://tea-lab.github.io/DenseMatcher/

  42. arXiv:2412.01091  [pdf, other

    cs.CV

    DuoCast: Duo-Probabilistic Meteorology-Aware Model for Extended Precipitation Nowcasting

    Authors: Penghui Wen, Lei Bai, Mengwei He, Patrick Filippi, Feng Zhang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu

    Abstract: Recently, extended short-term precipitation nowcasting struggles with decreasing precision because of insufficient consideration of meteorological knowledge, such as weather fronts which significantly influence precipitation intensity, duration, and spatial distribution. Therefore, in this paper, we present DuoCast, a novel dual-probabilistic meteorology-aware model designed to address both broad… ▽ More

    Submitted 2 December, 2024; v1 submitted 1 December, 2024; originally announced December 2024.

  43. arXiv:2411.11493  [pdf, other

    cs.DC

    LSRAM: A Lightweight Autoscaling and SLO Resource Allocation Framework for Microservices Based on Gradient Descent

    Authors: Kan Hu, Minxian Xu, Kejiang Ye, Chengzhong Xu

    Abstract: Microservices architecture has become the dominant architecture in cloud computing paradigm with its advantages of facilitating development, deployment, modularity and scalability. The workflow of microservices architecture is transparent to the users, who are concerned with the quality of service (QoS). Taking Service Level Objective (SLO) as an important indicator of system resource scaling can… ▽ More

    Submitted 18 November, 2024; originally announced November 2024.

    Comments: 22 pages

    Journal ref: Software: Practice and Experience 2024

  44. arXiv:2411.08063  [pdf

    physics.soc-ph cond-mat.mtrl-sci cs.AI

    MatPilot: an LLM-enabled AI Materials Scientist under the Framework of Human-Machine Collaboration

    Authors: Ziqi Ni, Yahao Li, Kaijia Hu, Kunyuan Han, Ming Xu, Xingyu Chen, Fengqi Liu, Yicong Ye, Shuxin Bai

    Abstract: The rapid evolution of artificial intelligence, particularly large language models, presents unprecedented opportunities for materials science research. We proposed and developed an AI materials scientist named MatPilot, which has shown encouraging abilities in the discovery of new materials. The core strength of MatPilot is its natural language interactive human-machine collaboration, which augme… ▽ More

    Submitted 10 November, 2024; originally announced November 2024.

  45. arXiv:2411.06508  [pdf, other

    cs.LG cs.AI cs.CV cs.IT stat.ML

    Understanding the Role of Equivariance in Self-supervised Learning

    Authors: Yifei Wang, Kaiwen Hu, Sharut Gupta, Ziyu Ye, Yisen Wang, Stefanie Jegelka

    Abstract: Contrastive learning has been a leading paradigm for self-supervised learning, but it is widely observed that it comes at the price of sacrificing useful features (\eg colors) by being invariant to data augmentations. Given this limitation, there has been a surge of interest in equivariant self-supervised learning (E-SSL) that learns features to be augmentation-aware. However, even for the simples… ▽ More

    Submitted 10 November, 2024; originally announced November 2024.

    Comments: Accepted at NeurIPS 2024

  46. arXiv:2411.05945  [pdf, other

    cs.CL cs.AI cs.LG cs.MA eess.AS

    NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts

    Authors: Yen-Ting Lin, Chao-Han Huck Yang, Zhehuai Chen, Piotr Zelasko, Xuesong Yang, Zih-Ching Chen, Krishna C Puvvada, Szu-Wei Fu, Ke Hu, Jun Wei Chiu, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang

    Abstract: Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in pa… ▽ More

    Submitted 8 November, 2024; originally announced November 2024.

    Comments: NeKo work has been done in June 2024. NeKo LMs will be open source on https://huggingface.co/nvidia under the MIT license

  47. arXiv:2411.04919  [pdf, other

    cs.RO cs.CV

    Stem-OB: Generalizable Visual Imitation Learning with Stem-Like Convergent Observation through Diffusion Inversion

    Authors: Kaizhe Hu, Zihang Rui, Yao He, Yuyao Liu, Pu Hua, Huazhe Xu

    Abstract: Visual imitation learning methods demonstrate strong performance, yet they lack generalization when faced with visual input perturbations, including variations in lighting and textures, impeding their real-world application. We propose Stem-OB that utilizes pretrained image diffusion models to suppress low-level visual differences while maintaining high-level scene structures. This image inversion… ▽ More

    Submitted 13 November, 2024; v1 submitted 7 November, 2024; originally announced November 2024.

    Comments: Arxiv preprint version, website: https://hukz18.github.io/Stem-Ob/

  48. arXiv:2411.02272  [pdf, other

    cs.LG cs.AI cs.CL

    Combining Induction and Transduction for Abstract Reasoning

    Authors: Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M. Dunn, Hao Tang, Michelangelo Naim, Dat Nguyen, Wei-Long Zheng, Zenna Tavares, Yewen Pu, Kevin Ellis

    Abstract: When learning an input-output mapping from very few examples, is it better to first infer a latent function that explains the examples, or is it better to directly predict new test outputs, e.g. using a neural network? We study this question on ARC by training neural models for induction (inferring latent functions) and transduction (directly predicting the test output for a given test input). We… ▽ More

    Submitted 2 December, 2024; v1 submitted 4 November, 2024; originally announced November 2024.

  49. SFDFusion: An Efficient Spatial-Frequency Domain Fusion Network for Infrared and Visible Image Fusion

    Authors: Kun Hu, Qingle Zhang, Maoxun Yuan, Yitian Zhang

    Abstract: Infrared and visible image fusion aims to utilize the complementary information from two modalities to generate fused images with prominent targets and rich texture details. Most existing algorithms only perform pixel-level or feature-level fusion from different modalities in the spatial domain. They usually overlook the information in the frequency domain, and some of them suffer from inefficienc… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

    Comments: accept in ECAI 2024

  50. arXiv:2410.17485  [pdf, other

    cs.CL eess.AS

    VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

    Authors: Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg

    Abstract: Recent studies have augmented large language models (LLMs) with speech capabilities, leading to the development of speech language models (SpeechLMs). Earlier SpeechLMs focused on single-turn speech-based question answering (QA), where user input comprised a speech context and a text question. More recent studies have extended this to multi-turn conversations, though they often require complex, mu… ▽ More

    Submitted 6 February, 2025; v1 submitted 22 October, 2024; originally announced October 2024.

    Comments: Accepted at NAACL 2025 main conference