Skip to main content

Showing 1–50 of 226 results for author: Zheng, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04630  [pdf, ps, other

    cs.CV

    Learn 3D VQA Better with Active Selection and Reannotation

    Authors: Shengli Zhou, Yang Liu, Feng Zheng

    Abstract: 3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scen… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: Accepted by ACM MM 2025

  2. arXiv:2507.01800  [pdf, ps, other

    cs.CV cs.MM

    HCNQA: Enhancing 3D VQA with Hierarchical Concentration Narrowing Supervision

    Authors: Shengli Zhou, Jianuo Zhu, Qilin Huang, Fangjing Wang, Yanfu Zhang, Feng Zheng

    Abstract: 3D Visual Question-Answering (3D VQA) is pivotal for models to perceive the physical world and perform spatial reasoning. Answer-centric supervision is a commonly used training method for 3D VQA models. Many models that utilize this strategy have achieved promising results in 3D VQA tasks. However, the answer-centric approach only supervises the final output of models and allows models to develop… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: ICANN 2025

  3. arXiv:2506.23835  [pdf, ps, other

    cs.CV

    Refine Any Object in Any Scene

    Authors: Ziwei Chen, Ziling Liu, Zitong Huang, Mingqi Gao, Feng Zheng

    Abstract: Viewpoint missing of objects is common in scene reconstruction, as camera paths typically prioritize capturing the overall scene structure rather than individual objects. This makes it highly challenging to achieve high-fidelity object-level modeling while maintaining accurate scene-level representation. Addressing this issue is critical for advancing downstream tasks requiring detailed object und… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 9 pages with 6 figures

  4. arXiv:2506.14806  [pdf, ps, other

    cs.LG

    Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis

    Authors: Bochen Lyu, Xiaojing Zhang, Fangyi Zheng, He Wang, Zheng Wang, Zhanxing Zhu

    Abstract: This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap b… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: 32 pages, 7 figures

  5. arXiv:2506.14226  [pdf, ps, other

    cs.SD

    Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification

    Authors: Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: Short-utterance speaker verification presents significant challenges due to the limited information in brief speech segments, which can undermine accuracy and reliability. Recently, zero-shot text-to-speech (ZS-TTS) systems have made considerable progress in preserving speaker identity. In this study, we explore, for the first time, the use of ZS-TTS systems for test-time data augmentation for spe… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  6. arXiv:2506.07570  [pdf, ps, other

    cs.CV cs.AI

    LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization

    Authors: Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, Feng Zheng

    Abstract: Automatic indoor layout generation has attracted increasing attention due to its potential in interior design, virtual environment construction, and embodied AI. Existing methods fall into two categories: prompt-driven approaches that leverage proprietary LLM services (e.g., GPT APIs) and learning-based methods trained on layout data upon diffusion-based models. Prompt-driven methods often suffer… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  7. arXiv:2505.15431  [pdf, ps, other

    cs.CL

    Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

    Authors: Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao, Dong Du, Dong Wang, Feng Zhang, Fengzong Lian, Guanghui Xu, Guanwei Zhang, Hai Wang, Haipeng Luo, Han Hu, Huilin Xu, Jiajia Wu, Jianchen Zhu, Jianfeng Yan, Jiaqi Zhu , et al. (230 additional authors not shown)

    Abstract: As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid response… ▽ More

    Submitted 4 July, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  8. arXiv:2504.18128  [pdf, ps, other

    cs.CL cs.LG

    Temporal Entailment Pretraining for Clinical Language Models over EHR Data

    Authors: Tatsunori Tanaka, Fi Zheng, Kai Sato, Zhifeng Li, Yuanyun Zhang, Shi Li

    Abstract: Clinical language models have achieved strong performance on downstream tasks by pretraining on domain specific corpora such as discharge summaries and medical notes. However, most approaches treat the electronic health record as a static document, neglecting the temporally-evolving and causally entwined nature of patient trajectories. In this paper, we introduce a novel temporal entailment pretra… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  9. arXiv:2504.13788  [pdf, other

    cs.CV

    RefComp: A Reference-guided Unified Framework for Unpaired Point Cloud Completion

    Authors: Yixuan Yang, Jinyu Yang, Zixiang Zhao, Victor Sanchez, Feng Zheng

    Abstract: The unpaired point cloud completion task aims to complete a partial point cloud by using models trained with no ground truth. Existing unpaired point cloud completion methods are class-aware, i.e., a separate model is needed for each object class. Since they have limited generalization capabilities, these methods perform poorly in real-world scenarios when confronted with a wide range of point clo… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  10. arXiv:2504.13710  [pdf, ps, other

    cs.CV

    Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

    Authors: Heng Liu, Guanghui Li, Mingqi Gao, Xiantong Zhen, Feng Zheng, Yang Wang

    Abstract: Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: 23 pages, 10 figures

  11. arXiv:2504.12636  [pdf, ps, other

    cs.RO

    A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

    Authors: Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, Yuxuan Kuang, Meng Cao, Feng Zheng, Xiaodan Liang

    Abstract: Robotic manipulation faces critical challenges in understanding spatial affordances--the "where" and "how" of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that foc… ▽ More

    Submitted 25 June, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

  12. arXiv:2503.12944  [pdf, other

    cs.CV

    GIFT: Generated Indoor video frames for Texture-less point tracking

    Authors: Jianzheng Huang, Xianyu Mo, Ziling Liu, Jinyu Yang, Feng Zheng

    Abstract: Point tracking is becoming a powerful solver for motion estimation and video editing. Compared to classical feature matching, point tracking methods have the key advantage of robustly tracking points under complex camera motion trajectories and over extended periods. However, despite certain improvements in methodologies, current point tracking methods still struggle to track any position in video… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  13. arXiv:2503.00841  [pdf, other

    cs.AI

    A Law Reasoning Benchmark for LLM with Tree-Organized Structures including Factum Probandum, Evidence and Experiences

    Authors: Jiaxin Shen, Jinan Xu, Huiqi Hu, Luyi Lin, Fei Zheng, Guoyang Ma, Fandong Meng, Jie Zhou, Wenjuan Han

    Abstract: While progress has been made in legal applications, law reasoning, crucial for fair adjudication, remains unexplored. We propose a transparent law reasoning schema enriched with hierarchical factum probandum, evidence, and implicit experience, enabling public scrutiny and preventing bias. Inspired by this schema, we introduce the challenging task, which takes a textual case description and outputs… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: 20 pages, 13 figures

  14. arXiv:2502.08549  [pdf, other

    cs.CV cs.LG

    Copula-based mixture model identification for subgroup clustering with imaging applications

    Authors: Fei Zheng, Nicolas Duchateau

    Abstract: Model-based clustering techniques have been widely applied to various application areas, while most studies focus on canonical mixtures with unique component distribution form. However, this strict assumption is often hard to satisfy. In this paper, we consider the more flexible Copula-Based Mixture Models (CBMMs) for clustering, which allow heterogeneous component distributions composed by flexib… ▽ More

    Submitted 12 February, 2025; originally announced February 2025.

  15. arXiv:2502.00421  [pdf, other

    cs.CL cs.SD eess.AS

    Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language

    Authors: Turi Abu, Ying Shi, Thomas Fang Zheng, Dong Wang

    Abstract: We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and… ▽ More

    Submitted 1 February, 2025; originally announced February 2025.

    Comments: Accepted for ICASSP2025 (2025 IEEE International Conference on Acoustics, Speech, and Signal Processing)

  16. arXiv:2501.09307  [pdf, other

    cs.RO

    RoboReflect: A Robotic Reflective Reasoning Framework for Grasping Ambiguous-Condition Objects

    Authors: Zhen Luo, Yixuan Yang, Yanfu Zhang, Feng Zheng

    Abstract: As robotic technology rapidly develops, robots are being employed in an increasing number of fields. However, due to the complexity of deployment environments or the prevalence of ambiguous-condition objects, the practical application of robotics still faces many challenges, leading to frequent errors. Traditional methods and some LLM-based approaches, although improved, still require substantial… ▽ More

    Submitted 10 March, 2025; v1 submitted 16 January, 2025; originally announced January 2025.

  17. arXiv:2501.02020  [pdf, other

    cs.CL cs.AI

    Enhancing Uncertainty Modeling with Semantic Graph for Hallucination Detection

    Authors: Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, Feng Zheng, Liang He

    Abstract: Large Language Models (LLMs) are prone to hallucination with non-factual or unfaithful statements, which undermines the applications in real-world scenarios. Recent researches focus on uncertainty-based hallucination detection, which utilizes the output probability of LLMs for uncertainty calculation and does not rely on external knowledge or frequent sampling from LLMs. Whereas, most approaches m… ▽ More

    Submitted 5 April, 2025; v1 submitted 2 January, 2025; originally announced January 2025.

  18. arXiv:2501.01042  [pdf, other

    cs.CV cs.CR cs.LG

    Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

    Authors: Linhao Huang, Xue Jiang, Zhiqiang Wang, Wentao Mo, Xi Xiao, Bo Han, Yongjie Yin, Feng Zheng

    Abstract: Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models--a common and practical real world scenario--remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that exist… ▽ More

    Submitted 10 January, 2025; v1 submitted 1 January, 2025; originally announced January 2025.

  19. SoftPatch+: Fully Unsupervised Anomaly Classification and Segmentation

    Authors: Chengjie Wang, Xi Jiang, Bin-Bin Gao, Zhenye Gan, Yong Liu, Feng Zheng, Lizhuang Ma

    Abstract: Although mainstream unsupervised anomaly detection (AD) (including image-level classification and pixel-level segmentation)algorithms perform well in academic datasets, their performance is limited in practical application due to the ideal experimental setting of clean training data. Training with noisy data is an inevitable problem in real-world anomaly detection but is seldom discussed. This pap… ▽ More

    Submitted 12 January, 2025; v1 submitted 30 December, 2024; originally announced December 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2403.14233 paper has been accepted by Pattern Recognition

  20. arXiv:2412.18459  [pdf, other

    cs.CV eess.IV

    Underwater Image Restoration via Polymorphic Large Kernel CNNs

    Authors: Xiaojiao Guo, Yihang Dong, Xuhang Chen, Weiwen Chen, Zimeng Li, FuChen Zheng, Chi-Man Pun

    Abstract: Underwater Image Restoration (UIR) remains a challenging task in computer vision due to the complex degradation of images in underwater environments. While recent approaches have leveraged various deep learning techniques, including Transformers and complex, parameter-heavy models to achieve significant improvements in restoration effects, we demonstrate that pure CNN architectures with lightweigh… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

    Comments: Accepted by ICASSP2025

  21. arXiv:2412.06212  [pdf, other

    cs.LG cs.AI

    A Self-guided Multimodal Approach to Enhancing Graph Representation Learning for Alzheimer's Diseases

    Authors: Zhepeng Wang, Runxue Bao, Yawen Wu, Guodong Liu, Lei Yang, Liang Zhan, Feng Zheng, Weiwen Jiang, Yanfu Zhang

    Abstract: Graph neural networks (GNNs) are powerful machine learning models designed to handle irregularly structured data. However, their generic design often proves inadequate for analyzing brain connectomes in Alzheimer's Disease (AD), highlighting the need to incorporate domain knowledge for optimal performance. Infusing AD-related knowledge into GNNs is a complicated task. Existing methods typically re… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  22. arXiv:2412.05789  [pdf, other

    cs.RO

    InfiniteWorld: A Unified Scalable Simulation Framework for General Visual-Language Robot Interaction

    Authors: Pengzhen Ren, Min Li, Zhen Luo, Xinshuai Song, Ziwei Chen, Weijia Liufu, Yixuan Yang, Hao Zheng, Rongtao Xu, Zitong Huang, Tongsheng Ding, Luyang Xie, Kaidong Zhang, Changfei Fu, Yang Liu, Liang Lin, Feng Zheng, Xiaodan Liang

    Abstract: Realizing scaling laws in embodied AI has become a focus. However, previous work has been scattered across diverse simulation platforms, with assets and models lacking unified interfaces, which has led to inefficiencies in research. To address this, we introduce InfiniteWorld, a unified and scalable simulator for general vision-language robot interaction built on Nvidia Isaac Sim. InfiniteWorld en… ▽ More

    Submitted 7 December, 2024; originally announced December 2024.

    Comments: 8 pages, 5 figures

  23. arXiv:2412.03200  [pdf, other

    cs.CV

    Fab-ME: A Vision State-Space and Attention-Enhanced Framework for Fabric Defect Detection

    Authors: Shuai Wang, Huiyan Kong, Baotian Li, Fa Zheng

    Abstract: Effective defect detection is critical for ensuring the quality, functionality, and economic value of textile products. However, existing methods face challenges in achieving high accuracy, real-time performance, and efficient global information extraction. To address these issues, we propose Fab-ME, an advanced framework based on YOLOv8s, specifically designed for the accurate detection of 20 fab… ▽ More

    Submitted 5 December, 2024; v1 submitted 4 December, 2024; originally announced December 2024.

    Comments: 6 pages, 5 figures

  24. arXiv:2412.02158  [pdf, other

    cs.CV

    Agri-LLaVA: Knowledge-Infused Large Multimodal Assistant on Agricultural Pests and Diseases

    Authors: Liqiong Wang, Teng Jin, Jinyu Yang, Ales Leonardis, Fangyi Wang, Feng Zheng

    Abstract: In the general domain, large multimodal models (LMMs) have achieved significant advancements, yet challenges persist in applying them to specific fields, especially agriculture. As the backbone of the global economy, agriculture confronts numerous challenges, with pests and diseases being particularly concerning due to their complexity, variability, rapid spread, and high resistance. This paper sp… ▽ More

    Submitted 4 December, 2024; v1 submitted 2 December, 2024; originally announced December 2024.

  25. arXiv:2411.19772  [pdf, other

    cs.CV cs.CL cs.LG cs.MM

    LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

    Authors: Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng

    Abstract: Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles… ▽ More

    Submitted 20 March, 2025; v1 submitted 29 November, 2024; originally announced November 2024.

    Comments: Accepted by CVPR2025

  26. arXiv:2411.06881  [pdf, other

    cs.LG stat.ML

    WassFFed: Wasserstein Fair Federated Learning

    Authors: Zhongxuan Han, Li Zhang, Chaochao Chen, Xiaolin Zheng, Fei Zheng, Yuyuan Li, Jianwei Yin

    Abstract: Federated Learning (FL) employs a training approach to address scenarios where users' data cannot be shared across clients. Achieving fairness in FL is imperative since training data in FL is inherently geographically distributed among diverse user groups. Existing research on fairness predominantly assumes access to the entire training data, making direct transfer to FL challenging. However, the… ▽ More

    Submitted 11 November, 2024; originally announced November 2024.

    Comments: Submitted to TKDE

  27. arXiv:2410.17598  [pdf, other

    cs.CV

    PlantCamo: Plant Camouflage Detection

    Authors: Jinyu Yang, Qingwei Wang, Feng Zheng, Peng Chen, Aleš Leonardis, Deng-Ping Fan

    Abstract: Camouflaged Object Detection (COD) aims to detect objects with camouflaged properties. Although previous studies have focused on natural (animals and insects) and unnatural (artistic and synthetic) camouflage detection, plant camouflage has been neglected. However, plant camouflage plays a vital role in natural camouflage. Therefore, this paper introduces a new challenging problem of Plant Camoufl… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

  28. arXiv:2410.13230  [pdf, ps, other

    cs.IR

    Starbucks-v2: Improved Training for 2D Matryoshka Embeddings

    Authors: Shengyao Zhuang, Shuai Wang, Fabio Zheng, Bevan Koopman, Guido Zuccon

    Abstract: 2D Matryoshka training enables a single embedding model to generate sub-network representations across different layers and embedding dimensions, offering adaptability to diverse computational and task constraints. However, its effectiveness remains well below that of individually trained models of equivalent sizes. To address this, we propose Starbucks, a new training strategy for Matryoshka-styl… ▽ More

    Submitted 30 May, 2025; v1 submitted 17 October, 2024; originally announced October 2024.

    Comments: Updated Version of Starbucks, add (1) Generalisation to E5 model (2) Out-of-domain zero-shot effectiveness (3) Propose Depth-wise Starbucks and Hybrid-Starbucks

  29. arXiv:2410.09761  [pdf, other

    cs.AI cs.IR

    ChartKG: A Knowledge-Graph-Based Representation for Chart Images

    Authors: Zhiguang Zhou, Haoxuan Wang, Zhengqing Zhao, Fengling Zheng, Yongheng Wang, Wei Chen, Yong Wang

    Abstract: Chart images, such as bar charts, pie charts, and line charts, are explosively produced due to the wide usage of data visualizations. Accordingly, knowledge mining from chart images is becoming increasingly important, which can benefit downstream tasks like chart retrieval and knowledge graph completion. However, existing methods for chart knowledge mining mainly focus on converting chart images i… ▽ More

    Submitted 13 October, 2024; originally announced October 2024.

  30. arXiv:2410.09453  [pdf, other

    cs.AI cs.CV

    MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

    Authors: Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, Feng Zheng

    Abstract: In the field of industrial inspection, Multimodal Large Language Models (MLLMs) have a high potential to renew the paradigms in practical applications due to their robust language capabilities and generalization abilities. However, despite their impressive problem-solving skills in many domains, MLLMs' ability in industrial anomaly detection has not been systematically studied. To bridge this gap,… ▽ More

    Submitted 20 February, 2025; v1 submitted 12 October, 2024; originally announced October 2024.

    Comments: Accepted by ICLR 2025. The code and data are available at https://github.com/jam-cc/MMAD

  31. arXiv:2410.08174  [pdf, ps, other

    cs.CL cs.AI cs.LG cs.MM

    Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

    Authors: Qingni Wang, Tiantian Geng, Zhiyuan Wang, Teng Wang, Bo Fu, Feng Zheng

    Abstract: Multimodal Large Language Models (MLLMs) exhibit promising advancements across various tasks, yet they still encounter significant trustworthiness issues. Prior studies apply Split Conformal Prediction (SCP) in language modeling to construct prediction sets with statistical guarantees. However, these methods typically rely on internal model logits or are restricted to multiple-choice settings, whi… ▽ More

    Submitted 29 June, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

    Comments: Accepted by ICLR 2025 Spotlights

  32. arXiv:2410.04671  [pdf, other

    cs.CV

    CAR: Controllable Autoregressive Modeling for Visual Generation

    Authors: Ziyu Yao, Jialin Li, Yifeng Zhou, Yong Liu, Xi Jiang, Chengjie Wang, Feng Zheng, Yuexian Zou, Lei Li

    Abstract: Controllable generation, which enables fine-grained control over generated outputs, has emerged as a critical focus in visual generative models. Currently, there are two primary technical approaches in visual generation: diffusion models and autoregressive models. Diffusion models, as exemplified by ControlNet and T2I-Adapter, offer advanced control mechanisms, whereas autoregressive models, despi… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

    Comments: Code available at: https://github.com/MiracleDance/CAR

  33. arXiv:2409.13853  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Unlocking Memorization in Large Language Models with Dynamic Soft Prompting

    Authors: Zhepeng Wang, Runxue Bao, Yawen Wu, Jackson Taylor, Cao Xiao, Feng Zheng, Weiwen Jiang, Shangqian Gao, Yanfu Zhang

    Abstract: Pretrained large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation. However, LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement. Accurate measurement of this memorization is essential to evaluate and mitigate… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

  34. arXiv:2409.07793  [pdf, other

    cs.CV cs.AI

    Lagrange Duality and Compound Multi-Attention Transformer for Semi-Supervised Medical Image Segmentation

    Authors: Fuchen Zheng, Quanjun Li, Weixuan Li, Xuhang Chen, Yihang Dong, Guoheng Huang, Chi-Man Pun, Shoujun Zhou

    Abstract: Medical image segmentation, a critical application of semantic segmentation in healthcare, has seen significant advancements through specialized computer vision techniques. While deep learning-based medical image segmentation is essential for assisting in medical diagnosis, the lack of diverse training data causes the long-tail problem. Moreover, most previous hybrid CNN-ViT architectures have lim… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: 5 pages, 4 figures, 3 tables

  35. arXiv:2409.07779  [pdf, other

    cs.CV cs.AI

    AFFSegNet: Adaptive Feature Fusion Segmentation Network for Microtumors and Multi-Organ Segmentation

    Authors: Fuchen Zheng, Xinyi Chen, Xuhang Chen, Haolun Li, Xiaojiao Guo, Weihuang Liu, Chi-Man Pun, Shoujun Zhou

    Abstract: Medical image segmentation, a crucial task in computer vision, facilitates the automated delineation of anatomical structures and pathologies, supporting clinicians in diagnosis, treatment planning, and disease monitoring. Notably, transformers employing shifted window-based self-attention have demonstrated exceptional performance. However, their reliance on local window attention limits the fusio… ▽ More

    Submitted 10 December, 2024; v1 submitted 12 September, 2024; originally announced September 2024.

    Comments: 8 pages, 4 figures, 3 tables

  36. arXiv:2409.00346  [pdf, other

    cs.CV

    SMAFormer: Synergistic Multi-Attention Transformer for Medical Image Segmentation

    Authors: Fuchen Zheng, Xuhang Chen, Weihuang Liu, Haolun Li, Yingtie Lei, Jiahui He, Chi-Man Pun, Shounjun Zhou

    Abstract: In medical image segmentation, specialized computer vision techniques, notably transformers grounded in attention mechanisms and residual networks employing skip connections, have been instrumental in advancing performance. Nonetheless, previous models often falter when segmenting small, irregularly shaped tumors. To this end, we introduce SMAFormer, an efficient, Transformer-based architecture th… ▽ More

    Submitted 26 March, 2025; v1 submitted 31 August, 2024; originally announced September 2024.

    Comments: Accepted by IEEE BIBM 2024

  37. arXiv:2408.15585  [pdf, other

    cs.SD eess.AS

    Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models

    Authors: Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: In this paper, Whisper, a large-scale pre-trained model for automatic speech recognition, is proposed to apply to speaker verification. A partial multi-scale feature aggregation (PMFA) approach is proposed based on a subset of Whisper encoder blocks to derive highly discriminative speaker embeddings.Experimental results demonstrate that using the middle to later blocks of the Whisper encoder keeps… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: Accepted by Interspeech 2024

  38. arXiv:2408.14089  [pdf, other

    cs.IT eess.SP

    Mini-Slot-Assisted Short Packet URLLC:Differential or Coherent Detection?

    Authors: Canjian Zheng, Fu-Chun Zheng, Jingjing Luo, Pengcheng Zhu, Xiaohu You, Daquan Feng

    Abstract: One of the primary challenges in short packet ultra-reliable and low-latency communications (URLLC) is to achieve reliable channel estimation and data detection while minimizing the impact on latency performance. Given the small packet size in mini-slot-assisted URLLC, relying solely on pilot-based coherent detection is almost impossible to meet the seemingly contradictory requirements of high cha… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: 14 pages, 8 figures, journal

  39. A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

    Authors: Xujiang Xing, Mingxing Xu, Thomas Fang Zheng

    Abstract: Automatic Speaker Verification (ASV) suffers from performance degradation in noisy conditions. To address this issue, we propose a novel adversarial learning framework that incorporates noise-disentanglement to establish a noise-independent speaker invariant embedding space. Specifically, the disentanglement module includes two encoders for separating speaker related and irrelevant information, re… ▽ More

    Submitted 22 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: 5 pages, accepted by Interspeech2024

    Report number: 707-711

    Journal ref: Interspeech2024

  40. arXiv:2408.10899  [pdf, other

    cs.RO

    All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents

    Authors: Zhiqiang Wang, Hao Zheng, Yunshuang Nie, Wenjun Xu, Qingwei Wang, Hua Ye, Zhe Li, Kaidong Zhang, Xuewen Cheng, Wanxi Dong, Chang Cai, Liang Lin, Feng Zheng, Xiaodan Liang

    Abstract: Embodied AI is transforming how AI systems interact with the physical world, yet existing datasets are inadequate for developing versatile, general-purpose agents. These limitations include a lack of standardized formats, insufficient data diversity, and inadequate data volume. To address these issues, we introduce ARIO (All Robots In One), a new data standard that enhances existing datasets by of… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Project website: https://imaei.github.io/project_pages/ario/

  41. arXiv:2408.03979  [pdf, ps, other

    cs.SD eess.AS

    Speaker Adaptation for Quantised End-to-End ASR Models

    Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: End-to-end models have shown superior performance for automatic speech recognition (ASR). However, such models are often very large in size and thus challenging to deploy on resource-constrained edge devices. While quantisation can reduce model sizes, it can lead to increased word error rates (WERs). Although improved quantisation methods were proposed to address the issue of performance degradati… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

    Comments: submitted to ASRU 2023 Workshop

  42. arXiv:2407.15277  [pdf, other

    cs.LG math.ST stat.ML

    Conformal Predictions under Markovian Data

    Authors: Frédéric Zheng, Alexandre Proutiere

    Abstract: We study the split Conformal Prediction method when applied to Markovian data. We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data). This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as $\sqrt{t_\mathrm{mix}\ln(n)/n}$ (where $t_\mathrm{mix}$ is the mixing time of the c… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

  43. arXiv:2407.12842  [pdf, other

    cs.CL cs.AI

    MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production

    Authors: Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng

    Abstract: Sign language understanding has made significant strides; however, there is still no viable solution for generating sign sequences directly from entire spoken content, e.g., text or speech. In this paper, we propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. In particular, a sequence diffusion model, utilizing embeddi… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted to ACL 2024 Findings; Project Page: https://hechang25.github.io/MS2SL

  44. arXiv:2407.11422  [pdf, other

    cs.CV

    Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

    Authors: Jinrui Zhang, Teng Wang, Haigang Zhang, Ping Lu, Feng Zheng

    Abstract: Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. However, they remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. While various mitigation strategies have been proposed, they often neglect a key contributor to hallucinations: lack of fine-grained reasoning supervision during training.… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: To appear at ECCV2024

  45. arXiv:2407.10373  [pdf, other

    cs.SD cs.AI cs.CV eess.AS

    Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

    Authors: Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng

    Abstract: Visual acoustic matching (VAM) is pivotal for enhancing the immersive experience, and the task of dereverberation is effective in improving audio intelligibility. Existing methods treat each task independently, overlooking the inherent reciprocity between them. Moreover, these methods depend on paired training data, which is challenging to acquire, impeding the utilization of extensive unpaired da… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: ECCV 2024; Project page: https://hechang25.github.io/MVSD

  46. arXiv:2406.19706  [pdf, other

    cs.SD eess.AS

    SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

    Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 5 pages, accepted by Interspeech 2024. arXiv admin note: substantial text overlap with arXiv:2309.09136

  47. arXiv:2406.18361  [pdf, other

    cs.CV cs.AI eess.IV

    Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

    Authors: Tianyu Lin, Zhiguang Chen, Zhonghao Yan, Weijiang Yu, Fudan Zheng

    Abstract: Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first laten… ▽ More

    Submitted 9 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg

  48. arXiv:2406.17005  [pdf, other

    cs.CV

    PVUW 2024 Challenge on Complex Video Understanding: Methods and Results

    Authors: Henghui Ding, Chang Liu, Yunchao Wei, Nikhila Ravi, Shuting He, Song Bai, Philip Torr, Deshui Miao, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang, Zhensong Xu, Jiangtao Yao, Chengjing Wu, Ting Liu, Luoqi Liu, Xinyu Liu, Jing Zhang, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang, Mingqi Gao, Jingnan Luo , et al. (12 additional authors not shown)

    Abstract: Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: MOSE Challenge: https://henghuiding.github.io/MOSE/ChallengeCVPR2024, MeViS Challenge: https://henghuiding.github.io/MeViS/ChallengeCVPR2024

  49. arXiv:2406.07043  [pdf, other

    cs.CV

    1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

    Authors: Mingqi Gao, Jingnan Luo, Jinyu Yang, Jungong Han, Feng Zheng

    Abstract: Motion Expression guided Video Segmentation (MeViS), as an emerging task, poses many new challenges to the field of referring video object segmentation (RVOS). In this technical report, we investigated and validated the effectiveness of static-dominant data and frame sampling on this challenging setting. Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeVi… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  50. arXiv:2406.03866  [pdf, other

    cs.CV

    LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model

    Authors: Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James J. Q. Yu, Victor Sanchez, Feng Zheng

    Abstract: Designing 3D indoor layouts is a crucial task with significant applications in virtual reality, interior design, and automated space planning. Existing methods for 3D layout design either rely on diffusion models, which utilize spatial relationship priors, or heavily leverage the inferential capabilities of proprietary Large Language Models (LLMs), which require extensive prompt engineering and in… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.