Skip to main content

Showing 1–50 of 203 results for author: Tang, D

Searching in archive cs. Search in all archives.
.
  1. VibWalk: Mapping Lower-limb Haptic Experiences of Everyday Walking

    Authors: Shih Ying-Lei, Dongxu Tang, Weiming Hu, Sang Ho Yoon, Yitian Shao

    Abstract: Walking is among the most common human activities where the feet can gather rich tactile information from the ground. The dynamic contact between the feet and the ground generates vibration signals that can be sensed by the foot skin. While existing research focuses on foot pressure sensing and lower-limb interactions, methods of decoding tactile information from foot vibrations remain underexplor… ▽ More

    Submitted 21 April, 2025; v1 submitted 12 April, 2025; originally announced April 2025.

    Comments: 17 pages, 12 figures

  2. arXiv:2504.07866  [pdf, ps, other

    cs.CL cs.AI

    Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

    Authors: Yichun Yin, Wenyong Huang, Kaikai Song, Yehui Tang, Xueyu Wu, Wei Guo, Peng Guo, Yaoyuan Wang, Xiaojun Meng, Yasheng Wang, Dong Li, Can Chen, Dandan Tu, Yin Li, Fisher Yu, Ruiming Tang, Yunhe Wang, Baojun Wang, Bin Wang, Bo Wang, Boxiao Liu, Changzheng Zhang, Duyu Tang, Fei Mi, Hui Jin , et al. (27 additional authors not shown)

    Abstract: We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize… ▽ More

    Submitted 11 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

    Comments: fix conflicts of latex pacakges

  3. arXiv:2504.04801  [pdf, other

    cs.CV

    OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM

    Authors: Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Hongxia Xu, Danny Chen, Jintai Chen, Jian Wu

    Abstract: Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Sp… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  4. arXiv:2503.24354  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion

    Authors: Rana Muhammad Shahroz Khan, Dongwen Tang, Pingzhi Li, Kai Wang, Tianlong Chen

    Abstract: Parameter generation has emerged as a novel paradigm for neural network development, offering an alternative to traditional neural network training by synthesizing high-quality model weights directly. In the context of Low-Rank Adaptation (LoRA) for evolving ($\textit{i.e.}$, constantly updated) large language models (LLMs), this approach promises efficient adaptation without costly retraining. Ho… ▽ More

    Submitted 8 April, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

  5. arXiv:2503.22712  [pdf, other

    cs.SD cs.LG eess.AS

    Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets

    Authors: Zijun Jia, Jinsong Yu, Hongyu Long, Diyin Tang

    Abstract: Road rage, often triggered by emotional suppression and sudden outbursts, significantly threatens road safety by causing collisions and aggressive behavior. Speech emotion recognition technologies can mitigate this risk by identifying negative emotions early and issuing timely alerts. However, current SER methods, such as those based on hidden markov models and Long short-term memory networks, pri… ▽ More

    Submitted 7 May, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

  6. arXiv:2503.15798  [pdf, other

    cs.LG cs.CL

    Mixture of Lookup Experts

    Authors: Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, Yunhe Wang

    Abstract: Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, signific… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  7. arXiv:2503.01275  [pdf, other

    cs.CL

    Enhancing Non-English Capabilities of English-Centric Large Language Models through Deep Supervision Fine-Tuning

    Authors: Wenshuai Huo, Xiaocheng Feng, Yichong Huang, Chengpeng Fu, Baohang Li, Yangfan Ye, Zhirui Zhang, Dandan Tu, Duyu Tang, Yunfei Lu, Hui Wang, Bing Qin

    Abstract: Large language models (LLMs) have demonstrated significant progress in multilingual language understanding and generation. However, due to the imbalance in training data, their capabilities in non-English languages are limited. Recent studies revealed the English-pivot multilingual mechanism of LLMs, where LLMs implicitly convert non-English queries into English ones at the bottom layers and adopt… ▽ More

    Submitted 5 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

    Comments: Accepted at AAAI 2025

  8. arXiv:2503.00952  [pdf, other

    cs.CV

    A Survey on Ordinal Regression: Applications, Advances and Prospects

    Authors: Jinhong Wang, Jintai Chen, Jian Liu, Dongqi Tang, Danny Z. Chen, Jian Wu

    Abstract: Ordinal regression refers to classifying object instances into ordinal categories. Ordinal regression is crucial for applications in various areas like facial age estimation, image aesthetics assessment, and even cancer staging, due to its capability to utilize ordered information effectively. More importantly, it also enhances model interpretation by considering category order, aiding the underst… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

  9. arXiv:2502.15443  [pdf, other

    cs.CL cs.AI

    When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

    Authors: Weilan Wang, Yu Mao, Dongdong Tang, Hongchao Du, Nan Guan, Chun Jason Xue

    Abstract: Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model wei… ▽ More

    Submitted 21 February, 2025; originally announced February 2025.

  10. arXiv:2502.14477  [pdf, other

    cs.CL

    Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

    Authors: Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang

    Abstract: Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficient… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: 14 pages,2 figures

  11. arXiv:2502.11458  [pdf, other

    cs.LG cs.AI

    Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models

    Authors: Jiecheng Zhou, Ding Tang, Rong Fu, Boni Hu, Haoran Xu, Yi Wang, Zhilin Pei, Zhongling Su, Liang Liu, Xingcheng Zhang, Weiming Zhang

    Abstract: The burgeoning computational demands for training large language models (LLMs) necessitate efficient methods, including quantized training, which leverages low-bit arithmetic operations to reduce costs. While FP8 precision has shown potential, leveraging FP4 remains challenging due to inherent quantization errors and limited representation capability. Based on the Transformer architecture, we pres… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

    Comments: 8 pages, 2 figure

    MSC Class: I.2

  12. arXiv:2501.18062  [pdf

    cs.LG cs.CL

    FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models

    Authors: Spencer Mateega, Carlos Georgescu, Danny Tang

    Abstract: FinanceQA is a testing suite that evaluates LLMs' performance on complex numerical financial analysis tasks that mirror real-world investment work. Despite recent advances, current LLMs fail to meet the strict accuracy requirements of financial institutions, with models failing approximately 60% of realistic tasks that mimic on-the-job analyses at hedge funds, private equity firms, investment bank… ▽ More

    Submitted 29 January, 2025; originally announced January 2025.

    Comments: 10 pages, 7 figures

  13. arXiv:2501.11587  [pdf, other

    cs.LG cs.AI

    Recurrent Diffusion for Large-Scale Parameter Generation

    Authors: Kai Wang, Dongwen Tang, Wangbo Zhao, Konstantin Schürholt, Zhangyang Wang, Yang You

    Abstract: Parameter generation has long struggled to match the scale of today large vision and language models, curbing its broader utility. In this paper, we introduce Recurrent Diffusion for Large Scale Parameter Generation (RPG), a novel framework that generates full neural network parameters up to hundreds of millions on a single GPU. Our approach first partitions a networks parameters into non-overlapp… ▽ More

    Submitted 10 February, 2025; v1 submitted 20 January, 2025; originally announced January 2025.

    Comments: Generating 200 million parameters in just minutes

  14. arXiv:2501.11323  [pdf

    cs.LG eess.SP physics.app-ph stat.ML

    Physics-Informed Machine Learning for Efficient Reconfigurable Intelligent Surface Design

    Authors: Zhen Zhang, Jun Hui Qiu, Jun Wei Zhang, Hui Dong Li, Dong Tang, Qiang Cheng, Wei Lin

    Abstract: Reconfigurable intelligent surface (RIS) is a two-dimensional periodic structure integrated with a large number of reflective elements, which can manipulate electromagnetic waves in a digital way, offering great potentials for wireless communication and radar detection applications. However, conventional RIS designs highly rely on extensive full-wave EM simulations that are extremely time-consumin… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

  15. arXiv:2412.18139  [pdf, other

    cs.CL

    Ensuring Consistency for In-Image Translation

    Authors: Chengpeng Fu, Xiaocheng Feng, Yichong Huang, Wenshuai Huo, Baohang Li, Zhirui Zhang, Yunfei Lu, Dandan Tu, Duyu Tang, Hui Wang, Bing Qin, Ting Liu

    Abstract: The in-image machine translation task involves translating text embedded within images, with the translated results presented in image format. While this task has numerous applications in various scenarios such as film poster translation and everyday scene image translation, existing methods frequently neglect the aspect of consistency throughout this process. We propose the need to uphold two typ… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.

  16. arXiv:2412.17787  [pdf, other

    cs.CV cs.CL

    Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective

    Authors: Xinmiao Yu, Xiaocheng Feng, Yun Li, Minghui Liao, Ya-Qi Yu, Xiachong Feng, Weihong Zhong, Ruihan Chen, Mengkang Hu, Jihao Wu, Dandan Tu, Duyu Tang, Bing Qin

    Abstract: Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the ins… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.

  17. arXiv:2412.12686  [pdf, other

    cs.CL

    XTransplant: A Probe into the Upper Bound Performance of Multilingual Capability and Culture Adaptability in LLMs via Mutual Cross-lingual Feed-forward Transplantation

    Authors: Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Libo Qin, Yichong Huang, Lei Huang, Weitao Ma, Zhirui Zhang, Yunfei Lu, Xiaohui Yan, Duyu Tang, Dandan Tu, Bing Qin

    Abstract: Current large language models (LLMs) often exhibit imbalances in multilingual capabilities and cultural adaptability, largely due to their English-centric pretraining data. To address this imbalance, we propose a probing method named XTransplant that explores cross-lingual latent interactions via cross-lingual feed-forward transplantation during inference stage, with the hope of enabling the model… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

  18. arXiv:2411.15497  [pdf, other

    cs.CV

    AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

    Authors: Datao Tang, Xiangyong Cao, Xuan Wu, Jialin Li, Jing Yao, Xueru Bai, Dongsheng Jiang, Yin Li, Deyu Meng

    Abstract: Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some… ▽ More

    Submitted 24 February, 2025; v1 submitted 23 November, 2024; originally announced November 2024.

  19. arXiv:2411.15221  [pdf, other

    cs.LG cond-mat.mtrl-sci physics.chem-ph

    Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

    Authors: Yoel Zimmermann, Adib Bazgir, Zartashia Afzal, Fariha Agbere, Qianxiang Ai, Nawaf Alampara, Alexander Al-Feghali, Mehrad Ansari, Dmytro Antypov, Amro Aswad, Jiaru Bai, Viktoriia Baibakova, Devi Dutta Biswajeet, Erik Bitzek, Joshua D. Bocarsly, Anna Borisova, Andres M Bran, L. Catherine Brinson, Marcel Moran Calderon, Alessandro Canalicchio, Victor Chen, Yuan Chiang, Defne Circi, Benjamin Charmes, Vikrant Chaudhary , et al. (119 additional authors not shown)

    Abstract: Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) mo… ▽ More

    Submitted 2 January, 2025; v1 submitted 20 November, 2024; originally announced November 2024.

    Comments: Updating author information, the submission remains largely unchanged. 98 pages total

  20. arXiv:2411.11361  [pdf, other

    cs.CV

    Scalable Autoregressive Monocular Depth Estimation

    Authors: Jinhong Wang, Jian Liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Danny Chen, Jintai Chen, Jian Wu

    Abstract: This paper shows that the autoregressive model is an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution auto… ▽ More

    Submitted 19 March, 2025; v1 submitted 18 November, 2024; originally announced November 2024.

    Comments: Accepted by CVPR2025

  21. arXiv:2411.05881  [pdf, other

    cs.RO

    MIPD: A Multi-sensory Interactive Perception Dataset for Embodied Intelligent Driving

    Authors: Zhiwei Li, Tingzhen Zhang, Meihua Zhou, Dandan Tang, Pengwei Zhang, Wenzhuo Liu, Qiaoning Yang, Tianyu Shen, Kunfeng Wang, Huaping Liu

    Abstract: During the process of driving, humans usually rely on multiple senses to gather information and make decisions. Analogously, in order to achieve embodied intelligence in autonomous driving, it is essential to integrate multidimensional sensory information in order to facilitate interaction with the environment. However, the current multi-modal fusion sensing schemes often neglect these additional… ▽ More

    Submitted 8 November, 2024; originally announced November 2024.

    Comments: Data, development kit and more details will be available at https://github.com/BUCT-IUSRC/Dataset MIPD

  22. arXiv:2410.16942  [pdf, other

    cs.CV

    DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

    Authors: Haowei Zhu, Dehua Tang, Ji Liu, Mingjie Lu, Jintu Zheng, Jinzhang Peng, Dong Li, Yu Wang, Fan Jiang, Lu Tian, Spandan Tiwari, Ashish Sirasao, Jun-Hai Yong, Bin Wang, Emad Barsoum

    Abstract: Diffusion models have achieved remarkable progress in the field of image generation due to their outstanding capabilities. However, these models require substantial computing resources because of the multi-step denoising process during inference. While traditional pruning methods have been employed to optimize these models, the retraining process necessitates large-scale training datasets and exte… ▽ More

    Submitted 22 October, 2024; originally announced October 2024.

  23. arXiv:2410.15569  [pdf, other

    cs.CV

    Online Pseudo-Label Unified Object Detection for Multiple Datasets Training

    Authors: XiaoJun Tang, Jingru Wang, Zeyu Shangguan, Darun Tang, Yuyu Liu

    Abstract: The Unified Object Detection (UOD) task aims to achieve object detection of all merged categories through training on multiple datasets, and is of great significance in comprehensive object detection scenarios. In this paper, we conduct a thorough analysis of the cross datasets missing annotations issue, and propose an Online Pseudo-Label Unified Object Detection scheme. Our method uses a periodic… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

  24. arXiv:2409.18971  [pdf, other

    cs.MM cs.AI cs.SD eess.AS

    Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

    Authors: Mengying Ge, Mingyang Li, Dongkai Tang, Pengbo Li, Kuo Liu, Shuhao Deng, Songbai Pu, Long Liu, Yang Song, Tao Zhang

    Abstract: In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with ot… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

  25. arXiv:2409.16098  [pdf

    cs.LG cs.AI cs.CY cs.HC

    The Digital Transformation in Health: How AI Can Improve the Performance of Health Systems

    Authors: África Periáñez, Ana Fernández del Río, Ivan Nazarov, Enric Jané, Moiz Hassan, Aditya Rastogi, Dexian Tang

    Abstract: Mobile health has the potential to revolutionize health care delivery and patient engagement. In this work, we discuss how integrating Artificial Intelligence into digital health applications-focused on supply chain, patient management, and capacity building, among other use cases-can improve the health system and public health performance. We present an Artificial Intelligence and Reinforcement L… ▽ More

    Submitted 21 November, 2024; v1 submitted 24 September, 2024; originally announced September 2024.

    Comments: This is an original manuscript of an article published by Taylor & Francis in Health Systems & Reform on 22 Oct 2024, available online: https://www.tandfonline.com/doi/10.1080/23288604.2024.2387138

    Journal ref: Health Systems & Reform, 10(2), 2024

  26. arXiv:2409.12470  [pdf, other

    cs.CV eess.IV

    HSIGene: A Foundation Model For Hyperspectral Image Generation

    Authors: Li Pang, Xiangyong Cao, Datao Tang, Shuang Xu, Xueru Bai, Feng Zhou, Deyu Meng

    Abstract: Hyperspectral image (HSI) plays a vital role in various fields such as agriculture and environmental monitoring. However, due to the expensive acquisition cost, the number of hyperspectral images is limited, degenerating the performance of downstream tasks. Although some recent studies have attempted to employ diffusion models to synthesize HSIs, they still struggle with the scarcity of HSIs, affe… ▽ More

    Submitted 1 November, 2024; v1 submitted 19 September, 2024; originally announced September 2024.

  27. arXiv:2409.06816  [pdf, other

    cs.CR

    LLM-Enhanced Software Patch Localization

    Authors: Jinhong Yu, Yi Chen, Di Tang, Xiaozhong Liu, XiaoFeng Wang, Chen Wu, Haixu Tang

    Abstract: Open source software (OSS) is integral to modern product development, and any vulnerability within it potentially compromises numerous products. While developers strive to apply security patches, pinpointing these patches among extensive OSS updates remains a challenge. Security patch localization (SPL) recommendation methods are leading approaches to address this. However, existing SPL models oft… ▽ More

    Submitted 12 September, 2024; v1 submitted 10 September, 2024; originally announced September 2024.

  28. arXiv:2409.00920  [pdf, other

    cs.LG cs.AI cs.CL

    ToolACE: Winning the Points of LLM Function Calling

    Authors: Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian , et al. (2 additional authors not shown)

    Abstract: Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic ag… ▽ More

    Submitted 1 September, 2024; originally announced September 2024.

    Comments: 21 pages, 22 figures

  29. arXiv:2409.00661  [pdf

    cs.AR

    Research on LLM Acceleration Using the High-Performance RISC-V Processor "Xiangshan" (Nanhu Version) Based on the Open-Source Matrix Instruction Set Extension (Vector Dot Product)

    Authors: Xu-Hao Chen, Si-Peng Hu, Hong-Chao Liu, Bo-Ran Liu, Dan Tang, Di Zhao

    Abstract: Considering the high-performance and low-power requirements of edge AI, this study designs a specialized instruction set processor for edge AI based on the RISC-V instruction set architecture, addressing practical issues in digital signal processing for edge devices. This design enhances the execution efficiency of edge AI and reduces its energy consumption with limited hardware overhead, meeting… ▽ More

    Submitted 1 September, 2024; originally announced September 2024.

    Comments: 10 pages, in Chinese language, 6 figures

    MSC Class: C.1.3 [Other Architecture Styles]: RISC (Reduced Instruction Set Computing)

  30. arXiv:2408.12099  [pdf, other

    cs.CV cs.CR

    Query-Efficient Video Adversarial Attack with Stylized Logo

    Authors: Duoxun Tang, Yuxin Cao, Xi Xiao, Derui Wang, Sheng Wen, Tianqing Zhu

    Abstract: Video classification systems based on Deep Neural Networks (DNNs) have demonstrated excellent performance in accurately verifying video content. However, recent studies have shown that DNNs are highly vulnerable to adversarial examples. Therefore, a deep understanding of adversarial attacks can better respond to emergency situations. In order to improve attack performance, many style-transfer-base… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  31. arXiv:2408.11788  [pdf, other

    cs.AI cs.CL cs.CV cs.SE

    DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework

    Authors: Zhifei Xie, Daniel Tang, Dingwei Tan, Jacques Klein, Tegawend F. Bissyand, Saad Ezzini

    Abstract: Current video generation models excel at creating short, realistic clips, but struggle with longer, multi-scene videos. We introduce \texttt{DreamFactory}, an LLM-based framework that tackles this challenge. \texttt{DreamFactory} leverages multi-agent collaboration principles and a Key Frames Iteration Design Method to ensure consistency and style across long videos. It utilizes Chain of Thought (… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: 13 pages, 8 figures

    MSC Class: TsingHua University

  32. arXiv:2408.11286  [pdf, ps, other

    cs.CV

    Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

    Authors: Mengying Ge, Dongkai Tang, Mingyang Li

    Abstract: Multimodal emotion recognition is a task of great concern. However, traditional data sets are based on fixed labels, resulting in models that often focus on main emotions and ignore detailed emotional changes in complex scenes. This report introduces the solution of using MLLMs technology to generate open-vocabulary emotion labels from a video. The solution includes the use of framework, data gene… ▽ More

    Submitted 21 August, 2024; v1 submitted 20 August, 2024; originally announced August 2024.

  33. arXiv:2408.08024  [pdf, other

    cs.LG cs.AI stat.ML

    Adaptive User Journeys in Pharma E-Commerce with Reinforcement Learning: Insights from SwipeRx

    Authors: Ana Fernández del Río, Michael Brennan Leong, Paulo Saraiva, Ivan Nazarov, Aditya Rastogi, Moiz Hassan, Dexian Tang, África Periáñez

    Abstract: This paper introduces a reinforcement learning (RL) platform that enhances end-to-end user journeys in healthcare digital tools through personalization. We explore a case study with SwipeRx, the most popular all-in-one app for pharmacists in Southeast Asia, demonstrating how the platform can be used to personalize and adapt user experiences. Our RL framework is tested through a series of experimen… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

    Comments: Presented at the Third Workshop on End-to-End Customer Journey Optimization at KDD 2024 (KDD CJ Workshop '24), August 26, Barcelona, Spain

  34. arXiv:2408.07647  [pdf, other

    cs.LG cs.AI cs.CY physics.data-an

    Adaptive Behavioral AI: Reinforcement Learning to Enhance Pharmacy Services

    Authors: Ana Fernández del Río, Michael Brennan Leong, Paulo Saraiva, Ivan Nazarov, Aditya Rastogi, Moiz Hassan, Dexian Tang, África Periáñez

    Abstract: Pharmacies are critical in healthcare systems, particularly in low- and middle-income countries. Procuring pharmacists with the right behavioral interventions or nudges can enhance their skills, public health awareness, and pharmacy inventory management, ensuring access to essential medicines that ultimately benefit their patients. We introduce a reinforcement learning operational system to delive… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

    Comments: Presented at The First Workshop on AI Behavioral Science (AIBS'24) at KDD 2024, August 25, Barcelona, Spain

  35. arXiv:2408.07629  [pdf, other

    cs.LG cs.AI cs.CY

    Optimizing HIV Patient Engagement with Reinforcement Learning in Resource-Limited Settings

    Authors: África Periáñez, Kathrin Schmitz, Lazola Makhupula, Moiz Hassan, Moeti Moleko, Ana Fernández del Río, Ivan Nazarov, Aditya Rastogi, Dexian Tang

    Abstract: By providing evidence-based clinical decision support, digital tools and electronic health records can revolutionize patient management, especially in resource-poor settings where fewer health workers are available and often need more training. When these tools are integrated with AI, they can offer personalized support and adaptive interventions, effectively connecting community health workers (C… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

    Comments: Presented at the 7th epiDAMIK ACM SIGKDD International Workshop on Epidemiology meets Data Mining and Knowledge Discovery, August 26, 2024, Barcelona, Spain

  36. arXiv:2408.04568  [pdf, other

    cs.CL cs.AI

    Learning Fine-Grained Grounded Citations for Attributed Large Language Models

    Authors: Lei Huang, Xiaocheng Feng, Weitao Ma, Yuxuan Gu, Weihong Zhong, Xiachong Feng, Weijiang Yu, Weihua Peng, Duyu Tang, Dandan Tu, Bing Qin

    Abstract: Despite the impressive performance on information-seeking tasks, large language models (LLMs) still struggle with hallucinations. Attributed LLMs, which augment generated text with in-line citations, have shown potential in mitigating hallucinations and improving verifiability. However, current approaches suffer from suboptimal citation quality due to their reliance on in-context learning. Further… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Accepted by ACL 2024 Findings

  37. arXiv:2408.01415  [pdf, other

    cs.AI cs.LG

    Conditional LoRA Parameter Generation

    Authors: Xiaolong Jin, Kai Wang, Dongwen Tang, Wangbo Zhao, Yukun Zhou, Junshu Tang, Yang You

    Abstract: Generative models have achieved remarkable success in image, video, and text domains. Inspired by this, researchers have explored utilizing generative models to generate neural network parameters. However, these efforts have been limited by the parameter size and the practicality of generating high-performance parameters. In this paper, we propose COND P-DIFF, a novel approach that demonstrates th… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

  38. arXiv:2407.12318  [pdf, ps, other

    cs.GT cs.MA eess.SY math.OC math.ST

    Information Compression in Dynamic Games

    Authors: Dengwang Tang, Vijay Subramanian, Demosthenis Teneketzis

    Abstract: One of the reasons why stochastic dynamic games with an underlying dynamic system are challenging is since strategic players have access to enormous amount of information which leads to the use of extremely complex strategies at equilibrium. One approach to resolve this challenge is to simplify players' strategies by identifying appropriate compression of information maps so that the players can m… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: 54 pages, 3 figures

    MSC Class: 90C40; 91A10; 91A15; 91A25; 91A50

  39. arXiv:2407.02392  [pdf, other

    cs.CV

    TokenPacker: Efficient Visual Projector for Multimodal LLM

    Authors: Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, Lei Zhang

    Abstract: The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significa… ▽ More

    Submitted 28 August, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: 16 pages, Codes:https://github.com/CircleRadon/TokenPacker

  40. arXiv:2407.01894  [pdf, other

    cs.CV cs.HC

    Adaptive Modality Balanced Online Knowledge Distillation for Brain-Eye-Computer based Dim Object Detection

    Authors: Zixing Li, Chao Yan, Zhen Lan, Xiaojia Xiang, Han Zhou, Jun Lai, Dengqing Tang

    Abstract: Advanced cognition can be extracted from the human brain using brain-computer interfaces. Integrating these interfaces with computer vision techniques, which possess efficient feature extraction capabilities, can achieve more robust and accurate detection of dim targets in aerial images. However, existing target detection methods primarily concentrate on homogeneous data, lacking efficient and ver… ▽ More

    Submitted 8 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: 18 pages,15 figures

  41. Retrieval-Augmented Conversational Recommendation with Prompt-based Semi-Structured Natural Language State Tracking

    Authors: Sara Kemper, Justin Cui, Kai Dicarlantonio, Kathy Lin, Danjie Tang, Anton Korikov, Scott Sanner

    Abstract: Conversational recommendation (ConvRec) systems must understand rich and diverse natural language (NL) expressions of user preferences and intents, often communicated in an indirect manner (e.g., "I'm watching my weight"). Such complex utterances make retrieving relevant items challenging, especially if only using often incomplete or out-of-date metadata. Fortunately, many domains feature rich ite… ▽ More

    Submitted 25 May, 2024; originally announced June 2024.

  42. arXiv:2405.16887  [pdf

    cs.AI cs.MA cs.RO

    A Large Language Model-based multi-agent manufacturing system for intelligent shopfloor

    Authors: Zhen Zhao, Dunbing Tang, Haihua Zhu, Zequn Zhang, Kai Chen, Changchun Liu, Yuchen Ji

    Abstract: As productivity advances, the demand of customers for multi-variety and small-batch production is increasing, thereby putting forward higher requirements for manufacturing systems. When production tasks frequent changes due to this demand, traditional manufacturing systems often cannot response promptly. The multi-agent manufacturing system is proposed to address this problem. However, because of… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  43. arXiv:2405.15090  [pdf, other

    cs.LG stat.ML

    Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed Budget

    Authors: Dengwang Tang, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo

    Abstract: In this paper, we introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. This is a pure exploration problem in a stochastic finite armed bandit model. Each arm is associated with a reward and multiple types of costs from unknown distributions. Unlike the unconstrained best arm identification problem, the optimal solution for the CBMAI problem may be a randomiz… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 7 pages, 5 figures, 1 table

  44. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  45. arXiv:2404.09979  [pdf, other

    cs.CV eess.IV

    One-Click Upgrade from 2D to 3D: Sandwiched RGB-D Video Compression for Stereoscopic Teleconferencing

    Authors: Yueyu Hu, Onur G. Guleryuz, Philip A. Chou, Danhang Tang, Jonathan Taylor, Rus Maxham, Yao Wang

    Abstract: Stereoscopic video conferencing is still challenging due to the need to compress stereo RGB-D video in real-time. Though hardware implementations of standard video codecs such as H.264 / AVC and HEVC are widely available, they are not designed for stereoscopic videos and suffer from reduced quality and performance. Specific multiview or 3D extensions of these codecs are complex and lack efficient… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2024 Workshop (AIS: Vision, Graphics and AI for Streaming https://ai4streaming-workshop.github.io )

  46. arXiv:2404.08817  [pdf, other

    cs.CL cs.PL cs.SE

    Revisiting Code Similarity Evaluation with Abstract Syntax Tree Edit Distance

    Authors: Yewei Song, Cedric Lothritz, Daniel Tang, Tegawendé F. Bissyandé, Jacques Klein

    Abstract: This paper revisits recent code similarity evaluation metrics, particularly focusing on the application of Abstract Syntax Tree (AST) editing distance in diverse programming languages. In particular, we explore the usefulness of these metrics and compare them to traditional sequence similarity metrics. Our experiments showcase the effectiveness of AST editing distance in capturing intricate code s… ▽ More

    Submitted 3 June, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

    Comments: ACL 2024 Main

  47. arXiv:2403.13667  [pdf, other

    cs.CV cs.MM

    DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

    Authors: Zixuan Wang, Jia Jia, Shikun Sun, Haozhe Wu, Rong Han, Zhenyu Li, Di Tang, Jiaqing Zhou, Jiebo Luo

    Abstract: Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accept to CVPR 2024

  48. arXiv:2403.12365  [pdf, other

    cs.CV

    GaussianFlow: Splatting Gaussian Dynamics for 4D Content Creation

    Authors: Quankai Gao, Qiangeng Xu, Zhe Cao, Ben Mildenhall, Wenchao Ma, Le Chen, Danhang Tang, Ulrich Neumann

    Abstract: Creating 4D fields of Gaussian Splatting from images or videos is a challenging task due to its under-constrained nature. While the optimization can draw photometric reference from the input videos or be regulated by generative models, directly supervising Gaussian motions remains underexplored. In this paper, we introduce a novel concept, Gaussian flow, which connects the dynamics of 3D Gaussians… ▽ More

    Submitted 13 May, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

  49. arXiv:2403.12204  [pdf, other

    cs.GT cs.MA eess.SY math.OC

    Information Compression in Dynamic Information Disclosure Games

    Authors: Dengwang Tang, Vijay G. Subramanian

    Abstract: We consider a two-player dynamic information design problem between a principal and a receiver -- a game is played between the two agents on top of a Markovian system controlled by the receiver's actions, where the principal obtains and strategically shares some information about the underlying system with the receiver in order to influence their actions. In our setting, both players have long-ter… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: 14 pages, 5 figures

  50. arXiv:2403.11614  [pdf, other

    cs.CV

    CRS-Diff: Controllable Remote Sensing Image Generation with Diffusion Model

    Authors: Datao Tang, Xiangyong Cao, Xingsong Hou, Zhongyuan Jiang, Junmin Liu, Deyu Meng

    Abstract: The emergence of generative models has revolutionized the field of remote sensing (RS) image generation. Despite generating high-quality images, existing methods are limited in relying mainly on text control conditions, and thus do not always generate images accurately and stably. In this paper, we propose CRS-Diff, a new RS generative framework specifically tailored for RS image generation, lever… ▽ More

    Submitted 1 September, 2024; v1 submitted 18 March, 2024; originally announced March 2024.