-
Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection
Authors:
Weiwei Duan,
Luping Ji,
Shengjia Chen,
Sicheng Zhu,
Jianghong Huang,
Mao Ye
Abstract:
Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background contrast.Currently, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared f…
▽ More
Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background contrast.Currently, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared frame images. Inspired by general object detection, non-fully supervised strategies ($e.g.$, weakly supervised) are believed to be potential in reducing annotation requirements. To break through traditional fully-supervised frameworks, as the first exploration work, this paper proposes a new weakly-supervised contrastive learning (WeCoL) scheme, only requires simple target quantity prompts during model training.Specifically, in our scheme, based on the pretrained segment anything model (SAM), a potential target mining strategy is designed to integrate target activation maps and multi-frame energy accumulation.Besides, contrastive learning is adopted to further improve the reliability of pseudo-labels, by calculating the similarity between positive and negative samples in feature subspace.Moreover, we propose a long-short term motion-aware learning scheme to simultaneously model the local motion patterns and global motion trajectory of small targets.The extensive experiments on two public datasets (DAUB and ITSDT-15K) verify that our weakly-supervised scheme could often outperform early fully-supervised methods. Even, its performance could reach over 90\% of state-of-the-art (SOTA) fully-supervised ones.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning
Authors:
Zhaoye Fei,
Li Ji,
Siyin Wang,
Junhao Shi,
Jingjing Gong,
Xipeng Qiu
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they face significant challenges in embodied task planning scenarios that require continuous environmental understanding and action generation. Existing approaches generate open-loop action scripts based on static knowledge, making it difficult to learn causal relationships between actions and environm…
▽ More
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they face significant challenges in embodied task planning scenarios that require continuous environmental understanding and action generation. Existing approaches generate open-loop action scripts based on static knowledge, making it difficult to learn causal relationships between actions and environmental feedback, particularly in partially observable environments. We introduce Embodied Planner-R1, a novel outcome-driven reinforcement learning framework that enables LLMs to develop interactive capabilities through autonomous exploration with minimal supervision. Our framework incorporates three key innovations: (1) Without human annotations, we employ pure reinforcement learning with group rollout, incorporating in-environment interaction through parallel exploration; (2) completion-driven sparse reward; and (3) Interactive Policy Optimization (IPO) for efficient learning from grouped trajectories. Across two challenging text-based Embodied planning benchmarks, Embodied Planner-R1 achieves impressive completion rates of 97.78% on ALFWorld and 79.92% on ScienceWorld, surpassing prior methods by a large margin, and suffers only a -3.66% drop in previously unseen environments, evidencing strong generalization.
△ Less
Submitted 29 June, 2025;
originally announced June 2025.
-
BeltCrack: the First Sequential-image Industrial Conveyor Belt Crack Detection Dataset and Its Baseline with Triple-domain Feature Learning
Authors:
Jianghong Huang,
Luping Ji,
Xin Ma,
Mao Ye
Abstract:
Conveyor belts are important equipment in modern industry, widely applied in production and manufacturing. Their health is much critical to operational efficiency and safety. Cracks are a major threat to belt health. Currently, considering safety, how to intelligently detect belt cracks is catching an increasing attention. To implement the intelligent detection with machine learning, real crack sa…
▽ More
Conveyor belts are important equipment in modern industry, widely applied in production and manufacturing. Their health is much critical to operational efficiency and safety. Cracks are a major threat to belt health. Currently, considering safety, how to intelligently detect belt cracks is catching an increasing attention. To implement the intelligent detection with machine learning, real crack samples are believed to be necessary. However, existing crack datasets primarily focus on pavement scenarios or synthetic data, no real-world industrial belt crack datasets at all. Cracks are a major threat to belt health. Furthermore, to validate usability and effectiveness, we propose a special baseline method with triple-domain ($i.e.$, time-space-frequency) feature hierarchical fusion learning for the two whole-new datasets. Experimental results demonstrate the availability and effectiveness of our dataset. Besides, they also show that our baseline is obviously superior to other similar detection methods. Our datasets and source codes are available at https://github.com/UESTC-nnLab/BeltCrack.
△ Less
Submitted 24 June, 2025; v1 submitted 21 June, 2025;
originally announced June 2025.
-
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning
Authors:
Yizhen Zhang,
Yang Ding,
Shuoshuo Zhang,
Xinchen Zhang,
Haoling Li,
Zhong-zhi Li,
Peijie Wang,
Jie Wu,
Lei Ji,
Yelong Shen,
Yujiu Yang,
Yeyun Gong
Abstract:
Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image c…
▽ More
Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression
Authors:
Zhong-Zhi Li,
Xiao Liang,
Zihao Tang,
Lei Ji,
Peijie Wang,
Haotian Xu,
Xing W,
Haizhen Huang,
Weiwei Deng,
Yeyun Gong,
Zhijiang Guo,
Xiao Liu,
Fei Yin,
Cheng-Lin Liu
Abstract:
Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pip…
▽ More
Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.
△ Less
Submitted 14 June, 2025; v1 submitted 3 June, 2025;
originally announced June 2025.
-
Convert Language Model into a Value-based Strategic Planner
Authors:
Xiaoyu Wang,
Yue Zhao,
Qingqing Gu,
Zhonglin Jiang,
Xiaokai Chen,
Yong Chen,
Luo Ji
Abstract:
Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage t…
▽ More
Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Q-learning on LLMs, and propose a framework called straQ*. Our framework allows a plug-and-play LLM to bootstrap the planning during ESC, determine the optimal strategy based on long-term returns, and finally guide the LLM to response. Substantial experiments on ESC datasets suggest that straQ* outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and finite state machines.
△ Less
Submitted 17 June, 2025; v1 submitted 11 May, 2025;
originally announced May 2025.
-
Detecting the Root Cause Code Lines in Bug-Fixing Commits by Heterogeneous Graph Learning
Authors:
Liguo Ji,
Chenchen Li,
Shenglin Wang,
Furui Zhan
Abstract:
With the continuous growth in the scale and complexity of software systems, defect remediation has become increasingly difficult and costly. Automated defect prediction tools can proactively identify software changes prone to defects within software projects, thereby enhancing software development efficiency. However, existing work in heterogeneous and complex software projects continues to face c…
▽ More
With the continuous growth in the scale and complexity of software systems, defect remediation has become increasingly difficult and costly. Automated defect prediction tools can proactively identify software changes prone to defects within software projects, thereby enhancing software development efficiency. However, existing work in heterogeneous and complex software projects continues to face challenges, such as struggling with heterogeneous commit structures and ignoring cross-line dependencies in code changes, which ultimately reduce the accuracy of defect identification. To address these challenges, we propose an approach called RC_Detector. RC_Detector comprises three main components: the bug-fixing graph construction component, the code semantic aggregation component, and the cross-line semantic retention component. The bug-fixing graph construction component identifies the code syntax structures and program dependencies within bug-fixing commits and transforms them into heterogeneous graph formats by converting the source code into vector representations. The code semantic aggregation component adapts to heterogeneous data by using heterogeneous attention to learn the hidden semantic representation of target code lines. The cross-line semantic retention component regulates propagated semantic information by using attenuation and reinforcement gates derived from old and new code semantic representations, effectively preserving cross-line semantic relationships. Extensive experiments were conducted to evaluate the performance of our model by collecting data from 87 open-source projects, including 675 bug-fixing commits. The experimental results demonstrate that our model outperforms state-of-the-art approaches, achieving significant improvements of 83.15%,96.83%,78.71%,74.15%,54.14%,91.66%,91.66%, and 34.82% in MFR, respectively, compared with the state-of-the-art approaches.
△ Less
Submitted 13 May, 2025; v1 submitted 2 May, 2025;
originally announced May 2025.
-
CPR: Leveraging LLMs for Topic and Phrase Suggestion to Facilitate Comprehensive Product Reviews
Authors:
Ekta Gujral,
Apurva Sinha,
Lishi Ji,
Bijayani Sanghamitra Mishra
Abstract:
Consumers often heavily rely on online product reviews, analyzing both quantitative ratings and textual descriptions to assess product quality. However, existing research hasn't adequately addressed how to systematically encourage the creation of comprehensive reviews that capture both customers sentiment and detailed product feature analysis. This paper presents CPR, a novel methodology that leve…
▽ More
Consumers often heavily rely on online product reviews, analyzing both quantitative ratings and textual descriptions to assess product quality. However, existing research hasn't adequately addressed how to systematically encourage the creation of comprehensive reviews that capture both customers sentiment and detailed product feature analysis. This paper presents CPR, a novel methodology that leverages the power of Large Language Models (LLMs) and Topic Modeling to guide users in crafting insightful and well-rounded reviews. Our approach employs a three-stage process: first, we present users with product-specific terms for rating; second, we generate targeted phrase suggestions based on these ratings; and third, we integrate user-written text through topic modeling, ensuring all key aspects are addressed. We evaluate CPR using text-to-text LLMs, comparing its performance against real-world customer reviews from Walmart. Our results demonstrate that CPR effectively identifies relevant product terms, even for new products lacking prior reviews, and provides sentiment-aligned phrase suggestions, saving users time and enhancing reviews quality. Quantitative analysis reveals a 12.3% improvement in BLEU score over baseline methods, further supported by manual evaluation of generated phrases. We conclude by discussing potential extensions and future research directions.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
FiSMiness: A Finite State Machine Based Paradigm for Emotional Support Conversations
Authors:
Yue Zhao,
Qingqing Gu,
Xiaoyu Wang,
Teng Chen,
Zhonglin Jiang,
Yong Chen,
Luo Ji
Abstract:
Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage t…
▽ More
Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Finite State Machine (FSM) on LLMs, and propose a framework called FiSMiness. Our framework allows a single LLM to bootstrap the planning during ESC, and self-reason the seeker's emotion, support strategy and the final response upon each conversational turn. Substantial experiments on ESC datasets suggest that FiSMiness outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and external-assisted methods, even those with many more parameters.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Structured and sparse partial least squares coherence for multivariate cortico-muscular analysis
Authors:
Jingyao Sun,
Qilu Zhang,
Di Ma,
Tianyu Jia,
Shijie Jia,
Xiaoxue Zhai,
Ruimou Xie,
Ping-Ju Lin,
Zhibin Li,
Yu Pan,
Linhong Ji,
Chong Li
Abstract:
Multivariate cortico-muscular analysis has recently emerged as a promising approach for evaluating the corticospinal neural pathway. However, current multivariate approaches encounter challenges such as high dimensionality and limited sample sizes, thus restricting their further applications. In this paper, we propose a structured and sparse partial least squares coherence algorithm (ssPLSC) to ex…
▽ More
Multivariate cortico-muscular analysis has recently emerged as a promising approach for evaluating the corticospinal neural pathway. However, current multivariate approaches encounter challenges such as high dimensionality and limited sample sizes, thus restricting their further applications. In this paper, we propose a structured and sparse partial least squares coherence algorithm (ssPLSC) to extract shared latent space representations related to cortico-muscular interactions. Our approach leverages an embedded optimization framework by integrating a partial least squares (PLS)-based objective function, a sparsity constraint and a connectivity-based structured constraint, addressing the generalizability, interpretability and spatial structure. To solve the optimization problem, we develop an efficient alternating iterative algorithm within a unified framework and prove its convergence experimentally. Extensive experimental results from one synthetic and several real-world datasets have demonstrated that ssPLSC can achieve competitive or better performance over some representative multivariate cortico-muscular fusion methods, particularly in scenarios characterized by limited sample sizes and high noise levels. This study provides a novel multivariate fusion method for cortico-muscular analysis, offering a transformative tool for the evaluation of corticospinal pathway integrity in neurological disorders.
△ Less
Submitted 14 June, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling
Authors:
Haebin Shin,
Lei Ji,
Xiao Liu,
Yeyun Gong
Abstract:
Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic…
▽ More
Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
PoseTraj: Pose-Aware Trajectory Control in Video Diffusion
Authors:
Longbin Ji,
Lei Zhong,
Pengfei Wei,
Changjian Li
Abstract:
Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. O…
▽ More
Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Camera Measurement of Blood Oxygen Saturation
Authors:
Jiankai Tang,
Xin Liu,
Daniel McDuff,
Zhang Jiang,
Hongming Hu,
Luxi Zhou,
Nodoka Nagao,
Haruta Suzuki,
Yuki Nagahama,
Wei Li,
Linhong Ji,
Yuanchun Shi,
Izumi Nishidate,
Yuntao Wang
Abstract:
Blood oxygen saturation (SpO2) is a crucial vital sign routinely monitored in medical settings. Traditional methods require dedicated contact sensors, limiting accessibility and comfort. This study presents a deep learning framework for contactless SpO2 measurement using an off-the-shelf camera, addressing challenges related to lighting variations and skin tone diversity. We conducted two large-sc…
▽ More
Blood oxygen saturation (SpO2) is a crucial vital sign routinely monitored in medical settings. Traditional methods require dedicated contact sensors, limiting accessibility and comfort. This study presents a deep learning framework for contactless SpO2 measurement using an off-the-shelf camera, addressing challenges related to lighting variations and skin tone diversity. We conducted two large-scale studies with diverse participants and evaluated our method against traditional signal processing approaches in intra- and inter-dataset scenarios. Our approach demonstrated consistent accuracy across demographic groups, highlighting the feasibility of camera-based SpO2 monitoring as a scalable and non-invasive tool for remote health assessment.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Progressive Collaborative and Semantic Knowledge Fusion for Generative Recommendation
Authors:
Longtao Xiao,
Haozhao Wang,
Cheng Wang,
Linfei Ji,
Yifan Wang,
Jieming Zhu,
Zhenhua Dong,
Rui Zhang,
Ruixuan Li
Abstract:
With the recent surge in interest surrounding generative paradigms, generative recommendation has increasingly attracted the attention of researchers in the recommendation community. This paradigm generally consists of two stages. In the first stage, pretrained semantic embeddings or collaborative ID embeddings are quantized to create item codes, aiming to capture and preserve rich semantic or col…
▽ More
With the recent surge in interest surrounding generative paradigms, generative recommendation has increasingly attracted the attention of researchers in the recommendation community. This paradigm generally consists of two stages. In the first stage, pretrained semantic embeddings or collaborative ID embeddings are quantized to create item codes, aiming to capture and preserve rich semantic or collaborative knowledge within these codes. The second stage involves utilizing these discrete codes to perform an autoregressive sequence generation task. Existing methods often either overlook collaborative or semantic knowledge, or combine the two roughly. In this paper, we observe that naively concatenating representations from semantic and collaborative modality leads to a semantic domination issue, where the resulting representation is overly influenced by semantic information, effectively overshadowing the collaborative representation. Consequently, downstream recommendation tasks fail to fully exploit the knowledge from both modalities, resulting in suboptimal performance. To address this, we propose a progressive collaborative and semantic knowledge fusion model for generative recommendation, named PRORec, which integrates semantic and collaborative knowledge with a unified code through a two-stage framework. Specifically, in the first stage, we propose a cross-modality knowledge alignment task, which integrates semantic knowledge into collaborative embeddings, enhancing their representational capability. In the second stage, we propose an in-modality knowledge distillation task, designed to effectively capture and integrate knowledge from both semantic and collaborative modalities. Extensive experiments on three widely used benchmarks validate the effectiveness of our approach, demonstrating its superiority compared to existing methods.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Bayesian Optimization by Kernel Regression and Density-based Exploration
Authors:
Tansheng Zhu,
Hongyu Zhou,
Ke Jin,
Xusheng Xu,
Qiufan Yuan,
Lijie Ji
Abstract:
Bayesian optimization is highly effective for optimizing expensive-to-evaluate black-box functions, but it faces significant computational challenges due to the high computational complexity of Gaussian processes, which results in a total time complexity that is quartic with respect to the number of iterations. To address this limitation, we propose the Bayesian Optimization by Kernel regression a…
▽ More
Bayesian optimization is highly effective for optimizing expensive-to-evaluate black-box functions, but it faces significant computational challenges due to the high computational complexity of Gaussian processes, which results in a total time complexity that is quartic with respect to the number of iterations. To address this limitation, we propose the Bayesian Optimization by Kernel regression and density-based Exploration (BOKE) algorithm. BOKE uses kernel regression for efficient function approximation, kernel density for exploration, and integrates them into the confidence bound criteria to guide the optimization process, thus reducing computational costs to quadratic. Our theoretical analysis rigorously establishes the global convergence of BOKE and ensures its robustness in noisy settings. Through extensive numerical experiments on both synthetic and real-world optimization tasks, we demonstrate that BOKE not only performs competitively compared to Gaussian process-based methods but also exhibits superior computational efficiency. These results highlight BOKE's effectiveness in resource-constrained environments, providing a practical approach for optimization problems in engineering applications.
△ Less
Submitted 15 March, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement
Authors:
Zhefan Rao,
Liya Ji,
Yazhou Xing,
Runtao Liu,
Zhaoyang Liu,
Jiaxin Xie,
Ziqiao Peng,
Yingqing He,
Qifeng Chen
Abstract:
Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to "grow" its abilities b…
▽ More
Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to "grow" its abilities based on a pre-trained foundation, analogous to how humans acquire new knowledge based on past experiences. There is a lack of extensive study of the continual pre-training techniques in T2V generation. In this work, we take the initial step toward exploring this task systematically and propose ModelGrow. Specifically, we break this task into two key aspects: increasing model capacity and improving semantic understanding. For model capacity, we introduce several novel techniques to expand the model size, enabling it to store new knowledge and improve generation performance. For semantic understanding, we propose a method that leverages large language models as advanced text encoders, integrating them into T2V models to enhance language comprehension and guide generation results according to detailed prompts. This approach enables the model to achieve better semantic alignment, particularly in response to complex user prompts. Extensive experiments demonstrate the effectiveness of our method across various metrics. The source code and the model of ModelGrow will be publicly available.
△ Less
Submitted 25 December, 2024;
originally announced December 2024.
-
Large Language Model Can Be a Foundation for Hidden Rationale-Based Retrieval
Authors:
Luo Ji,
Feixiang Guo,
Teng Chen,
Qingqing Gu,
Xiaoyu Wang,
Ningyuan Xi,
Yihong Wang,
Peng Yu,
Yue Zhao,
Hongyang Lei,
Zhonglin Jiang,
Yong Chen
Abstract:
Despite the recent advancement in Retrieval-Augmented Generation (RAG) systems, most retrieval methodologies are often developed for factual retrieval, which assumes query and positive documents are semantically similar. In this paper, we instead propose and study a more challenging type of retrieval task, called hidden rationale retrieval, in which query and document are not similar but can be in…
▽ More
Despite the recent advancement in Retrieval-Augmented Generation (RAG) systems, most retrieval methodologies are often developed for factual retrieval, which assumes query and positive documents are semantically similar. In this paper, we instead propose and study a more challenging type of retrieval task, called hidden rationale retrieval, in which query and document are not similar but can be inferred by reasoning chains, logic relationships, or empirical experiences. To address such problems, an instruction-tuned Large language model (LLM) with a cross-encoder architecture could be a reasonable choice. To further strengthen pioneering LLM-based retrievers, we design a special instruction that transforms the retrieval task into a generative task by prompting LLM to answer a binary-choice question. The model can be fine-tuned with direct preference optimization (DPO). The framework is also optimized for computational efficiency with no performance degradation. We name this retrieval framework by RaHoRe and verify its zero-shot and fine-tuned performance superiority on Emotional Support Conversation (ESC), compared with previous retrieval works. Our study suggests the potential to employ LLM as a foundation for a wider scope of retrieval tasks. Our codes, models, and datasets are available on https://github.com/flyfree5/LaHoRe.
△ Less
Submitted 9 April, 2025; v1 submitted 21 December, 2024;
originally announced December 2024.
-
Mining In-distribution Attributes in Outliers for Out-of-distribution Detection
Authors:
Yutian Lei,
Luping Ji,
Pei Liu
Abstract:
Out-of-distribution (OOD) detection is indispensable for deploying reliable machine learning systems in real-world scenarios. Recent works, using auxiliary outliers in training, have shown good potential. However, they seldom concern the intrinsic correlations between in-distribution (ID) and OOD data. In this work, we discover an obvious correlation that OOD data usually possesses significant ID…
▽ More
Out-of-distribution (OOD) detection is indispensable for deploying reliable machine learning systems in real-world scenarios. Recent works, using auxiliary outliers in training, have shown good potential. However, they seldom concern the intrinsic correlations between in-distribution (ID) and OOD data. In this work, we discover an obvious correlation that OOD data usually possesses significant ID attributes. These attributes should be factored into the training process, rather than blindly suppressed as in previous approaches. Based on this insight, we propose a structured multi-view-based out-of-distribution detection learning (MVOL) framework, which facilitates rational handling of the intrinsic in-distribution attributes in outliers. We provide theoretical insights on the effectiveness of MVOL for OOD detection. Extensive experiments demonstrate the superiority of our framework to others. MVOL effectively utilizes both auxiliary OOD datasets and even wild datasets with noisy in-distribution data. Code is available at https://github.com/UESTC-nnLab/MVOL.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation
Authors:
Xiaoyu Wang,
Ningyuan Xi,
Teng Chen,
Qingqing Gu,
Yue Zhao,
Xiaokai Chen,
Zhonglin Jiang,
Yong Chen,
Luo Ji
Abstract:
Large Language Models (LLM) are usually fine-tuned to participate in dyadic or two-party dialogues, which can not adapt well to multi-party dialogues (MPD), which hinders their applications in such scenarios including multi-personal meetings, discussions and daily communication. Previous LLM-based researches mainly focus on the multi-agent framework, while their base LLMs are still pairwisely fine…
▽ More
Large Language Models (LLM) are usually fine-tuned to participate in dyadic or two-party dialogues, which can not adapt well to multi-party dialogues (MPD), which hinders their applications in such scenarios including multi-personal meetings, discussions and daily communication. Previous LLM-based researches mainly focus on the multi-agent framework, while their base LLMs are still pairwisely fine-tuned. In this work, we design a multi-party fine-tuning framework (MuPaS) for LLMs on the multi-party dialogue datasets, and prove such a straightforward framework can let the LLM align with the multi-party conversation style efficiently and effectively. We also design two training strategies which can convert MuPaS into the MPD simulator. Substantial experiments show that MuPaS can achieve state-of-the-art multi-party response, higher accuracy of the-next-speaker prediction, higher human and automatic evaluated utterance qualities, and can even generate reasonably with out-of-distribution scene, topic and role descriptions. The MuPaS framework bridges the LLM training with more complicated multi-party applications, such as conversation generation, virtual rehearsal or meta-universe.
△ Less
Submitted 11 June, 2025; v1 submitted 6 December, 2024;
originally announced December 2024.
-
Generative Prompt Internalization
Authors:
Haebin Shin,
Lei Ji,
Yeyun Gong,
Sungdong Kim,
Eunbi Choi,
Minjoon Seo
Abstract:
Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Prompt Internalization (GenPI), a lightweight method that employs a joint training approach. GenPI not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along…
▽ More
Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Prompt Internalization (GenPI), a lightweight method that employs a joint training approach. GenPI not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model's behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Prompt Internalization enables high performance and efficient inference without the need for explicit prompts.
△ Less
Submitted 24 March, 2025; v1 submitted 24 November, 2024;
originally announced November 2024.
-
Explore the Reasoning Capability of LLMs in the Chess Testbed
Authors:
Shu Wang,
Lei Ji,
Renxi Wang,
Wenxiao Zhao,
Haokun Liu,
Yifan Hou,
Ying Nian Wu
Abstract:
Reasoning is a central capability of human intelligence. In recent years, with the advent of large-scale datasets, pretrained large language models have emerged with new capabilities, including reasoning. However, these models still struggle with long-term, complex reasoning tasks, such as playing chess. Based on the observation that expert chess players employ a dual approach combining long-term…
▽ More
Reasoning is a central capability of human intelligence. In recent years, with the advent of large-scale datasets, pretrained large language models have emerged with new capabilities, including reasoning. However, these models still struggle with long-term, complex reasoning tasks, such as playing chess. Based on the observation that expert chess players employ a dual approach combining long-term strategic play with short-term tactical play along with language explanation, we propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic. Specifically, we collect a dataset named MATE, which consists of 1 million chess positions with candidate moves annotated by chess experts for strategy and tactics. We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves. Our experiments show that our models perform better than GPT, Claude, and Gemini models. We find that language explanations can enhance the reasoning capability of large language models.
△ Less
Submitted 28 February, 2025; v1 submitted 10 November, 2024;
originally announced November 2024.
-
LGU-SLAM: Learnable Gaussian Uncertainty Matching with Deformable Correlation Sampling for Deep Visual SLAM
Authors:
Yucheng Huang,
Luping Ji,
Hudong Liu,
Mao Ye
Abstract:
Deep visual Simultaneous Localization and Mapping (SLAM) techniques, e.g., DROID, have made significant advancements by leveraging deep visual odometry on dense flow fields. In general, they heavily rely on global visual similarity matching. However, the ambiguous similarity interference in uncertain regions could often lead to excessive noise in correspondences, ultimately misleading SLAM in geom…
▽ More
Deep visual Simultaneous Localization and Mapping (SLAM) techniques, e.g., DROID, have made significant advancements by leveraging deep visual odometry on dense flow fields. In general, they heavily rely on global visual similarity matching. However, the ambiguous similarity interference in uncertain regions could often lead to excessive noise in correspondences, ultimately misleading SLAM in geometric modeling. To address this issue, we propose a Learnable Gaussian Uncertainty (LGU) matching. It mainly focuses on precise correspondence construction. In our scheme, a learnable 2D Gaussian uncertainty model is designed to associate matching-frame pairs. It could generate input-dependent Gaussian distributions for each correspondence map. Additionally, a multi-scale deformable correlation sampling strategy is devised to adaptively fine-tune the sampling of each direction by a priori look-up ranges, enabling reliable correlation construction. Furthermore, a KAN-bias GRU component is adopted to improve a temporal iterative enhancement for accomplishing sophisticated spatio-temporal modeling with limited parameters. The extensive experiments on real-world and synthetic datasets are conducted to validate the effectiveness and superiority of our method.
△ Less
Submitted 30 October, 2024;
originally announced October 2024.
-
Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification
Authors:
Jiaxiang Gou,
Luping Ji,
Pei Liu,
Mao Ye
Abstract:
Whole Slide Image (WSI) classification has very significant applications in clinical pathology, e.g., tumor identification and cancer diagnosis. Currently, most research attention is focused on Multiple Instance Learning (MIL) using static datasets. One of the most obvious weaknesses of these methods is that they cannot efficiently preserve and utilize previously learned knowledge. With any new da…
▽ More
Whole Slide Image (WSI) classification has very significant applications in clinical pathology, e.g., tumor identification and cancer diagnosis. Currently, most research attention is focused on Multiple Instance Learning (MIL) using static datasets. One of the most obvious weaknesses of these methods is that they cannot efficiently preserve and utilize previously learned knowledge. With any new data arriving, classification models are required to be re-trained on both previous and current new data. To overcome this shortcoming and break through traditional vision modality, this paper proposes the first Vision-Language-based framework with Queryable Prototype Multiple Instance Learning (QPMIL-VL) specially designed for incremental WSI classification. This framework mainly consists of two information processing branches: one is for generating bag-level features by prototype-guided aggregation of instance features, while the other is for enhancing class features through a combination of class ensemble, tunable vector and class similarity loss. The experiments on four public WSI datasets demonstrate that our QPMIL-VL framework is effective for incremental WSI classification and often significantly outperforms other compared methods, achieving state-of-the-art (SOTA) performance. Our source code is publicly available at https://github.com/can-can-ya/QPMIL-VL.
△ Less
Submitted 25 December, 2024; v1 submitted 14 October, 2024;
originally announced October 2024.
-
ToolGen: Unified Tool Retrieval and Calling via Generation
Authors:
Renxi Wang,
Xudong Han,
Lei Ji,
Shu Wang,
Timothy Baldwin,
Haonan Li
Abstract:
As large language models (LLMs) advance, their inability to autonomously execute tasks by directly interacting with external tools remains a critical limitation. Traditional methods rely on inputting tool descriptions as context, which is constrained by context length and requires separate, often inefficient, retrieval mechanisms. We introduce ToolGen, a paradigm shift that integrates tool knowled…
▽ More
As large language models (LLMs) advance, their inability to autonomously execute tasks by directly interacting with external tools remains a critical limitation. Traditional methods rely on inputting tool descriptions as context, which is constrained by context length and requires separate, often inefficient, retrieval mechanisms. We introduce ToolGen, a paradigm shift that integrates tool knowledge directly into the LLM's parameters by representing each tool as a unique token. This enables the LLM to generate tool calls and arguments as part of its next token prediction capabilities, seamlessly blending tool invocation with language generation. Our framework allows the LLM to access and utilize a vast amount of tools with no additional retrieval step, significantly enhancing both performance and scalability. Experimental results with over 47,000 tools show that ToolGen not only achieves superior results in both tool retrieval and autonomous task completion but also sets the stage for a new era of AI agents that can adapt to tools across diverse domains. By fundamentally transforming tool retrieval into a generative process, ToolGen paves the way for more versatile, efficient, and autonomous AI systems. ToolGen enables end-to-end tool learning and opens opportunities for integration with other advanced techniques such as chain-of-thought and reinforcement learning, thereby expanding the practical capabilities of LLMs.
△ Less
Submitted 29 March, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning
Authors:
Ningyuan Xi,
Xiaoyu Wang,
Yetao Wu,
Teng Chen,
Qingqing Gu,
Yue Zhao,
Jinxian Qu,
Zhonglin Jiang,
Yong Chen,
Luo Ji
Abstract:
Large Language Model can reasonably understand and generate human expressions but may lack of thorough thinking and reasoning mechanisms. Recently there have been several studies which enhance the thinking ability of language models but most of them are not data-driven or training-based. In this paper, we are motivated by the cognitive mechanism in the natural world, and design a novel model archi…
▽ More
Large Language Model can reasonably understand and generate human expressions but may lack of thorough thinking and reasoning mechanisms. Recently there have been several studies which enhance the thinking ability of language models but most of them are not data-driven or training-based. In this paper, we are motivated by the cognitive mechanism in the natural world, and design a novel model architecture called TaS which allows it to first consider the thoughts and then express the response based upon the query. We design several pipelines to annotate or generate the thought contents from prompt-response samples, then add language heads in a middle layer which behaves as the thinking layer. We train the language model by the thoughts-augmented data and successfully let the thinking layer automatically generate reasonable thoughts and finally output more reasonable responses. Both qualitative examples and quantitative results validate the effectiveness and performance of TaS. Our code is available at https://anonymous.4open.science/r/TadE.
△ Less
Submitted 25 April, 2025; v1 submitted 18 September, 2024;
originally announced September 2024.
-
Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology
Authors:
Pei Liu,
Luping Ji,
Jiaxiang Gou,
Bo Fu,
Mao Ye
Abstract:
Histopathology Whole-Slide Images (WSIs) provide an important tool to assess cancer prognosis in computational pathology (CPATH). While existing survival analysis (SA) approaches have made exciting progress, they are generally limited to adopting highly-expressive network architectures and only coarse-grained patient-level labels to learn visual prognostic representations from gigapixel WSIs. Such…
▽ More
Histopathology Whole-Slide Images (WSIs) provide an important tool to assess cancer prognosis in computational pathology (CPATH). While existing survival analysis (SA) approaches have made exciting progress, they are generally limited to adopting highly-expressive network architectures and only coarse-grained patient-level labels to learn visual prognostic representations from gigapixel WSIs. Such learning paradigm suffers from critical performance bottlenecks, when facing present scarce training data and standard multi-instance learning (MIL) framework in CPATH. To overcome it, this paper, for the first time, proposes a new Vision-Language-based SA (VLSA) paradigm. Concretely, (1) VLSA is driven by pathology VL foundation models. It no longer relies on high-capability networks and shows the advantage of data efficiency. (2) In vision-end, VLSA encodes textual prognostic prior and then employs it as auxiliary signals to guide the aggregating of visual prognostic features at instance level, thereby compensating for the weak supervision in MIL. Moreover, given the characteristics of SA, we propose i) ordinal survival prompt learning to transform continuous survival labels into textual prompts; and ii) ordinal incidence function as prediction target to make SA compatible with VL-based prediction. Notably, VLSA's predictions can be interpreted intuitively by our Shapley values-based method. The extensive experiments on five datasets confirm the effectiveness of our scheme. Our VLSA could pave a new way for SA in CPATH by offering weakly-supervised MIL an effective means to learn valuable prognostic clues from gigapixel WSIs. Our source code is available at https://github.com/liupei101/VLSA.
△ Less
Submitted 11 February, 2025; v1 submitted 14 September, 2024;
originally announced September 2024.
-
Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation
Authors:
Luo Ji,
Gao Liu,
Mingyang Yin,
Hongxia Yang,
Jingren Zhou
Abstract:
Modern listwise recommendation systems need to consider both long-term user perceptions and short-term interest shifts. Reinforcement learning can be applied on recommendation to study such a problem but is also subject to large search space, sparse user feedback and long interactive latency. Motivated by recent progress in hierarchical reinforcement learning, we propose a novel framework called m…
▽ More
Modern listwise recommendation systems need to consider both long-term user perceptions and short-term interest shifts. Reinforcement learning can be applied on recommendation to study such a problem but is also subject to large search space, sparse user feedback and long interactive latency. Motivated by recent progress in hierarchical reinforcement learning, we propose a novel framework called mccHRL to provide different levels of temporal abstraction on listwise recommendation. Within the hierarchical framework, the high-level agent studies the evolution of user perception, while the low-level agent produces the item selection policy by modeling the process as a sequential decision-making problem. We argue that such framework has a well-defined decomposition of the outra-session context and the intra-session context, which are encoded by the high-level and low-level agents, respectively. To verify this argument, we implement both a simulator-based environment and an industrial dataset-based experiment. Results observe significant performance improvement by our method, compared with several well-known baselines. Data and codes have been made public.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
Online Decision MetaMorphFormer: A Casual Transformer-Based Reinforcement Learning Framework of Universal Embodied Intelligence
Authors:
Luo Ji,
Runji Lin
Abstract:
Interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge is adaptive to multiple tasks and universal environments. Despite there being increasing efforts in the field of Reinforcement Learning (RL) with the aid of transformers, most of them might be limited by the offline training pipeline, which prohibits exploration and generali…
▽ More
Interactive artificial intelligence in the motion control field is an interesting topic, especially when universal knowledge is adaptive to multiple tasks and universal environments. Despite there being increasing efforts in the field of Reinforcement Learning (RL) with the aid of transformers, most of them might be limited by the offline training pipeline, which prohibits exploration and generalization abilities. To address this limitation, we propose the framework of Online Decision MetaMorphFormer (ODM) which aims to achieve self-awareness, environment recognition, and action planning through a unified model architecture. Motivated by cognitive and behavioral psychology, an ODM agent is able to learn from others, recognize the world, and practice itself based on its own experience. ODM can also be applied to any arbitrary agent with a multi-joint body, located in different environments, and trained with different types of tasks using large-scale pre-trained datasets. Through the use of pre-trained datasets, ODM can quickly warm up and learn the necessary knowledge to perform the desired task, while the target environment continues to reinforce the universal policy. Extensive online experiments as well as few-shot and zero-shot environmental tests are used to verify ODM's performance and generalization ability. The results of our study contribute to the study of general artificial intelligence in embodied and cognitive fields. Code, results, and video examples can be found on the website \url{https://rlodm.github.io/odm/}.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio
Authors:
Ningyuan Xi,
Yetao Wu,
Kun Fan,
Teng Chen,
Qingqing Gu,
Peng Yu,
Jinxian Qu,
Chenxi Liu,
Zhonglin Jiang,
Yong Chen,
Luo Ji
Abstract:
Large Language Models (LLM) often needs to be Continual Pre-Trained (CPT) to obtain the unfamiliar language skill or adapt into new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study which bridge the gap between the optimal mixture ratio and the actual mode…
▽ More
Large Language Models (LLM) often needs to be Continual Pre-Trained (CPT) to obtain the unfamiliar language skill or adapt into new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study which bridge the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicate the optimal experimental set up. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark, but also some specific domains including math, coding and emotional intelligence. We deploy the final 70B version of LLM on an real-life chat system which obtain satisfying performance.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
LaMsS: When Large Language Models Meet Self-Skepticism
Authors:
Yetao Wu,
Yihong Wang,
Teng Chen,
Ningyuan Xi,
Qingqing Gu,
Hongyang Lei,
Luo Ji
Abstract:
Hallucination is a major challenge for large language models (LLMs), preventing their further application in some fields. The skeptical thinking of humankind could be useful for LLMs to self-cognition, self-reflection and alleviate their hallucinations. Inspired by this consideration, we propose a novel approach called LaMsS, which combines the semantic understanding capability of LLMs with self-s…
▽ More
Hallucination is a major challenge for large language models (LLMs), preventing their further application in some fields. The skeptical thinking of humankind could be useful for LLMs to self-cognition, self-reflection and alleviate their hallucinations. Inspired by this consideration, we propose a novel approach called LaMsS, which combines the semantic understanding capability of LLMs with self-skepticism. By introducing a series of skepticism tokens and augmenting them into the vocabulary, we conduct both pertaining and finetuning, which allow the LLM to decode each normal token followed by a skeptical token, representing different skepticism levels. By calculating the response skepticism given a query, one can define a new self-aware LLM which is only willing to answer with relative lower skepticism level than the threshold. By examining the accuracy, AUC and AP of willingly answering questions, we demonstrate that LaMsS achieves better performance than baselines on both multi-choice questions and open-domain question-answering benchmarks, and can generalize to multi-task and out-of-domain settings. Our study sheds some lights on the self-skepticism modeling on further artificial intelligence. Project code and model checkpoints can be found in https://anonymous.4open.science/r/SM-1E76.
△ Less
Submitted 25 April, 2025; v1 submitted 10 September, 2024;
originally announced September 2024.
-
M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture
Authors:
Hongyang Lei,
Xiaolong Cheng,
Qi Qin,
Dan Wang,
Kun Fan,
Huazhen Huang,
Qingqing Gu,
Yetao Wu,
Zhonglin Jiang,
Yong Chen,
Luo Ji
Abstract:
Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a…
▽ More
Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference. Our observation suggests that M3-JEPA might become a new basis to self-supervised learning in the open world.
△ Less
Submitted 18 June, 2025; v1 submitted 9 September, 2024;
originally announced September 2024.
-
Bayesian CART models for aggregate claim modeling
Authors:
Yaojun Zhang,
Lanpeng Ji,
Georgios Aivaliotis,
Charles C. Taylor
Abstract:
This paper proposes three types of Bayesian CART (or BCART) models for aggregate claim amount, namely, frequency-severity models, sequential models and joint models. We propose a general framework for the BCART models applicable to data with multivariate responses, which is particularly useful for the joint BCART models with a bivariate response: the number of claims and aggregate claim amount. To…
▽ More
This paper proposes three types of Bayesian CART (or BCART) models for aggregate claim amount, namely, frequency-severity models, sequential models and joint models. We propose a general framework for the BCART models applicable to data with multivariate responses, which is particularly useful for the joint BCART models with a bivariate response: the number of claims and aggregate claim amount. To facilitate frequency-severity modeling, we investigate BCART models for the right-skewed and heavy-tailed claim severity data by using various distributions. We discover that the Weibull distribution is superior to gamma and lognormal distributions, due to its ability to capture different tail characteristics in tree models. Additionally, we find that sequential BCART models and joint BCART models, which incorporate dependence between the number of claims and average severity, are beneficial and thus preferable to the frequency-severity BCART models in which independence is assumed. The effectiveness of these models' performance is illustrated by carefully designed simulations and real insurance data.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
MGFs: Masked Gaussian Fields for Meshing Building based on Multi-View Images
Authors:
Tengfei Wang,
Zongqian Zhan,
Rui Xia,
Linxia Ji,
Xin Wang
Abstract:
Over the last few decades, image-based building surface reconstruction has garnered substantial research interest and has been applied across various fields, such as heritage preservation, architectural planning, etc. Compared to the traditional photogrammetric and NeRF-based solutions, recently, Gaussian fields-based methods have exhibited significant potential in generating surface meshes due to…
▽ More
Over the last few decades, image-based building surface reconstruction has garnered substantial research interest and has been applied across various fields, such as heritage preservation, architectural planning, etc. Compared to the traditional photogrammetric and NeRF-based solutions, recently, Gaussian fields-based methods have exhibited significant potential in generating surface meshes due to their time-efficient training and detailed 3D information preservation. However, most gaussian fields-based methods are trained with all image pixels, encompassing building and nonbuilding areas, which results in a significant noise for building meshes and degeneration in time efficiency. This paper proposes a novel framework, Masked Gaussian Fields (MGFs), designed to generate accurate surface reconstruction for building in a time-efficient way. The framework first applies EfficientSAM and COLMAP to generate multi-level masks of building and the corresponding masked point clouds. Subsequently, the masked gaussian fields are trained by integrating two innovative losses: a multi-level perceptual masked loss focused on constructing building regions and a boundary loss aimed at enhancing the details of the boundaries between different masks. Finally, we improve the tetrahedral surface mesh extraction method based on the masked gaussian spheres. Comprehensive experiments on UAV images demonstrate that, compared to the traditional method and several NeRF-based and Gaussian-based SOTA solutions, our approach significantly improves both the accuracy and efficiency of building surface reconstruction. Notably, as a byproduct, there is an additional gain in the novel view synthesis of building.
△ Less
Submitted 6 August, 2024;
originally announced August 2024.
-
Deformable Feature Alignment and Refinement for Moving Infrared Dim-small Target Detection
Authors:
Dengyan Luo,
Yanping Xiang,
Hu Wang,
Luping Ji,
Shuai Li,
Mao Ye
Abstract:
The detection of moving infrared dim-small targets has been a challenging and prevalent research topic. The current state-of-the-art methods are mainly based on ConvLSTM to aggregate information from adjacent frames to facilitate the detection of the current frame. However, these methods implicitly utilize motion information only in the training stage and fail to explicitly explore motion compensa…
▽ More
The detection of moving infrared dim-small targets has been a challenging and prevalent research topic. The current state-of-the-art methods are mainly based on ConvLSTM to aggregate information from adjacent frames to facilitate the detection of the current frame. However, these methods implicitly utilize motion information only in the training stage and fail to explicitly explore motion compensation, resulting in poor performance in the case of a video sequence including large motion. In this paper, we propose a Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution to explicitly use motion context in both the training and inference stages. Specifically, a Temporal Deformable Alignment (TDA) module based on the designed Dilated Convolution Attention Fusion (DCAF) block is developed to explicitly align the adjacent frames with the current frame at the feature level. Then, the feature refinement module adaptively fuses the aligned features and further aggregates useful spatio-temporal information by means of the proposed Attention-guided Deformable Fusion (AGDF) block. In addition, to improve the alignment of adjacent frames with the current frame, we extend the traditional loss function by introducing a new motion compensation loss. Extensive experimental results demonstrate that the proposed DFAR method achieves the state-of-the-art performance on two benchmark datasets including DAUB and IRDST.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Triple-domain Feature Learning with Frequency-aware Memory Enhancement for Moving Infrared Small Target Detection
Authors:
Weiwei Duan,
Luping Ji,
Shengjia Chen,
Sicheng Zhu,
Mao Ye
Abstract:
As a sub-field of object detection, moving infrared small target detection presents significant challenges due to tiny target sizes and low contrast against backgrounds. Currently-existing methods primarily rely on the features extracted only from spatio-temporal domain. Frequency domain has hardly been concerned yet, although it has been widely applied in image processing. To extend feature sourc…
▽ More
As a sub-field of object detection, moving infrared small target detection presents significant challenges due to tiny target sizes and low contrast against backgrounds. Currently-existing methods primarily rely on the features extracted only from spatio-temporal domain. Frequency domain has hardly been concerned yet, although it has been widely applied in image processing. To extend feature source domains and enhance feature representation, we propose a new Triple-domain Strategy (Tridos) with the frequency-aware memory enhancement on spatio-temporal domain for infrared small target detection. In this scheme, it effectively detaches and enhances frequency features by a local-global frequency-aware module with Fourier transform. Inspired by human visual system, our memory enhancement is designed to capture the spatial relations of infrared targets among video frames. Furthermore, it encodes temporal dynamics motion features via differential learning and residual enhancing. Additionally, we further design a residual compensation to reconcile possible cross-domain feature mismatches. To our best knowledge, proposed Tridos is the first work to explore infrared target feature learning comprehensively in spatio-temporal-frequency domains. The extensive experiments on three datasets (i.e., DAUB, ITSDT-15K and IRDST) validate that our triple-domain infrared feature learning scheme could often be obviously superior to state-of-the-art ones. Source codes are available at https://github.com/UESTC-nnLab/Tridos.
△ Less
Submitted 5 September, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features
Authors:
Lichuan Ji,
Yingqi Lin,
Zhenhua Huang,
Yan Han,
Xiaogang Xu,
Jiafei Wu,
Chong Wang,
Zhe Liu
Abstract:
The development of AI-Generated Content (AIGC) has empowered the creation of remarkably realistic AI-generated videos, such as those involving Sora. However, the widespread adoption of these models raises concerns regarding potential misuse, including face video scams and copyright disputes. Addressing these concerns requires the development of robust tools capable of accurately determining video…
▽ More
The development of AI-Generated Content (AIGC) has empowered the creation of remarkably realistic AI-generated videos, such as those involving Sora. However, the widespread adoption of these models raises concerns regarding potential misuse, including face video scams and copyright disputes. Addressing these concerns requires the development of robust tools capable of accurately determining video authenticity. The main challenges lie in the dataset and neural classifier for training. Current datasets lack a varied and comprehensive repository of real and generated content for effective discrimination. In this paper, we first introduce an extensive video dataset designed specifically for AI-Generated Video Detection (GenVidDet). It includes over 2.66 M instances of both real and generated videos, varying in categories, frames per second, resolutions, and lengths. The comprehensiveness of GenVidDet enables the training of a generalizable video detector. We also present the Dual-Branch 3D Transformer (DuB3D), an innovative and effective method for distinguishing between real and generated videos, enhanced by incorporating motion information alongside visual appearance. DuB3D utilizes a dual-branch architecture that adaptively leverages and fuses raw spatio-temporal data and optical flow. We systematically explore the critical factors affecting detection performance, achieving the optimal configuration for DuB3D. Trained on GenVidDet, DuB3D can distinguish between real and generated video content with 96.77% accuracy, and strong generalization capability even for unseen types.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios
Authors:
Zeyu Wang,
Yuanchun Shi,
Yuntao Wang,
Yuchen Yao,
Kun Yan,
Yuhan Wang,
Lei Ji,
Xuhai Xu,
Chun Yu
Abstract:
Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual fie…
▽ More
Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Weakly-Supervised Residual Evidential Learning for Multi-Instance Uncertainty Estimation
Authors:
Pei Liu,
Luping Ji
Abstract:
Uncertainty estimation (UE), as an effective means of quantifying predictive uncertainty, is crucial for safe and reliable decision-making, especially in high-risk scenarios. Existing UE schemes usually assume that there are completely-labeled samples to support fully-supervised learning. In practice, however, many UE tasks often have no sufficiently-labeled data to use, such as the Multiple Insta…
▽ More
Uncertainty estimation (UE), as an effective means of quantifying predictive uncertainty, is crucial for safe and reliable decision-making, especially in high-risk scenarios. Existing UE schemes usually assume that there are completely-labeled samples to support fully-supervised learning. In practice, however, many UE tasks often have no sufficiently-labeled data to use, such as the Multiple Instance Learning (MIL) with only weak instance annotations. To bridge this gap, this paper, for the first time, addresses the weakly-supervised issue of Multi-Instance UE (MIUE) and proposes a new baseline scheme, Multi-Instance Residual Evidential Learning (MIREL). Particularly, at the fine-grained instance UE with only weak supervision, we derive a multi-instance residual operator through the Fundamental Theorem of Symmetric Functions. On this operator derivation, we further propose MIREL to jointly model the high-order predictive distribution at bag and instance levels for MIUE. Extensive experiments empirically demonstrate that our MIREL not only could often make existing MIL networks perform better in MIUE, but also could surpass representative UE methods by large margins, especially in instance-level UE tasks. Our source code is available at https://github.com/liupei101/MIREL.
△ Less
Submitted 9 May, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
Exploring Diffusion Time-steps for Unsupervised Representation Learning
Authors:
Zhongqi Yue,
Jiankun Wang,
Qianru Sun,
Lei Ji,
Eric I-Chao Chang,
Hanwang Zhang
Abstract:
Representation learning is all about discovering the hidden modular attributes that generate the data faithfully. We explore the potential of Denoising Diffusion Probabilistic Model (DM) in unsupervised learning of the modular attributes. We build a theoretical framework that connects the diffusion time-steps and the hidden attributes, which serves as an effective inductive bias for unsupervised l…
▽ More
Representation learning is all about discovering the hidden modular attributes that generate the data faithfully. We explore the potential of Denoising Diffusion Probabilistic Model (DM) in unsupervised learning of the modular attributes. We build a theoretical framework that connects the diffusion time-steps and the hidden attributes, which serves as an effective inductive bias for unsupervised learning. Specifically, the forward diffusion process incrementally adds Gaussian noise to samples at each time-step, which essentially collapses different samples into similar ones by losing attributes, e.g., fine-grained attributes such as texture are lost with less noise added (i.e., early time-steps), while coarse-grained ones such as shape are lost by adding more noise (i.e., late time-steps). To disentangle the modular attributes, at each time-step t, we learn a t-specific feature to compensate for the newly lost attribute, and the set of all 1,...,t-specific features, corresponding to the cumulative set of lost attributes, are trained to make up for the reconstruction error of a pre-trained DM at time-step t. On CelebA, FFHQ, and Bedroom datasets, the learned feature significantly improves attribute classification and enables faithful counterfactual generation, e.g., interpolating only one specified attribute between two images, validating the disentanglement quality. Codes are in https://github.com/yue-zhongqi/diti.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.
-
Voila-A: Aligning Vision-Language Models with User's Gaze Attention
Authors:
Kun Yan,
Lei Ji,
Zeyu Wang,
Yuntao Wang,
Nan Duan,
Shuai Ma
Abstract:
In recent years, the integration of vision and language understanding has led to significant advancements in artificial intelligence, particularly through Vision-Language Models (VLMs). However, existing VLMs face challenges in handling real-world applications with complex scenes and multiple objects, as well as aligning their focus with the diverse attention patterns of human users. In this paper…
▽ More
In recent years, the integration of vision and language understanding has led to significant advancements in artificial intelligence, particularly through Vision-Language Models (VLMs). However, existing VLMs face challenges in handling real-world applications with complex scenes and multiple objects, as well as aligning their focus with the diverse attention patterns of human users. In this paper, we introduce gaze information, feasibly collected by AR or VR devices, as a proxy for human attention to guide VLMs and propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications. First, we collect hundreds of minutes of gaze data to demonstrate that we can mimic human gaze modalities using localized narratives. We then design an automatic data annotation pipeline utilizing GPT-4 to generate the VOILA-COCO dataset. Additionally, we innovate the Voila Perceiver modules to integrate gaze information into VLMs while preserving their pretrained knowledge. We evaluate Voila-A using a hold-out validation set and a newly collected VOILA-GAZE Testset, which features real-life scenarios captured with a gaze-tracking device. Our experimental results demonstrate that Voila-A significantly outperforms several baseline models. By aligning model attention with human gaze patterns, Voila-A paves the way for more intuitive, user-centric VLMs and fosters engaging human-AI interaction across a wide range of applications.
△ Less
Submitted 22 December, 2023;
originally announced January 2024.
-
An Adaptive Framework of Geographical Group-Specific Network on O2O Recommendation
Authors:
Luo Ji,
Jiayu Mao,
Hailong Shi,
Qian Li,
Yunfei Chu,
Hongxia Yang
Abstract:
Online to offline recommendation strongly correlates with the user and service's spatiotemporal information, therefore calling for a higher degree of model personalization. The traditional methodology is based on a uniform model structure trained by collected centralized data, which is unlikely to capture all user patterns over different geographical areas or time periods. To tackle this challenge…
▽ More
Online to offline recommendation strongly correlates with the user and service's spatiotemporal information, therefore calling for a higher degree of model personalization. The traditional methodology is based on a uniform model structure trained by collected centralized data, which is unlikely to capture all user patterns over different geographical areas or time periods. To tackle this challenge, we propose a geographical group-specific modeling method called GeoGrouse, which simultaneously studies the common knowledge as well as group-specific knowledge of user preferences. An automatic grouping paradigm is employed and verified based on users' geographical grouping indicators. Offline and online experiments are conducted to verify the effectiveness of our approach, and substantial business improvement is achieved.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
Authors:
Difei Gao,
Lei Ji,
Zechen Bai,
Mingyu Ouyang,
Peiran Li,
Dongxing Mao,
Qinchen Wu,
Weichen Zhang,
Peiyi Wang,
Xiangwu Guo,
Hengxu Wang,
Luowei Zhou,
Mike Zheng Shou
Abstract:
Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper…
▽ More
Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications, such as, After Effects and MS Word, each accompanied by the necessary project files for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied Agent framework, which incorporates a sophisticated GUI parser driven by an LLM-agent and an enhanced reasoning mechanism adept at handling lengthy procedural tasks. Our experimental results reveal that our GUI Parser and Reasoning mechanism outshine existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain.
△ Less
Submitted 1 January, 2024; v1 submitted 20 December, 2023;
originally announced December 2023.
-
EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images
Authors:
Seongsu Bae,
Daeun Kyung,
Jaehee Ryu,
Eunbyeol Cho,
Gyubok Lee,
Sunjun Kweon,
Jungwoo Oh,
Lei Ji,
Eric I-Chao Chang,
Tackeun Kim,
Edward Choi
Abstract:
Electronic Health Records (EHRs), which contain patients' medical histories in various multi-modal formats, often overlook the potential for joint reasoning across imaging and table modalities underexplored in current EHR Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel multi-modal question answering dataset combining structured EHRs and chest X-ray images. To develop o…
▽ More
Electronic Health Records (EHRs), which contain patients' medical histories in various multi-modal formats, often overlook the potential for joint reasoning across imaging and table modalities underexplored in current EHR Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel multi-modal question answering dataset combining structured EHRs and chest X-ray images. To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset. By integrating these two uni-modal resources, we successfully construct a multi-modal EHR QA dataset that necessitates both uni-modal and cross-modal reasoning. To address the unique challenges of multi-modal questions within EHRs, we propose a NeuralSQL-based strategy equipped with an external VQA API. This pioneering endeavor enhances engagement with multi-modal EHR sources and we believe that our dataset can catalyze advances in real-world medical scenarios such as clinical decision-making and research. EHRXQA is available at https://github.com/baeseongsu/ehrxqa.
△ Less
Submitted 25 December, 2023; v1 submitted 28 October, 2023;
originally announced October 2023.
-
Construction of optimal flag codes by MRD codes
Authors:
Shuangqing Liu,
Shuhui Yu,
Lijun Ji
Abstract:
Flag codes have received a lot of attention due to its application in random network coding. In 2021, Alonso-González et al. constructed optimal $(n,\mathcal{A})$-Optimum distance flag codes(ODFC) for $\mathcal {A}\subseteq \{1,2,\ldots,k,n-k,\ldots,n-1\}$ with $k\in \mathcal A$ and $k\mid n$. In this paper, we introduce a new construction of $(n,\mathcal A)_q$-ODFCs by maximum rank-metric codes,…
▽ More
Flag codes have received a lot of attention due to its application in random network coding. In 2021, Alonso-González et al. constructed optimal $(n,\mathcal{A})$-Optimum distance flag codes(ODFC) for $\mathcal {A}\subseteq \{1,2,\ldots,k,n-k,\ldots,n-1\}$ with $k\in \mathcal A$ and $k\mid n$. In this paper, we introduce a new construction of $(n,\mathcal A)_q$-ODFCs by maximum rank-metric codes, and prove that there is an $(n,\mathcal{A})$-ODFC of size $\frac{q^n-q^{k+r}}{q^k-1}+1$ for any $\mathcal{A}\subseteq\{1,2,\ldots,k,n-k,\ldots,n-1\}$ with $\mathcal A\cap \{k,n-k\}\neq\emptyset$, where $r\equiv n\pmod k$ and $0\leq r<k$. Furthermore, when $k>\frac{q^r-1}{q-1}$, this $(n,\mathcal A)_q$-ODFC is optimal. Specially, when $r=0$, Alonso-González et al.'s result is also obtained. We also gives a characterization of almost optimum distance flag codes, and construct a family of optimal almost optimum flag distance codes.
△ Less
Submitted 11 October, 2024; v1 submitted 17 October, 2023;
originally announced October 2023.
-
Qwen Technical Report
Authors:
Jinze Bai,
Shuai Bai,
Yunfei Chu,
Zeyu Cui,
Kai Dang,
Xiaodong Deng,
Yang Fan,
Wenbin Ge,
Yu Han,
Fei Huang,
Binyuan Hui,
Luo Ji,
Mei Li,
Junyang Lin,
Runji Lin,
Dayiheng Liu,
Gao Liu,
Chengqiang Lu,
Keming Lu,
Jianxin Ma,
Rui Men,
Xingzhang Ren,
Xuancheng Ren,
Chuanqi Tan,
Sinan Tan
, et al. (23 additional authors not shown)
Abstract:
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Q…
▽ More
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Design of Recognition and Evaluation System for Table Tennis Players' Motor Skills Based on Artificial Intelligence
Authors:
Zhuo-yong Shi,
Ye-tao Jia,
Ke-xin Zhang,
Ding-han Wang,
Long-meng Ji,
Yong Wu
Abstract:
With the rapid development of electronic science and technology, the research on wearable devices is constantly updated, but for now, it is not comprehensive for wearable devices to recognize and analyze the movement of specific sports. Based on this, this paper improves wearable devices of table tennis sport, and realizes the pattern recognition and evaluation of table tennis players' motor skill…
▽ More
With the rapid development of electronic science and technology, the research on wearable devices is constantly updated, but for now, it is not comprehensive for wearable devices to recognize and analyze the movement of specific sports. Based on this, this paper improves wearable devices of table tennis sport, and realizes the pattern recognition and evaluation of table tennis players' motor skills through artificial intelligence. Firstly, a device is designed to collect the movement information of table tennis players and the actual movement data is processed. Secondly, a sliding window is made to divide the collected motion data into a characteristic database of six table tennis benchmark movements. Thirdly, motion features were constructed based on feature engineering, and motor skills were identified for different models after dimensionality reduction. Finally, the hierarchical evaluation system of motor skills is established with the loss functions of different evaluation indexes. The results show that in the recognition of table tennis players' motor skills, the feature-based BP neural network proposed in this paper has higher recognition accuracy and stronger generalization ability than the traditional convolutional neural network.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model
Authors:
Longbin Ji,
Pengfei Wei,
Yi Ren,
Jinglin Liu,
Chen Zhang,
Xiang Yin
Abstract:
Co-speech gesture generation is crucial for automatic digital avatar animation. However, existing methods suffer from issues such as unstable training and temporal inconsistency, particularly in generating high-fidelity and comprehensive gestures. Additionally, these methods lack effective control over speaker identity and temporal editing of the generated gestures. Focusing on capturing temporal…
▽ More
Co-speech gesture generation is crucial for automatic digital avatar animation. However, existing methods suffer from issues such as unstable training and temporal inconsistency, particularly in generating high-fidelity and comprehensive gestures. Additionally, these methods lack effective control over speaker identity and temporal editing of the generated gestures. Focusing on capturing temporal latent information and applying practical controlling, we propose a Controllable Co-speech Gesture Generation framework, named C2G2. Specifically, we propose a two-stage temporal dependency enhancement strategy motivated by latent diffusion models. We further introduce two key features to C2G2, namely a speaker-specific decoder to generate speaker-related real-length skeletons and a repainting strategy for flexible gesture generation/editing. Extensive experiments on benchmark gesture datasets verify the effectiveness of our proposed C2G2 compared with several state-of-the-art baselines. The link of the project demo page can be found at https://c2g2-gesture.github.io/c2_gesture
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
Anomaly Detection in Automated Fibre Placement: Learning with Data Limitations
Authors:
Assef Ghamisi,
Todd Charter,
Li Ji,
Maxime Rivard,
Gil Lund,
Homayoun Najjaran
Abstract:
Conventional defect detection systems in Automated Fibre Placement (AFP) typically rely on end-to-end supervised learning, necessitating a substantial number of labelled defective samples for effective training. However, the scarcity of such labelled data poses a challenge. To overcome this limitation, we present a comprehensive framework for defect detection and localization in Automated Fibre Pl…
▽ More
Conventional defect detection systems in Automated Fibre Placement (AFP) typically rely on end-to-end supervised learning, necessitating a substantial number of labelled defective samples for effective training. However, the scarcity of such labelled data poses a challenge. To overcome this limitation, we present a comprehensive framework for defect detection and localization in Automated Fibre Placement. Our approach combines unsupervised deep learning and classical computer vision algorithms, eliminating the need for labelled data or manufacturing defect samples. It efficiently detects various surface issues while requiring fewer images of composite parts for training. Our framework employs an innovative sample extraction method leveraging AFP's inherent symmetry to expand the dataset. By inputting a depth map of the fibre layup surface, we extract local samples aligned with each composite strip (tow). These samples are processed through an autoencoder, trained on normal samples for precise reconstructions, highlighting anomalies through reconstruction errors. Aggregated values form an anomaly map for insightful visualization. The framework employs blob detection on this map to locate manufacturing defects. The experimental findings reveal that despite training the autoencoder with a limited number of images, our proposed method exhibits satisfactory detection accuracy and accurately identifies defect locations. Our framework demonstrates comparable performance to existing methods, while also offering the advantage of detecting all types of anomalies without relying on an extensive labelled dataset of defects.
△ Less
Submitted 14 August, 2023; v1 submitted 15 July, 2023;
originally announced July 2023.
-
KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report Summarization
Authors:
Gangwoo Kim,
Hajung Kim,
Lei Ji,
Seongsu Bae,
Chanhwi Kim,
Mujeen Sung,
Hyunjae Kim,
Kun Yan,
Eric Chang,
Jaewoo Kang
Abstract:
In this paper, we introduce CheXOFA, a new pre-trained vision-language model (VLM) for the chest X-ray domain. Our model is initially pre-trained on various multimodal datasets within the general domain before being transferred to the chest X-ray domain. Following a prominent VLM, we unify various domain-specific tasks into a simple sequence-to-sequence schema. It enables the model to effectively…
▽ More
In this paper, we introduce CheXOFA, a new pre-trained vision-language model (VLM) for the chest X-ray domain. Our model is initially pre-trained on various multimodal datasets within the general domain before being transferred to the chest X-ray domain. Following a prominent VLM, we unify various domain-specific tasks into a simple sequence-to-sequence schema. It enables the model to effectively learn the required knowledge and skills from limited resources in the domain. Demonstrating superior performance on the benchmark datasets provided by the BioNLP shared task, our model benefits from its training across multiple tasks and domains. With subtle techniques including ensemble and factual calibration, our system achieves first place on the RadSum23 leaderboard for the hidden test set.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Pseudo-Bag Mixup Augmentation for Multiple Instance Learning-Based Whole Slide Image Classification
Authors:
Pei Liu,
Luping Ji,
Xinyu Zhang,
Feng Ye
Abstract:
Given the special situation of modeling gigapixel images, multiple instance learning (MIL) has become one of the most important frameworks for Whole Slide Image (WSI) classification. In current practice, most MIL networks often face two unavoidable problems in training: i) insufficient WSI data and ii) the sample memorization inclination inherent in neural networks. These problems may hinder MIL m…
▽ More
Given the special situation of modeling gigapixel images, multiple instance learning (MIL) has become one of the most important frameworks for Whole Slide Image (WSI) classification. In current practice, most MIL networks often face two unavoidable problems in training: i) insufficient WSI data and ii) the sample memorization inclination inherent in neural networks. These problems may hinder MIL models from adequate and efficient training, suppressing the continuous performance promotion of classification models on WSIs. Inspired by the basic idea of Mixup, this paper proposes a new Pseudo-bag Mixup (PseMix) data augmentation scheme to improve the training of MIL models. This scheme generalizes the Mixup strategy for general images to special WSIs via pseudo-bags so as to be applied in MIL-based WSI classification. Cooperated by pseudo-bags, our PseMix fulfills the critical size alignment and semantic alignment in Mixup strategy. Moreover, it is designed as an efficient and decoupled method, neither involving time-consuming operations nor relying on MIL model predictions. Comparative experiments and ablation studies are specially designed to evaluate the effectiveness and advantages of our PseMix. Experimental results show that PseMix could often assist state-of-the-art MIL networks to refresh their classification performance on WSIs. Besides, it could also boost the generalization performance of MIL models in special test scenarios, and promote their robustness to patch occlusion and label noise. Our source code is available at https://github.com/liupei101/PseMix.
△ Less
Submitted 2 November, 2023; v1 submitted 28 June, 2023;
originally announced June 2023.