-
LanTu: Dynamics-Enhanced Deep Learning for Eddy-Resolving Ocean Forecasting
Authors:
Qingyu Zheng,
Qi Shao,
Guijun Han,
Wei Li,
Hong Li,
Xuan Wang
Abstract:
Mesoscale eddies dominate the spatiotemporal multiscale variability of the ocean, and their impact on the energy cascade of the global ocean cannot be ignored. Eddy-resolving ocean forecasting is providing more reliable protection for fisheries and navigational safety, but also presents significant scientific challenges and high computational costs for traditional numerical models. Artificial inte…
▽ More
Mesoscale eddies dominate the spatiotemporal multiscale variability of the ocean, and their impact on the energy cascade of the global ocean cannot be ignored. Eddy-resolving ocean forecasting is providing more reliable protection for fisheries and navigational safety, but also presents significant scientific challenges and high computational costs for traditional numerical models. Artificial intelligence (AI)-based weather and ocean forecasting systems are becoming powerful tools that balance forecast performance with computational efficiency. However, the complex multiscale features in the ocean dynamical system make AI models still face many challenges in mesoscale eddy forecasting (especially regional modelling). Here, we develop LanTu, a regional eddy-resolving ocean forecasting system based on dynamics-enhanced deep learning. We incorporate cross-scale interactions into LanTu and construct multiscale physical constraint for optimising LanTu guided by knowledge of eddy dynamics in order to improve the forecasting skill of LanTu for mesoscale evolution. The results show that LanTu outperforms the existing advanced operational numerical ocean forecasting system (NOFS) and AI-based ocean forecasting system (AI-OFS) in temperature, salinity, sea level anomaly and current prediction, with a lead time of more than 10 days. Our study highlights that dynamics-enhanced deep learning (LanTu) can be a powerful paradigm for eddy-resolving ocean forecasting.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
TACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition
Authors:
Feng Liu,
Ziwang Fu,
Yunlong Wang,
Qijian Zheng
Abstract:
The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the othe…
▽ More
The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art. All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Low-bit Model Quantization for Deep Neural Networks: A Survey
Authors:
Kai Liu,
Qian Zheng,
Kaiwen Tao,
Zhiteng Li,
Haotong Qin,
Wenbo Li,
Yong Guo,
Xianglong Liu,
Linghe Kong,
Guihai Chen,
Yulun Zhang,
Xiaokang Yang
Abstract:
With unprecedented rapid development, deep neural networks (DNNs) have deeply influenced almost all fields. However, their heavy computation costs and model sizes are usually unacceptable in real-world deployment. Model quantization, an effective weight-lighting technique, has become an indispensable procedure in the whole deployment pipeline. The essence of quantization acceleration is the conver…
▽ More
With unprecedented rapid development, deep neural networks (DNNs) have deeply influenced almost all fields. However, their heavy computation costs and model sizes are usually unacceptable in real-world deployment. Model quantization, an effective weight-lighting technique, has become an indispensable procedure in the whole deployment pipeline. The essence of quantization acceleration is the conversion from continuous floating-point numbers to discrete integer ones, which significantly speeds up the memory I/O and calculation, i.e., addition and multiplication. However, performance degradation also comes with the conversion because of the loss of precision. Therefore, it has become increasingly popular and critical to investigate how to perform the conversion and how to compensate for the information loss. This article surveys the recent five-year progress towards low-bit quantization on DNNs. We discuss and compare the state-of-the-art quantization methods and classify them into 8 main categories and 24 sub-categories according to their core techniques. Furthermore, we shed light on the potential research opportunities in the field of model quantization. A curated list of model quantization is provided at https://github.com/Kai-Liu001/Awesome-Model-Quantization.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Authors:
Siyan Zhao,
Devaansh Gupta,
Qinqing Zheng,
Aditya Grover
Abstract:
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large la…
▽ More
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and logical reasoning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Single-Pass Document Scanning for Question Answering
Authors:
Weili Cao,
Jianyou Wang,
Youze Zheng,
Longtian Bao,
Qirui Zheng,
Taylor Berg-Kirkpatrick,
Ramamohan Paturi,
Leon Bergen
Abstract:
Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which se…
▽ More
Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which sentences are most relevant to the query. On 41 QA benchmarks, our single-pass scanner consistently outperforms chunk-based embedding methods and competes with large language models at a fraction of the computational cost. By conditioning on the entire preceding context without chunk breaks, the method preserves global coherence, which is especially important for long documents. Overall, single-pass document scanning offers a simple solution for question answering over massive text. All code, datasets, and model checkpoints are available at https://github.com/MambaRetriever/MambaRetriever
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding
Authors:
Qi Wu,
Quanlong Zheng,
Yanhao Zhang,
Junlin Xie,
Jinguo Luo,
Kuo Wang,
Peng Liu,
Qingsong Xie,
Ru Zhen,
Haonan Lu,
Zhenyu Yang
Abstract:
With the rapid development of multimodal models, the demand for assessing video understanding capabilities has been steadily increasing. However, existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. These shortcomings hinder the accurate assessment of models' comprehensive video understanding capabilities. To ta…
▽ More
With the rapid development of multimodal models, the demand for assessing video understanding capabilities has been steadily increasing. However, existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. These shortcomings hinder the accurate assessment of models' comprehensive video understanding capabilities. To tackle this challenge, we propose a hierarchical and holistic video understanding (H2VU) benchmark designed to evaluate both general video and online streaming video comprehension. This benchmark contributes three key features:
Extended video duration: Spanning videos from brief 3-second clips to comprehensive 1.5-hour recordings, thereby bridging the temporal gaps found in current benchmarks. Comprehensive assessment tasks: Beyond traditional perceptual and reasoning tasks, we have introduced modules for countercommonsense comprehension and trajectory state tracking. These additions test the models' deep understanding capabilities beyond mere prior knowledge. Enriched video data: To keep pace with the rapid evolution of current AI agents, we have expanded first-person streaming video datasets. This expansion allows for the exploration of multimodal models' performance in understanding streaming videos from a first-person perspective. Extensive results from H2VU reveal that existing multimodal large language models (MLLMs) possess substantial potential for improvement in our newly proposed evaluation tasks. We expect that H2VU will facilitate advancements in video understanding research by offering a comprehensive and in-depth analysis of MLLMs.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
Authors:
Haoyu Zhao,
Zhongang Qi,
Cong Wang,
Qingping Zheng,
Guansong Lu,
Fei Chen,
Hang Xu,
Zuxuan Wu
Abstract:
Human image animation has recently gained significant attention due to advancements in generative models. However, existing methods still face two major challenges: (1) architectural limitations, most models rely on U-Net, which underperforms compared to the MM-DiT; and (2) the neglect of textual information, which can enhance controllability. In this work, we introduce DynamiCtrl, a novel framewo…
▽ More
Human image animation has recently gained significant attention due to advancements in generative models. However, existing methods still face two major challenges: (1) architectural limitations, most models rely on U-Net, which underperforms compared to the MM-DiT; and (2) the neglect of textual information, which can enhance controllability. In this work, we introduce DynamiCtrl, a novel framework that not only explores different pose-guided control structures in MM-DiT, but also reemphasizes the crucial role of text in this task. Specifically, we employ a Shared VAE encoder for both reference images and driving pose videos, eliminating the need for an additional pose encoder and simplifying the overall framework. To incorporate pose features into the full attention blocks, we propose Pose-adaptive Layer Norm (PadaLN), which utilizes adaptive layer normalization to encode sparse pose features. The encoded features are directly added to the visual input, preserving the spatiotemporal consistency of the backbone while effectively introducing pose control into MM-DiT. Furthermore, within the full attention mechanism, we align textual and visual features to enhance controllability. By leveraging text, we not only enable fine-grained control over the generated content, but also, for the first time, achieve simultaneous control over both background and motion. Experimental results verify the superiority of DynamiCtrl on benchmark datasets, demonstrating its strong identity preservation, heterogeneous character driving, background controllability, and high-quality synthesis. The project page is available at https://gulucaptain.github.io/DynamiCtrl/.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
A Simple yet Effective Layout Token in Large Language Models for Document Understanding
Authors:
Zhaoqing Zhu,
Chuwei Luo,
Zirui Shao,
Feiyu Gao,
Hangdi Xing,
Qi Zheng,
Ji Zhang
Abstract:
Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to…
▽ More
Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks. To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model's capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to enhance cross-modality learning between text and layout tokens. Extensive experiments show that LayTokenLLM outperforms existing layout-integrated LLMs and MLLMs of similar scales on multi-page document understanding tasks, as well as most single-page tasks.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization
Authors:
Juntao Dai,
Taiye Chen,
Yaodong Yang,
Qian Zheng,
Gang Pan
Abstract:
Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when t…
▽ More
Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. Specifically, we define behavior policy as the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model's extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
Machine learning algorithms to predict stroke in China based on causal inference of time series analysis
Authors:
Qizhi Zheng,
Ayang Zhao,
Xinzhu Wang,
Yanhong Bai,
Zikun Wang,
Xiuying Wang,
Xianzhang Zeng,
Guanghui Dong
Abstract:
Participants: This study employed a combination of Vector Autoregression (VAR) model and Graph Neural Networks (GNN) to systematically construct dynamic causal inference. Multiple classic classification algorithms were compared, including Random Forest, Logistic Regression, XGBoost, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gradient Boosting, and Multi Layer Perceptron (MLP). The SMO…
▽ More
Participants: This study employed a combination of Vector Autoregression (VAR) model and Graph Neural Networks (GNN) to systematically construct dynamic causal inference. Multiple classic classification algorithms were compared, including Random Forest, Logistic Regression, XGBoost, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gradient Boosting, and Multi Layer Perceptron (MLP). The SMOTE algorithm was used to undersample a small number of samples and employed Stratified K-fold Cross Validation. Results: This study included a total of 11,789 participants, including 6,334 females (53.73%) and 5,455 males (46.27%), with an average age of 65 years. Introduction of dynamic causal inference features has significantly improved the performance of almost all models. The area under the ROC curve of each model ranged from 0.78 to 0.83, indicating significant difference (P < 0.01). Among all the models, the Gradient Boosting model demonstrated the highest performance and stability. Model explanation and feature importance analysis generated model interpretation that illustrated significant contributors associated with risks of stroke. Conclusions and Relevance: This study proposes a stroke risk prediction method that combines dynamic causal inference with machine learning models, significantly improving prediction accuracy and revealing key health factors that affect stroke. The research results indicate that dynamic causal inference features have important value in predicting stroke risk, especially in capturing the impact of changes in health status over time on stroke risk. By further optimizing the model and introducing more variables, this study provides theoretical basis and practical guidance for future stroke prevention and intervention strategies.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
CCRSat: A Collaborative Computation Reuse Framework for Satellite Edge Computing Networks
Authors:
Ye Zhang,
Zhishu Shen,
Dawen Jiang,
Xiangrui Liu,
Qiushi Zheng,
Jiong Jin
Abstract:
In satellite computing applications, such as remote sensing, tasks often involve similar or identical input data, leading to the same processing results. Computation reuse is an emerging paradigm that leverages the execution results of previous tasks to enhance the utilization of computational resources. While this paradigm has been extensively studied in terrestrial networks with abundant computi…
▽ More
In satellite computing applications, such as remote sensing, tasks often involve similar or identical input data, leading to the same processing results. Computation reuse is an emerging paradigm that leverages the execution results of previous tasks to enhance the utilization of computational resources. While this paradigm has been extensively studied in terrestrial networks with abundant computing and caching resources, such as named data networking (NDN), it is essential to develop a framework appropriate for resource-constrained satellite networks, which are expected to have longer task completion times. In this paper, we propose CCRSat, a collaborative computation reuse framework for satellite edge computing networks. CCRSat initially implements local computation reuse on an independent satellite, utilizing a satellite reuse state (SRS) to assess the efficiency of computation reuse. Additionally, an inter-satellite computation reuse algorithm is introduced, which utilizes the collaborative sharing of similarity in previously processed data among multiple satellites. The evaluation results tested on real-world datasets demonstrate that, compared to comparative scenarios, our proposed CCRSat can significantly reduce task completion time by up to 62.1% and computational resource consumption by up to 28.8%.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
HGO-YOLO: Advancing Anomaly Behavior Detection with Hierarchical Features and Lightweight Optimized Detection
Authors:
Qizhi Zheng,
Zhongze Luo,
Meiyan Guo,
Xinzhu Wang,
Renqimuge Wu,
Qiu Meng,
Guanghui Dong
Abstract:
Accurate and real-time object detection is crucial for anomaly behavior detection, especially in scenarios constrained by hardware limitations, where balancing accuracy and speed is essential for enhancing detection performance. This study proposes a model called HGO-YOLO, which integrates the HGNetv2 architecture into YOLOv8. This combination expands the receptive field and captures a wider range…
▽ More
Accurate and real-time object detection is crucial for anomaly behavior detection, especially in scenarios constrained by hardware limitations, where balancing accuracy and speed is essential for enhancing detection performance. This study proposes a model called HGO-YOLO, which integrates the HGNetv2 architecture into YOLOv8. This combination expands the receptive field and captures a wider range of features while simplifying model complexity through GhostConv. We introduced a lightweight detection head, OptiConvDetect, which utilizes parameter sharing to construct the detection head effectively. Evaluation results show that the proposed algorithm achieves a [email protected] of 87.4% and a recall rate of 81.1%, with a model size of only 4.6 MB and a frame rate of 56 FPS on the CPU. HGO-YOLO not only improves accuracy by 3.0% but also reduces computational load by 51.69% (from 8.9 GFLOPs to 4.3 GFLOPs), while increasing the frame rate by a factor of 1.7. Additionally, real-time tests were conducted on Raspberry Pi4 and NVIDIA platforms. These results indicate that the HGO-YOLO model demonstrates superior performance in anomaly behavior detection.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Rule-Based Conflict-Free Decision Framework in Swarm Confrontation
Authors:
Zhaoqi Dong,
Zhinan Wang,
Quanqi Zheng,
Bin Xu,
Lei Chen,
Jinhu Lv
Abstract:
Traditional rule-based decision-making methods with interpretable advantage, such as finite state machine, suffer from the jitter or deadlock(JoD) problems in extremely dynamic scenarios. To realize agent swarm confrontation, decision conflicts causing many JoD problems are a key issue to be solved. Here, we propose a novel decision-making framework that integrates probabilistic finite state machi…
▽ More
Traditional rule-based decision-making methods with interpretable advantage, such as finite state machine, suffer from the jitter or deadlock(JoD) problems in extremely dynamic scenarios. To realize agent swarm confrontation, decision conflicts causing many JoD problems are a key issue to be solved. Here, we propose a novel decision-making framework that integrates probabilistic finite state machine, deep convolutional networks, and reinforcement learning to implement interpretable intelligence into agents. Our framework overcomes state machine instability and JoD problems, ensuring reliable and adaptable decisions in swarm confrontation. The proposed approach demonstrates effective performance via enhanced human-like cooperation and competitive strategies in the rigorous evaluation of real experiments, outperforming other methods.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
VACT: A Video Automatic Causal Testing System and a Benchmark
Authors:
Haotong Yang,
Qingyuan Zheng,
Yunjian Gao,
Yongkun Yang,
Yangbo He,
Zhouchen Lin,
Muhan Zhang
Abstract:
With the rapid advancement of text-conditioned Video Generation Models (VGMs), the quality of generated videos has significantly improved, bringing these models closer to functioning as ``*world simulators*'' and making real-world-level video generation more accessible and cost-effective. However, the generated videos often contain factual inaccuracies and lack understanding of fundamental physica…
▽ More
With the rapid advancement of text-conditioned Video Generation Models (VGMs), the quality of generated videos has significantly improved, bringing these models closer to functioning as ``*world simulators*'' and making real-world-level video generation more accessible and cost-effective. However, the generated videos often contain factual inaccuracies and lack understanding of fundamental physical laws. While some previous studies have highlighted this issue in limited domains through manual analysis, a comprehensive solution has not yet been established, primarily due to the absence of a generalized, automated approach for modeling and assessing the causal reasoning of these models across diverse scenarios. To address this gap, we propose VACT: an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios. By combining causal analysis techniques with a carefully designed large language model assistant, our system can assess the causal behavior of models in various contexts without human annotation, which offers strong generalization and scalability. Additionally, we introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs. As a demonstration, we use our framework to benchmark several prevailing VGMs, offering insight into their causal reasoning capabilities. Our work lays the foundation for systematically addressing the causal understanding deficiencies in VGMs and contributes to advancing their reliability and real-world applicability.
△ Less
Submitted 19 April, 2025; v1 submitted 8 March, 2025;
originally announced March 2025.
-
End-to-End HOI Reconstruction Transformer with Graph-based Encoding
Authors:
Zhenrong Wang,
Qi Zheng,
Sihan Ma,
Maosheng Ye,
Yibing Zhan,
Dongjiang Li
Abstract:
With the diversification of human-object interaction (HOI) applications and the success of capturing human meshes, HOI reconstruction has gained widespread attention. Existing mainstream HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structu…
▽ More
With the diversification of human-object interaction (HOI) applications and the success of capturing human meshes, HOI reconstruction has gained widespread attention. Existing mainstream HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Authors:
Xinsheng Wang,
Mingqi Jiang,
Ziyang Ma,
Ziyu Zhang,
Songxiang Liu,
Linqin Li,
Zheng Liang,
Qixi Zheng,
Rui Wang,
Xiaoqin Feng,
Weizhen Bian,
Zhen Ye,
Sitong Cheng,
Ruibin Yuan,
Zhixian Zhao,
Xinfa Zhu,
Jiahao Pan,
Liumeng Xue,
Pengcheng Zhu,
Yunlin Chen,
Zhifei Li,
Xie Chen,
Lei Xie,
Yike Guo,
Wei Xue
Abstract:
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a sin…
▽ More
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing
Authors:
Xiang Xiang,
Zhuo Xu,
Yao Deng,
Qinhao Zhou,
Yifan Liang,
Ke Chen,
Qingfang Zheng,
Yaowei Wang,
Xilin Chen,
Wen Gao
Abstract:
In open-world remote sensing, deployed models must continuously adapt to a steady influx of new data, which often exhibits various shifts compared to what the model encountered during the training phase. To effectively handle the new data, models are required to detect semantic shifts, adapt to covariate shifts, and continuously update themselves. These challenges give rise to a variety of open-wo…
▽ More
In open-world remote sensing, deployed models must continuously adapt to a steady influx of new data, which often exhibits various shifts compared to what the model encountered during the training phase. To effectively handle the new data, models are required to detect semantic shifts, adapt to covariate shifts, and continuously update themselves. These challenges give rise to a variety of open-world tasks. However, existing open-world remote sensing studies typically train and test within a single dataset to simulate open-world conditions. Currently, there is a lack of large-scale benchmarks capable of evaluating multiple open-world tasks. In this paper, we introduce OpenEarthSensing, a large-scale fine-grained benchmark for open-world remote sensing. OpenEarthSensing includes 189 scene and objects categories, covering the vast majority of potential semantic shifts that may occur in the real world. Additionally, OpenEarthSensing encompasses five data domains with significant covariate shifts, including two RGB satellite domians, one RGB aerial domian, one MS RGB domian, and one infrared domian. The various domains provide a more comprehensive testbed for evaluating the generalization performance of open-world models. We conduct the baseline evaluation of current mainstream open-world tasks and methods on OpenEarthSensing, demonstrating that it serves as a challenging benchmark for open-world remote sensing.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
M^3Builder: A Multi-Agent System for Automated Machine Learning in Medical Imaging
Authors:
Jinghao Feng,
Qiaoyu Zheng,
Chaoyi Wu,
Ziheng Zhao,
Ya Zhang,
Yanfeng Wang,
Weidi Xie
Abstract:
Agentic AI systems have gained significant attention for their ability to autonomously perform complex tasks. However, their reliance on well-prepared tools limits their applicability in the medical domain, which requires to train specialized models. In this paper, we make three contributions: (i) We present M3Builder, a novel multi-agent system designed to automate machine learning (ML) in medica…
▽ More
Agentic AI systems have gained significant attention for their ability to autonomously perform complex tasks. However, their reliance on well-prepared tools limits their applicability in the medical domain, which requires to train specialized models. In this paper, we make three contributions: (i) We present M3Builder, a novel multi-agent system designed to automate machine learning (ML) in medical imaging. At its core, M3Builder employs four specialized agents that collaborate to tackle complex, multi-step medical ML workflows, from automated data processing and environment configuration to self-contained auto debugging and model training. These agents operate within a medical imaging ML workspace, a structured environment designed to provide agents with free-text descriptions of datasets, training codes, and interaction tools, enabling seamless communication and task execution. (ii) To evaluate progress in automated medical imaging ML, we propose M3Bench, a benchmark comprising four general tasks on 14 training datasets, across five anatomies and three imaging modalities, covering both 2D and 3D data. (iii) We experiment with seven state-of-the-art large language models serving as agent cores for our system, such as Claude series, GPT-4o, and DeepSeek-V3. Compared to existing ML agentic designs, M3Builder shows superior performance on completing ML tasks in medical imaging, achieving a 94.29% success rate using Claude-3.7-Sonnet as the agent core, showing huge potential towards fully automated machine learning in medical imaging.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Application of machine learning algorithm in temperature field reconstruction
Authors:
Qianyu He,
Huaiwei Sun,
Yubo Li,
Zhiwen You,
Qiming Zheng,
Yinghan Huang,
Sipeng Zhu,
Fengyu Wang
Abstract:
This study focuses on the stratification patterns and dynamic evolution of reservoir water temperatures, aiming to estimate and reconstruct the temperature field using limited and noisy local measurement data. Due to complex measurement environments and technical limitations, obtaining complete temperature information for reservoirs is highly challenging. Therefore, accurately reconstructing the t…
▽ More
This study focuses on the stratification patterns and dynamic evolution of reservoir water temperatures, aiming to estimate and reconstruct the temperature field using limited and noisy local measurement data. Due to complex measurement environments and technical limitations, obtaining complete temperature information for reservoirs is highly challenging. Therefore, accurately reconstructing the temperature field from a small number of local data points has become a critical scientific issue. To address this, the study employs Proper Orthogonal Decomposition (POD) and sparse representation methods to reconstruct the temperature field based on temperature data from a limited number of local measurement points. The results indicate that satisfactory reconstruction can be achieved when the number of POD basis functions is set to 2 and the number of measurement points is 10. Under different water intake depths, the reconstruction errors of both POD and sparse representation methods remain stable at around 0.15, fully validating the effectiveness of these methods in reconstructing the temperature field based on limited local temperature data. Additionally, the study further explores the distribution characteristics of reconstruction errors for POD and sparse representation methods under different water level intervals, analyzing the optimal measurement point layout scheme and potential limitations of the reconstruction methods in this case. This research not only effectively reduces measurement costs and computational resource consumption but also provides a new technical approach for reservoir temperature analysis, holding significant theoretical and practical importance.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
FedHC: A Hierarchical Clustered Federated Learning Framework for Satellite Networks
Authors:
Zhuocheng Liu,
Zhishu Shen,
Pan Zhou,
Qiushi Zheng,
Jiong Jin
Abstract:
With the proliferation of data-driven services, the volume of data that needs to be processed by satellite networks has significantly increased. Federated learning (FL) is well-suited for big data processing in distributed, resource-constrained satellite environments. However, ensuring its convergence performance while minimizing processing time and energy consumption remains a challenge. To this…
▽ More
With the proliferation of data-driven services, the volume of data that needs to be processed by satellite networks has significantly increased. Federated learning (FL) is well-suited for big data processing in distributed, resource-constrained satellite environments. However, ensuring its convergence performance while minimizing processing time and energy consumption remains a challenge. To this end, we propose a hierarchical clustered federated learning framework, FedHC. This framework employs a satellite-clustered parameter server (PS) selection algorithm at the cluster aggregation stage, grouping nearby satellites into distinct clusters and designating a cluster center as the PS to accelerate model aggregation. Several communicable cluster PS satellites are then selected through ground stations to aggregate global parameters, facilitating the FL process. Moreover, a meta-learning-driven satellite re-clustering algorithm is introduced to enhance adaptability to dynamic satellite cluster changes. The extensive experiments on satellite networks testbed demonstrate that FedHC can significantly reduce processing time (up to 3x) and energy consumption (up to 2x) compared to other comparative methods while maintaining model accuracy.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
SeWA: Selective Weight Average via Probabilistic Masking
Authors:
Peng Wang,
Shengchao Hu,
Zerui Tao,
Guoxia Wang,
Dianhai Yu,
Li Shen,
Quan Zheng,
Dacheng Tao
Abstract:
Weight averaging has become a standard technique for enhancing model performance. However, methods such as Stochastic Weight Averaging (SWA) and Latest Weight Averaging (LAWA) often require manually designed procedures to sample from the training trajectory, and the results depend heavily on hyperparameter tuning. To minimize human effort, this paper proposes a simple yet efficient algorithm calle…
▽ More
Weight averaging has become a standard technique for enhancing model performance. However, methods such as Stochastic Weight Averaging (SWA) and Latest Weight Averaging (LAWA) often require manually designed procedures to sample from the training trajectory, and the results depend heavily on hyperparameter tuning. To minimize human effort, this paper proposes a simple yet efficient algorithm called Selective Weight Averaging (SeWA), which adaptively selects checkpoints during the final stages of training for averaging. Based on SeWA, we show that only a few points are needed to achieve better generalization and faster convergence. Theoretically, solving the discrete subset selection problem is inherently challenging. To address this, we transform it into a continuous probabilistic optimization framework and employ the Gumbel-Softmax estimator to learn the non-differentiable mask for each checkpoint. Further, we theoretically derive the SeWA's stability-based generalization bounds, which are sharper than that of SGD under both convex and non-convex assumptions. Finally, solid extended experiments in various domains, including behavior cloning, image classification, and text classification, further validate the effectiveness of our approach.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
MindAligner: Explicit Brain Functional Alignment for Cross-Subject Visual Decoding from Limited fMRI Data
Authors:
Yuqin Dai,
Zhouheng Yao,
Chunfeng Song,
Qihao Zheng,
Weijian Mai,
Kunyu Peng,
Shuai Lu,
Wanli Ouyang,
Jian Yang,
Jiamin Wu
Abstract:
Brain decoding aims to reconstruct visual perception of human subject from fMRI signals, which is crucial for understanding brain's perception mechanisms. Existing methods are confined to the single-subject paradigm due to substantial brain variability, which leads to weak generalization across individuals and incurs high training costs, exacerbated by limited availability of fMRI data. To address…
▽ More
Brain decoding aims to reconstruct visual perception of human subject from fMRI signals, which is crucial for understanding brain's perception mechanisms. Existing methods are confined to the single-subject paradigm due to substantial brain variability, which leads to weak generalization across individuals and incurs high training costs, exacerbated by limited availability of fMRI data. To address these challenges, we propose MindAligner, an explicit functional alignment framework for cross-subject brain decoding from limited fMRI data. The proposed MindAligner enjoys several merits. First, we learn a Brain Transfer Matrix (BTM) that projects the brain signals of an arbitrary new subject to one of the known subjects, enabling seamless use of pre-trained decoding models. Second, to facilitate reliable BTM learning, a Brain Functional Alignment module is proposed to perform soft cross-subject brain alignment under different visual stimuli with a multi-level brain alignment loss, uncovering fine-grained functional correspondences with high interpretability. Experiments indicate that MindAligner not only outperforms existing methods in visual decoding under data-limited conditions, but also provides valuable neuroscience insights in cross-subject functional analysis. The code will be made publicly available.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning
Authors:
DiJia Su,
Hanlin Zhu,
Yingchen Xu,
Jiantao Jiao,
Yuandong Tian,
Qinqing Zheng
Abstract:
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data, where the step-by-step thought process is explicitly outlined by text tokens. However, this results in lengthy inputs where many words support textual coherence rather than core reasoning information, and processing these inputs consumes substantial computation resources. In this work, we propo…
▽ More
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data, where the step-by-step thought process is explicitly outlined by text tokens. However, this results in lengthy inputs where many words support textual coherence rather than core reasoning information, and processing these inputs consumes substantial computation resources. In this work, we propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens generated by VQ-VAE, significantly reducing the length of reasoning traces. We explore the use of latent trace abstractions in two scenarios: 1) training the model from scratch for the Keys-Finding Maze problem, 2) fine-tuning LLMs on this hybrid data with an extended vocabulary including unseen latent tokens, for both logical and mathematical reasoning problems. To facilitate effective learning, we introduce a simple training procedure that randomly mixes latent and text tokens, which enables fast adaptation to new latent tokens. Our approach consistently outperforms the baselines methods in various benchmarks.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
Sub-Sequential Physics-Informed Learning with State Space Model
Authors:
Chenhui Xu,
Dancheng Liu,
Yuting Hu,
Jiajie Li,
Ruiyang Qin,
Qingxiao Zheng,
Jinjun Xiong
Abstract:
Physics-Informed Neural Networks (PINNs) are a kind of deep-learning-based numerical solvers for partial differential equations (PDEs). Existing PINNs often suffer from failure modes of being unable to propagate patterns of initial conditions. We discover that these failure modes are caused by the simplicity bias of neural networks and the mismatch between PDE's continuity and PINN's discrete samp…
▽ More
Physics-Informed Neural Networks (PINNs) are a kind of deep-learning-based numerical solvers for partial differential equations (PDEs). Existing PINNs often suffer from failure modes of being unable to propagate patterns of initial conditions. We discover that these failure modes are caused by the simplicity bias of neural networks and the mismatch between PDE's continuity and PINN's discrete sampling. We reveal that the State Space Model (SSM) can be a continuous-discrete articulation allowing initial condition propagation, and that simplicity bias can be eliminated by aligning a sequence of moderate granularity. Accordingly, we propose PINNMamba, a novel framework that introduces sub-sequence modeling with SSM. Experimental results show that PINNMamba can reduce errors by up to 86.3\% compared with state-of-the-art architecture. Our code is available at https://github.com/miniHuiHui/PINNMamba.
△ Less
Submitted 31 January, 2025;
originally announced February 2025.
-
An Atomic Skill Library Construction Method for Data-Efficient Embodied Manipulation
Authors:
Dongjiang Li,
Bo Peng,
Chang Li,
Ning Qiao,
Qi Zheng,
Lei Sun,
Yusen Qin,
Bangguo Li,
Yifeng Luan,
Bo Wu,
Yibing Zhan,
Mingang Sun,
Tong Xu,
Lusong Li,
Hui Shen,
Xiaodong He
Abstract:
Embodied manipulation is a fundamental ability in the realm of embodied artificial intelligence. Although current embodied manipulation models show certain generalizations in specific settings, they struggle in new environments and tasks due to the complexity and diversity of real-world scenarios. The traditional end-to-end data collection and training manner leads to significant data demands. Dec…
▽ More
Embodied manipulation is a fundamental ability in the realm of embodied artificial intelligence. Although current embodied manipulation models show certain generalizations in specific settings, they struggle in new environments and tasks due to the complexity and diversity of real-world scenarios. The traditional end-to-end data collection and training manner leads to significant data demands. Decomposing end-to-end tasks into atomic skills helps reduce data requirements and improves the task success rate. However, existing methods are limited by predefined skill sets that cannot be dynamically updated. To address the issue, we introduce a three-wheeled data-driven method to build an atomic skill library. We divide tasks into subtasks using the Vision-Language-Planning (VLP). Then, atomic skill definitions are formed by abstracting the subtasks. Finally, an atomic skill library is constructed via data collection and Vision-Language-Action (VLA) fine-tuning. As the atomic skill library expands dynamically with the three-wheel update strategy, the range of tasks it can cover grows naturally. In this way, our method shifts focus from end-to-end tasks to atomic skills, significantly reducing data costs while maintaining high performance and enabling efficient adaptation to new tasks. Extensive experiments in real-world settings demonstrate the effectiveness and efficiency of our approach.
△ Less
Submitted 5 February, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
Enhancing accuracy of uncertainty estimation in appearance-based gaze tracking with probabilistic evaluation and calibration
Authors:
Qiaojie Zheng,
Jiucai Zhang,
Xiaoli Zhang
Abstract:
Accurately knowing uncertainties in appearance-based gaze tracking is critical for ensuring reliable downstream applications. Due to the lack of individual uncertainty labels, current uncertainty-aware approaches adopt probabilistic models to acquire uncertainties by following distributions in the training dataset. Without regulations, this approach lets the uncertainty model build biases and over…
▽ More
Accurately knowing uncertainties in appearance-based gaze tracking is critical for ensuring reliable downstream applications. Due to the lack of individual uncertainty labels, current uncertainty-aware approaches adopt probabilistic models to acquire uncertainties by following distributions in the training dataset. Without regulations, this approach lets the uncertainty model build biases and overfits the training data, leading to poor performance when deployed. We first presented a strict proper evaluation metric from the probabilistic perspective based on comparing the coverage probability between prediction and observation to provide quantitative evaluation for better assessment on the inferred uncertainties. We then proposed a correction strategy based on probability calibration to mitigate biases in the estimated uncertainties of the trained models. Finally, we demonstrated the effectiveness of the correction strategy with experiments performed on two popular gaze estimation datasets with distinctive image characteristics caused by data collection settings.
△ Less
Submitted 17 March, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
Adaptive Homophily Clustering: Structure Homophily Graph Learning with Adaptive Filter for Hyperspectral Image
Authors:
Yao Ding,
Weijie Kang,
Aitao Yang,
Zhili Zhang,
Junyang Zhao,
Jie Feng,
Danfeng Hong,
Qinhe Zheng
Abstract:
Hyperspectral image (HSI) clustering has been a fundamental but challenging task with zero training labels. Currently, some deep graph clustering methods have been successfully explored for HSI due to their outstanding performance in effective spatial structural information encoding. Nevertheless, insufficient structural information utilization, poor feature presentation ability, and weak graph up…
▽ More
Hyperspectral image (HSI) clustering has been a fundamental but challenging task with zero training labels. Currently, some deep graph clustering methods have been successfully explored for HSI due to their outstanding performance in effective spatial structural information encoding. Nevertheless, insufficient structural information utilization, poor feature presentation ability, and weak graph update capability limit their performance. Thus, in this paper, a homophily structure graph learning with an adaptive filter clustering method (AHSGC) for HSI is proposed. Specifically, homogeneous region generation is first developed for HSI processing and constructing the original graph. Afterward, an adaptive filter graph encoder is designed to adaptively capture the high and low frequency features on the graph for subsequence processing. Then, a graph embedding clustering self-training decoder is developed with KL Divergence, with which the pseudo-label is generated for network training. Meanwhile, homophily-enhanced structure learning is introduced to update the graph according to the clustering task, in which the orient correlation estimation is adopted to estimate the node connection, and graph edge sparsification is designed to adjust the edges in the graph dynamically. Finally, a joint network optimization is introduced to achieve network self-training and update the graph. The K-means is adopted to express the latent features. Extensive experiments and repeated comparative analysis have verified that our AHSGC contains high clustering accuracy, low computational complexity, and strong robustness. The code source will be available at https://github.com/DY-HYX.
△ Less
Submitted 7 January, 2025; v1 submitted 2 January, 2025;
originally announced January 2025.
-
A Robust Federated Learning Framework for Undependable Devices at Scale
Authors:
Shilong Wang,
Jianchun Liu,
Hongli Xu,
Chunming Qiao,
Huarong Deng,
Qiuye Zheng,
Jiantao Gong
Abstract:
In a federated learning (FL) system, many devices, such as smartphones, are often undependable (e.g., frequently disconnected from WiFi) during training. Existing FL frameworks always assume a dependable environment and exclude undependable devices from training, leading to poor model performance and resource wastage. In this paper, we propose FLUDE to effectively deal with undependable environmen…
▽ More
In a federated learning (FL) system, many devices, such as smartphones, are often undependable (e.g., frequently disconnected from WiFi) during training. Existing FL frameworks always assume a dependable environment and exclude undependable devices from training, leading to poor model performance and resource wastage. In this paper, we propose FLUDE to effectively deal with undependable environments. First, FLUDE assesses the dependability of devices based on the probability distribution of their historical behaviors (e.g., the likelihood of successfully completing training). Based on this assessment, FLUDE adaptively selects devices with high dependability for training. To mitigate resource wastage during the training phase, FLUDE maintains a model cache on each device, aiming to preserve the latest training state for later use in case local training on an undependable device is interrupted. Moreover, FLUDE proposes a staleness-aware strategy to judiciously distribute the global model to a subset of devices, thus significantly reducing resource wastage while maintaining model performance. We have implemented FLUDE on two physical platforms with 120 smartphones and NVIDIA Jetson devices. Extensive experimental results demonstrate that FLUDE can effectively improve model performance and resource efficiency of FL training in undependable environments.
△ Less
Submitted 27 December, 2024;
originally announced December 2024.
-
Darkit: A User-Friendly Software Toolkit for Spiking Large Language Model
Authors:
Xin Du,
Shifan Ye,
Qian Zheng,
Yangfan Hu,
Rui Yan,
Shunyu Qi,
Shuyang Chen,
Huajin Tang,
Gang Pan,
Shuiguang Deng
Abstract:
Large language models (LLMs) have been widely applied in various practical applications, typically comprising billions of parameters, with inference processes requiring substantial energy and computational resources. In contrast, the human brain, employing bio-plausible spiking mechanisms, can accomplish the same tasks while significantly reducing energy consumption, even with a similar number of…
▽ More
Large language models (LLMs) have been widely applied in various practical applications, typically comprising billions of parameters, with inference processes requiring substantial energy and computational resources. In contrast, the human brain, employing bio-plausible spiking mechanisms, can accomplish the same tasks while significantly reducing energy consumption, even with a similar number of parameters. Based on this, several pioneering researchers have proposed and implemented various large language models that leverage spiking neural networks. They have demonstrated the feasibility of these models, validated their performance, and open-sourced their frameworks and partial source code. To accelerate the adoption of brain-inspired large language models and facilitate secondary development for researchers, we are releasing a software toolkit named DarwinKit (Darkit). The toolkit is designed specifically for learners, researchers, and developers working on spiking large models, offering a suite of highly user-friendly features that greatly simplify the learning, deployment, and development processes.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.
-
ST-ReP: Learning Predictive Representations Efficiently for Spatial-Temporal Forecasting
Authors:
Qi Zheng,
Zihao Yao,
Yaying Zhang
Abstract:
Spatial-temporal forecasting is crucial and widely applicable in various domains such as traffic, energy, and climate. Benefiting from the abundance of unlabeled spatial-temporal data, self-supervised methods are increasingly adapted to learn spatial-temporal representations. However, it encounters three key challenges: 1) the difficulty in selecting reliable negative pairs due to the homogeneity…
▽ More
Spatial-temporal forecasting is crucial and widely applicable in various domains such as traffic, energy, and climate. Benefiting from the abundance of unlabeled spatial-temporal data, self-supervised methods are increasingly adapted to learn spatial-temporal representations. However, it encounters three key challenges: 1) the difficulty in selecting reliable negative pairs due to the homogeneity of variables, hindering contrastive learning methods; 2) overlooking spatial correlations across variables over time; 3) limitations of efficiency and scalability in existing self-supervised learning methods. To tackle these, we propose a lightweight representation-learning model ST-ReP, integrating current value reconstruction and future value prediction into the pre-training framework for spatial-temporal forecasting. And we design a new spatial-temporal encoder to model fine-grained relationships. Moreover, multi-time scale analysis is incorporated into the self-supervised loss to enhance predictive capability. Experimental results across diverse domains demonstrate that the proposed model surpasses pre-training-based baselines, showcasing its ability to learn compact and semantically enriched representations while exhibiting superior scalability.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
CLDG: Contrastive Learning on Dynamic Graphs
Authors:
Yiming Xu,
Bin Shi,
Teng Ma,
Bo Dong,
Haoyi Zhou,
Qinghua Zheng
Abstract:
The graph with complex annotations is the most potent data type, whose constantly evolving motivates further exploration of the unsupervised dynamic graph representation. One of the representative paradigms is graph contrastive learning. It constructs self-supervised signals by maximizing the mutual information between the statistic graph's augmentation views. However, the semantics and labels may…
▽ More
The graph with complex annotations is the most potent data type, whose constantly evolving motivates further exploration of the unsupervised dynamic graph representation. One of the representative paradigms is graph contrastive learning. It constructs self-supervised signals by maximizing the mutual information between the statistic graph's augmentation views. However, the semantics and labels may change within the augmentation process, causing a significant performance drop in downstream tasks. This drawback becomes greatly magnified on dynamic graphs. To address this problem, we designed a simple yet effective framework named CLDG. Firstly, we elaborate that dynamic graphs have temporal translation invariance at different levels. Then, we proposed a sampling layer to extract the temporally-persistent signals. It will encourage the node to maintain consistent local and global representations, i.e., temporal translation invariance under the timespan views. The extensive experiments demonstrate the effectiveness and efficiency of the method on seven datasets by outperforming eight unsupervised state-of-the-art baselines and showing competitiveness against four semi-supervised methods. Compared with the existing dynamic graph method, the number of model parameters and training time is reduced by an average of 2,001.86 times and 130.31 times on seven datasets, respectively.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Generating Unseen Nonlinear Evolution in Sea Surface Temperature Using a Deep Learning-Based Latent Space Data Assimilation Framework
Authors:
Qingyu Zheng,
Guijun Han,
Wei Li,
Lige Cao,
Gongfu Zhou,
Haowen Wu,
Qi Shao,
Ru Wang,
Xiaobo Wu,
Xudong Cui,
Hong Li,
Xuan Wang
Abstract:
Advances in data assimilation (DA) methods have greatly improved the accuracy of Earth system predictions. To fuse multi-source data and reconstruct the nonlinear evolution missing from observations, geoscientists are developing future-oriented DA methods. In this paper, we redesign a purely data-driven latent space DA framework (DeepDA) that employs a generative artificial intelligence model to c…
▽ More
Advances in data assimilation (DA) methods have greatly improved the accuracy of Earth system predictions. To fuse multi-source data and reconstruct the nonlinear evolution missing from observations, geoscientists are developing future-oriented DA methods. In this paper, we redesign a purely data-driven latent space DA framework (DeepDA) that employs a generative artificial intelligence model to capture the nonlinear evolution in sea surface temperature. Under variational constraints, DeepDA embedded with nonlinear features can effectively fuse heterogeneous data. The results show that DeepDA remains highly stable in capturing and generating nonlinear evolutions even when a large amount of observational information is missing. It can be found that when only 10% of the observation information is available, the error increase of DeepDA does not exceed 40%. Furthermore, DeepDA has been shown to be robust in the fusion of real observations and ensemble simulations. In particular, this paper provides a mechanism analysis of the nonlinear evolution generated by DeepDA from the perspective of physical patterns, which reveals the inherent explainability of our DL model in capturing multi-scale ocean signals.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation
Authors:
Juntao Dai,
Yaodong Yang,
Qian Zheng,
Gang Pan
Abstract:
A key aspect of Safe Reinforcement Learning (Safe RL) involves estimating the constraint condition for the next policy, which is crucial for guiding the optimization of safe policy updates. However, the existing Advantage-based Estimation (ABE) method relies on the infinite-horizon discounted advantage function. This dependence leads to catastrophic errors in finite-horizon scenarios with non-disc…
▽ More
A key aspect of Safe Reinforcement Learning (Safe RL) involves estimating the constraint condition for the next policy, which is crucial for guiding the optimization of safe policy updates. However, the existing Advantage-based Estimation (ABE) method relies on the infinite-horizon discounted advantage function. This dependence leads to catastrophic errors in finite-horizon scenarios with non-discounted constraints, resulting in safety-violation updates. In response, we propose the first estimation method for finite-horizon non-discounted constraints in deep Safe RL, termed Gradient-based Estimation (GBE), which relies on the analytic gradient derived along trajectories. Our theoretical and empirical analyses demonstrate that GBE can effectively estimate constraint changes over a finite horizon. Constructing a surrogate optimization problem with GBE, we developed a novel Safe RL algorithm called Constrained Gradient-based Policy Optimization (CGPO). CGPO identifies feasible optimal policies by iteratively resolving sub-problems within trust regions. Our empirical results reveal that CGPO, unlike baseline algorithms, successfully estimates the constraint functions of subsequent policies, thereby ensuring the efficiency and feasibility of each update.
△ Less
Submitted 15 December, 2024;
originally announced December 2024.
-
How Well Can Modern LLMs Act as Agent Cores in Radiology Environments?
Authors:
Qiaoyu Zheng,
Chaoyi Wu,
Pengcheng Qiu,
Lisong Dai,
Ya Zhang,
Yanfeng Wang,
Weidi Xie
Abstract:
We introduce RadA-BenchPlat, an evaluation platform that benchmarks the performance of large language models (LLMs) act as agent cores in radiology environments using 2,200 radiologist-verified synthetic patient records covering six anatomical regions, five imaging modalities, and 2,200 disease scenarios, resulting in 24,200 question-answer pairs that simulate diverse clinical situations. The plat…
▽ More
We introduce RadA-BenchPlat, an evaluation platform that benchmarks the performance of large language models (LLMs) act as agent cores in radiology environments using 2,200 radiologist-verified synthetic patient records covering six anatomical regions, five imaging modalities, and 2,200 disease scenarios, resulting in 24,200 question-answer pairs that simulate diverse clinical situations. The platform also defines ten categories of tools for agent-driven task solving and evaluates seven leading LLMs, revealing that while models like Claude-3.7-Sonnet can achieve a 67.1% task completion rate in routine settings, they still struggle with complex task understanding and tool coordination, limiting their capacity to serve as the central core of automated radiology systems. By incorporating four advanced prompt engineering strategies--where prompt-backpropagation and multi-agent collaboration contributed 16.8% and 30.7% improvements, respectively--the performance for complex tasks was enhanced by 48.2% overall. Furthermore, automated tool building was explored to improve robustness, achieving a 65.4% success rate, thereby offering promising insights for the future integration of fully automated radiology applications into clinical practice. All of our code and data are openly available at https://github.com/MAGIC-AI4Med/RadABench.
△ Less
Submitted 7 April, 2025; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Unicorn: Unified Neural Image Compression with One Number Reconstruction
Authors:
Qi Zheng,
Haozhi Wang,
Zihao Liu,
Jiaming Liu,
Peiye Liu,
Zhijian Hao,
Yanheng Lu,
Dimin Niu,
Jinjia Zhou,
Minge Jing,
Yibo Fan
Abstract:
Prevalent lossy image compression schemes can be divided into: 1) explicit image compression (EIC), including traditional standards and neural end-to-end algorithms; 2) implicit image compression (IIC) based on implicit neural representations (INR). The former is encountering impasses of either leveling off bitrate reduction at a cost of tremendous complexity while the latter suffers from excessiv…
▽ More
Prevalent lossy image compression schemes can be divided into: 1) explicit image compression (EIC), including traditional standards and neural end-to-end algorithms; 2) implicit image compression (IIC) based on implicit neural representations (INR). The former is encountering impasses of either leveling off bitrate reduction at a cost of tremendous complexity while the latter suffers from excessive smoothing quality as well as lengthy decoder models. In this paper, we propose an innovative paradigm, which we dub \textbf{Unicorn} (\textbf{U}nified \textbf{N}eural \textbf{I}mage \textbf{C}ompression with \textbf{O}ne \textbf{N}number \textbf{R}econstruction). By conceptualizing the images as index-image pairs and learning the inherent distribution of pairs in a subtle neural network model, Unicorn can reconstruct a visually pleasing image from a randomly generated noise with only one index number. The neural model serves as the unified decoder of images while the noises and indexes corresponds to explicit representations. As a proof of concept, we propose an effective and efficient prototype of Unicorn based on latent diffusion models with tailored model designs. Quantitive and qualitative experimental results demonstrate that our prototype achieves significant bitrates reduction compared with EIC and IIC algorithms. More impressively, benefitting from the unified decoder, our compression ratio escalates as the quantity of images increases. We envision that more advanced model designs will endow Unicorn with greater potential in image compression. We will release our codes in \url{https://github.com/uniqzheng/Unicorn-Laduree}.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
Video Quality Assessment: A Comprehensive Survey
Authors:
Qi Zheng,
Yibo Fan,
Leilei Huang,
Tianyu Zhu,
Jiaming Liu,
Zhijian Hao,
Shuo Xing,
Chia-Ju Chen,
Xiongkuo Min,
Alan C. Bovik,
Zhengzhong Tu
Abstract:
Video quality assessment (VQA) is an important processing task, aiming at predicting the quality of videos in a manner highly consistent with human judgments of perceived quality. Traditional VQA models based on natural image and/or video statistics, which are inspired both by models of projected images of the real world and by dual models of the human visual system, deliver only limited predictio…
▽ More
Video quality assessment (VQA) is an important processing task, aiming at predicting the quality of videos in a manner highly consistent with human judgments of perceived quality. Traditional VQA models based on natural image and/or video statistics, which are inspired both by models of projected images of the real world and by dual models of the human visual system, deliver only limited prediction performances on real-world user-generated content (UGC), as exemplified in recent large-scale VQA databases containing large numbers of diverse video contents crawled from the web. Fortunately, recent advances in deep neural networks and Large Multimodality Models (LMMs) have enabled significant progress in solving this problem, yielding better results than prior handcrafted models. Numerous deep learning-based VQA models have been developed, with progress in this direction driven by the creation of content-diverse, large-scale human-labeled databases that supply ground truth psychometric video quality data. Here, we present a comprehensive survey of recent progress in the development of VQA algorithms and the benchmarking studies and databases that make them possible. We also analyze open research directions on study design and VQA algorithm architectures. Github link: https://github.com/taco-group/Video-Quality-Assessment-A-Comprehensive-Survey.
△ Less
Submitted 11 December, 2024; v1 submitted 4 December, 2024;
originally announced December 2024.
-
M3-CVC: Controllable Video Compression with Multimodal Generative Models
Authors:
Rui Wan,
Qi Zheng,
Yibo Fan
Abstract:
Traditional and neural video codecs commonly encounter limitations in controllability and generality under ultra-low-bitrate coding scenarios. To overcome these challenges, we propose M3-CVC, a controllable video compression framework incorporating multimodal generative models. The framework utilizes a semantic-motion composite strategy for keyframe selection to retain critical information. For ea…
▽ More
Traditional and neural video codecs commonly encounter limitations in controllability and generality under ultra-low-bitrate coding scenarios. To overcome these challenges, we propose M3-CVC, a controllable video compression framework incorporating multimodal generative models. The framework utilizes a semantic-motion composite strategy for keyframe selection to retain critical information. For each keyframe and its corresponding video clip, a dialogue-based large multimodal model (LMM) approach extracts hierarchical spatiotemporal details, enabling both inter-frame and intra-frame representations for improved video fidelity while enhancing encoding interpretability. M3-CVC further employs a conditional diffusion-based, text-guided keyframe compression method, achieving high fidelity in frame reconstruction. During decoding, textual descriptions derived from LMMs guide the diffusion process to restore the original video's content accurately. Experimental results demonstrate that M3-CVC significantly outperforms the state-of-the-art VVC standard in ultra-low bitrate scenarios, particularly in preserving semantic and perceptual fidelity.
△ Less
Submitted 25 December, 2024; v1 submitted 24 November, 2024;
originally announced November 2024.
-
Neuro-3D: Towards 3D Visual Decoding from EEG Signals
Authors:
Zhanqiang Guo,
Jiamin Wu,
Yonghao Song,
Jiahui Bu,
Weijian Mai,
Qihao Zheng,
Wanli Ouyang,
Chunfeng Song
Abstract:
Human's perception of the visual world is shaped by the stereo processing of 3D information. Understanding how the brain perceives and processes 3D visual stimuli in the real world has been a longstanding endeavor in neuroscience. Towards this goal, we introduce a new neuroscience task: decoding 3D visual perception from EEG signals, a neuroimaging technique that enables real-time monitoring of ne…
▽ More
Human's perception of the visual world is shaped by the stereo processing of 3D information. Understanding how the brain perceives and processes 3D visual stimuli in the real world has been a longstanding endeavor in neuroscience. Towards this goal, we introduce a new neuroscience task: decoding 3D visual perception from EEG signals, a neuroimaging technique that enables real-time monitoring of neural dynamics enriched with complex visual cues. To provide the essential benchmark, we first present EEG-3D, a pioneering dataset featuring multimodal analysis data and extensive EEG recordings from 12 subjects viewing 72 categories of 3D objects rendered in both videos and images. Furthermore, we propose Neuro-3D, a 3D visual decoding framework based on EEG signals. This framework adaptively integrates EEG features derived from static and dynamic stimuli to learn complementary and robust neural representations, which are subsequently utilized to recover both the shape and color of 3D objects through the proposed diffusion-based colored point cloud decoder. To the best of our knowledge, we are the first to explore EEG-based 3D visual decoding. Experiments indicate that Neuro-3D not only reconstructs colored 3D objects with high fidelity, but also learns effective neural representations that enable insightful brain region analysis. The dataset and associated code will be made publicly available.
△ Less
Submitted 21 November, 2024; v1 submitted 19 November, 2024;
originally announced November 2024.
-
Collaborative UAVs Multi-task Video Processing Optimization Based on Enhanced Distributed Actor-Critic Networks
Authors:
Ziqi Rong,
Qiushi Zheng,
Zhishu Shen,
Xiaolong Li,
Tiehua Zhang,
Zheng Lei,
Jiong Jin
Abstract:
With the rapid advancement of the Internet of Things (IoT) and Artificial Intelligence (AI), intelligent information services are being increasingly integrated across various sectors, including healthcare, industry, and transportation. Traditional solutions rely on centralized cloud processing, which encounters considerable challenges in fulfilling the Quality of Service (QoS) requirements of Comp…
▽ More
With the rapid advancement of the Internet of Things (IoT) and Artificial Intelligence (AI), intelligent information services are being increasingly integrated across various sectors, including healthcare, industry, and transportation. Traditional solutions rely on centralized cloud processing, which encounters considerable challenges in fulfilling the Quality of Service (QoS) requirements of Computer Vision (CV) tasks generated in the resource-constrained infrastructure-less environments. In this paper, we introduce a distributed framework called CoUAV-Pro for multi-task video processing powered by Unmanned Aerial Vehicles (UAVs). This framework empowers multiple UAVs to meet the service demands of various computer vision (CV) tasks in infrastructure-less environments, thereby eliminating the need for centralized processing. Specifically, we develop a novel task allocation algorithm that leverages enhanced distributed actor-critic networks within CoUAV-Pro, aiming to optimize task processing efficiency while contending with constraints associated with UAV's energy, computational, and communication resources. Comprehensive experiments demonstrate that our proposed solution achieves satisfactory performance levels against those of centralized methods across key metrics including task acquisition rates, task latency, and energy consumption.
△ Less
Submitted 16 November, 2024;
originally announced November 2024.
-
Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding
Authors:
Zirui Shao,
Chuwei Luo,
Zhaoqing Zhu,
Hangdi Xing,
Zhi Yu,
Qi Zheng,
Jiajun Bu
Abstract:
Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand in recent years. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, current MLLMs often face conflicts between perception and cognition. Taking a document VQA…
▽ More
Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand in recent years. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it "sees" and what it "understands." Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflicts, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 68.6% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. This method first ensures task-specific consistency and then connects the cognitive and perceptual knowledge. Our method significantly reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks in most scenarios.
△ Less
Submitted 12 November, 2024;
originally announced November 2024.
-
A Sharded Blockchain-Based Secure Federated Learning Framework for LEO Satellite Networks
Authors:
Wenbo Wu,
Cheng Tan,
Kangcheng Yang,
Zhishu Shen,
Qiushi Zheng,
Jiong Jin
Abstract:
Low Earth Orbit (LEO) satellite networks are increasingly essential for space-based artificial intelligence (AI) applications. However, as commercial use expands, LEO satellite networks face heightened cyberattack risks, especially through satellite-to-satellite communication links, which are more vulnerable than ground-based connections. As the number of operational satellites continues to grow,…
▽ More
Low Earth Orbit (LEO) satellite networks are increasingly essential for space-based artificial intelligence (AI) applications. However, as commercial use expands, LEO satellite networks face heightened cyberattack risks, especially through satellite-to-satellite communication links, which are more vulnerable than ground-based connections. As the number of operational satellites continues to grow, addressing these security challenges becomes increasingly critical. Traditional approaches, which focus on sending models to ground stations for validation, often overlook the limited communication windows available to LEO satellites, leaving critical security risks unaddressed. To tackle these challenges, we propose a sharded blockchain-based federated learning framework for LEO networks, called SBFL-LEO. This framework improves the reliability of inter-satellite communications using blockchain technology and assigns specific roles to each satellite. Miner satellites leverage cosine similarity (CS) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to identify malicious models and monitor each other to detect inaccurate aggregated models. Security analysis and experimental results demonstrate that our approach outperforms baseline methods in both model accuracy and energy efficiency, significantly enhancing system robustness against attacks.
△ Less
Submitted 9 November, 2024;
originally announced November 2024.
-
Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models
Authors:
Jianqun Zhou,
Yuanlei Zheng,
Wei Chen,
Qianqian Zheng,
Hui Su,
Wei Zhang,
Rui Meng,
Xiaoyu Shen
Abstract:
Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these p…
▽ More
Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.
△ Less
Submitted 5 March, 2025; v1 submitted 31 October, 2024;
originally announced October 2024.
-
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
Authors:
Qinqing Zheng,
Mikael Henaff,
Amy Zhang,
Aditya Grover,
Brandon Amos
Abstract:
Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: t…
▽ More
Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose \oni, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent's collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. By studying their relative tradeoffs, we shed light on questions regarding intrinsic reward design for sparse reward problems. Our approach achieves state-of-the-art performance across a range of challenging, sparse reward tasks from the NetHack Learning Environment in a simple unified process, solely using the agent's gathered experience, without requiring external datasets. We make our code available at \url{https://github.com/facebookresearch/oni}.
△ Less
Submitted 17 December, 2024; v1 submitted 30 October, 2024;
originally announced October 2024.
-
Application of an ANN and LSTM-based Ensemble Model for Stock Market Prediction
Authors:
Fang Liu,
Shaobo Guo,
Qianwen Xing,
Xinye Sha,
Ying Chen,
Yuhui Jin,
Qi Zheng,
Chang Yu
Abstract:
Stock trading has always been a key economic indicator in modern society and a primary source of profit for financial giants such as investment banks, quantitative trading firms, and hedge funds. Discovering the underlying patterns within the seemingly volatile yet intrinsically structured economic activities has become a central focus of research for many companies. Our study leverages widely-use…
▽ More
Stock trading has always been a key economic indicator in modern society and a primary source of profit for financial giants such as investment banks, quantitative trading firms, and hedge funds. Discovering the underlying patterns within the seemingly volatile yet intrinsically structured economic activities has become a central focus of research for many companies. Our study leverages widely-used modern financial forecasting algorithms, including LSTM, ANN, CNN, and BiLSTM. We begin by comparing the predictive performance of these well-known algorithms on our stock market data, utilizing metrics such as R2, MAE, MSE, RMSE for detailed evaluation. Based on the performance of these models, we then aim to combine their strengths while mitigating their weaknesses, striving to construct a powerful hybrid model that overcomes the performance limitations of individual models.Through rigorous experimentation and exploration, we ultimately developed an LSTM+ANN model that breaks through prior performance bottlenecks, achieving promising and exciting results.
△ Less
Submitted 13 November, 2024; v1 submitted 26 October, 2024;
originally announced October 2024.
-
SeisGPT: A Physics-Informed Data-Driven Large Model for Real-Time Seismic Response Prediction
Authors:
Shiqiao Meng,
Ying Zhou,
Qinghua Zheng,
Bingxu Liao,
Mushi Chang,
Tianshu Zhang,
Abderrahim Djerrad
Abstract:
Accurately predicting the dynamic responses of building structures under seismic loads is essential for ensuring structural safety and minimizing potential damage. This critical aspect of structural analysis allows engineers to evaluate how structures perform under various loading conditions, facilitating informed design and safety decisions. Traditional methods, which rely on complex finite eleme…
▽ More
Accurately predicting the dynamic responses of building structures under seismic loads is essential for ensuring structural safety and minimizing potential damage. This critical aspect of structural analysis allows engineers to evaluate how structures perform under various loading conditions, facilitating informed design and safety decisions. Traditional methods, which rely on complex finite element models often struggle with balancing computational efficiency and accuracy. To address this challenge, we introduce SeisGPT, a data-driven, large physics-informed model that leverages deep neural networks based on the Generative Pre-trained Transformer (GPT) architecture. SeisGPT is designed to predict, in real-time the dynamic behavior of building structures under seismic forces. Trained on a diverse corpus of seismic data and structural engineering principles, it instantly generates predictive responses, including displacement, acceleration, and inter-story drift, with high accuracy and computational efficiency. Its adaptability across various building typologies and seismic intensities makes this framework a valuable tool for designing robust structures and assessing seismic risk. Through comprehensive validation, this approach exhibits superior performance, offering engineers and researchers a powerful tool for assessing seismic response and informing resilient design strategies. This innovative framework represents a significant advancement in seismic engineering practice, with potential applications in mitigating seismic hazards and enhancing structural resilience.
△ Less
Submitted 26 October, 2024;
originally announced October 2024.
-
A Robust and Efficient Visual-Inertial Initialization with Probabilistic Normal Epipolar Constraint
Authors:
Changshi Mu,
Daquan Feng,
Qi Zheng,
Yuan Zhuang
Abstract:
Accurate and robust initialization is essential for Visual-Inertial Odometry (VIO), as poor initialization can severely degrade pose accuracy. During initialization, it is crucial to estimate parameters such as accelerometer bias, gyroscope bias, initial velocity, gravity, etc. Most existing VIO initialization methods adopt Structure from Motion (SfM) to solve for gyroscope bias. However, SfM is n…
▽ More
Accurate and robust initialization is essential for Visual-Inertial Odometry (VIO), as poor initialization can severely degrade pose accuracy. During initialization, it is crucial to estimate parameters such as accelerometer bias, gyroscope bias, initial velocity, gravity, etc. Most existing VIO initialization methods adopt Structure from Motion (SfM) to solve for gyroscope bias. However, SfM is not stable and efficient enough in fast-motion or degenerate scenes. To overcome these limitations, we extended the rotation-translation-decoupled framework by adding new uncertainty parameters and optimization modules. First, we adopt a gyroscope bias estimator that incorporates probabilistic normal epipolar constraints. Second, we fuse IMU and visual measurements to solve for velocity, gravity, and scale efficiently. Finally, we design an additional refinement module that effectively reduces gravity and scale errors. Extensive EuRoC dataset tests show that our method reduces gyroscope bias and rotation errors by 16\% and 4\% on average, and gravity error by 29\% on average. On the TUM dataset, our method reduces the gravity error and scale error by 14.2\% and 5.7\% on average respectively. The source code is available at https://github.com/MUCS714/DRT-PNEC.git
△ Less
Submitted 18 February, 2025; v1 submitted 25 October, 2024;
originally announced October 2024.
-
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
Authors:
DiJia Su,
Sainbayar Sukhbaatar,
Michael Rabbat,
Yuandong Tian,
Qinqing Zheng
Abstract:
In human cognition theory, human thinking is governed by two systems: the fast and intuitive System 1 and the slower but more deliberative System 2. Recent studies have shown that incorporating System 2 process into Transformers including large language models (LLMs), significantly enhances their reasoning capabilities. Nevertheless, models that purely resemble System 2 thinking require substantia…
▽ More
In human cognition theory, human thinking is governed by two systems: the fast and intuitive System 1 and the slower but more deliberative System 2. Recent studies have shown that incorporating System 2 process into Transformers including large language models (LLMs), significantly enhances their reasoning capabilities. Nevertheless, models that purely resemble System 2 thinking require substantially higher computational costs and are much slower to respond. To address this challenge, we present Dualformer, a single Transformer model that seamlessly integrates both the fast and slow reasoning modes. Dualformer is obtained by training on data with randomized reasoning traces, where different parts of the traces are dropped during training. The dropping strategies are specifically tailored according to the trace structure, analogous to analyzing our thinking process and creating shortcuts with patterns. At inference time, our model can be configured to output only the solutions (fast mode) or both the reasoning chain and the final solution (slow mode), or automatically decide which mode to engage (auto mode). In all cases, Dualformer outperforms the corresponding baseline models in both performance and computational efficiency: (1) in slow mode, Dualformer optimally solves unseen 30 x 30 maze navigation tasks 97.6% of the time, surpassing the Searchformer (trained on data with complete reasoning traces) baseline performance of 93.3%, while only using 45.5% fewer reasoning steps; (2) in fast mode, Dualformer completes those tasks with an 80% optimal rate, significantly outperforming the Solution-Only model (trained on solution-only data), which has an optimal rate of only 30%. For math problems, our techniques have also achieved improved performance with LLM fine-tuning, showing its generalization beyond task-specific models.
△ Less
Submitted 10 April, 2025; v1 submitted 13 October, 2024;
originally announced October 2024.
-
Spiking GS: Towards High-Accuracy and Low-Cost Surface Reconstruction via Spiking Neuron-based Gaussian Splatting
Authors:
Weixing Zhang,
Zongrui Li,
De Ma,
Huajin Tang,
Xudong Jiang,
Qian Zheng,
Gang Pan
Abstract:
3D Gaussian Splatting is capable of reconstructing 3D scenes in minutes. Despite recent advances in improving surface reconstruction accuracy, the reconstructed results still exhibit bias and suffer from inefficiency in storage and training. This paper provides a different observation on the cause of the inefficiency and the reconstruction bias, which is attributed to the integration of the low-op…
▽ More
3D Gaussian Splatting is capable of reconstructing 3D scenes in minutes. Despite recent advances in improving surface reconstruction accuracy, the reconstructed results still exhibit bias and suffer from inefficiency in storage and training. This paper provides a different observation on the cause of the inefficiency and the reconstruction bias, which is attributed to the integration of the low-opacity parts (LOPs) of the generated Gaussians. We show that LOPs consist of Gaussians with overall low-opacity (LOGs) and the low-opacity tails (LOTs) of Gaussians. We propose Spiking GS to reduce such two types of LOPs by integrating spiking neurons into the Gaussian Splatting pipeline. Specifically, we introduce global and local full-precision integrate-and-fire spiking neurons to the opacity and representation function of flattened 3D Gaussians, respectively. Furthermore, we enhance the density control strategy with spiking neurons' thresholds and a new criterion on the scale of Gaussians. Our method can represent more accurate reconstructed surfaces at a lower cost. The supplementary material and code are available at https://github.com/zju-bmi-lab/SpikingGS.
△ Less
Submitted 3 December, 2024; v1 submitted 8 October, 2024;
originally announced October 2024.
-
A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models
Authors:
Cong Guo,
Feng Cheng,
Zhixu Du,
James Kiessling,
Jonathan Ku,
Shiyu Li,
Ziru Li,
Mingyuan Ma,
Tergel Molom-Ochir,
Benjamin Morris,
Haoxuan Shan,
Jingwei Sun,
Yitu Wang,
Chiyue Wei,
Xueying Wu,
Yuhao Wu,
Hao Frank Yang,
Jingyang Zhang,
Junyao Zhang,
Qilin Zheng,
Guanglei Zhou,
Hai,
Li,
Yiran Chen
Abstract:
The rapid development of large language models (LLMs) has significantly transformed the field of artificial intelligence, demonstrating remarkable capabilities in natural language processing and moving towards multi-modal functionality. These models are increasingly integrated into diverse applications, impacting both research and industry. However, their development and deployment present substan…
▽ More
The rapid development of large language models (LLMs) has significantly transformed the field of artificial intelligence, demonstrating remarkable capabilities in natural language processing and moving towards multi-modal functionality. These models are increasingly integrated into diverse applications, impacting both research and industry. However, their development and deployment present substantial challenges, including the need for extensive computational resources, high energy consumption, and complex software optimizations. Unlike traditional deep learning systems, LLMs require unique optimization strategies for training and inference, focusing on system-level efficiency. This paper surveys hardware and software co-design approaches specifically tailored to address the unique characteristics and constraints of large language models. This survey analyzes the challenges and impacts of LLMs on hardware and algorithm research, exploring algorithm optimization, hardware design, and system-level innovations. It aims to provide a comprehensive understanding of the trade-offs and considerations in LLM-centric computing systems, guiding future advancements in AI. Finally, we summarize the existing efforts in this space and outline future directions toward realizing production-grade co-design methodologies for the next generation of large language models and AI systems.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
Authors:
Yifei Xing,
Xiangyuan Lan,
Ruiping Wang,
Dongmei Jiang,
Wenjun Huang,
Qingfang Zheng,
Yaowei Wang
Abstract:
Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on mu…
▽ More
Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code will be provided.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.