-
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models
Authors:
Zhiyu Li,
Shichao Song,
Hanyu Wang,
Simin Niu,
Ding Chen,
Jiawei Yang,
Chenyang Xi,
Huayi Lai,
Jihao Zhao,
Yezhaohui Wang,
Junpeng Ren,
Zehao Lin,
Jiahao Huo,
Tianyi Chen,
Kai Chen,
Kehang Li,
Zhiqiang Yin,
Qingchen Yu,
Bo Tang,
Hongkang Yang,
Zhi-Qin John Xu,
Feiyu Xiong
Abstract:
Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation…
▽ More
Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation memory (context-limited runtime states). While emerging methods like Retrieval-Augmented Generation (RAG) incorporate plaintext memory, they lack lifecycle management and multi-modal integration, limiting their capacity for long-term knowledge evolution. To address this, we introduce MemOS, a memory operating system designed for LLMs that, for the first time, elevates memory to a first-class operational resource. It builds unified mechanisms for representation, organization, and governance across three core memory types: parametric, activation, and plaintext. At its core is the MemCube, a standardized memory abstraction that enables tracking, fusion, and migration of heterogeneous memory, while offering structured, traceable access across tasks and contexts. MemOS establishes a memory-centric execution framework with strong controllability, adaptability, and evolvability. It fills a critical gap in current LLM infrastructure and lays the groundwork for continual adaptation, personalized intelligence, and cross-platform coordination in next-generation intelligent systems.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems
Authors:
Kai Chen,
Taihang Zhen,
Hewei Wang,
Kailai Liu,
Xinfeng Li,
Jing Huo,
Tianpei Yang,
Jinfeng Xu,
Wei Dong,
Yang Gao
Abstract:
As large language models (LLMs) are increasingly deployed in healthcare, ensuring their safety, particularly within collaborative multi-agent configurations, is paramount. In this paper we introduce MedSentry, a benchmark comprising 5 000 adversarial medical prompts spanning 25 threat categories with 100 subthemes. Coupled with this dataset, we develop an end-to-end attack-defense evaluation pipel…
▽ More
As large language models (LLMs) are increasingly deployed in healthcare, ensuring their safety, particularly within collaborative multi-agent configurations, is paramount. In this paper we introduce MedSentry, a benchmark comprising 5 000 adversarial medical prompts spanning 25 threat categories with 100 subthemes. Coupled with this dataset, we develop an end-to-end attack-defense evaluation pipeline to systematically analyze how four representative multi-agent topologies (Layers, SharedPool, Centralized, and Decentralized) withstand attacks from 'dark-personality' agents. Our findings reveal critical differences in how these architectures handle information contamination and maintain robust decision-making, exposing their underlying vulnerability mechanisms. For instance, SharedPool's open information sharing makes it highly susceptible, whereas Decentralized architectures exhibit greater resilience thanks to inherent redundancy and isolation. To mitigate these risks, we propose a personality-scale detection and correction mechanism that identifies and rehabilitates malicious agents, restoring system safety to near-baseline levels. MedSentry thus furnishes both a rigorous evaluation framework and practical defense strategies that guide the design of safer LLM-based multi-agent systems in medical domains.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis
Authors:
Haoming Huang,
Yibo Yan,
Jiahao Huo,
Xin Zou,
Xinfeng Li,
Kun Wang,
Xuming Hu
Abstract:
Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time obser…
▽ More
Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing. By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the internal workings of attention heads, tracing how competing knowledge pathways contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit's effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation.
△ Less
Submitted 20 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving
Authors:
Muleilan Pei,
Jiayao Shan,
Peiliang Li,
Jieqi Shi,
Jing Huo,
Yang Gao,
Shaojie Shen
Abstract:
Online scene perception and topology reasoning are critical for autonomous vehicles to understand their driving environments, particularly for mapless driving systems that endeavor to reduce reliance on costly High-Definition (HD) maps. However, recent advances in online scene understanding still face limitations, especially in long-range or occluded scenarios, due to the inherent constraints of o…
▽ More
Online scene perception and topology reasoning are critical for autonomous vehicles to understand their driving environments, particularly for mapless driving systems that endeavor to reduce reliance on costly High-Definition (HD) maps. However, recent advances in online scene understanding still face limitations, especially in long-range or occluded scenarios, due to the inherent constraints of onboard sensors. To address this challenge, we propose a Standard-Definition (SD) Map Enhanced scene Perception and Topology reasoning (SEPT) framework, which explores how to effectively incorporate the SD map as prior knowledge into existing perception and reasoning pipelines. Specifically, we introduce a novel hybrid feature fusion strategy that combines SD maps with Bird's-Eye-View (BEV) features, considering both rasterized and vectorized representations, while mitigating potential misalignment between SD maps and BEV feature spaces. Additionally, we leverage the SD map characteristics to design an auxiliary intersection-aware keypoint detection task, which further enhances the overall scene understanding performance. Experimental results on the large-scale OpenLane-V2 dataset demonstrate that by effectively integrating SD map priors, our framework significantly improves both scene perception and topology reasoning, outperforming existing methods by a substantial margin.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Black-box Adversaries from Latent Space: Unnoticeable Attacks on Human Pose and Shape Estimation
Authors:
Zhiying Li,
Guanggang Geng,
Yeying Jin,
Zhizhi Guo,
Bruce Gu,
Jidong Huo,
Zhaoxin Fan,
Wenjun Wu
Abstract:
Expressive human pose and shape (EHPS) estimation is vital for digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities. Current adversarial attacks on EHPS models often require white-box access (e.g., model details or gradients) or generate…
▽ More
Expressive human pose and shape (EHPS) estimation is vital for digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities. Current adversarial attacks on EHPS models often require white-box access (e.g., model details or gradients) or generate visually conspicuous perturbations, limiting their practicality and ability to expose real-world security threats. To address these limitations, we propose a novel Unnoticeable Black-Box Attack (UBA) against EHPS models. UBA leverages the latent-space representations of natural images to generate an optimal adversarial noise pattern and iteratively refine its attack potency along an optimized direction in digital space. Crucially, this process relies solely on querying the model's output, requiring no internal knowledge of the EHPS architecture, while guiding the noise optimization toward greater stealth and effectiveness. Extensive experiments and visual analyses demonstrate the superiority of UBA. Notably, UBA increases the pose estimation errors of EHPS models by 17.27%-58.21% on average, revealing critical vulnerabilities. These findings underscore the urgent need to address and mitigate security risks associated with digital human generation systems.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models
Authors:
Xuyang Guo,
Jiayan Huo,
Zhenmei Shi,
Zhao Song,
Jiahao Zhang,
Jiale Zhao
Abstract:
Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mat…
▽ More
Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation
Authors:
Zixuan Chen,
Junhui Yin,
Yangtao Chen,
Jing Huo,
Pinzhuo Tian,
Jieqi Shi,
Yiwen Hou,
Yinchuan Li,
Yang Gao
Abstract:
Generalizing language-conditioned multi-task imitation learning (IL) models to novel long-horizon 3D manipulation tasks remains a significant challenge. To address this, we propose DeCo (Task Decomposition and Skill Composition), a model-agnostic framework compatible with various multi-task IL models, designed to enhance their zero-shot generalization to novel, compositional, long-horizon 3D manip…
▽ More
Generalizing language-conditioned multi-task imitation learning (IL) models to novel long-horizon 3D manipulation tasks remains a significant challenge. To address this, we propose DeCo (Task Decomposition and Skill Composition), a model-agnostic framework compatible with various multi-task IL models, designed to enhance their zero-shot generalization to novel, compositional, long-horizon 3D manipulation tasks. DeCo first decomposes IL demonstrations into a set of modular atomic tasks based on the physical interaction between the gripper and objects, and constructs an atomic training dataset that enables models to learn a diverse set of reusable atomic skills during imitation learning. At inference time, DeCo leverages a vision-language model (VLM) to parse high-level instructions for novel long-horizon tasks, retrieve the relevant atomic skills, and dynamically schedule their execution; a spatially-aware skill-chaining module then ensures smooth, collision-free transitions between sequential skills. We evaluate DeCo in simulation using DeCoBench, a benchmark specifically designed to assess zero-shot generalization of multi-task IL models in compositional long-horizon 3D manipulation. Across three representative multi-task IL models (RVT-2, 3DDA, and ARP), DeCo achieves success rate improvements of 66.67%, 21.53%, and 57.92%, respectively, on 12 novel compositional tasks. Moreover, in real-world experiments, a DeCo-enhanced model trained on only 6 atomic tasks successfully completes 9 novel long-horizon tasks, yielding an average success rate improvement of 53.33% over the base multi-task IL model. Video demonstrations are available at: https://deco226.github.io.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
Authors:
Xuyang Guo,
Jiayan Huo,
Zhenmei Shi,
Zhao Song,
Jiahao Zhang,
Jiale Zhao
Abstract:
Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic…
▽ More
Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Multi-Agent Deep Reinforcement Learning for Multiple Anesthetics Collaborative Control
Authors:
Huijie Li,
Yide Yu,
Si Shi,
Anmin Hu,
Jian Huo,
Wei Lin,
Chaoran Wu,
Wuman Luo
Abstract:
Automated control of personalized multiple anesthetics in clinical Total Intravenous Anesthesia (TIVA) is crucial yet challenging. Current systems, including target-controlled infusion (TCI) and closed-loop systems, either rely on relatively static pharmacokinetic/pharmacodynamic (PK/PD) models or focus on single anesthetic control, limiting personalization and collaborative control. To address th…
▽ More
Automated control of personalized multiple anesthetics in clinical Total Intravenous Anesthesia (TIVA) is crucial yet challenging. Current systems, including target-controlled infusion (TCI) and closed-loop systems, either rely on relatively static pharmacokinetic/pharmacodynamic (PK/PD) models or focus on single anesthetic control, limiting personalization and collaborative control. To address these issues, we propose a novel framework, Value Decomposition Multi-Agent Deep Reinforcement Learning (VD-MADRL). VD-MADRL optimizes the collaboration between two anesthetics propofol (Agent I) and remifentanil (Agent II). And It uses a Markov Game (MG) to identify optimal actions among heterogeneous agents. We employ various value function decomposition methods to resolve the credit allocation problem and enhance collaborative control. We also introduce a multivariate environment model based on random forest (RF) for anesthesia state simulation. Additionally, a data resampling and alignment technique ensures synchronized trajectory data. Our experiments on general and thoracic surgery datasets show that VD-MADRL performs better than human experience. It improves dose precision and keeps anesthesia states stable, providing great clinical value.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Can You Count to Nine? A Human Evaluation Benchmark for Counting Limits in Modern Text-to-Video Models
Authors:
Xuyang Guo,
Zekai Huang,
Jiayan Huo,
Yingyu Liang,
Zhenmei Shi,
Zhao Song,
Jiahao Zhang
Abstract:
Generative models have driven significant progress in a variety of AI tasks, including text-to-video generation, where models like Video LDM and Stable Video Diffusion can produce realistic, movie-level videos from textual instructions. Despite these advances, current text-to-video models still face fundamental challenges in reliably following human commands, particularly in adhering to simple num…
▽ More
Generative models have driven significant progress in a variety of AI tasks, including text-to-video generation, where models like Video LDM and Stable Video Diffusion can produce realistic, movie-level videos from textual instructions. Despite these advances, current text-to-video models still face fundamental challenges in reliably following human commands, particularly in adhering to simple numerical constraints. In this work, we present T2VCountBench, a specialized benchmark aiming at evaluating the counting capability of SOTA text-to-video models as of 2025. Our benchmark employs rigorous human evaluations to measure the number of generated objects and covers a diverse range of generators, covering both open-source and commercial models. Extensive experiments reveal that all existing models struggle with basic numerical tasks, almost always failing to generate videos with an object count of 9 or fewer. Furthermore, our comprehensive ablation studies explore how factors like video style, temporal dynamics, and multilingual inputs may influence counting performance. We also explore prompt refinement techniques and demonstrate that decomposing the task into smaller subtasks does not easily alleviate these limitations. Our findings highlight important challenges in current text-to-video generation and provide insights for future research aimed at improving adherence to basic numerical constraints.
△ Less
Submitted 5 April, 2025;
originally announced April 2025.
-
DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding
Authors:
Chong Li,
Jingyang Huo,
Weikang Gong,
Yanwei Fu,
Xiangyang Xue,
Jianfeng Feng
Abstract:
Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals.…
▽ More
Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Model-Based Offline Reinforcement Learning with Adversarial Data Augmentation
Authors:
Hongye Cao,
Fan Feng,
Jing Huo,
Shangdong Yang,
Meng Fang,
Tianpei Yang,
Yang Gao
Abstract:
Model-based offline Reinforcement Learning (RL) constructs environment models from offline datasets to perform conservative policy optimization. Existing approaches focus on learning state transitions through ensemble models, rollouting conservative estimation to mitigate extrapolation errors. However, the static data makes it challenging to develop a robust policy, and offline agents cannot acces…
▽ More
Model-based offline Reinforcement Learning (RL) constructs environment models from offline datasets to perform conservative policy optimization. Existing approaches focus on learning state transitions through ensemble models, rollouting conservative estimation to mitigate extrapolation errors. However, the static data makes it challenging to develop a robust policy, and offline agents cannot access the environment to gather new data. To address these challenges, we introduce Model-based Offline Reinforcement learning with AdversariaL data augmentation (MORAL). In MORAL, we replace the fixed horizon rollout by employing adversaria data augmentation to execute alternating sampling with ensemble models to enrich training data. Specifically, this adversarial process dynamically selects ensemble models against policy for biased sampling, mitigating the optimistic estimation of fixed models, thus robustly expanding the training data for policy optimization. Moreover, a differential factor is integrated into the adversarial process for regularization, ensuring error minimization in extrapolations. This data-augmented optimization adapts to diverse offline tasks without rollout horizon tuning, showing remarkable applicability. Extensive experiments on D4RL benchmark demonstrate that MORAL outperforms other model-based offline RL methods in terms of policy learning and sample efficiency.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection
Authors:
Yibo Yan,
Shen Wang,
Jiahao Huo,
Philip S. Yu,
Xuming Hu,
Qingsong Wen
Abstract:
Mathematical error detection in educational settings presents a significant challenge for Multimodal Large Language Models (MLLMs), requiring a sophisticated understanding of both visual and textual mathematical content along with complex reasoning capabilities. Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of identifying and categorizing student erro…
▽ More
Mathematical error detection in educational settings presents a significant challenge for Multimodal Large Language Models (MLLMs), requiring a sophisticated understanding of both visual and textual mathematical content along with complex reasoning capabilities. Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of identifying and categorizing student errors in multimodal mathematical contexts. Therefore, we introduce MathAgent, a novel Mixture-of-Math-Agent framework designed specifically to address these challenges. Our approach decomposes error detection into three phases, each handled by a specialized agent: an image-text consistency validator, a visual semantic interpreter, and an integrative error analyzer. This architecture enables more accurate processing of mathematical content by explicitly modeling relationships between multimodal problems and student solution steps. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification and 3% improvement in error categorization compared to baseline models. Besides, MathAgent has been successfully deployed in an educational platform that has served over one million K-12 students, achieving nearly 90% student satisfaction while generating significant cost savings by reducing manual error detection.
△ Less
Submitted 20 May, 2025; v1 submitted 23 March, 2025;
originally announced March 2025.
-
Robust Dataset Distillation by Matching Adversarial Trajectories
Authors:
Wei Lai,
Tianyu Ding,
ren dongdong,
Lei Wang,
Jing Huo,
Yang Gao,
Wenbin Li
Abstract:
Dataset distillation synthesizes compact datasets that enable models to achieve performance comparable to training on the original large-scale datasets. However, existing distillation methods overlook the robustness of the model, resulting in models that are vulnerable to adversarial attacks when trained on distilled data. To address this limitation, we introduce the task of ``robust dataset disti…
▽ More
Dataset distillation synthesizes compact datasets that enable models to achieve performance comparable to training on the original large-scale datasets. However, existing distillation methods overlook the robustness of the model, resulting in models that are vulnerable to adversarial attacks when trained on distilled data. To address this limitation, we introduce the task of ``robust dataset distillation", a novel paradigm that embeds adversarial robustness into the synthetic datasets during the distillation process. We propose Matching Adversarial Trajectories (MAT), a method that integrates adversarial training into trajectory-based dataset distillation. MAT incorporates adversarial samples during trajectory generation to obtain robust training trajectories, which are then used to guide the distillation process. As experimentally demonstrated, even through natural training on our distilled dataset, models can achieve enhanced adversarial robustness while maintaining competitive accuracy compared to existing distillation methods. Our work highlights robust dataset distillation as a new and important research direction and provides a strong baseline for future research to bridge the gap between efficient training and adversarial robustness.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help
Authors:
Yuefan Cao,
Xuyang Guo,
Jiayan Huo,
Yingyu Liang,
Zhenmei Shi,
Zhao Song,
Jiahao Zhang,
Zhen Zhuang
Abstract:
Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamen…
▽ More
Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability.
Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
New insight into the Rapid Burster by Insight-HXMT
Authors:
Y. P. Chen,
S. Zhang,
S. N. Zhang,
L. Ji,
L. D. Kong,
P. J. Wang,
L. Tao,
M. Y. Ge,
C. Z. Liu,
F. J. Lu,
J. L. Qu,
T. P. Li,
Y. P. Xu,
X. L. Cao,
Y. Chen,
Q. C. Bu,
C. Cai,
Z. Chang,
G. Chen,
L. Chen,
T. X. Chen,
W. W. Cui,
Y. Y. Du,
G. H. Gao,
H. Gao
, et al. (70 additional authors not shown)
Abstract:
We report the timing and spectral analyses upon of the type II X-ray bursts from the Rapid Burster (MXB 1730--335) observed by Insight-HXMT and Swift/XRT. By stacking the long-duration bursts, we find for the first time that the hard X-rays are lagging than the soft X-rays by 3 seconds. However, such a lag is not visible for the short-duration bursts, probably because of the poor statistics. For a…
▽ More
We report the timing and spectral analyses upon of the type II X-ray bursts from the Rapid Burster (MXB 1730--335) observed by Insight-HXMT and Swift/XRT. By stacking the long-duration bursts, we find for the first time that the hard X-rays are lagging than the soft X-rays by 3 seconds. However, such a lag is not visible for the short-duration bursts, probably because of the poor statistics. For all bursts the energy spectrum is found to be non-thermal, thanks to the broad band coverage of Insight-HXMT. These findings put new insights into the type-II bursts and require a temporally showing-up corona for possible interpretation.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Interleaved Block-based Learned Image Compression with Feature Enhancement and Quantization Error Compensation
Authors:
Shiqi Jiang,
Hui Yuan,
Shuai Li,
Raouf Hamzaoui,
Xu Wang,
Junyan Huo
Abstract:
In recent years, learned image compression (LIC) methods have achieved significant performance improvements. However, obtaining a more compact latent representation and reducing the impact of quantization errors remain key challenges in the field of LIC. To address these challenges, we propose a feature extraction module, a feature refinement module, and a feature enhancement module. Our feature e…
▽ More
In recent years, learned image compression (LIC) methods have achieved significant performance improvements. However, obtaining a more compact latent representation and reducing the impact of quantization errors remain key challenges in the field of LIC. To address these challenges, we propose a feature extraction module, a feature refinement module, and a feature enhancement module. Our feature extraction module shuffles the pixels in the image, splits the resulting image into sub-images, and extracts coarse features from the sub-images. Our feature refinement module stacks the coarse features and uses an attention refinement block composed of concatenated three-dimensional convolution residual blocks to learn more compact latent features by exploiting correlations across channels, within sub-images (intra-sub-image correlations), and across sub-images (inter-sub-image correlations). Our feature enhancement module reduces information loss in the decoded features following quantization. We also propose a quantization error compensation module that mitigates the quantization mismatch between training and testing. Our four modules can be readily integrated into state-of-the-art LIC methods. Experiments show that combining our modules with Tiny-LIC outperforms existing LIC methods and image compression standards in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) on the Kodak dataset and the CLIC dataset.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models
Authors:
Jiamin Su,
Yibo Yan,
Fangteng Fu,
Han Zhang,
Jingheng Ye,
Xiang Liu,
Jiahao Huo,
Huiyu Zhou,
Xuming Hu
Abstract:
Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal context…
▽ More
Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research.
△ Less
Submitted 20 May, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
Authors:
Hongye Cao,
Yanming Wang,
Sijia Jing,
Ziyue Peng,
Zhixin Bai,
Zhe Cao,
Meng Fang,
Fan Feng,
Boyan Wang,
Jiaheng Liu,
Tianpei Yang,
Jing Huo,
Yang Gao,
Fanyu Meng,
Xi Yang,
Chao Deng,
Junlan Feng
Abstract:
With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. T…
▽ More
With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.
△ Less
Submitted 17 February, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.
-
MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models
Authors:
Jiahao Huo,
Yibo Yan,
Xu Zheng,
Yuanhuiyi Lyu,
Xin Zou,
Zhihua Wei,
Xuming Hu
Abstract:
Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to reformulate the task of multimodal MU in the era of MLLMs, which aims to erase only the visual patterns associated wi…
▽ More
Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to reformulate the task of multimodal MU in the era of MLLMs, which aims to erase only the visual patterns associated with a given entity while preserving the corresponding textual knowledge encoded within the original parameters of the language model backbone. Furthermore, we develop a novel geometry-constrained gradient ascent method MMUnlearner. It updates the weights of MLLMs with a weight saliency map jointly restricted by the remaining concepts and textual knowledge during unlearning, thereby preserving parameters essential for non-target knowledge. Extensive experiments demonstrate that MMUnlearner surpasses baselines that finetuning MLLMs with VQA data directly through Gradient Ascent (GA) or Negative Preference Optimization (NPO), across all evaluation dimensions. Our code can be found in [this URL](https://github.com/Z1zs/MMUnlearner).
△ Less
Submitted 27 May, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.
-
Causal Information Prioritization for Efficient Reinforcement Learning
Authors:
Hongye Cao,
Fan Feng,
Tianpei Yang,
Jing Huo,
Yang Gao
Abstract:
Current Reinforcement Learning (RL) methods often suffer from sample-inefficiency, resulting from blind exploration strategies that neglect causal relationships among states, actions, and rewards. Although recent causal approaches aim to address this problem, they lack grounded modeling of reward-guided causal understanding of states and actions for goal-orientation, thus impairing learning effici…
▽ More
Current Reinforcement Learning (RL) methods often suffer from sample-inefficiency, resulting from blind exploration strategies that neglect causal relationships among states, actions, and rewards. Although recent causal approaches aim to address this problem, they lack grounded modeling of reward-guided causal understanding of states and actions for goal-orientation, thus impairing learning efficiency. To tackle this issue, we propose a novel method named Causal Information Prioritization (CIP) that improves sample efficiency by leveraging factored MDPs to infer causal relationships between different dimensions of states and actions with respect to rewards, enabling the prioritization of causal information. Specifically, CIP identifies and leverages causal relationships between states and rewards to execute counterfactual data augmentation to prioritize high-impact state features under the causal understanding of the environments. Moreover, CIP integrates a causality-aware empowerment learning objective, which significantly enhances the agent's execution of reward-guided actions for more efficient exploration in complex environments. To fully assess the effectiveness of CIP, we conduct extensive experiments across 39 tasks in 5 diverse continuous control environments, encompassing both locomotion and manipulation skills learning with pixel-based and sparse reward settings. Experimental results demonstrate that CIP consistently outperforms existing RL methods across a wide range of scenarios.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Towards Empowerment Gain through Causal Structure Learning in Model-Based RL
Authors:
Hongye Cao,
Fan Feng,
Meng Fang,
Shaokang Dong,
Tianpei Yang,
Jing Huo,
Yang Gao
Abstract:
In Model-Based Reinforcement Learning (MBRL), incorporating causal structures into dynamics models provides agents with a structured understanding of the environments, enabling efficient decision. Empowerment as an intrinsic motivation enhances the ability of agents to actively control their environments by maximizing the mutual information between future states and actions. We posit that empowerm…
▽ More
In Model-Based Reinforcement Learning (MBRL), incorporating causal structures into dynamics models provides agents with a structured understanding of the environments, enabling efficient decision. Empowerment as an intrinsic motivation enhances the ability of agents to actively control their environments by maximizing the mutual information between future states and actions. We posit that empowerment coupled with causal understanding can improve controllability, while enhanced empowerment gain can further facilitate causal reasoning in MBRL. To improve learning efficiency and controllability, we propose a novel framework, Empowerment through Causal Learning (ECL), where an agent with the awareness of causal dynamics models achieves empowerment-driven exploration and optimizes its causal structure for task learning. Specifically, ECL operates by first training a causal dynamics model of the environment based on collected data. We then maximize empowerment under the causal structure for exploration, simultaneously using data gathered through exploration to update causal dynamics model to be more controllable than dense dynamics model without causal structure. In downstream task learning, an intrinsic curiosity reward is included to balance the causality, mitigating overfitting. Importantly, ECL is method-agnostic and is capable of integrating various causal discovery methods. We evaluate ECL combined with 3 causal discovery methods across 6 environments including pixel-based tasks, demonstrating its superior performance compared to other causal MBRL methods, in terms of causal discovery, sample efficiency, and asymptotic performance.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Authors:
Yibo Yan,
Shen Wang,
Jiahao Huo,
Jingheng Ye,
Zhendong Chu,
Xuming Hu,
Philip S. Yu,
Carla Gomes,
Bart Selman,
Qingsong Wen
Abstract:
Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal L…
▽ More
Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal Large Language Models (MLLMs), which integrate text, images, and other modalities, present an exciting opportunity to overcome these limitations and enhance scientific reasoning. Therefore, this position paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. First, we propose a four-stage research roadmap of scientific reasoning capabilities, and highlight the current state of MLLM applications in scientific reasoning, noting their ability to integrate and reason over diverse data types. Second, we summarize the key challenges that remain obstacles to achieving MLLM's full potential. To address these challenges, we propose actionable insights and suggestions for the future. Overall, our work offers a novel perspective on MLLM integration with scientific reasoning, providing the LLM community with a valuable vision for achieving Artificial General Intelligence (AGI).
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
The critical role of entropy in glass transition kinetics
Authors:
Lijian Song,
Meng Gao,
Juntao Huo,
Li-Min Wang,
Yuanzheng Yue,
Jun-Qiang Wang
Abstract:
Glass transition is a reversible transition that occurs in most amorphous materials. However, the nature of glass transition remains far from being clarified. A key to understand the glass transition is to clarify what determines the glass transition temperature (Tg) and liquid fragility (m). Here the glass transition thermodynamics for 150 different glass-forming systems are studied statistically…
▽ More
Glass transition is a reversible transition that occurs in most amorphous materials. However, the nature of glass transition remains far from being clarified. A key to understand the glass transition is to clarify what determines the glass transition temperature (Tg) and liquid fragility (m). Here the glass transition thermodynamics for 150 different glass-forming systems are studied statistically. It is found that the activation characters in the energy landscape are crucial to precisely portray the glass transition and, in particular, both the activation free energy (G*) and the activation entropy (S*) play critical roles. G* determines Tg, Tg=G*/290+25.5, while S* determines m, m=S*/Rln10+15 with R is gas constant. Based on the Boltzmann definition of entropy, the fragility is an indication of the number of the degeneracy of the evolution paths. This explains why the nano-confined, low-dimension or high-pressured glasses exhibit stronger characteristics, which has been a puzzling phenomenon for a long time.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Exploring the two-body strong decay properties of the possible $Λ_cK^{*}$ and $Σ_cK^{(*)}$ molecules
Authors:
Jin-yu Huo,
Rui Chen
Abstract:
In this work, we apply the effective Lagrangian approach to investigate the two-body strong decay behaviors of the possible $Λ_c K^*$ and $Σ_c K^{(*)}$ molecules, as predicted in our previous study [Phys. Rev. D 108, 054011 (2023)]. Our results indicate that the decay width for the coupled $Σ_c K / Λ_c K^* / Σ_c K^*$ molecule with $I(J^P) = 1/2(1/2^-)$ is on the order of several MeV, with the…
▽ More
In this work, we apply the effective Lagrangian approach to investigate the two-body strong decay behaviors of the possible $Λ_c K^*$ and $Σ_c K^{(*)}$ molecules, as predicted in our previous study [Phys. Rev. D 108, 054011 (2023)]. Our results indicate that the decay width for the coupled $Σ_c K / Λ_c K^* / Σ_c K^*$ molecule with $I(J^P) = 1/2(1/2^-)$ is on the order of several MeV, with the $D_s N$ channel being dominant. For the coupled $Λ_c K^* / Σ_c K^*$ molecule with $1/2(1/2^-, 3/2^-)$, the decay widths are on the order of tens of MeV, with the dominant channels being $Σ_c K$ and $Σ_c^* K$, respectively. For the $Σ_c K^*$ molecules with $1/2(1/2^-)$, the decay width can reach one hundred MeV, with $Σ_c K$ and $Λ_c K$ being the dominant decay channels. The decay widths for the $Σ_c K^*$ molecules with $1/2(3/2^-)$ and $3/2(1/2^-)$ are on the order of tens of MeV, with the dominant decay modes being $Σ_c^* K$ and $Σ_c K$, respectively. The branching ratios for all the discussed channels show little dependence on the binding energies.
△ Less
Submitted 19 January, 2025;
originally announced January 2025.
-
An Intermediate-mass Black Hole Lurking in A Galactic Halo Caught Alive during Outburst
Authors:
C. -C. Jin,
D. -Y. Li,
N. Jiang,
L. -X. Dai,
H. -Q. Cheng,
J. -Z. Zhu,
C. -W. Yang,
A. Rau,
P. Baldini,
T. -G. Wang,
H. -Y. Zhou,
W. Yuan,
C. Zhang,
X. -W. Shu,
R. -F. Shen,
Y. -L. Wang,
S. -X. Wen,
Q. -Y. Wu,
Y. -B. Wang,
L. L. Thomsen,
Z. -J. Zhang,
W. -J. Zhang,
A. Coleiro,
R. Eyles-Ferris,
X. Fang
, et al. (116 additional authors not shown)
Abstract:
Stellar-mass and supermassive black holes abound in the Universe, whereas intermediate-mass black holes (IMBHs) of ~10^2-10^5 solar masses in between are largely missing observationally, with few cases found only. Here we report the real-time discovery of a long-duration X-ray transient, EP240222a, accompanied by an optical flare with prominent H and He emission lines revealed by prompt follow-up…
▽ More
Stellar-mass and supermassive black holes abound in the Universe, whereas intermediate-mass black holes (IMBHs) of ~10^2-10^5 solar masses in between are largely missing observationally, with few cases found only. Here we report the real-time discovery of a long-duration X-ray transient, EP240222a, accompanied by an optical flare with prominent H and He emission lines revealed by prompt follow-up observations. Its observed properties evidence an IMBH located unambiguously in the halo of a nearby galaxy and flaring by tidally disrupting a star -- the only confirmed off-nucleus IMBH-tidal disruption event so far. This work demonstrates the potential of sensitive time-domain X-ray surveys, complemented by timely multi-wavelength follow-ups, in probing IMBHs, their environments, demographics, origins and connections to stellar-mass and supermassive black holes.
△ Less
Submitted 16 January, 2025;
originally announced January 2025.
-
RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation
Authors:
Zixuan Chen,
Jing Huo,
Yangtao Chen,
Yang Gao
Abstract:
Efficient control in long-horizon robotic manipulation is challenging due to complex representation and policy learning requirements. Model-based visual reinforcement learning (RL) has shown great potential in addressing these challenges but still faces notable limitations, particularly in handling sparse rewards and complex visual features in long-horizon environments. To address these limitation…
▽ More
Efficient control in long-horizon robotic manipulation is challenging due to complex representation and policy learning requirements. Model-based visual reinforcement learning (RL) has shown great potential in addressing these challenges but still faces notable limitations, particularly in handling sparse rewards and complex visual features in long-horizon environments. To address these limitations, we propose the Recognize-Sense-Plan-Act (RSPA) pipeline for long-horizon tasks and further introduce RoboHorizon, an LLM-assisted multi-view world model tailored for long-horizon robotic manipulation. In RoboHorizon, pre-trained LLMs generate dense reward structures for multi-stage sub-tasks based on task language instructions, enabling robots to better recognize long-horizon tasks. Keyframe discovery is then integrated into the multi-view masked autoencoder (MAE) architecture to enhance the robot's ability to sense critical task sequences, strengthening its multi-stage perception of long-horizon processes. Leveraging these dense rewards and multi-view representations, a robotic world model is constructed to efficiently plan long-horizon tasks, enabling the robot to reliably act through RL algorithms. Experiments on two representative benchmarks, RLBench and FurnitureBench, show that RoboHorizon outperforms state-of-the-art visual model-based RL methods, achieving a 23.35% improvement in task success rates on RLBench's 4 short-horizon tasks and a 29.23% improvement on 6 long-horizon tasks from RLBench and 3 furniture assembly tasks from FurnitureBench.
△ Less
Submitted 24 January, 2025; v1 submitted 11 January, 2025;
originally announced January 2025.
-
Fast Gradient Computation for RoPE Attention in Almost Linear Time
Authors:
Yifang Chen,
Jiayan Huo,
Xiaoyu Li,
Yingyu Liang,
Zhenmei Shi,
Zhao Song
Abstract:
The Rotary Position Embedding (RoPE) mechanism has become a powerful enhancement to the Transformer architecture, which enables models to capture token relationships when encoding positional information. However, the RoPE mechanisms make the computations of attention mechanisms more complicated, which makes efficient algorithms challenging. Earlier research introduced almost linear time, i.e.,…
▽ More
The Rotary Position Embedding (RoPE) mechanism has become a powerful enhancement to the Transformer architecture, which enables models to capture token relationships when encoding positional information. However, the RoPE mechanisms make the computations of attention mechanisms more complicated, which makes efficient algorithms challenging. Earlier research introduced almost linear time, i.e., $n^{1+o(1)}$ where $n$ is the number of input tokens, algorithms for the forward computation under specific parameter settings. However, achieving a subquadratic time algorithm for other parameter regimes remains impossible unless the widely accepted Strong Exponential Time Hypothesis (SETH) is disproven. In this work, we develop the first almost linear time algorithm for backward computations in the RoPE-based attention under bounded entries. Our approach builds on recent advancements in fast RoPE attention computations, utilizing a novel combination of the polynomial method and the Fast Fourier Transform. Furthermore, we show that with lower bounds derived from the SETH, the bounded entry condition is necessary for subquadratic performance.
△ Less
Submitted 31 December, 2024; v1 submitted 23 December, 2024;
originally announced December 2024.
-
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
Authors:
Yunkai Dang,
Kaichen Huang,
Jiahao Huo,
Yibo Yan,
Sirui Huang,
Dongrui Liu,
Mengxi Gao,
Jie Zhang,
Chen Qian,
Kun Wang,
Yong Liu,
Jing Shao,
Hui Xiong,
Xuming Hu
Abstract:
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audi…
▽ More
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
Effects of time aggregation, product aggregation, and seasonality in measuring bullwhip ratio
Authors:
Hau Mike Ma,
Jiazhen Huo,
Yongrui Duan
Abstract:
The bullwhip study has received a lot of attention in the literature, but with conflicting results, especially in the context of data aggregation. In this paper, we investigate three widely studied factors in bullwhip measurement: time aggregation, product aggregation, and seasonality. In time aggregation, we decompose the variance into two components: the expectation of the subset variances and t…
▽ More
The bullwhip study has received a lot of attention in the literature, but with conflicting results, especially in the context of data aggregation. In this paper, we investigate three widely studied factors in bullwhip measurement: time aggregation, product aggregation, and seasonality. In time aggregation, we decompose the variance into two components: the expectation of the subset variances and the variance of subset expectations, thus decomposing the bullwhip ratio into four components to explore the underlying mechanism of time aggregation. In product aggregation, the bullwhip ratio is analyzed in the context of products with either uncorrelated or correlated demands and orders. Seasonality is also examined to study its effect on the bullwhip ratio. Our key findings are: (a) Time aggregation can increase, decrease, or maintain the bullwhip ratio in different scenarios. (b) Aggregated bullwhip ratio of uncorrelated products is a weighted average of bullwhip ratios from individual products, with corresponding demand variance as the weights. However, aggregated bullwhip ratio of correlated products could break the boundaries. (c) Seasonality can be considered as a standalone product with a bullwhip ratio of one, which can drive the overall bullwhip ratio closer to one.
△ Less
Submitted 1 December, 2024;
originally announced December 2024.
-
Distractor-free Generalizable 3D Gaussian Splatting
Authors:
Yanqi Bao,
Jing Liao,
Jing Huo,
Yang Gao
Abstract:
We present DGGS, a novel framework addressing the previously unexplored challenge of Distractor-free Generalizable 3D Gaussian Splatting (3DGS). It accomplishes two key objectives: fortifying generalizable 3DGS against distractor-laden data during both training and inference phases, while successfully extending cross-scene adaptation capabilities to conventional distractor-free approaches. To achi…
▽ More
We present DGGS, a novel framework addressing the previously unexplored challenge of Distractor-free Generalizable 3D Gaussian Splatting (3DGS). It accomplishes two key objectives: fortifying generalizable 3DGS against distractor-laden data during both training and inference phases, while successfully extending cross-scene adaptation capabilities to conventional distractor-free approaches. To achieve these objectives, DGGS introduces a scene-agnostic reference-based mask prediction and refinement methodology during training phase, coupled with a training view selection strategy, effectively improving distractor prediction accuracy and training stability. Moreover, to address distractor-induced voids and artifacts during inference stage, we propose a two-stage inference framework for better reference selection based on the predicted distractor masks, complemented by a distractor pruning module to eliminate residual distractor effects. Extensive generalization experiments demonstrate DGGS's advantages under distractor-laden conditions. Additionally, experimental results show that our scene-agnostic mask inference achieves accuracy comparable to scene-specific trained methods. Homepage is \url{https://github.com/bbbbby-99/DGGS}.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
SAM-I2I: Unleash the Power of Segment Anything Model for Medical Image Translation
Authors:
Jiayu Huo,
Sebastien Ourselin,
Rachel Sparks
Abstract:
Medical image translation is crucial for reducing the need for redundant and expensive multi-modal imaging in clinical field. However, current approaches based on Convolutional Neural Networks (CNNs) and Transformers often fail to capture fine-grain semantic features, resulting in suboptimal image quality. To address this challenge, we propose SAM-I2I, a novel image-to-image translation framework…
▽ More
Medical image translation is crucial for reducing the need for redundant and expensive multi-modal imaging in clinical field. However, current approaches based on Convolutional Neural Networks (CNNs) and Transformers often fail to capture fine-grain semantic features, resulting in suboptimal image quality. To address this challenge, we propose SAM-I2I, a novel image-to-image translation framework based on the Segment Anything Model 2 (SAM2). SAM-I2I utilizes a pre-trained image encoder to extract multiscale semantic features from the source image and a decoder, based on the mask unit attention module, to synthesize target modality images. Our experiments on multi-contrast MRI datasets demonstrate that SAM-I2I outperforms state-of-the-art methods, offering more efficient and accurate medical image translation.
△ Less
Submitted 12 November, 2024;
originally announced November 2024.
-
Einstein Probe discovery of EP240408a: a peculiar X-ray transient with an intermediate timescale
Authors:
Wenda Zhang,
Weimin Yuan,
Zhixing Ling,
Yong Chen,
Nanda Rea,
Arne Rau,
Zhiming Cai,
Huaqing Cheng,
Francesco Coti Zelati,
Lixin Dai,
Jingwei Hu,
Shumei Jia,
Chichuan Jin,
Dongyue Li,
Paul O'Brien,
Rongfeng Shen,
Xinwen Shu,
Shengli Sun,
Xiaojin Sun,
Xiaofeng Wang,
Lei Yang,
Bing Zhang,
Chen Zhang,
Shuang-Nan Zhang,
Yonghe Zhang
, et al. (115 additional authors not shown)
Abstract:
We report the discovery of a peculiar X-ray transient, EP240408a, by Einstein Probe (EP) and follow-up studies made with EP, Swift, NICER, GROND, ATCA and other ground-based multi-wavelength telescopes. The new transient was first detected with Wide-field X-ray Telescope (WXT) on board EP on April 8th, 2024, manifested in an intense yet brief X-ray flare lasting for 12 seconds. The flare reached a…
▽ More
We report the discovery of a peculiar X-ray transient, EP240408a, by Einstein Probe (EP) and follow-up studies made with EP, Swift, NICER, GROND, ATCA and other ground-based multi-wavelength telescopes. The new transient was first detected with Wide-field X-ray Telescope (WXT) on board EP on April 8th, 2024, manifested in an intense yet brief X-ray flare lasting for 12 seconds. The flare reached a peak flux of 3.9x10^(-9) erg/cm2/s in 0.5-4 keV, about 300 times brighter than the underlying X-ray emission detected throughout the observation. Rapid and more precise follow-up observations by EP/FXT, Swift and NICER confirmed the finding of this new transient. Its X-ray spectrum is non-thermal in 0.5-10 keV, with a power-law photon index varying within 1.8-2.5. The X-ray light curve shows a plateau lasting for about 4 days, followed by a steep decay till becoming undetectable about 10 days after the initial detection. Based on its temporal property and constraints from previous EP observations, an unusual timescale in the range of 7-23 days is found for EP240408a, which is intermediate between the commonly found fast and long-term transients. No counterparts have been found in optical and near-infrared, with the earliest observation at 17 hours after the initial X-ray detection, suggestive of intrinsically weak emission in these bands. We demonstrate that the remarkable properties of EP240408a are inconsistent with any of the transient types known so far, by comparison with, in particular, jetted tidal disruption events, gamma-ray bursts, X-ray binaries and fast blue optical transients. The nature of EP240408a thus remains an enigma. We suggest that EP240408a may represent a new type of transients with intermediate timescales of the order of about 10 days. The detection and follow-ups of more of such objects are essential for revealing their origin.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment
Authors:
Mingzhi Wang,
Chengdong Ma,
Qizhi Chen,
Linjian Meng,
Yang Han,
Jiancong Xiao,
Zhaowei Zhang,
Jing Huo,
Weijie J. Su,
Yaodong Yang
Abstract:
Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM) performance but also overcomes the limitations of traditional Bradley-Terry (BT) model assumptions by finding the Nash equilibrium (NE) of a preference-based, two-play…
▽ More
Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM) performance but also overcomes the limitations of traditional Bradley-Terry (BT) model assumptions by finding the Nash equilibrium (NE) of a preference-based, two-player constant-sum game. However, existing methods either guarantee only average-iterate convergence, incurring high storage and inference costs, or converge to the NE of a regularized game, failing to accurately reflect true human preferences. In this paper, we introduce Magnetic Preference Optimization (MPO), a novel approach capable of achieving last-iterate convergence to the NE of the original game, effectively overcoming the limitations of existing methods. Building upon Magnetic Mirror Descent (MMD), MPO attains a linear convergence rate, making it particularly suitable for fine-tuning LLMs. To ensure our algorithm is both theoretically sound and practically viable, we present a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment.
△ Less
Submitted 19 April, 2025; v1 submitted 22 October, 2024;
originally announced October 2024.
-
MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models
Authors:
Kaichen Huang,
Jiahao Huo,
Yibo Yan,
Kun Wang,
Yutao Yue,
Xuming Hu
Abstract:
In recent years, multimodal large language models (MLLMs) have significantly advanced, integrating more modalities into diverse applications. However, the lack of explainability remains a major barrier to their use in scenarios requiring decision transparency. Current neuron-level explanation paradigms mainly focus on knowledge localization or language- and domain-specific analyses, leaving the ex…
▽ More
In recent years, multimodal large language models (MLLMs) have significantly advanced, integrating more modalities into diverse applications. However, the lack of explainability remains a major barrier to their use in scenarios requiring decision transparency. Current neuron-level explanation paradigms mainly focus on knowledge localization or language- and domain-specific analyses, leaving the exploration of multimodality largely unaddressed. To tackle these challenges, we propose MINER, a transferable framework for mining modality-specific neurons (MSNs) in MLLMs, which comprises four stages: (1) modality separation, (2) importance score calculation, (3) importance score aggregation, (4) modality-specific neuron selection. Extensive experiments across six benchmarks and two representative MLLMs show that (I) deactivating ONLY 2% of MSNs significantly reduces MLLMs performance (0.56 to 0.24 for Qwen2-VL, 0.69 to 0.31 for Qwen2-Audio), (II) different modalities mainly converge in the lower layers, (III) MSNs influence how key information from various modalities converges to the last token, (IV) two intriguing phenomena worth further investigation, i.e., semantic probing and semantic telomeres. The source code is available at this URL.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
Authors:
Yibo Yan,
Shen Wang,
Jiahao Huo,
Hang Li,
Boyan Li,
Jiamin Su,
Xiong Gao,
Yi-Fan Zhang,
Tianlong Xu,
Zhendong Chu,
Aoxiao Zhong,
Kun Wang,
Hui Xiong,
Philip S. Yu,
Xuming Hu,
Qingsong Wen
Abstract:
As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detecti…
▽ More
As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs' complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with rigorous annotation and rich metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation. The dataset will be available upon acceptance.
△ Less
Submitted 8 October, 2024; v1 submitted 6 October, 2024;
originally announced October 2024.
-
Extragalactic fast X-ray transient from a weak relativistic jet associated with a Type Ic-BL supernova
Authors:
H. Sun,
W. -X. Li,
L. -D. Liu,
H. Gao,
X. -F. Wang,
W. Yuan,
B. Zhang,
A. V. Filippenko,
D. Xu,
T. An,
S. Ai,
T. G. Brink,
Y. Liu,
Y. -Q. Liu,
C. -Y. Wang,
Q. -Y. Wu,
X. -F. Wu,
Y. Yang,
B. -B. Zhang,
W. -K. Zheng,
T. Ahumada,
Z. -G. Dai,
J. Delaunay,
N. Elias-Rosa,
S. Benetti
, et al. (140 additional authors not shown)
Abstract:
Massive stars end their life as core-collapse supernovae, amongst which some extremes are Type Ic broad-lined supernovae associated with long-duration gamma-ray bursts (LGRBs) having powerful relativistic jets. Their less-extreme brethren make unsuccessful jets that are choked inside the stars, appearing as X-ray flashes or low-luminosity GRBs. On the other hand, there exists a population of extra…
▽ More
Massive stars end their life as core-collapse supernovae, amongst which some extremes are Type Ic broad-lined supernovae associated with long-duration gamma-ray bursts (LGRBs) having powerful relativistic jets. Their less-extreme brethren make unsuccessful jets that are choked inside the stars, appearing as X-ray flashes or low-luminosity GRBs. On the other hand, there exists a population of extragalactic fast X-ray transients (EFXTs) with timescales ranging from seconds to thousands of seconds, whose origins remain obscure. Known sources that contribute to the observed EFXT population include the softer analogs of LGRBs, shock breakouts of supernovae, or unsuccessful jets. Here, we report the discovery of the bright X-ray transient EP240414a detected by the Einstein Probe (EP), which is associated with the Type Ic supernova SN 2024gsa at a redshift of 0.401. The X-ray emission evolution is characterised by a very soft energy spectrum peaking at < 1.3 keV, which makes it distinct from known LGRBs, X-ray flashes, or low-luminosity GRBs. Follow-up observations at optical and radio bands revealed the existence of a weak relativistic jet that interacts with an extended shell surrounding the progenitor star. Located on the outskirts of a massive galaxy, this event reveals a new population of explosions of Wolf-Rayet stars characterised by a less powerful engine that drives a successful but weak jet, possibly owing to a progenitor star with a smaller core angular momentum than in traditional LGRB progenitors.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation
Authors:
Yangtao Chen,
Zixuan Chen,
Junhui Yin,
Jing Huo,
Pinzhuo Tian,
Jieqi Shi,
Yang Gao
Abstract:
Robots' ability to follow language instructions and execute diverse 3D manipulation tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in understanding novel tasks, thereby mitigating this issue. However, these methods lack a t…
▽ More
Robots' ability to follow language instructions and execute diverse 3D manipulation tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in understanding novel tasks, thereby mitigating this issue. However, these methods lack a task-specific learning process, which is essential for an accurate understanding of 3D environments, often leading to execution failures. In this paper, we introduce GravMAD, a sub-goal-driven, language-conditioned action diffusion framework that combines the strengths of imitation learning and foundation models. Our approach breaks tasks into sub-goals based on language instructions, allowing auxiliary guidance during both training and inference. During training, we introduce Sub-goal Keypose Discovery to identify key sub-goals from demonstrations. Inference differs from training, as there are no demonstrations available, so we use pre-trained foundation models to bridge the gap and identify sub-goals for the current task. In both phases, GravMaps are generated from sub-goals, providing GravMAD with more flexible 3D spatial guidance compared to fixed 3D positions. Empirical evaluations on RLBench show that GravMAD significantly outperforms state-of-the-art methods, with a 28.63% improvement on novel tasks and a 13.36% gain on tasks encountered during training. Evaluations on real-world robotic tasks further show that GravMAD can reason about real-world tasks, associate them with relevant visual information, and generalize to novel tasks. These results demonstrate GravMAD's strong multi-task learning and generalization in 3D manipulation. Video demonstrations are available at: https://gravmad.github.io.
△ Less
Submitted 16 March, 2025; v1 submitted 30 September, 2024;
originally announced September 2024.
-
Making Large Vision Language Models to be Good Few-shot Learners
Authors:
Fan Liu,
Wenwen Cai,
Jian Huo,
Chuanyi Zhang,
Delong Chen,
Jun Zhou
Abstract:
Few-shot classification (FSC) is a fundamental yet challenging task in computer vision that involves recognizing novel classes from limited data. While previous methods have focused on enhancing visual features or incorporating additional modalities, Large Vision Language Models (LVLMs) offer a promising alternative due to their rich knowledge and strong visual perception. However, LVLMs risk lear…
▽ More
Few-shot classification (FSC) is a fundamental yet challenging task in computer vision that involves recognizing novel classes from limited data. While previous methods have focused on enhancing visual features or incorporating additional modalities, Large Vision Language Models (LVLMs) offer a promising alternative due to their rich knowledge and strong visual perception. However, LVLMs risk learning specific response formats rather than effectively extracting useful information from support data in FSC tasks. In this paper, we investigate LVLMs' performance in FSC and identify key issues such as insufficient learning and the presence of severe positional biases. To tackle the above challenges, we adopt the meta-learning strategy to teach models "learn to learn". By constructing a rich set of meta-tasks for instruction fine-tuning, LVLMs enhance the ability to extract information from few-shot support data for classification. Additionally, we further boost LVLM's few-shot learning capabilities through label augmentation and candidate selection in the fine-tuning and inference stage, respectively. Label augmentation is implemented via a character perturbation strategy to ensure the model focuses on support information. Candidate selection leverages attribute descriptions to filter out unreliable candidates and simplify the task. Extensive experiments demonstrate that our approach achieves superior performance on both general and fine-grained datasets. Furthermore, our candidate selection strategy has been proven beneficial for training-free LVLMs.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities
Authors:
Yanqi Bao,
Tianyu Ding,
Jing Huo,
Yaoli Liu,
Yuxin Li,
Wenbin Li,
Yang Gao,
Jiebo Luo
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives, including relat…
▽ More
3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives, including related tasks, technologies, challenges, and opportunities. The primary objective is to provide newcomers with a rapid understanding of the field and to assist researchers in methodically organizing existing technologies and challenges. Specifically, we delve into the optimization, application, and extension of 3DGS, categorizing them based on their focuses or motivations. Additionally, we summarize and classify nine types of technical modules and corresponding improvements identified in existing works. Based on these analyses, we further examine the common challenges and technologies across various tasks, proposing potential research opportunities.
△ Less
Submitted 17 December, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
Exploring the mass spectrum and the electromagnetic properties of the possible $Ξ_{cc}K^{(*)}$ and $Ξ_{cc}\bar{K}^{(*)}$ molecules
Authors:
Li-Cheng Sheng,
Jin-Yu Huo,
Rui Chen,
Fu-Lai Wang,
Xiang Liu
Abstract:
Using the one-boson-exchange model, we investigate the interactions between the doubly charmed baryon $Ξ_{cc}(3621)$ and the $S-$wave (anti-)kaon accounting for the $S-D$ wave mixing and coupled-channel effects. We find the coupled $Ξ_{cc}K/Ξ_{cc}K^*$ state with $I(J^P)=0(1/2^-)$, the $Ξ_{cc}K^*$ state with $0(1/2^-)$, the $Ξ_{cc}\bar{K}$ state with $0(1/2^-)$, and the $Ξ_{cc}\bar{K}^*$ states wit…
▽ More
Using the one-boson-exchange model, we investigate the interactions between the doubly charmed baryon $Ξ_{cc}(3621)$ and the $S-$wave (anti-)kaon accounting for the $S-D$ wave mixing and coupled-channel effects. We find the coupled $Ξ_{cc}K/Ξ_{cc}K^*$ state with $I(J^P)=0(1/2^-)$, the $Ξ_{cc}K^*$ state with $0(1/2^-)$, the $Ξ_{cc}\bar{K}$ state with $0(1/2^-)$, and the $Ξ_{cc}\bar{K}^*$ states with $0(1/2^-,3/2^-)$ can be recommended as good doubly charmed molecular candidates with strangeness $|S|=1$. We further examine their M1 radiative decay behaviors and magnetic moments within the constituent quark model framework. This information can enhance our understanding of their inner structures, including the distribution of electric charge and the orientation of the constituent quarks' spins.
△ Less
Submitted 29 September, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
Self-supervised Brain Lesion Generation for Effective Data Augmentation of Medical Images
Authors:
Jiayu Huo,
Sebastien Ourselin,
Rachel Sparks
Abstract:
Accurate brain lesion delineation is important for planning neurosurgical treatment. Automatic brain lesion segmentation methods based on convolutional neural networks have demonstrated remarkable performance. However, neural network performance is constrained by the lack of large-scale well-annotated training datasets. In this manuscript, we propose a comprehensive framework to efficiently genera…
▽ More
Accurate brain lesion delineation is important for planning neurosurgical treatment. Automatic brain lesion segmentation methods based on convolutional neural networks have demonstrated remarkable performance. However, neural network performance is constrained by the lack of large-scale well-annotated training datasets. In this manuscript, we propose a comprehensive framework to efficiently generate new samples for training a brain lesion segmentation model. We first train a lesion generator, based on an adversarial autoencoder, in a self-supervised manner. Next, we utilize a novel image composition algorithm, Soft Poisson Blending, to seamlessly combine synthetic lesions and brain images to obtain training samples. Finally, to effectively train the brain lesion segmentation model with augmented images we introduce a new prototype consistence regularization to align real and synthetic features. Our framework is validated by extensive experiments on two public brain lesion segmentation datasets: ATLAS v2.0 and Shift MS. Our method outperforms existing brain image data augmentation schemes. For instance, our method improves the Dice from 50.36% to 60.23% compared to the U-Net with conventional data augmentation techniques for the ATLAS v2.0 dataset.
△ Less
Submitted 18 August, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
Authors:
Jiahao Huo,
Yibo Yan,
Boren Hu,
Yutao Yue,
Xuming Hu
Abstract:
Projecting visual features into word embedding space has become a significant fusion strategy adopted by Multimodal Large Language Models (MLLMs). However, its internal mechanisms have yet to be explored. Inspired by multilingual research, we identify domain-specific neurons in multimodal large language models. Specifically, we investigate the distribution of domain-specific neurons and the mechan…
▽ More
Projecting visual features into word embedding space has become a significant fusion strategy adopted by Multimodal Large Language Models (MLLMs). However, its internal mechanisms have yet to be explored. Inspired by multilingual research, we identify domain-specific neurons in multimodal large language models. Specifically, we investigate the distribution of domain-specific neurons and the mechanism of how MLLMs process features from diverse domains. Furthermore, we propose a three-stage mechanism for language model modules in MLLMs when handling projected image features, and verify this hypothesis using logit lens. Extensive experiments indicate that while current MLLMs exhibit Visual Question Answering (VQA) capability, they may not fully utilize domain-specific information. Manipulating domain-specific neurons properly will result in a 10% change of accuracy at most, shedding light on the development of cross-domain, all-encompassing MLLMs in the future. The source code is available at https://github.com/Z1zs/MMNeuron.
△ Less
Submitted 1 October, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
The expressway network design problem for multiple urban subregions based on the macroscopic fundamental diagram
Authors:
Yunran Di,
Weihua Zhang,
Haotian Shi,
Heng Ding,
Jinbiao Huo,
Bin Ran
Abstract:
As urbanization advances, cities are expanding, leading to a more decentralized urban structure and longer average commuting durations. The construction of an urban expressway system emerges as a critical strategy to tackle this challenge. However, the traditional link-level network design method faces modeling and solution challenges when dealing with the large-scale expressway network design pro…
▽ More
As urbanization advances, cities are expanding, leading to a more decentralized urban structure and longer average commuting durations. The construction of an urban expressway system emerges as a critical strategy to tackle this challenge. However, the traditional link-level network design method faces modeling and solution challenges when dealing with the large-scale expressway network design problem (ENDP). To address the challenges, this paper proposes an expressway network design method for multiple urban subregions based on the macroscopic fundamental diagram (MFD). Initially, a mixed road network traffic model that describes traffic dynamics of multiple subregions and candidate expressways is developed by integrating the MFD and the cell transmission model (CTM). Then, treating urban subregions and candidate expressways as route nodes in the mixed road network, a route choice model is established based on stochastic user equilibrium. Finally, a decision model for ENDP is proposed to minimize vehicle travel time under the construction budget constraint. The impact of financial investment and traffic demand on expressway network design schemes in the case study is explored separately. The simulation results indicate that during the initial stages of expressway planning, the construction of new expressways can significantly alleviate traffic congestion. However, as the expressway network expands further, the effectiveness of improving traffic conditions through new expressway construction gradually diminishes if traffic demand does not continue to increase. Additionally, variations in traffic demand between subregions result in different construction schemes, emphasizing the importance of adjusting budget allocations based on specific traffic demands.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Zero-Shot Video Editing through Adaptive Sliding Score Distillation
Authors:
Lianghan Zhu,
Yanqi Bao,
Jing Huo,
Jing Wu,
Yu-Kun Lai,
Wenbin Li,
Yang Gao
Abstract:
The rapidly evolving field of Text-to-Video generation (T2V) has catalyzed renewed interest in controllable video editing research. While the application of editing prompts to guide diffusion model denoising has gained prominence, mirroring advancements in image editing, this noise-based inference process inherently compromises the original video's integrity, resulting in unintended over-editing a…
▽ More
The rapidly evolving field of Text-to-Video generation (T2V) has catalyzed renewed interest in controllable video editing research. While the application of editing prompts to guide diffusion model denoising has gained prominence, mirroring advancements in image editing, this noise-based inference process inherently compromises the original video's integrity, resulting in unintended over-editing and temporal discontinuities. To address these challenges, this study proposes a novel paradigm of video-based score distillation, facilitating direct manipulation of original video content. Specifically, distinguishing it from image-based score distillation, we propose an Adaptive Sliding Score Distillation strategy, which incorporates both global and local video guidance to reduce the impact of editing errors. Combined with our proposed Image-based Joint Guidance mechanism, it has the ability to mitigate the inherent instability of the T2V model and single-step sampling. Additionally, we design a Weighted Attention Fusion module to further preserve the key features of the original video and avoid over-editing. Extensive experiments demonstrate that these strategies effectively address existing challenges, achieving superior performance compared to current state-of-the-art methods.
△ Less
Submitted 6 September, 2024; v1 submitted 7 June, 2024;
originally announced June 2024.
-
Lepton flavor violating decays $Z\rightarrow l^{\pm}_{i}l^{\mp}_{j}$ in the B-L Supersymmetric Standard Model
Authors:
Jia-Peng Huo,
Xing-Xing Dong,
Jiao Ma,
Shu-Min Zhao,
Cai Guo,
Hai-Bin Zhang,
Jin-Lei Yang,
Tai-Fu Feng
Abstract:
Lepton flavor violation (LFV) represents a clear new physics (NP) signal beyond the standard model (SM). In this paper, we study LFV decays $Z\rightarrow l^{\pm}_{i}l^{\mp}_{j}$ in the B-L Supersymmetric Standard Model(B-LSSM). We calculate these processes separately in the mass eigenstate basis and the electroweak interaction basis, and the latter adopt the mass insertion approximation (MIA) meth…
▽ More
Lepton flavor violation (LFV) represents a clear new physics (NP) signal beyond the standard model (SM). In this paper, we study LFV decays $Z\rightarrow l^{\pm}_{i}l^{\mp}_{j}$ in the B-L Supersymmetric Standard Model(B-LSSM). We calculate these processes separately in the mass eigenstate basis and the electroweak interaction basis, and the latter adopt the mass insertion approximation (MIA) method. The MIA clearly shows the effect of parameters on the LFV decays $Z\rightarrow l^{\pm}_{i}l^{\mp}_{j}$ in the analytic level, which provides a new way for us to analyze the LFV processes. At the same time, the corresponding constraints from the LFV decays $l^{-}_{j} \rightarrow l^{-}_{i} γ$ and $(g-2)_μ$ are considered to analyze the numerical results.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model
Authors:
Zheng Gu,
Shiyuan Yang,
Jing Liao,
Jing Huo,
Yang Gao
Abstract:
Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual IC…
▽ More
Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Predicting possible molecular states of nucleons with $Ξ_c$, $Ξ_c^{*}$, and $Ξ_c^{\prime}$
Authors:
Jin-Yu Huo,
Li-Cheng Sheng,
Rui Chen,
Xiang Liu
Abstract:
In the framework of a one-boson-exchange model, we carry out a comprehensive investigation of the $Ξ_cN/Λ_cΣ/Ξ_c^{\prime}N/Σ_cΛ/Ξ_c^*N/Σ_c^*Λ/Σ_cΣ/Σ_c^*Σ$ interactions. We consider the $S$-$D$-wave mixing effects and the coupled-channel effects to derive the relevant effective potentials. Our results can predict several possible charm-strange deuteronlike $Ξ_c^{(',*)}N$ hexaquarks, the…
▽ More
In the framework of a one-boson-exchange model, we carry out a comprehensive investigation of the $Ξ_cN/Λ_cΣ/Ξ_c^{\prime}N/Σ_cΛ/Ξ_c^*N/Σ_c^*Λ/Σ_cΣ/Σ_c^*Σ$ interactions. We consider the $S$-$D$-wave mixing effects and the coupled-channel effects to derive the relevant effective potentials. Our results can predict several possible charm-strange deuteronlike $Ξ_c^{(',*)}N$ hexaquarks, the $Ξ_c^{\prime}N$ molecules with $I(I^P)=0(0^+)$, $0(1^+)$, $1(1^+)$, the $Ξ_c^*N$ molecules with $0(1^+)$, $0(2^+)$, $1(2^+)$, and the coupled $Ξ_cN/Ξ_c^{\prime}N/Ξ_c^*N/Σ_cΣ/Σ_c^*Σ$ molecule with $0(1^+)$. We expect the experiments to search for our predictions of the $Ξ_c^{(\prime,*)}N$ bound states.
△ Less
Submitted 29 September, 2024; v1 submitted 27 April, 2024;
originally announced April 2024.
-
Soft X-ray prompt emission from a high-redshift gamma-ray burst EP240315a
Authors:
Y. Liu,
H. Sun,
D. Xu,
D. S. Svinkin,
J. Delaunay,
N. R. Tanvir,
H. Gao,
C. Zhang,
Y. Chen,
X. -F. Wu,
B. Zhang,
W. Yuan,
J. An,
G. Bruni,
D. D. Frederiks,
G. Ghirlanda,
J. -W. Hu,
A. Li,
C. -K. Li,
J. -D. Li,
D. B. Malesani,
L. Piro,
G. Raman,
R. Ricci,
E. Troja
, et al. (170 additional authors not shown)
Abstract:
Long gamma-ray bursts (GRBs) are believed to originate from core collapse of massive stars. High-redshift GRBs can probe the star formation and reionization history of the early universe, but their detection remains rare. Here we report the detection of a GRB triggered in the 0.5--4 keV band by the Wide-field X-ray Telescope (WXT) on board the Einstein Probe (EP) mission, designated as EP240315a,…
▽ More
Long gamma-ray bursts (GRBs) are believed to originate from core collapse of massive stars. High-redshift GRBs can probe the star formation and reionization history of the early universe, but their detection remains rare. Here we report the detection of a GRB triggered in the 0.5--4 keV band by the Wide-field X-ray Telescope (WXT) on board the Einstein Probe (EP) mission, designated as EP240315a, whose bright peak was also detected by the Swift Burst Alert Telescope and Konus-Wind through off-line analyses. At a redshift of $z=4.859$, EP240315a showed a much longer and more complicated light curve in the soft X-ray band than in gamma-rays. Benefiting from a large field-of-view ($\sim$3600 deg$^2$) and a high sensitivity, EP-WXT captured the earlier engine activation and extended late engine activity through a continuous detection. With a peak X-ray flux at the faint end of previously known high-$z$ GRBs, the detection of EP240315a demonstrates the great potential for EP to study the early universe via GRBs.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs
Authors:
Ruoxi Cheng,
Haoxuan Ma,
Shuirong Cao,
Jiaqi Li,
Aihua Pei,
Zhiqiang Wang,
Pengliang Ji,
Haoyu Wang,
Jiaqi Huo
Abstract:
Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debat…
▽ More
Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs on BBQ and our datasets demonstrate the effectiveness of our approach in bias mitigation. Our source code and datasets are available at \texttt{https://anonymous.4open.science/r/RLDF-E344}.
△ Less
Submitted 16 August, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.