-
MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting
Authors:
Mengqiu Xu,
Kaixin Chen,
Heng Guo,
Yixiang Huang,
Ming Wu,
Zhenwei Shi,
Chuang Zhang,
Jun Guo
Abstract:
Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and…
▽ More
Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are at \href{https://github.com/kaka0910/MFogHub}{https://github.com/kaka0910/MFogHub}.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation
Authors:
Jun Guo,
Xiaojian Ma,
Yikai Wang,
Min Yang,
Huaping Liu,
Qing Li
Abstract:
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rende…
▽ More
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Generative Molecular Design with Steerable and Granular Synthesizability Control
Authors:
Jeff Guo,
Víctor Sabanza-Gil,
Zlatko Jončev,
Jeremy S. Luterbacher,
Philippe Schwaller
Abstract:
Synthesizability in small molecule generative design remains a bottleneck. Existing works that do consider synthesizability can output predicted synthesis routes for generated molecules. However, there has been minimal attention in addressing the ease of synthesis and enabling flexibility to incorporate desired reaction constraints. In this work, we propose a small molecule generative design frame…
▽ More
Synthesizability in small molecule generative design remains a bottleneck. Existing works that do consider synthesizability can output predicted synthesis routes for generated molecules. However, there has been minimal attention in addressing the ease of synthesis and enabling flexibility to incorporate desired reaction constraints. In this work, we propose a small molecule generative design framework that enables steerable and granular synthesizability control. Generated molecules satisfy arbitrary multi-parameter optimization objectives with predicted synthesis routes containing pre-defined allowed reactions, while optionally avoiding others. One can also enforce that all reactions belong to a pre-defined set. We show the capability to mix-and-match these reaction constraints across the most common medicinal chemistry transformations. Next, we show how our framework can be used to valorize industrial byproducts towards de novo optimized molecules. Going further, we demonstrate how granular control over synthesizability constraints can loosely mimic virtual screening of ultra-large make-on-demand libraries. Using only a single GPU, we generate and dock 15k molecules to identify promising candidates in Freedom 4.0 constituting 142B make-on-demand molecules (assessing only 0.00001% of the library). Generated molecules satisfying the reaction constraints have > 90% exact match rate. Lastly, we benchmark our framework against recent synthesizability-constrained generative models and demonstrate the highest sample efficiency even when imposing the additional constraint that all molecules must be synthesizable from a single reaction type. The main theme is demonstrating that a pre-trained generalist molecular generative model can be incentivized to generate property-optimized small molecules under challenging synthesizability constraints through reinforcement learning.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography
Authors:
Jiewen Yang,
Taoran Huang,
Shangwei Ding,
Xiaowei Xu,
Qinhua Zhao,
Yong Jiang,
Jiarong Guo,
Bin Pu,
Jiexuan Zheng,
Caojin Zhang,
Hongwen Fei,
Xiaomeng Li
Abstract:
Echocardiographers can detect pulmonary hypertension using Doppler echocardiography; however, accurately assessing its progression often proves challenging. Right heart catheterization (RHC), the gold standard for precise evaluation, is invasive and unsuitable for routine use, limiting its practicality for timely diagnosis and monitoring of pulmonary hypertension progression. Here, we propose MePH…
▽ More
Echocardiographers can detect pulmonary hypertension using Doppler echocardiography; however, accurately assessing its progression often proves challenging. Right heart catheterization (RHC), the gold standard for precise evaluation, is invasive and unsuitable for routine use, limiting its practicality for timely diagnosis and monitoring of pulmonary hypertension progression. Here, we propose MePH, a multi-view, multi-modal vision-language model to accurately assess pulmonary hypertension progression using non-invasive echocardiography. We constructed a large dataset comprising paired standardized echocardiogram videos, spectral images and RHC data, covering 1,237 patient cases from 12 medical centers. For the first time, MePH precisely models the correlation between non-invasive multi-view, multi-modal echocardiography and the pressure and resistance obtained via RHC. We show that MePH significantly outperforms echocardiographers' assessments using echocardiography, reducing the mean absolute error in estimating mean pulmonary arterial pressure (mPAP) and pulmonary vascular resistance (PVR) by 49.73% and 43.81%, respectively. In eight independent external hospitals, MePH achieved a mean absolute error of 3.147 for PVR assessment. Furthermore, MePH achieved an area under the curve of 0.921, surpassing echocardiographers (area under the curve of 0.842) in accurately predicting the severity of pulmonary hypertension, whether mild or severe. A prospective study demonstrated that MePH can predict treatment efficacy for patients. Our work provides pulmonary hypertension patients with a non-invasive and timely method for monitoring disease progression, improving the accuracy and efficiency of pulmonary hypertension management while enabling earlier interventions and more personalized treatment decisions.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
LLM-Augmented Chemical Synthesis and Design Decision Programs
Authors:
Haorui Wang,
Jeff Guo,
Lingkai Kong,
Rampi Ramprasad,
Philippe Schwaller,
Yuanqi Du,
Chao Zhang
Abstract:
Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible path…
▽ More
Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection
Authors:
Jiawei Guo,
Haipeng Cai
Abstract:
Large language models (LLMs) have gained widespread adoption across diverse applications due to their impressive generative capabilities. Their plug-and-play nature enables both developers and end users to interact with these models through simple prompts. However, as LLMs become more integrated into various systems in diverse domains, concerns around their security are growing. Existing studies m…
▽ More
Large language models (LLMs) have gained widespread adoption across diverse applications due to their impressive generative capabilities. Their plug-and-play nature enables both developers and end users to interact with these models through simple prompts. However, as LLMs become more integrated into various systems in diverse domains, concerns around their security are growing. Existing studies mainly focus on threats arising from user prompts (e.g. prompt injection attack) and model output (e.g. model inversion attack), while the security of system prompts remains largely overlooked. This work bridges the critical gap. We introduce system prompt poisoning, a new attack vector against LLMs that, unlike traditional user prompt injection, poisons system prompts hence persistently impacts all subsequent user interactions and model responses. We systematically investigate four practical attack strategies in various poisoning scenarios. Through demonstration on both generative and reasoning LLMs, we show that system prompt poisoning is highly feasible without requiring jailbreak techniques, and effective across a wide range of tasks, including those in mathematics, coding, logical reasoning, and natural language processing. Importantly, our findings reveal that the attack remains effective even when user prompts employ advanced prompting techniques like chain-of-thought (CoT). We also show that such techniques, including CoT and retrieval-augmentation-generation (RAG), which are proven to be effective for improving LLM performance in a wide range of tasks, are significantly weakened in their effectiveness by system prompt poisoning.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Opening the Scope of Openness in AI
Authors:
Tamara Paris,
AJung Moon,
Jin Guo
Abstract:
The concept of openness in AI has so far been heavily inspired by the definition and community practice of open source software. This positions openness in AI as having positive connotations; it introduces assumptions of certain advantages, such as collaborative innovation and transparency. However, the practices and benefits of open source software are not fully transferable to AI, which has its…
▽ More
The concept of openness in AI has so far been heavily inspired by the definition and community practice of open source software. This positions openness in AI as having positive connotations; it introduces assumptions of certain advantages, such as collaborative innovation and transparency. However, the practices and benefits of open source software are not fully transferable to AI, which has its own challenges. Framing a notion of openness tailored to AI is crucial to addressing its growing societal implications, risks, and capabilities. We argue that considering the fundamental scope of openness in different disciplines will broaden discussions, introduce important perspectives, and reflect on what openness in AI should mean. Toward this goal, we qualitatively analyze 98 concepts of openness discovered from topic modeling, through which we develop a taxonomy of openness. Using this taxonomy as an instrument, we situate the current discussion on AI openness, identify gaps and highlight links with other disciplines. Our work contributes to the recent efforts in framing openness in AI by reflecting principles and practices of openness beyond open source software and calls for a more holistic view of openness in terms of actions, system properties, and ethical objectives.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Long-Term Individual Causal Effect Estimation via Identifiable Latent Representation Learning
Authors:
Ruichu Cai,
Junjie Wan,
Weilin Chen,
Zeqin Yang,
Zijian Li,
Peng Zhen,
Jiecheng Guo
Abstract:
Estimating long-term causal effects by combining long-term observational and short-term experimental data is a crucial but challenging problem in many real-world scenarios. In existing methods, several ideal assumptions, e.g. latent unconfoundedness assumption or additive equi-confounding bias assumption, are proposed to address the latent confounder problem raised by the observational data. Howev…
▽ More
Estimating long-term causal effects by combining long-term observational and short-term experimental data is a crucial but challenging problem in many real-world scenarios. In existing methods, several ideal assumptions, e.g. latent unconfoundedness assumption or additive equi-confounding bias assumption, are proposed to address the latent confounder problem raised by the observational data. However, in real-world applications, these assumptions are typically violated which limits their practical effectiveness. In this paper, we tackle the problem of estimating the long-term individual causal effects without the aforementioned assumptions. Specifically, we propose to utilize the natural heterogeneity of data, such as data from multiple sources, to identify latent confounders, thereby significantly avoiding reliance on idealized assumptions. Practically, we devise a latent representation learning-based estimator of long-term causal effects. Theoretically, we establish the identifiability of latent confounders, with which we further achieve long-term effect identification. Extensive experimental studies, conducted on multiple synthetic and semi-synthetic datasets, demonstrate the effectiveness of our proposed method.
△ Less
Submitted 8 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
SGCR: Spherical Gaussians for Efficient 3D Curve Reconstruction
Authors:
Xinran Yang,
Donghao Ji,
Yuanqi Li,
Jie Guo,
Yanwen Guo,
Junyuan Xie
Abstract:
Neural rendering techniques have made substantial progress in generating photo-realistic 3D scenes. The latest 3D Gaussian Splatting technique has achieved high quality novel view synthesis as well as fast rendering speed. However, 3D Gaussians lack proficiency in defining accurate 3D geometric structures despite their explicit primitive representations. This is due to the fact that Gaussian's att…
▽ More
Neural rendering techniques have made substantial progress in generating photo-realistic 3D scenes. The latest 3D Gaussian Splatting technique has achieved high quality novel view synthesis as well as fast rendering speed. However, 3D Gaussians lack proficiency in defining accurate 3D geometric structures despite their explicit primitive representations. This is due to the fact that Gaussian's attributes are primarily tailored and fine-tuned for rendering diverse 2D images by their anisotropic nature. To pave the way for efficient 3D reconstruction, we present Spherical Gaussians, a simple and effective representation for 3D geometric boundaries, from which we can directly reconstruct 3D feature curves from a set of calibrated multi-view images. Spherical Gaussians is optimized from grid initialization with a view-based rendering loss, where a 2D edge map is rendered at a specific view and then compared to the ground-truth edge map extracted from the corresponding image, without the need for any 3D guidance or supervision. Given Spherical Gaussians serve as intermedia for the robust edge representation, we further introduce a novel optimization-based algorithm called SGCR to directly extract accurate parametric curves from aligned Spherical Gaussians. We demonstrate that SGCR outperforms existing state-of-the-art methods in 3D edge reconstruction while enjoying great efficiency.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Authors:
Hao Sun,
Zile Qiao,
Jiayan Guo,
Xuanbo Fan,
Yingyan Hou,
Yong Jiang,
Pengjun Xie,
Fei Huang,
Yan Zhang
Abstract:
Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Do…
▽ More
Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
RAIL: Region-Aware Instructive Learning for Semi-Supervised Tooth Segmentation in CBCT
Authors:
Chuyu Zhao,
Hao Huang,
Jiashuo Guo,
Ziyu Shen,
Zhongwei Zhou,
Jie Liu,
Zekuan Yu
Abstract:
Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal. However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. To address these…
▽ More
Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal. However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. To address these problems, we propose Region-Aware Instructive Learning (RAIL), a dual-group dual-student, semi-supervised framework. Each group contains two student models guided by a shared teacher network. By alternating training between the two groups, RAIL promotes intergroup knowledge transfer and collaborative region-aware instruction while reducing overfitting to the characteristics of any single model. Specifically, RAIL introduces two instructive mechanisms. Disagreement-Focused Supervision (DFS) Controller improves supervised learning by instructing predictions only within areas where student outputs diverge from both ground truth and the best student, thereby concentrating supervision on structurally ambiguous or mislabeled areas. In the unsupervised phase, Confidence-Aware Learning (CAL) Modulator reinforces agreement in regions with high model certainty while reducing the effect of low-confidence predictions during training. This helps prevent our model from learning unstable patterns and improves the overall reliability of pseudo-labels. Extensive experiments on four CBCT tooth segmentation datasets show that RAIL surpasses state-of-the-art methods under limited annotation. Our code will be available at https://github.com/Tournesol-Saturday/RAIL.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
VISLIX: An XAI Framework for Validating Vision Models with Slice Discovery and Analysis
Authors:
Xinyuan Yan,
Xiwei Xuan,
Jorge Piazentin Ono,
Jiajing Guo,
Vikram Mohanty,
Shekar Arvind Kumar,
Liang Gou,
Bei Wang,
Liu Ren
Abstract:
Real-world machine learning models require rigorous evaluation before deployment, especially in safety-critical domains like autonomous driving and surveillance. The evaluation of machine learning models often focuses on data slices, which are subsets of the data that share a set of characteristics. Data slice finding automatically identifies conditions or data subgroups where models underperform,…
▽ More
Real-world machine learning models require rigorous evaluation before deployment, especially in safety-critical domains like autonomous driving and surveillance. The evaluation of machine learning models often focuses on data slices, which are subsets of the data that share a set of characteristics. Data slice finding automatically identifies conditions or data subgroups where models underperform, aiding developers in mitigating performance issues. Despite its popularity and effectiveness, data slicing for vision model validation faces several challenges. First, data slicing often needs additional image metadata or visual concepts, and falls short in certain computer vision tasks, such as object detection. Second, understanding data slices is a labor-intensive and mentally demanding process that heavily relies on the expert's domain knowledge. Third, data slicing lacks a human-in-the-loop solution that allows experts to form hypothesis and test them interactively. To overcome these limitations and better support the machine learning operations lifecycle, we introduce VISLIX, a novel visual analytics framework that employs state-of-the-art foundation models to help domain experts analyze slices in computer vision models. Our approach does not require image metadata or visual concepts, automatically generates natural language insights, and allows users to test data slice hypothesis interactively. We evaluate VISLIX with an expert study and three use cases, that demonstrate the effectiveness of our tool in providing comprehensive insights for validating object detection models.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction
Authors:
Yiqun Lin,
Hualiang Wang,
Jixiang Chen,
Jiewen Yang,
Jiarong Guo,
Xiaomeng Li
Abstract:
Cone-beam computed tomography (CBCT) is a critical 3D imaging technology in the medical field, while the high radiation exposure required for high-quality imaging raises significant concerns, particularly for vulnerable populations. Sparse-view reconstruction reduces radiation by using fewer X-ray projections while maintaining image quality, yet existing methods face challenges such as high comput…
▽ More
Cone-beam computed tomography (CBCT) is a critical 3D imaging technology in the medical field, while the high radiation exposure required for high-quality imaging raises significant concerns, particularly for vulnerable populations. Sparse-view reconstruction reduces radiation by using fewer X-ray projections while maintaining image quality, yet existing methods face challenges such as high computational demands and poor generalizability to different datasets. To overcome these limitations, we propose DeepSparse, the first foundation model for sparse-view CBCT reconstruction, featuring DiCE (Dual-Dimensional Cross-Scale Embedding), a novel network that integrates multi-view 2D features and multi-scale 3D features. Additionally, we introduce the HyViP (Hybrid View Sampling Pretraining) framework, which pretrains the model on large datasets with both sparse-view and dense-view projections, and a two-step finetuning strategy to adapt and refine the model for new datasets. Extensive experiments and ablation studies demonstrate that our proposed DeepSparse achieves superior reconstruction quality compared to state-of-the-art methods, paving the way for safer and more efficient CBCT imaging.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Authors:
Xinjie Zhang,
Jintao Guo,
Shanshan Zhao,
Minghao Fu,
Lunhao Duan,
Guo-Hua Wang,
Qing-Guo Chen,
Zhao Xu,
Weihua Luo,
Kaifu Zhang
Abstract:
Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recentl…
▽ More
Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).
△ Less
Submitted 7 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
An Empirical Study of Qwen3 Quantization
Authors:
Xingyu Zheng,
Yuye Li,
Haoran Chu,
Yue Feng,
Xudong Ma,
Jie Luo,
Jinyang Guo,
Haotong Qin,
Michele Magno,
Xianglong Liu
Abstract:
The Qwen series has emerged as a leading family of open-source Large Language Models (LLMs), demonstrating remarkable capabilities in natural language understanding tasks. With the recent release of Qwen3, which exhibits superior performance across diverse benchmarks, there is growing interest in deploying these models efficiently in resource-constrained environments. Low-bit quantization presents…
▽ More
The Qwen series has emerged as a leading family of open-source Large Language Models (LLMs), demonstrating remarkable capabilities in natural language understanding tasks. With the recent release of Qwen3, which exhibits superior performance across diverse benchmarks, there is growing interest in deploying these models efficiently in resource-constrained environments. Low-bit quantization presents a promising solution, yet its impact on Qwen3's performance remains underexplored. This study conducts a systematic evaluation of Qwen3's robustness under various quantization settings, aiming to uncover both opportunities and challenges in compressing this state-of-the-art model. We rigorously assess 5 existing classic post-training quantization techniques applied to Qwen3, spanning bit-widths from 1 to 8 bits, and evaluate their effectiveness across multiple datasets. Our findings reveal that while Qwen3 maintains competitive performance at moderate bit-widths, it experiences notable degradation in linguistic tasks under ultra-low precision, underscoring the persistent hurdles in LLM compression. These results emphasize the need for further research to mitigate performance loss in extreme quantization scenarios. We anticipate that this empirical analysis will provide actionable insights for advancing quantization methods tailored to Qwen3 and future LLMs, ultimately enhancing their practicality without compromising accuracy. Our project is released on https://github.com/Efficient-ML/Qwen3-Quantization and https://huggingface.co/collections/Efficient-ML/qwen3-quantization-68164450decb1c868788cb2b.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach
Authors:
Shouyang Dong,
Yuanbo Wen,
Jun Bi,
Di Huang,
Jiaming Guo,
Jianxing Xu,
Ruibai Xu,
Xinkai Song,
Yifan Hao,
Xuehai Zhou,
Tianshi Chen,
Qi Guo,
Yunji Chen
Abstract:
Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous…
▽ More
Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering "Write Once, Run Anywhere" of tensor programs an open question.
We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
Bayesian learning of the optimal action-value function in a Markov decision process
Authors:
Jiaqi Guo,
Chon Wai Ho,
Sumeetpal S. Singh
Abstract:
The Markov Decision Process (MDP) is a popular framework for sequential decision-making problems, and uncertainty quantification is an essential component of it to learn optimal decision-making strategies. In particular, a Bayesian framework is used to maintain beliefs about the optimal decisions and the unknown ingredients of the model, which are also to be learned from the data, such as the rewa…
▽ More
The Markov Decision Process (MDP) is a popular framework for sequential decision-making problems, and uncertainty quantification is an essential component of it to learn optimal decision-making strategies. In particular, a Bayesian framework is used to maintain beliefs about the optimal decisions and the unknown ingredients of the model, which are also to be learned from the data, such as the rewards and state dynamics. However, many existing Bayesian approaches for learning the optimal decision-making strategy are based on unrealistic modelling assumptions and utilise approximate inference techniques. This raises doubts whether the benefits of Bayesian uncertainty quantification are fully realised or can be relied upon.
We focus on infinite-horizon and undiscounted MDPs, with finite state and action spaces, and a terminal state. We provide a full Bayesian framework, from modelling to inference to decision-making. For modelling, we introduce a likelihood function with minimal assumptions for learning the optimal action-value function based on Bellman's optimality equations, analyse its properties, and clarify connections to existing works. For deterministic rewards, the likelihood is degenerate and we introduce artificial observation noise to relax it, in a controlled manner, to facilitate more efficient Monte Carlo-based inference. For inference, we propose an adaptive sequential Monte Carlo algorithm to both sample from and adjust the sequence of relaxed posterior distributions. For decision-making, we choose actions using samples from the posterior distribution over the optimal strategies. While commonly done, we provide new insight that clearly shows that it is a generalisation of Thompson sampling from multi-arm bandit problems. Finally, we evaluate our framework on the Deep Sea benchmark problem and demonstrate the exploration benefits of posterior sampling in MDPs.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Low-Precision Training of Large Language Models: Methods, Challenges, and Opportunities
Authors:
Zhiwei Hao,
Jianyuan Guo,
Li Shen,
Yong Luo,
Han Hu,
Guoxia Wang,
Dianhai Yu,
Yonggang Wen,
Dacheng Tao
Abstract:
Large language models (LLMs) have achieved impressive performance across various domains. However, the substantial hardware resources required for their training present a significant barrier to efficiency and scalability. To mitigate this challenge, low-precision training techniques have been widely adopted, leading to notable advancements in training efficiency. Despite these gains, low-precisio…
▽ More
Large language models (LLMs) have achieved impressive performance across various domains. However, the substantial hardware resources required for their training present a significant barrier to efficiency and scalability. To mitigate this challenge, low-precision training techniques have been widely adopted, leading to notable advancements in training efficiency. Despite these gains, low-precision training involves several components$\unicode{x2013}$such as weights, activations, and gradients$\unicode{x2013}$each of which can be represented in different numerical formats. The resulting diversity has created a fragmented landscape in low-precision training research, making it difficult for researchers to gain a unified overview of the field. This survey provides a comprehensive review of existing low-precision training methods. To systematically organize these approaches, we categorize them into three primary groups based on their underlying numerical formats, which is a key factor influencing hardware compatibility, computational efficiency, and ease of reference for readers. The categories are: (1) fixed-point and integer-based methods, (2) floating-point-based methods, and (3) customized format-based methods. Additionally, we discuss quantization-aware training approaches, which share key similarities with low-precision training during forward propagation. Finally, we highlight several promising research directions to advance this field. A collection of papers discussed in this survey is provided in https://github.com/Hao840/Awesome-Low-Precision-Training.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models
Authors:
Xuhui Jiang,
Shengjie Ma,
Chengjin Xu,
Cehao Yang,
Liyu Zhang,
Jian Guo
Abstract:
Large Language Models (LLMs) have achieved remarkable success but remain data-inefficient, especially when learning from small, specialized corpora with limited and proprietary data. Existing synthetic data generation methods for continue pre-training focus on intra-document content and overlook cross-document knowledge associations, limiting content diversity and depth. We propose Synthetic-on-Gr…
▽ More
Large Language Models (LLMs) have achieved remarkable success but remain data-inefficient, especially when learning from small, specialized corpora with limited and proprietary data. Existing synthetic data generation methods for continue pre-training focus on intra-document content and overlook cross-document knowledge associations, limiting content diversity and depth. We propose Synthetic-on-Graph (SoG), a synthetic data generation framework that incorporates cross-document knowledge associations for efficient corpus expansion. SoG constructs a context graph by extracting entities and concepts from the original corpus, representing cross-document associations, and employing a graph walk strategy for knowledge-associated sampling. This enhances synthetic data diversity and coherence, enabling models to learn complex knowledge structures and handle rare knowledge. To further improve synthetic data quality, we integrate Chain-of-Thought (CoT) and Contrastive Clarifying (CC) synthetic, enhancing reasoning processes and discriminative power. Experiments show that SoG outperforms the state-of-the-art (SOTA) method in a multi-hop document Q&A dataset while performing comparably to the SOTA method on the reading comprehension task datasets, which also underscores the better generalization capability of SoG. Our work advances synthetic data generation and provides practical solutions for efficient knowledge acquisition in LLMs, especially in domains with limited data availability.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
Authors:
Jianyu Wu,
Yizhou Wang,
Xiangyu Yue,
Xinzhu Ma,
Jingyang Guo,
Dongzhan Zhou,
Wanli Ouyang,
Shixiang Tang
Abstract:
While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR…
▽ More
While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the ``edge-counters-surface'' priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC. The dataset, code and pretrained network shall be released.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
The Estimation of Continual Causal Effect for Dataset Shifting Streams
Authors:
Baining Chen,
Yiming Zhang,
Yuqiao Han,
Ruyue Zhang,
Ruihuan Du,
Zhishuo Zhou,
Zhengdan Zhu,
Xun Liu,
Jiecheng Guo
Abstract:
Causal effect estimation has been widely used in marketing optimization. The framework of an uplift model followed by a constrained optimization algorithm is popular in practice. To enhance performance in the online environment, the framework needs to be improved to address the complexities caused by temporal dataset shift. This paper focuses on capturing the dataset shift from user behavior and d…
▽ More
Causal effect estimation has been widely used in marketing optimization. The framework of an uplift model followed by a constrained optimization algorithm is popular in practice. To enhance performance in the online environment, the framework needs to be improved to address the complexities caused by temporal dataset shift. This paper focuses on capturing the dataset shift from user behavior and domain distribution changing over time. We propose an Incremental Causal Effect with Proxy Knowledge Distillation (ICE-PKD) framework to tackle this challenge. The ICE-PKD framework includes two components: (i) a multi-treatment uplift network that eliminates confounding bias using counterfactual regression; (ii) an incremental training strategy that adapts to the temporal dataset shift by updating with the latest data and protects generalization via replay-based knowledge distillation. We also revisit the uplift modeling metrics and introduce a novel metric for more precise online evaluation in multiple treatment scenarios. Extensive experiments on both simulated and online datasets show that the proposed framework achieves better performance. The ICE-PKD framework has been deployed in the marketing system of Huaxiaozhu, a ride-hailing platform in China.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
GarmentX: Autoregressive Parametric Representations for High-Fidelity 3D Garment Generation
Authors:
Jingfeng Guo,
Jinnan Chen,
Weikai Chen,
Zhenyu Sun,
Lanjiong Li,
Baozhu Zhao,
Lingting Zhu,
Xin Wang,
Qi Liu
Abstract:
This work presents GarmentX, a novel framework for generating diverse, high-fidelity, and wearable 3D garments from a single input image. Traditional garment reconstruction methods directly predict 2D pattern edges and their connectivity, an overly unconstrained approach that often leads to severe self-intersections and physically implausible garment structures. In contrast, GarmentX introduces a…
▽ More
This work presents GarmentX, a novel framework for generating diverse, high-fidelity, and wearable 3D garments from a single input image. Traditional garment reconstruction methods directly predict 2D pattern edges and their connectivity, an overly unconstrained approach that often leads to severe self-intersections and physically implausible garment structures. In contrast, GarmentX introduces a structured and editable parametric representation compatible with GarmentCode, ensuring that the decoded sewing patterns always form valid, simulation-ready 3D garments while allowing for intuitive modifications of garment shape and style. To achieve this, we employ a masked autoregressive model that sequentially predicts garment parameters, leveraging autoregressive modeling for structured generation while mitigating inconsistencies in direct pattern prediction. Additionally, we introduce GarmentX dataset, a large-scale dataset of 378,682 garment parameter-image pairs, constructed through an automatic data generation pipeline that synthesizes diverse and high-quality garment images conditioned on parametric garment representations. Through integrating our method with GarmentX dataset, we achieve state-of-the-art performance in geometric fidelity and input image alignment, significantly outperforming prior approaches. We will release GarmentX dataset upon publication.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
DeepAndes: A Self-Supervised Vision Foundation Model for Multi-Spectral Remote Sensing Imagery of the Andes
Authors:
Junlin Guo,
James R. Zimmer-Dauphinee,
Jordan M. Nieusma,
Siqi Lu,
Quan Liu,
Ruining Deng,
Can Cui,
Jialin Yue,
Yizhe Lin,
Tianyuan Yao,
Juming Xiong,
Junchao Zhu,
Chongyu Qu,
Yuechen Yang,
Mitchell Wilkes,
Xiao Wang,
Parker VanValkenburgh,
Steven A. Wernke,
Yuankai Huo
Abstract:
By mapping sites at large scales using remotely sensed data, archaeologists can generate unique insights into long-term demographic trends, inter-regional social networks, and past adaptations to climate change. Remote sensing surveys complement field-based approaches, and their reach can be especially great when combined with deep learning and computer vision techniques. However, conventional sup…
▽ More
By mapping sites at large scales using remotely sensed data, archaeologists can generate unique insights into long-term demographic trends, inter-regional social networks, and past adaptations to climate change. Remote sensing surveys complement field-based approaches, and their reach can be especially great when combined with deep learning and computer vision techniques. However, conventional supervised deep learning methods face challenges in annotating fine-grained archaeological features at scale. While recent vision foundation models have shown remarkable success in learning large-scale remote sensing data with minimal annotations, most off-the-shelf solutions are designed for RGB images rather than multi-spectral satellite imagery, such as the 8-band data used in our study. In this paper, we introduce DeepAndes, a transformer-based vision foundation model trained on three million multi-spectral satellite images, specifically tailored for Andean archaeology. DeepAndes incorporates a customized DINOv2 self-supervised learning algorithm optimized for 8-band multi-spectral imagery, marking the first foundation model designed explicitly for the Andes region. We evaluate its image understanding performance through imbalanced image classification, image instance retrieval, and pixel-level semantic segmentation tasks. Our experiments show that DeepAndes achieves superior F1 scores, mean average precision, and Dice scores in few-shot learning scenarios, significantly outperforming models trained from scratch or pre-trained on smaller datasets. This underscores the effectiveness of large-scale self-supervised pre-training in archaeological remote sensing. Codes will be available on https://github.com/geopacha/DeepAndes.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Long-Distance Field Demonstration of Imaging-Free Drone Identification in Intracity Environments
Authors:
Junran Guo,
Tonglin Mu,
Keyuan Li,
Jianing Li,
Ziyang Luo,
Ye Chen,
Xiaodong Fan,
Jinquan Huang,
Minjie Liu,
Jinbei Zhang,
Ruoyang Qi,
Naiting Gu,
Shihai Sun
Abstract:
Detecting small objects, such as drones, over long distances presents a significant challenge with broad implications for security, surveillance, environmental monitoring, and autonomous systems. Traditional imaging-based methods rely on high-resolution image acquisition, but are often constrained by range, power consumption, and cost. In contrast, data-driven single-photon-single-pixel light dete…
▽ More
Detecting small objects, such as drones, over long distances presents a significant challenge with broad implications for security, surveillance, environmental monitoring, and autonomous systems. Traditional imaging-based methods rely on high-resolution image acquisition, but are often constrained by range, power consumption, and cost. In contrast, data-driven single-photon-single-pixel light detection and ranging (\text{D\textsuperscript{2}SP\textsuperscript{2}-LiDAR}) provides an imaging-free alternative, directly enabling target identification while reducing system complexity and cost. However, its detection range has been limited to a few hundred meters. Here, we introduce a novel integration of residual neural networks (ResNet) with \text{D\textsuperscript{2}SP\textsuperscript{2}-LiDAR}, incorporating a refined observation model to extend the detection range to 5~\si{\kilo\meter} in an intracity environment while enabling high-accuracy identification of drone poses and types. Experimental results demonstrate that our approach not only outperforms conventional imaging-based recognition systems, but also achieves 94.93\% pose identification accuracy and 97.99\% type classification accuracy, even under weak signal conditions with long distances and low signal-to-noise ratios (SNRs). These findings highlight the potential of imaging-free methods for robust long-range detection of small targets in real-world scenarios.
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
Bernstein Bounds for Caustics
Authors:
Zhimin Fan,
Chen Wang,
Yiming Wang,
Boxuan Li,
Yuxuan Guo,
Ling-Qi Yan,
Yanwen Guo,
Jie Guo
Abstract:
Systematically simulating specular light transport requires an exhaustive search for primitive tuples containing admissible paths. Given the extreme inefficiency of enumerating all combinations, we propose to significantly reduce the search domain by sampling such tuples. The challenge is to design proper sampling probabilities that keep the noise level controllable. Our key insight is that by bou…
▽ More
Systematically simulating specular light transport requires an exhaustive search for primitive tuples containing admissible paths. Given the extreme inefficiency of enumerating all combinations, we propose to significantly reduce the search domain by sampling such tuples. The challenge is to design proper sampling probabilities that keep the noise level controllable. Our key insight is that by bounding the range of irradiance contributed by each primitive tuple at a given position, we can sample a subset of primitive tuples with potentially high contributions. Although low-contribution tuples are assigned a negligible probability, the overall variance remains low. Therefore, we derive vertex position and irradiance bounds for each primitive tuple, introducing a bounding property of rational functions on the Bernstein basis. When formulating position and irradiance expressions into rational functions, we handle non-rational components through remainder variables to maintain validity. Finally, we carefully design the sampling probabilities by optimizing the upper bound of the variance, expressed only using the position and irradiance bound. The proposed primitive sampling is intrinsically unbiased. It can be seamlessly combined with various unbiased and biased root-finding techniques within a local primitive domain. Extensive evaluations show that our method enables fast and reliable rendering of complex caustic effects.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
TransparentGS: Fast Inverse Rendering of Transparent Objects with Gaussians
Authors:
Letian Huang,
Dongwei Ye,
Jialin Dan,
Chengzhi Tao,
Huiwen Liu,
Kun Zhou,
Bo Ren,
Yuanqi Li,
Yanwen Guo,
Jie Guo
Abstract:
The emergence of neural and Gaussian-based radiance field methods has led to considerable advancements in novel view synthesis and 3D object reconstruction. Nonetheless, specular reflection and refraction continue to pose significant challenges due to the instability and incorrect overfitting of radiance fields to high-frequency light variations. Currently, even 3D Gaussian Splatting (3D-GS), as a…
▽ More
The emergence of neural and Gaussian-based radiance field methods has led to considerable advancements in novel view synthesis and 3D object reconstruction. Nonetheless, specular reflection and refraction continue to pose significant challenges due to the instability and incorrect overfitting of radiance fields to high-frequency light variations. Currently, even 3D Gaussian Splatting (3D-GS), as a powerful and efficient tool, falls short in recovering transparent objects with nearby contents due to the existence of apparent secondary ray effects. To address this issue, we propose TransparentGS, a fast inverse rendering pipeline for transparent objects based on 3D-GS. The main contributions are three-fold. Firstly, an efficient representation of transparent objects, transparent Gaussian primitives, is designed to enable specular refraction through a deferred refraction strategy. Secondly, we leverage Gaussian light field probes (GaussProbe) to encode both ambient light and nearby contents in a unified framework. Thirdly, a depth-based iterative probes query (IterQuery) algorithm is proposed to reduce the parallax errors in our probe-based framework. Experiments demonstrate the speed and accuracy of our approach in recovering transparent objects from complex environments, as well as several applications in computer graphics and vision.
△ Less
Submitted 1 May, 2025; v1 submitted 25 April, 2025;
originally announced April 2025.
-
QuantBench: Benchmarking AI Methods for Quantitative Investment
Authors:
Saizhuo Wang,
Hao Kong,
Jiadong Guo,
Fengrui Hua,
Yiyan Qi,
Wanyun Zhou,
Jiahao Zheng,
Xinyu Wang,
Lionel M. Ni,
Jian Guo
Abstract:
The field of artificial intelligence (AI) in quantitative investment has seen significant advancements, yet it lacks a standardized benchmark aligned with industry practices. This gap hinders research progress and limits the practical application of academic innovations. We present QuantBench, an industrial-grade benchmark platform designed to address this critical need. QuantBench offers three ke…
▽ More
The field of artificial intelligence (AI) in quantitative investment has seen significant advancements, yet it lacks a standardized benchmark aligned with industry practices. This gap hinders research progress and limits the practical application of academic innovations. We present QuantBench, an industrial-grade benchmark platform designed to address this critical need. QuantBench offers three key strengths: (1) standardization that aligns with quantitative investment industry practices, (2) flexibility to integrate various AI algorithms, and (3) full-pipeline coverage of the entire quantitative investment process. Our empirical studies using QuantBench reveal some critical research directions, including the need for continual learning to address distribution shifts, improved methods for modeling relational financial data, and more robust approaches to mitigate overfitting in low signal-to-noise environments. By providing a common ground for evaluation and fostering collaboration between researchers and practitioners, QuantBench aims to accelerate progress in AI for quantitative investment, similar to the impact of benchmark platforms in computer vision and natural language processing.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
An Empirical Study of Evaluating Long-form Question Answering
Authors:
Ning Xian,
Yixing Fan,
Ruqing Zhang,
Maarten de Rijke,
Jiafeng Guo
Abstract:
\Ac{LFQA} aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram matching, while the reliability of large language model-based evaluations for long-form answers remains relatively unexplored. We address this gap by conducting an i…
▽ More
\Ac{LFQA} aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram matching, while the reliability of large language model-based evaluations for long-form answers remains relatively unexplored. We address this gap by conducting an in-depth study of long-form answer evaluation with the following research questions: (i) To what extent do existing automatic evaluation metrics serve as a substitute for human evaluations? (ii) What are the limitations of existing evaluation metrics compared to human evaluations? (iii) How can the effectiveness and robustness of existing evaluation methods be improved? We collect 5,236 factoid and non-factoid long-form answers generated by different large language models and conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. Subsequently, we investigated the performance of automatic evaluation metrics by evaluating these answers, analyzing the consistency between these metrics and human evaluations. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics. However, fine-grained evaluation helps mitigate this issue on some metrics. Our findings have important implications for the use of large language models for evaluating long-form question answering. All code and datasets are available at https://github.com/bugtig6351/lfqa_evaluation.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation
Authors:
Zihan Cheng,
Jintao Guo,
Jian Zhang,
Lei Qi,
Luping Zhou,
Yinghuan Shi,
Yang Gao
Abstract:
To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation.…
▽ More
To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model's generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model's learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at https://github.com/orange-czh/Mamba-Sea.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
"Ohhh, He's the Boss!": Unpacking Power Dynamics Among Developers, Designers, and End-Users in FLOSS Usability
Authors:
Jazlyn Hellman,
Itai Epstein,
Jinghui Cheng,
Jin L. C. Guo
Abstract:
Addressing usability in free, libre, and open-source software (FLOSS) is a challenging issue, particularly due to a long-existing "by developer, for developer" mentality. Engaging designers and end-users to work with developers can help improve its usability, but unequal power dynamics among those stakeholder roles must be mitigated. To explore how the power of different FLOSS stakeholders manifes…
▽ More
Addressing usability in free, libre, and open-source software (FLOSS) is a challenging issue, particularly due to a long-existing "by developer, for developer" mentality. Engaging designers and end-users to work with developers can help improve its usability, but unequal power dynamics among those stakeholder roles must be mitigated. To explore how the power of different FLOSS stakeholders manifests and can be mediated during collaboration, we conducted eight design workshops with different combinations of key FLOSS stakeholders (i.e., developers, designers, and end-users). Leveraging existing theories on Dimensions of Power, we revealed how participants navigate existing role-based power structures through resource utilization, knowledge gap management, and experience referencing. We also observed that participants exhibited diverse behaviors confirming and challenging the status quo of FLOSS usability. Overall, our results contribute to a comprehensive understanding of the power dynamics among FLOSS stakeholders, providing valuable insights into ways to balance their power to improve FLOSS usability. Our work also serves as an exemplar of using design workshops as a research method to study power dynamics during collaboration that are usually hidden in the field.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
Authors:
David Ma,
Yuanxing Zhang,
Jincheng Ren,
Jarvis Guo,
Yifan Yao,
Zhenlin Wei,
Zhenzhu Yang,
Zhongyuan Peng,
Boyu Feng,
Jun Ma,
Xiao Gu,
Zhoufutu Wen,
King Zhu,
Yancheng He,
Meng Cao,
Shiwen Ni,
Jiaheng Liu,
Wenhao Huang,
Ge Zhang,
Xiaojie Jin
Abstract:
Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos…
▽ More
Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation
Authors:
Yunpu Zhao,
Rui Zhang,
Junbin Xiao,
Ruibo Hou,
Jiaming Guo,
Zihao Zhang,
Yifan Hao,
Yunji Chen
Abstract:
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration, resulting in misalignment between their verbalized confidence and response correctness. This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information. In this work, we propose a novel Confidence Calibration through Semantic Perturb…
▽ More
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration, resulting in misalignment between their verbalized confidence and response correctness. This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information. In this work, we propose a novel Confidence Calibration through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for VLMs in response to object-centric queries. We first introduce a perturbed dataset where Gaussian noise is applied to the key object regions to simulate visual uncertainty at different confidence levels, establishing an explicit mapping between visual ambiguity and confidence levels. We further enhance calibration through a two-stage training process combining supervised fine-tuning on the perturbed dataset with subsequent preference optimization. Extensive experiments on popular benchmarks demonstrate that our method significantly improves the alignment between verbalized confidence and response correctness while maintaining or enhancing overall task performance. These results highlight the potential of semantic perturbation as a practical tool for improving the reliability and interpretability of VLMs.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends
Authors:
Jiaxin GUO,
Xiaoyu Chen,
Zhiqiang Rao,
Jinlong Yang,
Zongyao Li,
Hengchao Shang,
Daimeng Wei,
Hao Yang
Abstract:
With the rapid development of deep learning technologies, the field of machine translation has witnessed significant progress, especially with the advent of large language models (LLMs) that have greatly propelled the advancement of document-level translation. However, accurately evaluating the quality of document-level translation remains an urgent issue. This paper first introduces the developme…
▽ More
With the rapid development of deep learning technologies, the field of machine translation has witnessed significant progress, especially with the advent of large language models (LLMs) that have greatly propelled the advancement of document-level translation. However, accurately evaluating the quality of document-level translation remains an urgent issue. This paper first introduces the development status of document-level translation and the importance of evaluation, highlighting the crucial role of automatic evaluation metrics in reflecting translation quality and guiding the improvement of translation systems. It then provides a detailed analysis of the current state of automatic evaluation schemes and metrics, including evaluation methods with and without reference texts, as well as traditional metrics, Model-based metrics and LLM-based metrics. Subsequently, the paper explores the challenges faced by current evaluation methods, such as the lack of reference diversity, dependence on sentence-level alignment information, and the bias, inaccuracy, and lack of interpretability of the LLM-as-a-judge method. Finally, the paper looks ahead to the future trends in evaluation methods, including the development of more user-friendly document-level evaluation methods and more robust LLM-as-a-judge methods, and proposes possible research directions, such as reducing the dependency on sentence-level information, introducing multi-level and multi-granular evaluation approaches, and training models specifically for machine translation evaluation. This study aims to provide a comprehensive analysis of automatic evaluation for document-level translation and offer insights into future developments.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Real-World Face Restoration: Methods and Results
Authors:
Zheng Chen,
Jingkai Wang,
Kai Liu,
Jue Gong,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yulun Zhang,
Jianxing Zhang,
Jinlong Wu,
Jun Wang,
Zheng Xie,
Hakjae Jeon,
Suejin Han,
Hyung-Ju Chun,
Hyunhee Park,
Zhicun Yin,
Junjie Chen,
Ming Liu,
Xiaoming Li,
Chao Zhou,
Wangmeng Zuo,
Weixia Zhang,
Dingquan Li,
Kede Ma
, et al. (29 additional authors not shown)
Abstract:
This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or…
▽ More
This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. The track of the challenge evaluates performance using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 141 registrants, with 13 teams submitting valid models, and ultimately, 10 teams achieved a valid score in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Search is All You Need for Few-shot Anomaly Detection
Authors:
Qishan Wang,
Jia Guo,
Shuyong Gao,
Haofen Wang,
Li Xiong,
Junjie Hu,
Hanqi Guo,
Wenqiang Zhang
Abstract:
Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineer…
▽ More
Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineering and extensive manual tuning. In this paper, we demonstrate that a straightforward nearest-neighbor search framework can surpass state-of-the-art performance in both single-class and multi-class FSAD scenarios. Our proposed method, VisionAD, consists of four simple yet essential components: (1) scalable vision foundation models that extract universal and discriminative features; (2) dual augmentation strategies - support augmentation to enhance feature matching adaptability and query augmentation to address the oversights of single-view prediction; (3) multi-layer feature integration that captures both low-frequency global context and high-frequency local details with minimal computational overhead; and (4) a class-aware visual memory bank enabling efficient one-for-all multi-class detection. Extensive evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate VisionAD's exceptional performance. Using only 1 normal images as support, our method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8% respectively, outperforming current state-of-the-art approaches by significant margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior few-shot capabilities of VisionAD make it particularly appealing for real-world applications where samples are scarce or expensive to obtain. Code is available at https://github.com/Qiqigeww/VisionAD.
△ Less
Submitted 8 May, 2025; v1 submitted 16 April, 2025;
originally announced April 2025.
-
Data driven approach towards more efficient Newton-Raphson power flow calculation for distribution grids
Authors:
Shengyuan Yan,
Farzad Vazinram,
Zeynab Kaseb,
Lindsay Spoor,
Jochen Stiasny,
Betul Mamudi,
Amirhossein Heydarian Ardakani,
Ugochukwu Orji,
Pedro P. Vergara,
Yu Xiang,
Jerry Guo
Abstract:
Power flow (PF) calculations are fundamental to power system analysis to ensure stable and reliable grid operation. The Newton-Raphson (NR) method is commonly used for PF analysis due to its rapid convergence when initialized properly. However, as power grids operate closer to their capacity limits, ill-conditioned cases and convergence issues pose significant challenges. This work, therefore, add…
▽ More
Power flow (PF) calculations are fundamental to power system analysis to ensure stable and reliable grid operation. The Newton-Raphson (NR) method is commonly used for PF analysis due to its rapid convergence when initialized properly. However, as power grids operate closer to their capacity limits, ill-conditioned cases and convergence issues pose significant challenges. This work, therefore, addresses these challenges by proposing strategies to improve NR initialization, hence minimizing iterations and avoiding divergence. We explore three approaches: (i) an analytical method that estimates the basin of attraction using mathematical bounds on voltages, (ii) Two data-driven models leveraging supervised learning or physics-informed neural networks (PINNs) to predict optimal initial guesses, and (iii) a reinforcement learning (RL) approach that incrementally adjusts voltages to accelerate convergence. These methods are tested on benchmark systems. This research is particularly relevant for modern power systems, where high penetration of renewables and decentralized generation require robust and scalable PF solutions. In experiments, all three proposed methods demonstrate a strong ability to provide an initial guess for Newton-Raphson method to converge with fewer steps. The findings provide a pathway for more efficient real-time grid operations, which, in turn, support the transition toward smarter and more resilient electricity networks.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
AdapCsiNet: Environment-Adaptive CSI Feedback via Scene Graph-Aided Deep Learning
Authors:
Jiayi Liu,
Jiajia Guo,
Yiming Cui,
Chao-Kai Wen,
Shi Jin
Abstract:
Accurate channel state information (CSI) is critical for realizing the full potential of multiple-antenna wireless communication systems. While deep learning (DL)-based CSI feedback methods have shown promise in reducing feedback overhead, their generalization capability across varying propagation environments remains limited due to their data-driven nature. Existing solutions based on online trai…
▽ More
Accurate channel state information (CSI) is critical for realizing the full potential of multiple-antenna wireless communication systems. While deep learning (DL)-based CSI feedback methods have shown promise in reducing feedback overhead, their generalization capability across varying propagation environments remains limited due to their data-driven nature. Existing solutions based on online training improve adaptability but impose significant overhead in terms of data collection and computational resources. In this work, we propose AdapCsiNet, an environment-adaptive DL-based CSI feedback framework that eliminates the need for online training. By integrating environmental information -- represented as a scene graph -- into a hypernetwork-guided CSI reconstruction process, AdapCsiNet dynamically adapts to diverse channel conditions. A two-step training strategy is introduced to ensure baseline reconstruction performance and effective environment-aware adaptation. Simulation results demonstrate that AdapCsiNet achieves up to 46.4% improvement in CSI reconstruction accuracy and matches the performance of online learning methods without incurring additional runtime overhead.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Unleashing Expert Opinion from Social Media for Stock Prediction
Authors:
Wanyun Zhou,
Saizhuo Wang,
Xiang Li,
Yiyan Qi,
Jian Guo,
Xiaowen Chu
Abstract:
While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value…
▽ More
While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value for trading predictions due to their noisy nature. To tackle this, we propose a novel dynamic expert tracing algorithm that filters out non-informative posts and identifies both true and inverse experts whose consistent predictions can serve as valuable trading signals. Our approach achieves significant improvements over existing expert identification methods in stock trend prediction. However, when using binary expert predictions to predict the return ratio, similar to all other expert identification methods, our approach faces a common challenge of signal sparsity with expert signals cover only about 4% of all stock-day combinations in our dataset. To address this challenge, we propose a dual graph attention neural network that effectively propagates expert signals across related stocks, enabling accurate prediction of return ratios and significantly increasing signal coverage. Empirical results show that our propagated expert-based signals not only exhibit strong predictive power independently but also work synergistically with traditional financial features. These combined signals significantly outperform representative baseline models in all quant-related metrics including predictive accuracy, return metrics, and correlation metrics, resulting in more robust investment strategies. We hope this work inspires further research into leveraging social media data for enhancing quantitative investment strategies. The code can be seen in https://github.com/wanyunzh/DualGAT.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
A Secure Communication Protocol for Remote Keyless Entry System with Adaptive Adjustment of Transmission Parameters
Authors:
Jingjing Guo,
Bo Tang,
Jiayuan Xu,
Qingyi Li,
Yuyuan Qin,
Xinghua Li
Abstract:
Remote Keyless Entry (RKE) systems have become a standard feature in modern vehicles, yet their unidirectional fixed-frequency radio communication renders them vulnerable to replay attacks, impersonation attacks, cryptanalysis, and intentional interference. Existing cryptographic authentication methods enhance security but often fail to address real-world constraints such as computational efficien…
▽ More
Remote Keyless Entry (RKE) systems have become a standard feature in modern vehicles, yet their unidirectional fixed-frequency radio communication renders them vulnerable to replay attacks, impersonation attacks, cryptanalysis, and intentional interference. Existing cryptographic authentication methods enhance security but often fail to address real-world constraints such as computational efficiency and radio interference. To mitigate these threats, we designed the Adaptive Frequency-Hopping Algorithm and the Adaptive TXP and PHY Mode Control Algorithm that can dynamically optimize channel selection, transmission power, and PHY modes based on real-time channel quality assessment. To enhance the security and reliability of RKE systems, we propose the Lightweight Vehicle-Key Authentication Protocol. In addition, a prototype of the proposed scheme was implemented to verify its effectiveness in mitigating interference and preventing unauthorized access.Experimental results show that our scheme significantly enhances communication security and reliability while maintaining low computational overhead. Under mild interference conditions, the packet delivery rate (PDR) of the adaptive scheme increases from 93% to 99.23%, and under strong interference, it improves from 85% to 99.01%. Additionally, the scheme effectively prevents replay and impersonation attacks, ensuring secure vehicle access control by dynamically optimizing communication parameters to maintain stable and reliable transmission.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Authors:
Weixiang Zhao,
Jiahe Guo,
Yulin Hu,
Yang Deng,
An Zhang,
Xingyu Sui,
Xinyang Han,
Yanyan Zhao,
Bing Qin,
Tat-Seng Chua,
Ting Liu
Abstract:
Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjus…
▽ More
Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
Authors:
Peixian Ma,
Xialie Zhuang,
Chengjin Xu,
Xuhui Jiang,
Ran Chen,
Jian Guo
Abstract:
Natural Language to SQL (NL2SQL) enables intuitive interactions with databases by transforming natural language queries into structured SQL statements. Despite recent advancements in enhancing human-computer interaction within database applications, significant challenges persist, particularly regarding the inference performance in complex scenarios involving multi-table joins and nested queries.…
▽ More
Natural Language to SQL (NL2SQL) enables intuitive interactions with databases by transforming natural language queries into structured SQL statements. Despite recent advancements in enhancing human-computer interaction within database applications, significant challenges persist, particularly regarding the inference performance in complex scenarios involving multi-table joins and nested queries. Current methodologies primarily utilize supervised fine-tuning (SFT) to train the NL2SQL model, which may limit adaptability and interpretability in new environments (e.g., finance and healthcare). In order to enhance the reasoning performance of the NL2SQL model in the above complex situations, we introduce SQL-R1, a novel NL2SQL reasoning model trained by the reinforcement learning (RL) algorithms. We design a specialized RL-based reward function tailored for NL2SQL tasks and discussed the impact of cold start on the effectiveness of intensive training. In addition, we achieve competitive accuracy using only a tiny amount of synthetic NL2SQL data for augmented training and further explore data engineering for RL. In existing experiments, SQL-R1 achieves execution accuracy of 88.6% and 66.6% on the benchmark Spider and BIRD, respectively, only using the 7B base model.
△ Less
Submitted 12 May, 2025; v1 submitted 11 April, 2025;
originally announced April 2025.
-
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
Authors:
Junliang Guo,
Yang Ye,
Tianyu He,
Haoyu Wu,
Yushu Jiang,
Tim Pearce,
Jiang Bian
Abstract:
World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired…
▽ More
World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate $4$ to $7$ frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
CTSR: Cartesian tensor-based sparse regression for data-driven discovery of high-dimensional invariant governing equations
Authors:
Boqian Zhang,
Juanmian Lei,
Guoyou Sun,
Shuaibing Ding,
Jian Guo
Abstract:
Accurate and concise governing equations are crucial for understanding system dynamics. Recently, data-driven methods such as sparse regression have been employed to automatically uncover governing equations from data, representing a significant shift from traditional first-principles modeling. However, most existing methods focus on scalar equations, limiting their applicability to simple, low-di…
▽ More
Accurate and concise governing equations are crucial for understanding system dynamics. Recently, data-driven methods such as sparse regression have been employed to automatically uncover governing equations from data, representing a significant shift from traditional first-principles modeling. However, most existing methods focus on scalar equations, limiting their applicability to simple, low-dimensional scenarios, and failing to ensure rotation and reflection invariance without incurring significant computational cost or requiring additional prior knowledge. This paper proposes a Cartesian tensor-based sparse regression (CTSR) technique to accurately and efficiently uncover complex, high-dimensional governing equations while ensuring invariance. Evaluations on two two-dimensional (2D) and two three-dimensional (3D) test cases demonstrate that the proposed method achieves superior accuracy and efficiency compared to the conventional technique.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models
Authors:
Ling Team,
Caizhi Tang,
Chilin Fu,
Chunwei Wu,
Jia Guo,
Jianwen Wang,
Jingyu Hu,
Liang Jiang,
Meng Li,
Peng Jiao,
Pingping Liu,
Shaomian Zheng,
Shiwei Liang,
Shuaicheng Li,
Yalin Zhang,
Yingting Wu,
Yongkang Liu,
Zhenyu Huang
Abstract:
This technical report presents Ring-Lite-Distill, a lightweight reasoning model derived from our open-source Mixture-of-Experts (MoE) Large Language Models (LLMs) Ling-Lite. This study demonstrates that through meticulous high-quality data curation and ingenious training paradigms, the compact MoE model Ling-Lite can be further trained to achieve exceptional reasoning capabilities, while maintaini…
▽ More
This technical report presents Ring-Lite-Distill, a lightweight reasoning model derived from our open-source Mixture-of-Experts (MoE) Large Language Models (LLMs) Ling-Lite. This study demonstrates that through meticulous high-quality data curation and ingenious training paradigms, the compact MoE model Ling-Lite can be further trained to achieve exceptional reasoning capabilities, while maintaining its parameter-efficient architecture with only 2.75 billion activated parameters, establishing an efficient lightweight reasoning architecture. In particular, in constructing this model, we have not merely focused on enhancing advanced reasoning capabilities, exemplified by high-difficulty mathematical problem solving, but rather aimed to develop a reasoning model with more comprehensive competency coverage. Our approach ensures coverage across reasoning tasks of varying difficulty levels while preserving generic capabilities, such as instruction following, tool use, and knowledge retention. We show that, Ring-Lite-Distill's reasoning ability reaches a level comparable to DeepSeek-R1-Distill-Qwen-7B, while its general capabilities significantly surpass those of DeepSeek-R1-Distill-Qwen-7B. The models are accessible at https://huggingface.co/inclusionAI
△ Less
Submitted 10 April, 2025; v1 submitted 9 April, 2025;
originally announced April 2025.
-
ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities
Authors:
Dingkun Yan,
Xinrui Wang,
Yusuke Iwasawa,
Yutaka Matsuo,
Suguru Saito,
Jiaxian Guo
Abstract:
Reference-based sketch colorization methods have garnered significant attention due to their potential applications in the animation production industry. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially well-aligned, while real-world references and sketches often exhibit substantial misalignment. This mismatch…
▽ More
Reference-based sketch colorization methods have garnered significant attention due to their potential applications in the animation production industry. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially well-aligned, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in spatial artifacts and significant degradation in overall colorization quality, limiting potential applications of current methods for general purposes. To address this limitation, we conduct an in-depth analysis of the \textbf{carrier}, defined as the latent representation facilitating information transfer from reference to sketch. Based on this analysis, we propose a novel workflow that dynamically adapts the carrier to optimize distinct aspects of colorization. Specifically, for spatially misaligned artifacts, we introduce a split cross-attention mechanism with spatial masks, enabling region-specific reference injection within the diffusion process. To mitigate semantic neglect of sketches, we employ dedicated background and style encoders to transfer detailed reference information in the latent feature space, achieving enhanced spatial control and richer detail synthesis. Furthermore, we propose character-mask merging and background bleaching as preprocessing steps to improve foreground-background integration and background generation. Extensive qualitative and quantitative evaluations, including a user study, demonstrate the superior performance of our proposed method compared to existing approaches. An ablation study further validates the efficacy of each proposed component.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
CRYSIM: Prediction of Symmetric Structures of Large Crystals with GPU-based Ising Machines
Authors:
Chen Liang,
Diptesh Das,
Jiang Guo,
Ryo Tamura,
Zetian Mao,
Koji Tsuda
Abstract:
Solving black-box optimization problems with Ising machines is increasingly common in materials science. However, their application to crystal structure prediction (CSP) is still ineffective due to symmetry agnostic encoding of atomic coordinates. We introduce CRYSIM, an algorithm that encodes the space group, the Wyckoff positions combination, and coordinates of independent atomic sites as separa…
▽ More
Solving black-box optimization problems with Ising machines is increasingly common in materials science. However, their application to crystal structure prediction (CSP) is still ineffective due to symmetry agnostic encoding of atomic coordinates. We introduce CRYSIM, an algorithm that encodes the space group, the Wyckoff positions combination, and coordinates of independent atomic sites as separate variables. This encoding reduces the search space substantially by exploiting the symmetry in space groups. When CRYSIM is interfaced to Fixstars Amplify, a GPU-based Ising machine, its prediction performance was competitive with CALYPSO and Bayesian optimization for crystals containing more than 150 atoms in a unit cell. Although it is not realistic to interface CRYSIM to current small-scale quantum devices, it has the potential to become the standard CSP algorithm in the coming quantum age.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Bridging Queries and Tables through Entities in Table Retrieval
Authors:
Da Li,
Keping Bi,
Jiafeng Guo,
Xueqi Cheng
Abstract:
Table retrieval is essential for accessing information stored in structured tabular formats; however, it remains less explored than text retrieval. The content of the table primarily consists of phrases and words, which include a large number of entities, such as time, locations, persons, and organizations. Entities are well-studied in the context of text retrieval, but there is a noticeable lack…
▽ More
Table retrieval is essential for accessing information stored in structured tabular formats; however, it remains less explored than text retrieval. The content of the table primarily consists of phrases and words, which include a large number of entities, such as time, locations, persons, and organizations. Entities are well-studied in the context of text retrieval, but there is a noticeable lack of research on their applications in table retrieval. In this work, we explore how to leverage entities in tables to improve retrieval performance. First, we investigate the important role of entities in table retrieval from a statistical perspective and propose an entity-enhanced training framework. Subsequently, we use the type of entities to highlight entities instead of introducing an external knowledge base. Moreover, we design an interaction paradigm based on entity representations. Our proposed framework is plug-and-play and flexible, making it easy to integrate into existing table retriever training processes. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that our proposed framework is both simple and effective in enhancing existing retrievers. We also conduct extensive analyses to confirm the efficacy of different components. Overall, our work provides a promising direction for elevating table retrieval, enlightening future research in this area.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
GTS-LUM: Reshaping User Behavior Modeling with LLMs in Telecommunications Industry
Authors:
Liu Shi,
Tianwu Zhou,
Wei Xu,
Li Liu,
Zhexin Cui,
Shaoyi Liang,
Haoxing Niu,
Yichong Tian,
Jianwei Guo
Abstract:
As telecommunication service providers shifting their focus to analyzing user behavior for package design and marketing interventions, a critical challenge lies in developing a unified, end-to-end framework capable of modeling long-term and periodic user behavior sequences with diverse time granularities, multi-modal data inputs, and heterogeneous labels. This paper introduces GTS-LUM, a novel use…
▽ More
As telecommunication service providers shifting their focus to analyzing user behavior for package design and marketing interventions, a critical challenge lies in developing a unified, end-to-end framework capable of modeling long-term and periodic user behavior sequences with diverse time granularities, multi-modal data inputs, and heterogeneous labels. This paper introduces GTS-LUM, a novel user behavior model that redefines modeling paradigms in telecommunication settings. GTS-LUM adopts a (multi-modal) encoder-adapter-LLM decoder architecture, enhanced with several telecom-specific innovations. Specifically, the model incorporates an advanced timestamp processing method to handle varying time granularities. It also supports multi-modal data inputs -- including structured tables and behavior co-occurrence graphs -- and aligns these with semantic information extracted by a tokenizer using a Q-former structure. Additionally, GTS-LUM integrates a front-placed target-aware mechanism to highlight historical behaviors most relevant to the target. Extensive experiments on industrial dataset validate the effectiveness of this end-to-end framework and also demonstrate that GTS-LUM outperforms LLM4Rec approaches which are popular in recommendation systems, offering an effective and generalizing solution for user behavior modeling in telecommunications.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Leveraging Robust Optimization for LLM Alignment under Distribution Shifts
Authors:
Mingye Zhu,
Yi Liu,
Junbo Guo,
Quan Wang,
Yongdong Zhang,
Zhendong Mao
Abstract:
Large language models (LLMs) increasingly rely on preference alignment methods to steer outputs toward human values, yet these methods are often constrained by the scarcity of high-quality human-annotated data. To tackle this, recent approaches have turned to synthetic data generated by LLMs as a scalable alternative. However, synthetic data can introduce distribution shifts, compromising the nuan…
▽ More
Large language models (LLMs) increasingly rely on preference alignment methods to steer outputs toward human values, yet these methods are often constrained by the scarcity of high-quality human-annotated data. To tackle this, recent approaches have turned to synthetic data generated by LLMs as a scalable alternative. However, synthetic data can introduce distribution shifts, compromising the nuanced human preferences that are essential for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment in the presence of such shifts. Our approach first estimates the likelihood ratios between the target and training distributions leveraging a learned classifier, then it minimizes the worst-case loss over data regions that reflect the target human-preferred distribution. By explicitly prioritizing the target distribution during optimization, our method mitigates the adverse effects of distributional variation and enhances the generation of responses that faithfully reflect human values.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Authors:
Yunlong Tang,
Jing Bi,
Chao Huang,
Susan Liang,
Daiki Shimada,
Hang Hua,
Yunzhong Xiao,
Yizhi Song,
Pinxin Liu,
Mingqian Feng,
Junjia Guo,
Zhuo Liu,
Luchuan Song,
Ali Vosoughi,
Jinxi He,
Liu He,
Zeliang Zhang,
Jiebo Luo,
Chenliang Xu
Abstract:
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on SAMURAI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection and tempora…
▽ More
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on SAMURAI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection and temporal analysis, and a Captioner using InternVL-2.5 for generating detailed object-centric descriptions. Through spatiotemporal visual prompts and chain-of-thought reasoning, our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data. CAT-V supports flexible user interactions through various visual prompts (points, bounding boxes, and irregular regions) and maintains temporal sensitivity by tracking object states and interactions across different time segments. Our approach addresses limitations of existing video captioning methods, which either produce overly abstract descriptions or lack object-level precision, enabling fine-grained, object-specific descriptions while maintaining temporal coherence and spatial accuracy. The GitHub repository for this project is available at https://github.com/yunlong10/CAT-V
△ Less
Submitted 8 April, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.