-
Reinforcement Learning-based Fault-Tolerant Control for Quadrotor with Online Transformer Adaptation
Authors:
Dohyun Kim,
Jayden Dongwoo Lee,
Hyochoong Bang,
Jungho Bae
Abstract:
Multirotors play a significant role in diverse field robotics applications but remain highly susceptible to actuator failures, leading to rapid instability and compromised mission reliability. While various fault-tolerant control (FTC) strategies using reinforcement learning (RL) have been widely explored, most previous approaches require prior knowledge of the multirotor model or struggle to adap…
▽ More
Multirotors play a significant role in diverse field robotics applications but remain highly susceptible to actuator failures, leading to rapid instability and compromised mission reliability. While various fault-tolerant control (FTC) strategies using reinforcement learning (RL) have been widely explored, most previous approaches require prior knowledge of the multirotor model or struggle to adapt to new configurations. To address these limitations, we propose a novel hybrid RL-based FTC framework integrated with a transformer-based online adaptation module. Our framework leverages a transformer architecture to infer latent representations in real time, enabling adaptation to previously unseen system models without retraining. We evaluate our method in a PyBullet simulation under loss-of-effectiveness actuator faults, achieving a 95% success rate and a positional root mean square error (RMSE) of 0.129 m, outperforming existing adaptation methods with 86% success and an RMSE of 0.153 m. Further evaluations on quadrotors with varying configurations confirm the robustness of our framework across untrained dynamics. These results demonstrate the potential of our framework to enhance the adaptability and reliability of multirotors, enabling efficient fault management in dynamic and uncertain environments. Website is available at http://00dhkim.me/paper/rl-ftc
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Guiding Data Collection via Factored Scaling Curves
Authors:
Lihan Zha,
Apurva Badithela,
Michael Zhang,
Justin Lidard,
Jeremy Bao,
Emily Zhou,
David Snyder,
Allen Z. Ren,
Dhruv Shah,
Anirudha Majumdar
Abstract:
Generalist imitation learning policies trained on large datasets show great promise for solving diverse manipulation tasks. However, to ensure generalization to different conditions, policies need to be trained with data collected across a large set of environmental factor variations (e.g., camera pose, table height, distractors) $-$ a prohibitively expensive undertaking, if done exhaustively. We…
▽ More
Generalist imitation learning policies trained on large datasets show great promise for solving diverse manipulation tasks. However, to ensure generalization to different conditions, policies need to be trained with data collected across a large set of environmental factor variations (e.g., camera pose, table height, distractors) $-$ a prohibitively expensive undertaking, if done exhaustively. We introduce a principled method for deciding what data to collect and how much to collect for each factor by constructing factored scaling curves (FSC), which quantify how policy performance varies as data scales along individual or paired factors. These curves enable targeted data acquisition for the most influential factor combinations within a given budget. We evaluate the proposed method through extensive simulated and real-world experiments, across both training-from-scratch and fine-tuning settings, and show that it boosts success rates in real-world tasks in new environments by up to 26% over existing data-collection strategies. We further demonstrate how factored scaling curves can effectively guide data collection using an offline metric, without requiring real-world evaluation at scale.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting
Authors:
Youngsik Yun,
Jeongmin Bae,
Hyunseung Son,
Seoha Kim,
Hahyun Lee,
Gun Bang,
Youngjung Uh
Abstract:
Online reconstruction of dynamic scenes is significant as it enables learning scenes from live-streaming video inputs, while existing offline dynamic reconstruction methods rely on recorded video inputs. However, previous online reconstruction approaches have primarily focused on efficiency and rendering quality, overlooking the temporal consistency of their results, which often contain noticeable…
▽ More
Online reconstruction of dynamic scenes is significant as it enables learning scenes from live-streaming video inputs, while existing offline dynamic reconstruction methods rely on recorded video inputs. However, previous online reconstruction approaches have primarily focused on efficiency and rendering quality, overlooking the temporal consistency of their results, which often contain noticeable artifacts in static regions. This paper identifies that errors such as noise in real-world recordings affect temporal inconsistency in online reconstruction. We propose a method that enhances temporal consistency in online reconstruction from observations with temporal inconsistency which is inevitable in cameras. We show that our method restores the ideal observation by subtracting the learned error. We demonstrate that applying our method to various baselines significantly enhances both temporal consistency and rendering quality across datasets. Code, video results, and checkpoints are available at https://bbangsik13.github.io/OR2.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
A Generalized Meta Federated Learning Framework with Theoretical Convergence Guarantees
Authors:
Mohammad Vahid Jamali,
Hamid Saber,
Jung Hyun Bae
Abstract:
Meta federated learning (FL) is a personalized variant of FL, where multiple agents collaborate on training an initial shared model without exchanging raw data samples. The initial model should be trained in a way that current or new agents can easily adapt it to their local datasets after one or a few fine-tuning steps, thus improving the model personalization. Conventional meta FL approaches min…
▽ More
Meta federated learning (FL) is a personalized variant of FL, where multiple agents collaborate on training an initial shared model without exchanging raw data samples. The initial model should be trained in a way that current or new agents can easily adapt it to their local datasets after one or a few fine-tuning steps, thus improving the model personalization. Conventional meta FL approaches minimize the average loss of agents on the local models obtained after one step of fine-tuning. In practice, agents may need to apply several fine-tuning steps to adapt the global model to their local data, especially under highly heterogeneous data distributions across agents. To this end, we present a generalized framework for the meta FL by minimizing the average loss of agents on their local model after any arbitrary number $ν$ of fine-tuning steps. For this generalized framework, we present a variant of the well-known federated averaging (FedAvg) algorithm and conduct a comprehensive theoretical convergence analysis to characterize the convergence speed as well as behavior of the meta loss functions in both the exact and approximated cases. Our experiments on real-world datasets demonstrate superior accuracy and faster convergence for the proposed scheme compared to conventional approaches.
△ Less
Submitted 12 May, 2025; v1 submitted 30 April, 2025;
originally announced April 2025.
-
GVPO: Group Variance Policy Optimization for Large Language Model Post-Training
Authors:
Kaichen Zhang,
Yuzhong Hong,
Junwei Bao,
Hongfei Jiang,
Yang Song,
Dingqian Hong,
Hui Xiong
Abstract:
Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their pract…
▽ More
Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their practical adoption. To address this challenge, we present Group Variance Policy Optimization (GVPO). GVPO incorporates the analytical solution to KL-constrained reward maximization directly into its gradient weights, ensuring alignment with the optimal policy. The method provides intuitive physical interpretations: its gradient mirrors the mean squared error between the central distance of implicit rewards and that of actual rewards. GVPO offers two key advantages: (1) it guarantees a unique optimal solution, exactly the KL-constrained reward maximization objective, (2) it supports flexible sampling distributions that avoids on-policy and importance sampling limitations. By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Fast Autoregressive Models for Continuous Latent Generation
Authors:
Tiankai Hang,
Jianmin Bao,
Fangyun Wei,
Dong Chen
Abstract:
Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the h…
▽ More
Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the high computational cost of the iterative denoising process. To address this, we propose the Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head, enabling efficient few-step sampling while preserving autoregressive principles. Additionally, FAR seamlessly integrates with causal Transformers, extending them from discrete to continuous token generation without requiring architectural modifications. Experiments demonstrate that FAR achieves $2.3\times$ faster inference than MAR while maintaining competitive FID and IS scores. This work establishes the first efficient autoregressive paradigm for high-fidelity continuous-space image generation, bridging the critical gap between quality and scalability in visual autoregressive modeling.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians
Authors:
Cailin Zhuang,
Yaoqi Hu,
Xuanyang Zhang,
Wei Cheng,
Jiacheng Bao,
Shengqi Liu,
Yiying Yang,
Xianfang Zeng,
Gang Yu,
Ming Li
Abstract:
3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual…
▽ More
3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Motion Synthesis with Sparse and Flexible Keyjoint Control
Authors:
Inwoo Hwang,
Jinseok Bae,
Donggeun Lim,
Young Min Kim
Abstract:
Creating expressive character animations is labor-intensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators. To process high-level intent and intuitive co…
▽ More
Creating expressive character animations is labor-intensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators. To process high-level intent and intuitive control in diverse scenarios, we propose a practical controllable motions synthesis framework that respects sparse and flexible keyjoint signals. Our approach employs a decomposed diffusion-based motion synthesis framework that first synthesizes keyjoint movements from sparse input control signals and then synthesizes full-body motion based on the completed keyjoint trajectories. The low-dimensional keyjoint movements can easily adapt to various control signal types, such as end-effector position for diverse goal-driven motion synthesis, or incorporate functional constraints on a subset of keyjoints. Additionally, we introduce a time-agnostic control formulation, eliminating the need for frame-specific timing annotations and enhancing control flexibility. Then, the shared second stage can synthesize a natural whole-body motion that precisely satisfies the task requirement from dense keyjoint movements. We demonstrate the effectiveness of sparse and flexible keyjoint control through comprehensive experiments on diverse datasets and scenarios.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Less is More: Improving Motion Diffusion Models with Sparse Keyframes
Authors:
Jinseok Bae,
Inwoo Hwang,
Young Yoon Lee,
Ziyu Guo,
Joseph Liu,
Yizhak Ben-Shabat,
Young Min Kim,
Mubbasir Kapadia
Abstract:
Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intr…
▽ More
Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks.
△ Less
Submitted 12 May, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
Robust Detection of Extremely Thin Lines Using 0.2mm Piano Wire
Authors:
Jisoo Hong,
Youngjin Jung,
Jihwan Bae,
Seungho Song,
Sung-Woo Kang
Abstract:
This study developed an algorithm capable of detecting a reference line (a 0.2 mm thick piano wire) to accurately determine the position of an automated installation robot within an elevator shaft. A total of 3,245 images were collected from the experimental tower of H Company, the leading elevator manufacturer in South Korea, and the detection performance was evaluated using four experimental app…
▽ More
This study developed an algorithm capable of detecting a reference line (a 0.2 mm thick piano wire) to accurately determine the position of an automated installation robot within an elevator shaft. A total of 3,245 images were collected from the experimental tower of H Company, the leading elevator manufacturer in South Korea, and the detection performance was evaluated using four experimental approaches (GCH, GSCH, GECH, FCH). During the initial image processing stage, Gaussian blurring, sharpening filter, embossing filter, and Fourier Transform were applied, followed by Canny Edge Detection and Hough Transform. Notably, the method was developed to accurately extract the reference line by averaging the x-coordinates of the lines detected through the Hough Transform. This approach enabled the detection of the 0.2 mm thick piano wire with high accuracy, even in the presence of noise and other interfering factors (e.g., concrete cracks inside the elevator shaft or safety bars for filming equipment). The experimental results showed that Experiment 4 (FCH), which utilized Fourier Transform in the preprocessing stage, achieved the highest detection rate for the LtoL, LtoR, and RtoL datasets. Experiment 2(GSCH), which applied Gaussian blurring and a sharpening filter, demonstrated superior detection performance on the RtoR dataset. This study proposes a reference line detection algorithm that enables precise position calculation and control of automated robots in elevator shaft installation. Moreover, the developed method shows potential for applicability even in confined working spaces. Future work aims to develop a line detection algorithm equipped with machine learning-based hyperparameter tuning capabilities.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Versatile Physics-based Character Control with Hybrid Latent Representation
Authors:
Jinseok Bae,
Jungdam Won,
Donggeun Lim,
Inwoo Hwang,
Young Min Kim
Abstract:
We present a versatile latent representation that enables physically simulated character to efficiently utilize motion priors. To build a powerful motion embedding that is shared across multiple tasks, the physics controller should employ rich latent space that is easily explored and capable of generating high-quality motion. We propose integrating continuous and discrete latent representations to…
▽ More
We present a versatile latent representation that enables physically simulated character to efficiently utilize motion priors. To build a powerful motion embedding that is shared across multiple tasks, the physics controller should employ rich latent space that is easily explored and capable of generating high-quality motion. We propose integrating continuous and discrete latent representations to build a versatile motion prior that can be adapted to a wide range of challenging control tasks. Specifically, we build a discrete latent model to capture distinctive posterior distribution without collapse, and simultaneously augment the sampled vector with the continuous residuals to generate high-quality, smooth motion without jittering. We further incorporate Residual Vector Quantization, which not only maximizes the capacity of the discrete motion prior, but also efficiently abstracts the action space during the task learning phase. We demonstrate that our agent can produce diverse yet smooth motions simply by traversing the learned motion prior through unconditional motion generation. Furthermore, our model robustly satisfies sparse goal conditions with highly expressive natural motions, including head-mounted device tracking and motion in-betweening at irregular intervals, which could not be achieved with existing latent representations.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
A Time Series Multitask Framework Integrating a Large Language Model, Pre-Trained Time Series Model, and Knowledge Graph
Authors:
Shule Hao,
Junpeng Bao,
Chuncheng Lu
Abstract:
Time series analysis is crucial in fields like finance, transportation, and industry. However, traditional models often focus solely on temporal features, limiting their ability to capture underlying information. This paper proposes a novel time series multitask framework, called LTM, which integrates temporal features with textual descriptions to enhance analytical and predictive capabilities. LT…
▽ More
Time series analysis is crucial in fields like finance, transportation, and industry. However, traditional models often focus solely on temporal features, limiting their ability to capture underlying information. This paper proposes a novel time series multitask framework, called LTM, which integrates temporal features with textual descriptions to enhance analytical and predictive capabilities. LTM combines pre-trained time series model, large language model (LLM), and knowledge graph to tackle time series tasks, including forecasting, imputation, and anomaly detection. LTM achieves improved performance with a few trainable parameters. It is very efficient and practical. LTM encodes time series data into patches and enriches user-provided prompts using knowledge graphs to generate enhanced prompts. A novel feature fusion method embeds prompts into each patch encoding, which is processed by a frozen LLM, followed by a feature enhancement module and a time decoder module. During fine-tuning stage, cosine similarity between prompts and temporal patches is integrated into the loss function to boost performance. Experiments on benchmark datasets show that LTM significantly outperforms existing methods. It provides a robust and versatile solution for time series tasks.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables
Authors:
Gunho Park,
Hyeokjun Kwon,
Jiwoo Kim,
Jeongin Bae,
Baeseong Park,
Dongsoo Lee,
Youngjoo Lee
Abstract:
Weight-only quantization has emerged as a promising solution to the deployment challenges of large language models (LLMs). However, it necessitates FP-INT operations, which make implementation on general-purpose hardware like GPUs difficult. In this paper, we propose FIGLUT, an efficient look-up table (LUT)-based GEMM accelerator architecture. Instead of performing traditional arithmetic operation…
▽ More
Weight-only quantization has emerged as a promising solution to the deployment challenges of large language models (LLMs). However, it necessitates FP-INT operations, which make implementation on general-purpose hardware like GPUs difficult. In this paper, we propose FIGLUT, an efficient look-up table (LUT)-based GEMM accelerator architecture. Instead of performing traditional arithmetic operations, FIGLUT retrieves precomputed values from an LUT based on weight patterns, significantly reducing the computational complexity. We also introduce a novel LUT design that addresses the limitations of conventional memory architectures. To further improve LUT-based operations, we propose a half-size LUT combined with a dedicated decoding and multiplexing unit. FIGLUT efficiently supports different bit precisions and quantization methods using a single fixed hardware configuration. For the same 3-bit weight precision, FIGLUT demonstrates 59% higher TOPS/W and 20% lower perplexity than state-of-the-art accelerator designs. When targeting the same perplexity, FIGLUT achieves 98% higher TOPS/W by performing 2.4-bit operations.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Bridging Synthetic-to-Real Gaps: Frequency-Aware Perturbation and Selection for Single-shot Multi-Parametric Mapping Reconstruction
Authors:
Linyu Fan,
Che Wang,
Ming Ye,
Qizhi Yang,
Zejun Wu,
Xinghao Ding,
Yue Huang,
Jianfeng Bao,
Shuhui Cai,
Congbo Cai
Abstract:
Data-centric artificial intelligence (AI) has remarkably advanced medical imaging, with emerging methods using synthetic data to address data scarcity while introducing synthetic-to-real gaps. Unsupervised domain adaptation (UDA) shows promise in ground truth-scarce tasks, but its application in reconstruction remains underexplored. Although multiple overlapping-echo detachment (MOLED) achieves ul…
▽ More
Data-centric artificial intelligence (AI) has remarkably advanced medical imaging, with emerging methods using synthetic data to address data scarcity while introducing synthetic-to-real gaps. Unsupervised domain adaptation (UDA) shows promise in ground truth-scarce tasks, but its application in reconstruction remains underexplored. Although multiple overlapping-echo detachment (MOLED) achieves ultra-fast multi-parametric reconstruction, extending its application to various clinical scenarios, the quality suffers from deficiency in mitigating the domain gap, difficulty in maintaining structural integrity, and inadequacy in ensuring mapping accuracy. To resolve these issues, we proposed frequency-aware perturbation and selection (FPS), comprising Wasserstein distance-modulated frequency-aware perturbation (WDFP) and hierarchical frequency-aware selection network (HFSNet), which integrates frequency-aware adaptive selection (FAS), compact FAS (cFAS) and feature-aware architecture integration (FAI). Specifically, perturbation activates domain-invariant feature learning within uncertainty, while selection refines optimal solutions within perturbation, establishing a robust and closed-loop learning pathway. Extensive experiments on synthetic data, along with diverse real clinical cases from 5 healthy volunteers, 94 ischemic stroke patients, and 46 meningioma patients, demonstrate the superiority and clinical applicability of FPS. Furthermore, FPS is applied to diffusion tensor imaging (DTI), underscoring its versatility and potential for broader medical applications. The code is available at https://github.com/flyannie/FPS.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Targeted Distillation for Sentiment Analysis
Authors:
Yice Zhang,
Guangyu Xie,
Jingjie Lin,
Jianzhu Bao,
Qianlong Wang,
Xi Zeng,
Ruifeng Xu
Abstract:
This paper presents a compact model that achieves strong sentiment analysis capabilities through targeted distillation from advanced large language models (LLMs). Our methodology decouples the distillation target into two key components: sentiment-related knowledge and task alignment. To transfer these components, we propose a two-stage distillation framework. The first stage, knowledge-driven dis…
▽ More
This paper presents a compact model that achieves strong sentiment analysis capabilities through targeted distillation from advanced large language models (LLMs). Our methodology decouples the distillation target into two key components: sentiment-related knowledge and task alignment. To transfer these components, we propose a two-stage distillation framework. The first stage, knowledge-driven distillation (\textsc{KnowDist}), transfers sentiment-related knowledge to enhance fundamental sentiment analysis capabilities. The second stage, in-context learning distillation (\textsc{ICLDist}), transfers task-specific prompt-following abilities to optimize task alignment. For evaluation, we introduce \textsc{SentiBench}, a comprehensive sentiment analysis benchmark comprising 3 task categories across 12 datasets. Experiments on this benchmark demonstrate that our model effectively balances model size and performance, showing strong competitiveness compared to existing small-scale LLMs.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Authors:
Microsoft,
:,
Abdelrahman Abouelenin,
Atabak Ashfaq,
Adam Atkinson,
Hany Awadalla,
Nguyen Bach,
Jianmin Bao,
Alon Benhaim,
Martin Cai,
Vishrav Chaudhary,
Congcong Chen,
Dong Chen,
Dongdong Chen,
Junkun Chen,
Weizhu Chen,
Yen-Chun Chen,
Yi-ling Chen,
Qi Dai,
Xiyang Dai,
Ruchao Fan,
Mei Gao,
Min Gao,
Amit Garg,
Abhishek Goswami
, et al. (51 additional authors not shown)
Abstract:
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement…
▽ More
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
△ Less
Submitted 7 March, 2025; v1 submitted 3 March, 2025;
originally announced March 2025.
-
DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models
Authors:
Zhendong Wang,
Jianmin Bao,
Shuyang Gu,
Dong Chen,
Wengang Zhou,
Houqiang Li
Abstract:
In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of…
▽ More
In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Quantum algorithms and lower bounds for eccentricity, radius, and diameter in undirected graphs
Authors:
Adam Wesołowski,
Jinge Bao
Abstract:
The problems of computing eccentricity, radius, and diameter are fundamental to graph theory. These parameters are intrinsically defined based on the distance metric of the graph. In this work, we propose quantum algorithms for the diameter and radius of undirected, weighted graphs in the adjacency list model. The algorithms output diameter and radius with the corresponding paths in…
▽ More
The problems of computing eccentricity, radius, and diameter are fundamental to graph theory. These parameters are intrinsically defined based on the distance metric of the graph. In this work, we propose quantum algorithms for the diameter and radius of undirected, weighted graphs in the adjacency list model. The algorithms output diameter and radius with the corresponding paths in $\widetilde{O}(n\sqrt{m})$ time. Additionally, for the diameter, we present a quantum algorithm that approximates the diameter within a $2/3$ ratio in $\widetilde{O}(\sqrt{m}n^{3/4})$ time. We also establish quantum query lower bounds of $Ω(\sqrt{nm})$ for all the aforementioned problems through a reduction from the minima finding problem.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
Authors:
Yifan Pu,
Yiming Zhao,
Zhicong Tang,
Ruihong Yin,
Haoxing Ye,
Yuhui Yuan,
Dong Chen,
Jianmin Bao,
Sirui Zhang,
Yanbin Wang,
Lin Liang,
Lijuan Wang,
Ji Li,
Xiu Li,
Zhouhui Lian,
Gao Huang,
Baining Guo
Abstract:
Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Insp…
▽ More
Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory suggests that knowledge is organized in frameworks (schemas) that enable people to interpret and learn from new information by linking it to prior knowledge.}, this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
Path Planning using Instruction-Guided Probabilistic Roadmaps
Authors:
Jiaqi Bao,
Ryo Yonetani
Abstract:
This work presents a novel data-driven path planning algorithm named Instruction-Guided Probabilistic Roadmap (IG-PRM). Despite the recent development and widespread use of mobile robot navigation, the safe and effective travels of mobile robots still require significant engineering effort to take into account the constraints of robots and their tasks. With IG-PRM, we aim to address this problem b…
▽ More
This work presents a novel data-driven path planning algorithm named Instruction-Guided Probabilistic Roadmap (IG-PRM). Despite the recent development and widespread use of mobile robot navigation, the safe and effective travels of mobile robots still require significant engineering effort to take into account the constraints of robots and their tasks. With IG-PRM, we aim to address this problem by allowing robot operators to specify such constraints through natural language instructions, such as ``aim for wider paths'' or ``mind small gaps''. The key idea is to convert such instructions into embedding vectors using large-language models (LLMs) and use the vectors as a condition to predict instruction-guided cost maps from occupancy maps. By constructing a roadmap based on the predicted costs, we can find instruction-guided paths via the standard shortest path search. Experimental results demonstrate the effectiveness of our approach on both synthetic and real-world indoor navigation environments.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge
Authors:
Heegyu Kim,
Taeyang Jeon,
Seungtaek Choi,
Ji Hoon Hong,
Dong Won Jeon,
Ga-Yeon Baek,
Gyeong-Won Kwak,
Dong-Hee Lee,
Jisu Bae,
Chihoon Lee,
Yunseo Kim,
Seon-Jin Choi,
Jin-Seong Park,
Sung Beom Cho,
Hyunsouk Cho
Abstract:
Materials synthesis is vital for innovations such as energy storage, catalysis, electronics, and biomedical devices. Yet, the process relies heavily on empirical, trial-and-error methods guided by expert intuition. Our work aims to support the materials science community by providing a practical, data-driven resource. We have curated a comprehensive dataset of 17K expert-verified synthesis recipes…
▽ More
Materials synthesis is vital for innovations such as energy storage, catalysis, electronics, and biomedical devices. Yet, the process relies heavily on empirical, trial-and-error methods guided by expert intuition. Our work aims to support the materials science community by providing a practical, data-driven resource. We have curated a comprehensive dataset of 17K expert-verified synthesis recipes from open-access literature, which forms the basis of our newly developed benchmark, AlchemyBench. AlchemyBench offers an end-to-end framework that supports research in large language models applied to synthesis prediction. It encompasses key tasks, including raw materials and equipment prediction, synthesis procedure generation, and characterization outcome forecasting. We propose an LLM-as-a-Judge framework that leverages large language models for automated evaluation, demonstrating strong statistical agreement with expert assessments. Overall, our contributions offer a supportive foundation for exploring the capabilities of LLMs in predicting and guiding materials synthesis, ultimately paving the way for more efficient experimental design and accelerated innovation in materials science.
△ Less
Submitted 19 March, 2025; v1 submitted 23 February, 2025;
originally announced February 2025.
-
Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition
Authors:
Priya Kasimbeg,
Frank Schneider,
Runa Eschenhagen,
Juhan Bae,
Chandramouli Shama Sastry,
Mark Saroufim,
Boyuan Feng,
Less Wright,
Edward Z. Yang,
Zachary Nado,
Sourabh Medapati,
Philipp Hennig,
Michael Rabbat,
George E. Dahl
Abstract:
The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions ar…
▽ More
The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition's results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Diffusion Models without Classifier-free Guidance
Authors:
Zhicong Tang,
Jianmin Bao,
Dong Chen,
Baining Guo
Abstract:
This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making…
▽ More
This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making it a plug-and-play module for existing models. Our method significantly accelerates the training process, doubles the inference speed, and achieve exceptional quality that parallel and even surpass concurrent diffusion models with CFG. Extensive experiments demonstrate the effectiveness, efficiency, scalability on different models and datasets. Finally, we establish state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34. Our code is available at https://github.com/tzco/Diffusion-wo-CFG.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
FinMamba: Market-Aware Graph Enhanced Multi-Level Mamba for Stock Movement Prediction
Authors:
Yifan Hu,
Peiyuan Liu,
Yuante Li,
Dawei Cheng,
Naiqi Li,
Tao Dai,
Jigang Bao,
Shu-Tao Xia
Abstract:
Recently, combining stock features with inter-stock correlations has become a common and effective approach for stock movement prediction. However, financial data presents significant challenges due to its low signal-to-noise ratio and the dynamic complexity of the market, which give rise to two key limitations in existing methods. First, the relationships between stocks are highly influenced by m…
▽ More
Recently, combining stock features with inter-stock correlations has become a common and effective approach for stock movement prediction. However, financial data presents significant challenges due to its low signal-to-noise ratio and the dynamic complexity of the market, which give rise to two key limitations in existing methods. First, the relationships between stocks are highly influenced by multifaceted factors including macroeconomic market dynamics, and current models fail to adaptively capture these evolving interactions under specific market conditions. Second, for the accuracy and timeliness required by real-world trading, existing financial data mining methods struggle to extract beneficial pattern-oriented dependencies from long historical data while maintaining high efficiency and low memory consumption. To address the limitations, we propose FinMamba, a Mamba-GNN-based framework for market-aware and multi-level hybrid stock movement prediction. Specifically, we devise a dynamic graph to learn the changing representations of inter-stock relationships by integrating a pruning module that adapts to market trends. Afterward, with a selective mechanism, the multi-level Mamba discards irrelevant information and resets states to skillfully recall historical patterns across multiple time scales with linear time costs, which are then jointly optimized for reliable prediction. Extensive experiments on U.S. and Chinese stock markets demonstrate the effectiveness of our proposed FinMamba, achieving state-of-the-art prediction accuracy and trading profitability, while maintaining low computational complexity. The code is available at https://github.com/TROUBADOUR000/FinMamba.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Spectral-factorized Positive-definite Curvature Learning for NN Training
Authors:
Wu Lin,
Felix Dangel,
Runa Eschenhagen,
Juhan Bae,
Richard E. Turner,
Roger B. Grosse
Abstract:
Many training methods, such as Adam(W) and Shampoo, learn a positive-definite curvature matrix and apply an inverse root before preconditioning. Recently, non-diagonal training methods, such as Shampoo, have gained significant attention; however, they remain computationally inefficient and are limited to specific types of curvature information due to the costly matrix root computation via matrix d…
▽ More
Many training methods, such as Adam(W) and Shampoo, learn a positive-definite curvature matrix and apply an inverse root before preconditioning. Recently, non-diagonal training methods, such as Shampoo, have gained significant attention; however, they remain computationally inefficient and are limited to specific types of curvature information due to the costly matrix root computation via matrix decomposition. To address this, we propose a Riemannian optimization approach that dynamically adapts spectral-factorized positive-definite curvature estimates, enabling the efficient application of arbitrary matrix roots and generic curvature learning. We demonstrate the efficacy and versatility of our approach in positive-definite matrix optimization and covariance adaptation for gradient-free optimization, as well as its efficiency in curvature learning for neural net training.
△ Less
Submitted 28 March, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement
Authors:
Jae-Sung Bae,
Anastasia Kuznetsova,
Dinesh Manocha,
John Hershey,
Trausti Kristjansson,
Minje Kim
Abstract:
This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP 2025. Collecting high-quality personalized data is challenging due to privacy concerns and technical difficulties in recording audio from the test scene. To add…
▽ More
This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP 2025. Collecting high-quality personalized data is challenging due to privacy concerns and technical difficulties in recording audio from the test scene. To address these issues, synthetic data generation using generative models has gained significant attention. In this challenge, participants are tasked first with building zero-shot TTS systems to augment personalized data. Subsequently, PSE systems are asked to be trained with this augmented personalized dataset. Through this challenge, we aim to investigate how the quality of augmented data generated by zero-shot TTS models affects PSE model performance. We also provide baseline experiments using open-source zero-shot TTS models to encourage participation and benchmark advancements. Our baseline code implementation and checkpoints are available online.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
Training Data Attribution (TDA): Examining Its Adoption & Use Cases
Authors:
Deric Cheng,
Juhan Bae,
Justin Bullock,
David Kristofferson
Abstract:
This report investigates Training Data Attribution (TDA) and its potential importance to and tractability for reducing extreme risks from AI. First, we discuss the plausibility and amount of effort it would take to bring existing TDA research efforts from their current state, to an efficient and accurate tool for TDA inference that can be run on frontier-scale LLMs. Next, we discuss the numerous r…
▽ More
This report investigates Training Data Attribution (TDA) and its potential importance to and tractability for reducing extreme risks from AI. First, we discuss the plausibility and amount of effort it would take to bring existing TDA research efforts from their current state, to an efficient and accurate tool for TDA inference that can be run on frontier-scale LLMs. Next, we discuss the numerous research benefits AI labs will expect to see from using such TDA tooling. Then, we discuss a key outstanding bottleneck that would limit such TDA tooling from being accessible publicly: AI labs' willingness to disclose their training data. We suggest ways AI labs may work around these limitations, and discuss the willingness of governments to mandate such access. Assuming that AI labs willingly provide access to TDA inference, we then discuss what high-level societal benefits you might see. We list and discuss a series of policies and systems that may be enabled by TDA. Finally, we present an evaluation of TDA's potential impact on mitigating large-scale risks from AI systems.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
Quantum Advantage in Private Multiple Hypothesis Testing
Authors:
Seung-Hyun Nam,
Hyun-Young Park,
Joonwoo Bae,
Si-Hyeon Lee
Abstract:
For multiple hypothesis testing based on classical data samples, we demonstrate a quantum advantage in the optimal privacy-utility trade-off (PUT), where the privacy and utility measures are set to (quantum) local differential privacy and the pairwise-minimum Chernoff information, respectively. To show the quantum advantage, we consider some class of hypotheses that we coin smoothed point masses.…
▽ More
For multiple hypothesis testing based on classical data samples, we demonstrate a quantum advantage in the optimal privacy-utility trade-off (PUT), where the privacy and utility measures are set to (quantum) local differential privacy and the pairwise-minimum Chernoff information, respectively. To show the quantum advantage, we consider some class of hypotheses that we coin smoothed point masses. For such hypotheses, we derive an upper bound of the optimal PUT achieved by classical mechanisms, which is tight for some cases, and propose a certain quantum mechanism which achieves a better PUT than the upper bound. The proposed quantum mechanism consists of a classical-quantum channel whose outputs are pure states corresponding to a symmetric informationally complete positive operator-valued measure (SIC-POVM), and a depolarizing channel.
△ Less
Submitted 12 February, 2025; v1 submitted 17 January, 2025;
originally announced January 2025.
-
SmartEraser: Remove Anything from Images using Masked-Region Guidance
Authors:
Longtao Jiang,
Zhendong Wang,
Jianmin Bao,
Wengang Zhou,
Dongdong Chen,
Lei Shi,
Dong Chen,
Houqiang Li
Abstract:
Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Maske…
▽ More
Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked-Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large-scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths. Experimental results demonstrate that SmartEraser significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions.
△ Less
Submitted 29 March, 2025; v1 submitted 14 January, 2025;
originally announced January 2025.
-
Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement
Authors:
Z. Zhang,
B. Liu,
J. Bao,
L. Chen,
S. Zhu,
J. Yu
Abstract:
Recent text-to-image generation favors various forms of spatial conditions, e.g., masks, bounding boxes, and key points. However, the majority of the prior art requires form-specific annotations to fine-tune the original model, leading to poor test-time generalizability. Meanwhile, existing training-free methods work well only with simplified prompts and spatial conditions. In this work, we propos…
▽ More
Recent text-to-image generation favors various forms of spatial conditions, e.g., masks, bounding boxes, and key points. However, the majority of the prior art requires form-specific annotations to fine-tune the original model, leading to poor test-time generalizability. Meanwhile, existing training-free methods work well only with simplified prompts and spatial conditions. In this work, we propose a novel yet generic test-time controllable generation method that aims at natural text prompts and complex conditions. Specifically, we decouple spatial conditions into semantic and geometric conditions and then enforce their consistency during the image-generation process individually. As for the former, we target bridging the gap between the semantic condition and text prompts, as well as the gap between such condition and the attention map from diffusion models. To achieve this, we propose to first complete the prompt w.r.t. semantic condition, and then remove the negative impact of distracting prompt words by measuring their statistics in attention maps as well as distances in word space w.r.t. this condition. To further cope with the complex geometric conditions, we introduce a geometric transform module, in which Region-of-Interests will be identified in attention maps and further used to translate category-wise latents w.r.t. geometric condition. More importantly, we propose a diffusion-based latents-refill method to explicitly remove the impact of latents at the RoI, reducing the artifacts on generated images. Experiments on Coco-stuff dataset showcase 30$\%$ relative boost compared to SOTA training-free methods on layout consistency evaluation metrics.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
Distilling Fine-grained Sentiment Understanding from Large Language Models
Authors:
Yice Zhang,
Guangyu Xie,
Hongling Xu,
Kaiheng Hou,
Jianzhu Bao,
Qianlong Wang,
Shiwei Chen,
Ruifeng Xu
Abstract:
Fine-grained sentiment analysis (FSA) aims to extract and summarize user opinions from vast opinionated text. Recent studies demonstrate that large language models (LLMs) possess exceptional sentiment understanding capabilities. However, directly deploying LLMs for FSA applications incurs high inference costs. Therefore, this paper investigates the distillation of fine-grained sentiment understand…
▽ More
Fine-grained sentiment analysis (FSA) aims to extract and summarize user opinions from vast opinionated text. Recent studies demonstrate that large language models (LLMs) possess exceptional sentiment understanding capabilities. However, directly deploying LLMs for FSA applications incurs high inference costs. Therefore, this paper investigates the distillation of fine-grained sentiment understanding from LLMs into small language models (SLMs). We prompt LLMs to examine and interpret the sentiments of given reviews and then utilize the generated content to pretrain SLMs. Additionally, we develop a comprehensive FSA benchmark to evaluate both SLMs and LLMs. Extensive experiments on this benchmark reveal that: (1) distillation significantly enhances the performance of SLMs in FSA tasks, achieving a 6.00\% improvement in $F_1$-score, and the distilled model can outperform Llama-2-7b with only 220M parameters; (2) distillation equips SLMs with excellent zero-shot sentiment classification capabilities, enabling them to match or even exceed their teacher models. These results suggest that distillation from LLMs is a highly promising direction for FSA. We will release our code, data, and pretrained model weights at https://github.com/HITSZ-HLT/FSA-Distillation.
△ Less
Submitted 30 December, 2024; v1 submitted 24 December, 2024;
originally announced December 2024.
-
Color Enhancement for V-PCC Compressed Point Cloud via 2D Attribute Map Optimization
Authors:
Jingwei Bao,
Yu Liu,
Zeliang Li,
Shuyuan Zhu,
Siu-Kei Au Yeung
Abstract:
Video-based point cloud compression (V-PCC) converts the dynamic point cloud data into video sequences using traditional video codecs for efficient encoding. However, this lossy compression scheme introduces artifacts that degrade the color attributes of the data. This paper introduces a framework designed to enhance the color quality in the V-PCC compressed point clouds. We propose the lightweigh…
▽ More
Video-based point cloud compression (V-PCC) converts the dynamic point cloud data into video sequences using traditional video codecs for efficient encoding. However, this lossy compression scheme introduces artifacts that degrade the color attributes of the data. This paper introduces a framework designed to enhance the color quality in the V-PCC compressed point clouds. We propose the lightweight de-compression Unet (LDC-Unet), a 2D neural network, to optimize the projection maps generated during V-PCC encoding. The optimized 2D maps will then be back-projected to the 3D space to enhance the corresponding point cloud attributes. Additionally, we introduce a transfer learning strategy and develop a customized natural image dataset for the initial training. The model was then fine-tuned using the projection maps of the compressed point clouds. The whole strategy effectively addresses the scarcity of point cloud training data. Our experiments, conducted on the public 8i voxelized full bodies long sequences (8iVSLF) dataset, demonstrate the effectiveness of our proposed method in improving the color quality.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model
Authors:
Yuzhong Hong,
Hanshan Zhang,
Junwei Bao,
Hongfei Jiang,
Yang Song
Abstract:
Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify…
▽ More
Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently,the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To approximate the MLE in practice, we propose a contrastive loss named Energy Preference Alignment (EPA), wherein each positive sample is contrasted against one or more strong negatives as well as many free weak negatives. Theoretical properties of our EBM enable the approximation error of EPA to almost surely vanish when a sufficient number of negatives are used. Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby showing the superiority of our EBM.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Plug-and-Play Tri-Branch Invertible Block for Image Rescaling
Authors:
Jingwei Bao,
Jinhua Hao,
Pengcheng Xu,
Ming Sun,
Chao Zhou,
Shuyuan Zhu
Abstract:
High-resolution (HR) images are commonly downscaled to low-resolution (LR) to reduce bandwidth, followed by upscaling to restore their original details. Recent advancements in image rescaling algorithms have employed invertible neural networks (INNs) to create a unified framework for downscaling and upscaling, ensuring a one-to-one mapping between LR and HR images. Traditional methods, utilizing d…
▽ More
High-resolution (HR) images are commonly downscaled to low-resolution (LR) to reduce bandwidth, followed by upscaling to restore their original details. Recent advancements in image rescaling algorithms have employed invertible neural networks (INNs) to create a unified framework for downscaling and upscaling, ensuring a one-to-one mapping between LR and HR images. Traditional methods, utilizing dual-branch based vanilla invertible blocks, process high-frequency and low-frequency information separately, often relying on specific distributions to model high-frequency components. However, processing the low-frequency component directly in the RGB domain introduces channel redundancy, limiting the efficiency of image reconstruction. To address these challenges, we propose a plug-and-play tri-branch invertible block (T-InvBlocks) that decomposes the low-frequency branch into luminance (Y) and chrominance (CbCr) components, reducing redundancy and enhancing feature processing. Additionally, we adopt an all-zero mapping strategy for high-frequency components during upscaling, focusing essential rescaling information within the LR image. Our T-InvBlocks can be seamlessly integrated into existing rescaling models, improving performance in both general rescaling tasks and scenarios involving lossy compression. Extensive experiments confirm that our method advances the state of the art in HR image reconstruction.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models
Authors:
Yuchen Fan,
Yuzhong Hong,
Qiushi Wang,
Junwei Bao,
Hongfei Jiang,
Yang Song
Abstract:
Alignment, endowing a pre-trained Large language model (LLM) with the ability to follow instructions, is crucial for its real-world applications. Conventional supervised fine-tuning (SFT) methods formalize it as causal language modeling typically with a cross-entropy objective, requiring a large amount of high-quality instruction-response pairs. However, the quality of widely used SFT datasets can…
▽ More
Alignment, endowing a pre-trained Large language model (LLM) with the ability to follow instructions, is crucial for its real-world applications. Conventional supervised fine-tuning (SFT) methods formalize it as causal language modeling typically with a cross-entropy objective, requiring a large amount of high-quality instruction-response pairs. However, the quality of widely used SFT datasets can not be guaranteed due to the high cost and intensive labor for the creation and maintenance in practice. To overcome the limitations associated with the quality of SFT datasets, we introduce a novel \textbf{p}reference-\textbf{o}riented supervised \textbf{f}ine-\textbf{t}uning approach, namely PoFT. The intuition is to boost SFT by imposing a particular preference: \textit{favoring the target model over aligned LLMs on the same SFT data.} This preference encourages the target model to predict a higher likelihood than that predicted by the aligned LLMs, incorporating assessment information on data quality (i.e., predicted likelihood by the aligned LLMs) into the training process. Extensive experiments are conducted, and the results validate the effectiveness of the proposed method. PoFT achieves stable and consistent improvements over the SFT baselines across different training datasets and base models. Moreover, we prove that PoFT can be integrated with existing SFT data filtering methods to achieve better performance, and further improved by following preference optimization procedures, such as DPO.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
BoRA: Bi-dimensional Weight-Decomposed Low-Rank Adaptation
Authors:
Qiushi Wang,
Yuchen Fan,
Junwei Bao,
Hongfei Jiang,
Yang Song
Abstract:
In recent years, Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) have significantly enhanced the adaptability of large-scale pre-trained models. Weight-Decomposed Low-Rank Adaptation (DoRA) improves upon LoRA by separating the magnitude and direction components of the weight matrix, leading to superior performance. However, DoRA's improvements are limited to the vert…
▽ More
In recent years, Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) have significantly enhanced the adaptability of large-scale pre-trained models. Weight-Decomposed Low-Rank Adaptation (DoRA) improves upon LoRA by separating the magnitude and direction components of the weight matrix, leading to superior performance. However, DoRA's improvements are limited to the vertical dimension, resulting in an asymmetrical pattern between horizontal and vertical dimensions. This paper introduces BoRA, an innovative extension of LoRA and DoRA, characterized by symmetrical properties across horizontal and vertical dimensions. Our approach optimizes the weight matrix symmetrically by adjusting both column-wise and row-wise magnitudes. Extensive experiments demonstrate that BoRA surpasses state-of-the-art PEFT methods, including LoRA and DoRA, achieving superior results across various benchmarks.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
MageBench: Bridging Large Multimodal Models to Agents
Authors:
Miaosen Zhang,
Qi Dai,
Yifan Yang,
Jianmin Bao,
Dongdong Chen,
Kai Qiu,
Chong Luo,
Xin Geng,
Baining Guo
Abstract:
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along th…
▽ More
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483 different scenarios. It thoroughly validates the agent's knowledge and engineering capabilities, visual intelligence, and interaction skills. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human-level. More specifically, we found current models severely lack the ability to modify their planning based on visual feedback, as well as visual imagination, interleaved image-text long context handling, and other abilities. We hope that our work will provide optimization directions for LMM from the perspective of being an agent. We release our code and data at https://github.com/microsoft/MageBench.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Structure-Aware Stylized Image Synthesis for Robust Medical Image Segmentation
Authors:
Jie Bao,
Zhixin Zhou,
Wen Jung Li,
Rui Luo
Abstract:
Accurate medical image segmentation is essential for effective diagnosis and treatment planning but is often challenged by domain shifts caused by variations in imaging devices, acquisition conditions, and patient-specific attributes. Traditional domain generalization methods typically require inclusion of parts of the test domain within the training set, which is not always feasible in clinical s…
▽ More
Accurate medical image segmentation is essential for effective diagnosis and treatment planning but is often challenged by domain shifts caused by variations in imaging devices, acquisition conditions, and patient-specific attributes. Traditional domain generalization methods typically require inclusion of parts of the test domain within the training set, which is not always feasible in clinical settings with limited diverse data. Additionally, although diffusion models have demonstrated strong capabilities in image generation and style transfer, they often fail to preserve the critical structural information necessary for precise medical analysis. To address these issues, we propose a novel medical image segmentation method that combines diffusion models and Structure-Preserving Network for structure-aware one-shot image stylization. Our approach effectively mitigates domain shifts by transforming images from various sources into a consistent style while maintaining the location, size, and shape of lesions. This ensures robust and accurate segmentation even when the target domain is absent from the training data. Experimental evaluations on colonoscopy polyp segmentation and skin lesion segmentation datasets show that our method enhances the robustness and accuracy of segmentation models, achieving superior performance metrics compared to baseline models without style transfer. This structure-aware stylization framework offers a practical solution for improving medical image segmentation across diverse domains, facilitating more reliable clinical diagnoses.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Authors:
Qixiu Li,
Yaobo Liang,
Zeyu Wang,
Lin Luo,
Xi Chen,
Mozheng Liao,
Fangyun Wei,
Yu Deng,
Sicheng Xu,
Yizhong Zhang,
Xiaofan Wang,
Bei Liu,
Jianlong Fu,
Jianmin Bao,
Dong Chen,
Yuanchun Shi,
Jiaolong Yang,
Baining Guo
Abstract:
The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks succes…
▽ More
The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).
△ Less
Submitted 29 November, 2024;
originally announced November 2024.
-
MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers
Authors:
Jongseong Bae,
Susang Kim,
Minsu Cho,
Ha Young Kim
Abstract:
Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN…
▽ More
Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum. Each normalization method outputs a different distribution, generating distinct features. Thus, the MVN is expected to offer diverse pattern information to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters, and it incorporates stage specificity by configuring various receptive fields for the token mixer at each stage, efficiently capturing ranges of visual patterns. We propose a novel ViT model, multi-vision transformer (MVFormer), adopting the MVN and MVTM in the MetaFormer block, the generalized ViT scheme. Our MVFormer outperforms state-of-the-art convolution-based ViTs on image classification, object detection, and instance and semantic segmentation with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer-T, S, and B achieve 83.4%, 84.3%, and 84.6% top-1 accuracy, respectively, on ImageNet-1K benchmark.
△ Less
Submitted 28 November, 2024;
originally announced November 2024.
-
MvKeTR: Chest CT Report Generation with Multi-View Perception and Knowledge Enhancement
Authors:
Xiwei Deng,
Xianchun He,
Jiangfeng Bao,
Yudan Zhou,
Shuhui Cai,
Congbo Cai,
Zhong Chen
Abstract:
CT report generation (CTRG) aims to automatically generate diagnostic reports for 3D volumes, relieving clinicians' workload and improving patient care. Despite clinical value, existing works fail to effectively incorporate diagnostic information from multiple anatomical views and lack related clinical expertise essential for accurate and reliable diagnosis. To resolve these limitations, we propos…
▽ More
CT report generation (CTRG) aims to automatically generate diagnostic reports for 3D volumes, relieving clinicians' workload and improving patient care. Despite clinical value, existing works fail to effectively incorporate diagnostic information from multiple anatomical views and lack related clinical expertise essential for accurate and reliable diagnosis. To resolve these limitations, we propose a novel Multi-view perception Knowledge-enhanced Transformer (MvKeTR) to mimic the diagnostic workflow of clinicians. Just as radiologists first examine CT scans from multiple planes, a Multi-View Perception Aggregator (MVPA) with view-aware attention effectively synthesizes diagnostic information from multiple anatomical views. Then, inspired by how radiologists further refer to relevant clinical records to guide diagnostic decision-making, a Cross-Modal Knowledge Enhancer (CMKE) retrieves the most similar reports based on the query volume to incorporate domain knowledge into the diagnosis procedure. Furthermore, instead of traditional MLPs, we employ Kolmogorov-Arnold Networks (KANs) with learnable nonlinear activation functions as the fundamental building blocks of both modules to better capture intricate diagnostic patterns in CT interpretation. Extensive experiments on the public CTRG-Chest-548K dataset demonstrate that our method outpaces prior state-of-the-art (SOTA) models across almost all metrics. The code will be made publicly available.
△ Less
Submitted 6 January, 2025; v1 submitted 27 November, 2024;
originally announced November 2024.
-
DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model
Authors:
JiHwan Moon,
Jihoon Park,
Jungeun Kim,
Jongseong Bae,
Hyeongwoo Jeon,
Ha Young Kim
Abstract:
Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a novel gloss-free SLT framework that leverag…
▽ More
Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a novel gloss-free SLT framework that leverages a diffusion model, enabling diverse translations while preserving sign language semantics. DiffSLT transforms random noise into the target latent representation, conditioned on the visual features of input video. To enhance visual conditioning, we design Guidance Fusion Module, which fully utilizes the multi-level spatiotemporal information of the visual features. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses and visual features, providing key textual guidance and reducing the modality gap. As a result, DiffSLT and DiffSLT-P significantly improve diversity over previous gloss-free SLT methods and achieve state-of-the-art performance on two SLT datasets, thereby markedly improving translation quality.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
4D Scaffold Gaussian Splatting for Memory Efficient Dynamic Scene Reconstruction
Authors:
Woong Oh Cho,
In Cho,
Seoha Kim,
Jeongmin Bae,
Youngjung Uh,
Seon Joo Kim
Abstract:
Existing 4D Gaussian methods for dynamic scene reconstruction offer high visual fidelity and fast rendering. However, these methods suffer from excessive memory and storage demands, which limits their practical deployment. This paper proposes a 4D anchor-based framework that retains visual quality and rendering speed of 4D Gaussians while significantly reducing storage costs. Our method extends 3D…
▽ More
Existing 4D Gaussian methods for dynamic scene reconstruction offer high visual fidelity and fast rendering. However, these methods suffer from excessive memory and storage demands, which limits their practical deployment. This paper proposes a 4D anchor-based framework that retains visual quality and rendering speed of 4D Gaussians while significantly reducing storage costs. Our method extends 3D scaffolding to 4D space, and leverages sparse 4D grid-aligned anchors with compressed feature vectors. Each anchor models a set of neural 4D Gaussians, each of which represent a local spatiotemporal region. In addition, we introduce a temporal coverage-aware anchor growing strategy to effectively assign additional anchors to under-reconstructed dynamic regions. Our method adjusts the accumulated gradients based on Gaussians' temporal coverage, improving reconstruction quality in dynamic regions. To reduce the number of anchors, we further present enhanced formulations of neural 4D Gaussians. These include the neural velocity, and the temporal opacity derived from a generalized Gaussian distribution. Experimental results demonstrate that our method achieves state-of-the-art visual quality and 97.8% storage reduction over 4DGS.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Authors:
Jungeun Kim,
Hyeongwoo Jeon,
Jongseong Bae,
Ha Young Kim
Abstract:
Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Si…
▽ More
Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we generate detailed textual descriptions of sign language components using MLLMs. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be effectively utilized in SLT.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
Three Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion
Authors:
Jongseong Bae,
Junwoo Ha,
Ha Young Kim
Abstract:
Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both des…
▽ More
Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both designed to enhance distant scenes by leveraging context from near-viewpoint scenes. The Scan Module uses axis-wise masked attention, where each axis employing a near-to-far cascade masking that enables distant voxels to capture relationships with preceding voxels. In addition, the Scan Loss computes the cross-entropy along each axis between cumulative logits and corresponding class distributions in a near-to-far direction, thereby propagating rich context-aware signals to distant voxels. Leveraging the synergy between these components, ScanSSC achieves state-of-the-art performance, with IoUs of 44.54 and 48.29, and mIoUs of 17.40 and 20.14 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.
△ Less
Submitted 25 March, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents
Authors:
Rui Tian,
Qi Dai,
Jianmin Bao,
Kai Qiu,
Yifan Yang,
Chong Luo,
Zuxuan Wu,
Yu-Gang Jiang
Abstract:
Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access. One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents based on a content image. Towards this go…
▽ More
Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access. One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents based on a content image. Towards this goal, we design an image-conditioned VAE to encode a video to an extremely compressed motion latent space. This magic Reducio charm enables 64x reduction of latents compared to a common 2D VAE, without sacrificing the quality. Training diffusion models on such a compact representation easily allows for generating 1K resolution videos. We then adopt a two-stage video generation paradigm, which performs text-to-image and text-image-to-video sequentially. Extensive experiments show that our Reducio-DiT achieves strong performance in evaluation, though trained with limited GPU resources. More importantly, our method significantly boost the efficiency of video LDMs both in training and inference. We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU. Code released at https://github.com/microsoft/Reducio-VAE .
△ Less
Submitted 26 November, 2024; v1 submitted 20 November, 2024;
originally announced November 2024.
-
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
Authors:
Laura Ruis,
Maximilian Mozes,
Juhan Bae,
Siddhartha Rao Kamalakara,
Dwarak Talupuru,
Acyr Locatelli,
Robert Kirk,
Tim Rocktäschel,
Edward Grefenstette,
Max Bartolo
Abstract:
The capabilities and limitations of Large Language Models have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume o…
▽ More
The capabilities and limitations of Large Language Models have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume of data used in the design of LLMs has precluded us from applying the method traditionally used to measure generalisation: train-test set separation. To overcome this, we study what kind of generalisation strategies LLMs employ when performing reasoning tasks by investigating the pretraining data they rely on. For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents influence the model outputs for three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions. We find that, while the models rely on mostly distinct sets of data for each factual question, a document often has a similar influence across different reasoning questions within the same task, indicating the presence of procedural knowledge. We further find that the answers to factual questions often show up in the most influential data. However, for reasoning questions the answers usually do not show up as highly influential, nor do the answers to the intermediate reasoning steps. When we characterise the top ranked documents for the reasoning questions qualitatively, we confirm that the influential documents often contain procedural knowledge, like demonstrating how to obtain a solution using formulae or code. Our findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar form of reasoning.
△ Less
Submitted 6 March, 2025; v1 submitted 19 November, 2024;
originally announced November 2024.
-
Game-Theoretic Defenses for Robust Conformal Prediction Against Adversarial Attacks in Medical Imaging
Authors:
Rui Luo,
Jie Bao,
Zhixin Zhou,
Chuangyin Dang
Abstract:
Adversarial attacks pose significant threats to the reliability and safety of deep learning models, especially in critical domains such as medical imaging. This paper introduces a novel framework that integrates conformal prediction with game-theoretic defensive strategies to enhance model robustness against both known and unknown adversarial perturbations. We address three primary research questi…
▽ More
Adversarial attacks pose significant threats to the reliability and safety of deep learning models, especially in critical domains such as medical imaging. This paper introduces a novel framework that integrates conformal prediction with game-theoretic defensive strategies to enhance model robustness against both known and unknown adversarial perturbations. We address three primary research questions: constructing valid and efficient conformal prediction sets under known attacks (RQ1), ensuring coverage under unknown attacks through conservative thresholding (RQ2), and determining optimal defensive strategies within a zero-sum game framework (RQ3). Our methodology involves training specialized defensive models against specific attack types and employing maximum and minimum classifiers to aggregate defenses effectively. Extensive experiments conducted on the MedMNIST datasets, including PathMNIST, OrganAMNIST, and TissueMNIST, demonstrate that our approach maintains high coverage guarantees while minimizing prediction set sizes. The game-theoretic analysis reveals that the optimal defensive strategy often converges to a singular robust model, outperforming uniform and simple strategies across all evaluated datasets. This work advances the state-of-the-art in uncertainty quantification and adversarial robustness, providing a reliable mechanism for deploying deep learning models in adversarial environments.
△ Less
Submitted 3 March, 2025; v1 submitted 6 November, 2024;
originally announced November 2024.
-
Soft Finger Grasp Force and Contact State Estimation from Tactile Sensors
Authors:
Hun Jang,
Joonbum Bae,
Kevin Haninger
Abstract:
Soft robotic fingers can improve adaptability in grasping and manipulation, compensating for geometric variation in object or environmental contact, but today lack force capacity and fine dexterity. Integrated tactile sensors can provide grasp and task information which can improve dexterity,but should ideally not require object-specific training. The total force vector exerted by a finger provide…
▽ More
Soft robotic fingers can improve adaptability in grasping and manipulation, compensating for geometric variation in object or environmental contact, but today lack force capacity and fine dexterity. Integrated tactile sensors can provide grasp and task information which can improve dexterity,but should ideally not require object-specific training. The total force vector exerted by a finger provides general information to the internal grasp forces (e.g. for grasp stability) and, when summed over fingers, an estimate of the external force acting on the grasped object (e.g. for task-level control). In this study, we investigate the efficacy of estimating finger force from integrated soft sensors and use it to estimate contact states. We use a neural network for force regression, collecting labelled data with a force/torque sensor and a range of test objects. Subsequently, we apply this model in a plug-in task scenario and demonstrate its validity in estimating contact states.
△ Less
Submitted 25 October, 2024;
originally announced October 2024.
-
Exploring how deep learning decodes anomalous diffusion via Grad-CAM
Authors:
Jaeyong Bae,
Yongjoo Baek,
Hawoong Jeong
Abstract:
While deep learning has been successfully applied to the data-driven classification of anomalous diffusion mechanisms, how the algorithm achieves the feat still remains a mystery. In this study, we use a well-known technique aimed at achieving explainable AI, namely the Gradient-weighted Class Activation Map (Grad-CAM), to investigate how deep learning (implemented by ResNets) recognizes the disti…
▽ More
While deep learning has been successfully applied to the data-driven classification of anomalous diffusion mechanisms, how the algorithm achieves the feat still remains a mystery. In this study, we use a well-known technique aimed at achieving explainable AI, namely the Gradient-weighted Class Activation Map (Grad-CAM), to investigate how deep learning (implemented by ResNets) recognizes the distinctive features of a particular anomalous diffusion model from the raw trajectory data. Our results show that Grad-CAM reveals the portions of the trajectory that hold crucial information about the underlying mechanism of anomalous diffusion, which can be utilized to enhance the robustness of the trained classifier against the measurement noise. Moreover, we observe that deep learning distills unique statistical characteristics of different diffusion mechanisms at various spatiotemporal scales, with larger-scale (smaller-scale) features identified at higher (lower) layers.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.