Skip to main content

Showing 1–50 of 380 results for author: Bae, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.08223  [pdf, ps, other

    cs.RO cs.AI

    Reinforcement Learning-based Fault-Tolerant Control for Quadrotor with Online Transformer Adaptation

    Authors: Dohyun Kim, Jayden Dongwoo Lee, Hyochoong Bang, Jungho Bae

    Abstract: Multirotors play a significant role in diverse field robotics applications but remain highly susceptible to actuator failures, leading to rapid instability and compromised mission reliability. While various fault-tolerant control (FTC) strategies using reinforcement learning (RL) have been widely explored, most previous approaches require prior knowledge of the multirotor model or struggle to adap… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Accpted at the 2025 IEEE International Conference on Robotics & Automation (ICRA) Workshop: Robots in the Wild

  2. arXiv:2505.07728  [pdf, other

    cs.RO cs.AI cs.LG

    Guiding Data Collection via Factored Scaling Curves

    Authors: Lihan Zha, Apurva Badithela, Michael Zhang, Justin Lidard, Jeremy Bao, Emily Zhou, David Snyder, Allen Z. Ren, Dhruv Shah, Anirudha Majumdar

    Abstract: Generalist imitation learning policies trained on large datasets show great promise for solving diverse manipulation tasks. However, to ensure generalization to different conditions, policies need to be trained with data collected across a large set of environmental factor variations (e.g., camera pose, table height, distractors) $-$ a prohibitively expensive undertaking, if done exhaustively. We… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Project website: https://factored-data-scaling.github.io

  3. Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting

    Authors: Youngsik Yun, Jeongmin Bae, Hyunseung Son, Seoha Kim, Hahyun Lee, Gun Bang, Youngjung Uh

    Abstract: Online reconstruction of dynamic scenes is significant as it enables learning scenes from live-streaming video inputs, while existing offline dynamic reconstruction methods rely on recorded video inputs. However, previous online reconstruction approaches have primarily focused on efficiency and rendering quality, overlooking the temporal consistency of their results, which often contain noticeable… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: SIGGRAPH 2025, Project page: https://bbangsik13.github.io/OR2

  4. arXiv:2504.21327  [pdf, ps, other

    cs.LG

    A Generalized Meta Federated Learning Framework with Theoretical Convergence Guarantees

    Authors: Mohammad Vahid Jamali, Hamid Saber, Jung Hyun Bae

    Abstract: Meta federated learning (FL) is a personalized variant of FL, where multiple agents collaborate on training an initial shared model without exchanging raw data samples. The initial model should be trained in a way that current or new agents can easily adapt it to their local datasets after one or a few fine-tuning steps, thus improving the model personalization. Conventional meta FL approaches min… ▽ More

    Submitted 12 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

  5. arXiv:2504.19599  [pdf, ps, other

    cs.AI cs.LG

    GVPO: Group Variance Policy Optimization for Large Language Model Post-Training

    Authors: Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, Hui Xiong

    Abstract: Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their pract… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  6. arXiv:2504.18391  [pdf, other

    cs.CV cs.LG

    Fast Autoregressive Models for Continuous Latent Generation

    Authors: Tiankai Hang, Jianmin Bao, Fangyun Wei, Dong Chen

    Abstract: Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the h… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  7. arXiv:2504.15281  [pdf, other

    cs.CV

    StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

    Authors: Cailin Zhuang, Yaoqi Hu, Xuanyang Zhang, Wei Cheng, Jiacheng Bao, Shengqi Liu, Yiying Yang, Xianfang Zeng, Gang Yu, Ming Li

    Abstract: 3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 16 pages; Project page: https://styleme3d.github.io/

  8. arXiv:2503.15557  [pdf, other

    cs.GR cs.CV cs.RO

    Motion Synthesis with Sparse and Flexible Keyjoint Control

    Authors: Inwoo Hwang, Jinseok Bae, Donggeun Lim, Young Min Kim

    Abstract: Creating expressive character animations is labor-intensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators. To process high-level intent and intuitive co… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: 11 pages, Project Page: http://inwoohwang.me/SFControl

  9. arXiv:2503.13859  [pdf, other

    cs.CV

    Less is More: Improving Motion Diffusion Models with Sparse Keyframes

    Authors: Jinseok Bae, Inwoo Hwang, Young Yoon Lee, Ziyu Guo, Joseph Liu, Yizhak Ben-Shabat, Young Min Kim, Mubbasir Kapadia

    Abstract: Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intr… ▽ More

    Submitted 12 May, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

  10. arXiv:2503.13473  [pdf

    eess.SP cs.AI cs.CV cs.RO

    Robust Detection of Extremely Thin Lines Using 0.2mm Piano Wire

    Authors: Jisoo Hong, Youngjin Jung, Jihwan Bae, Seungho Song, Sung-Woo Kang

    Abstract: This study developed an algorithm capable of detecting a reference line (a 0.2 mm thick piano wire) to accurately determine the position of an automated installation robot within an elevator shaft. A total of 3,245 images were collected from the experimental tower of H Company, the leading elevator manufacturer in South Korea, and the detection performance was evaluated using four experimental app… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  11. arXiv:2503.12814  [pdf, other

    cs.GR cs.AI cs.RO

    Versatile Physics-based Character Control with Hybrid Latent Representation

    Authors: Jinseok Bae, Jungdam Won, Donggeun Lim, Inwoo Hwang, Young Min Kim

    Abstract: We present a versatile latent representation that enables physically simulated character to efficiently utilize motion priors. To build a powerful motion embedding that is shared across multiple tasks, the physics controller should employ rich latent space that is easily explored and capable of generating high-quality motion. We propose integrating continuous and discrete latent representations to… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  12. arXiv:2503.07682  [pdf, other

    cs.LG cs.AI

    A Time Series Multitask Framework Integrating a Large Language Model, Pre-Trained Time Series Model, and Knowledge Graph

    Authors: Shule Hao, Junpeng Bao, Chuncheng Lu

    Abstract: Time series analysis is crucial in fields like finance, transportation, and industry. However, traditional models often focus solely on temporal features, limiting their ability to capture underlying information. This paper proposes a novel time series multitask framework, called LTM, which integrates temporal features with textual descriptions to enhance analytical and predictive capabilities. LT… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  13. arXiv:2503.06862  [pdf, other

    cs.AR

    FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables

    Authors: Gunho Park, Hyeokjun Kwon, Jiwoo Kim, Jeongin Bae, Baeseong Park, Dongsoo Lee, Youngjoo Lee

    Abstract: Weight-only quantization has emerged as a promising solution to the deployment challenges of large language models (LLMs). However, it necessitates FP-INT operations, which make implementation on general-purpose hardware like GPUs difficult. In this paper, we propose FIGLUT, an efficient look-up table (LUT)-based GEMM accelerator architecture. Instead of performing traditional arithmetic operation… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

    Comments: HPCA 2025

  14. arXiv:2503.03475  [pdf, other

    eess.IV cs.CV

    Bridging Synthetic-to-Real Gaps: Frequency-Aware Perturbation and Selection for Single-shot Multi-Parametric Mapping Reconstruction

    Authors: Linyu Fan, Che Wang, Ming Ye, Qizhi Yang, Zejun Wu, Xinghao Ding, Yue Huang, Jianfeng Bao, Shuhui Cai, Congbo Cai

    Abstract: Data-centric artificial intelligence (AI) has remarkably advanced medical imaging, with emerging methods using synthetic data to address data scarcity while introducing synthetic-to-real gaps. Unsupervised domain adaptation (UDA) shows promise in ground truth-scarce tasks, but its application in reconstruction remains underexplored. Although multiple overlapping-echo detachment (MOLED) achieves ul… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

    Comments: This work will be submitted to the IEEE for possible publication

  15. arXiv:2503.03225  [pdf, other

    cs.CL

    Targeted Distillation for Sentiment Analysis

    Authors: Yice Zhang, Guangyu Xie, Jingjie Lin, Jianzhu Bao, Qianlong Wang, Xi Zeng, Ruifeng Xu

    Abstract: This paper presents a compact model that achieves strong sentiment analysis capabilities through targeted distillation from advanced large language models (LLMs). Our methodology decouples the distillation target into two key components: sentiment-related knowledge and task alignment. To transfer these components, we propose a two-stage distillation framework. The first stage, knowledge-driven dis… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  16. arXiv:2503.01743  [pdf, other

    cs.CL cs.AI cs.LG

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Authors: Microsoft, :, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami , et al. (51 additional authors not shown)

    Abstract: We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement… ▽ More

    Submitted 7 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

    Comments: 39 pages

  17. arXiv:2503.01645  [pdf, other

    cs.CV

    DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

    Authors: Zhendong Wang, Jianmin Bao, Shuyang Gu, Dong Chen, Wengang Zhou, Houqiang Li

    Abstract: In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  18. arXiv:2502.20148  [pdf, ps, other

    quant-ph cs.DS

    Quantum algorithms and lower bounds for eccentricity, radius, and diameter in undirected graphs

    Authors: Adam Wesołowski, Jinge Bao

    Abstract: The problems of computing eccentricity, radius, and diameter are fundamental to graph theory. These parameters are intrinsically defined based on the distance metric of the graph. In this work, we propose quantum algorithms for the diameter and radius of undirected, weighted graphs in the adjacency list model. The algorithms output diameter and radius with the corresponding paths in… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

    Comments: 15 pages, 2 figures

  19. arXiv:2502.18364  [pdf, other

    cs.CV

    ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

    Authors: Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haoxing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang, Yanbin Wang, Lin Liang, Lijuan Wang, Ji Li, Xiu Li, Zhouhui Lian, Gao Huang, Baining Guo

    Abstract: Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Insp… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

    Comments: Project page: https://art-msra.github.io/

  20. arXiv:2502.16515  [pdf, other

    cs.RO

    Path Planning using Instruction-Guided Probabilistic Roadmaps

    Authors: Jiaqi Bao, Ryo Yonetani

    Abstract: This work presents a novel data-driven path planning algorithm named Instruction-Guided Probabilistic Roadmap (IG-PRM). Despite the recent development and widespread use of mobile robot navigation, the safe and effective travels of mobile robots still require significant engineering effort to take into account the constraints of robots and their tasks. With IG-PRM, we aim to address this problem b… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.

    Comments: ICRA 2025

  21. arXiv:2502.16457  [pdf, other

    cs.CL

    Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge

    Authors: Heegyu Kim, Taeyang Jeon, Seungtaek Choi, Ji Hoon Hong, Dong Won Jeon, Ga-Yeon Baek, Gyeong-Won Kwak, Dong-Hee Lee, Jisu Bae, Chihoon Lee, Yunseo Kim, Seon-Jin Choi, Jin-Seong Park, Sung Beom Cho, Hyunsouk Cho

    Abstract: Materials synthesis is vital for innovations such as energy storage, catalysis, electronics, and biomedical devices. Yet, the process relies heavily on empirical, trial-and-error methods guided by expert intuition. Our work aims to support the materials science community by providing a practical, data-driven resource. We have curated a comprehensive dataset of 17K expert-verified synthesis recipes… ▽ More

    Submitted 19 March, 2025; v1 submitted 23 February, 2025; originally announced February 2025.

    Comments: under review

  22. arXiv:2502.15015  [pdf, other

    cs.LG stat.ML

    Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition

    Authors: Priya Kasimbeg, Frank Schneider, Runa Eschenhagen, Juhan Bae, Chandramouli Shama Sastry, Mark Saroufim, Boyuan Feng, Less Wright, Edward Z. Yang, Zachary Nado, Sourabh Medapati, Philipp Hennig, Michael Rabbat, George E. Dahl

    Abstract: The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions ar… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: ICLR 2025; 23 pages, 5 figures, 8 tables

  23. arXiv:2502.12154  [pdf, other

    cs.CV cs.AI cs.LG

    Diffusion Models without Classifier-free Guidance

    Authors: Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo

    Abstract: This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  24. arXiv:2502.06707  [pdf, other

    cs.CE

    FinMamba: Market-Aware Graph Enhanced Multi-Level Mamba for Stock Movement Prediction

    Authors: Yifan Hu, Peiyuan Liu, Yuante Li, Dawei Cheng, Naiqi Li, Tao Dai, Jigang Bao, Shu-Tao Xia

    Abstract: Recently, combining stock features with inter-stock correlations has become a common and effective approach for stock movement prediction. However, financial data presents significant challenges due to its low signal-to-noise ratio and the dynamic complexity of the market, which give rise to two key limitations in existing methods. First, the relationships between stocks are highly influenced by m… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

  25. arXiv:2502.06268  [pdf, other

    stat.ML cs.LG

    Spectral-factorized Positive-definite Curvature Learning for NN Training

    Authors: Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Roger B. Grosse

    Abstract: Many training methods, such as Adam(W) and Shampoo, learn a positive-definite curvature matrix and apply an inverse root before preconditioning. Recently, non-diagonal training methods, such as Shampoo, have gained significant attention; however, they remain computationally inefficient and are limited to specific types of curvature information due to the costly matrix root computation via matrix d… ▽ More

    Submitted 28 March, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

    Comments: fixed some typos in the appendix

  26. arXiv:2501.13372  [pdf, other

    eess.AS cs.AI

    Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement

    Authors: Jae-Sung Bae, Anastasia Kuznetsova, Dinesh Manocha, John Hershey, Trausti Kristjansson, Minje Kim

    Abstract: This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP 2025. Collecting high-quality personalized data is challenging due to privacy concerns and technical difficulties in recording audio from the test scene. To add… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

    Comments: Accepted to ICASSP 2025 Satellite Workshop: Generative Data Augmentation for Real-World Signal Processing Applications

  27. arXiv:2501.12642  [pdf

    cs.CY

    Training Data Attribution (TDA): Examining Its Adoption & Use Cases

    Authors: Deric Cheng, Juhan Bae, Justin Bullock, David Kristofferson

    Abstract: This report investigates Training Data Attribution (TDA) and its potential importance to and tractability for reducing extreme risks from AI. First, we discuss the plausibility and amount of effort it would take to bring existing TDA research efforts from their current state, to an efficient and accurate tool for TDA inference that can be run on frontier-scale LLMs. Next, we discuss the numerous r… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

  28. arXiv:2501.10152  [pdf, ps, other

    quant-ph cs.IT

    Quantum Advantage in Private Multiple Hypothesis Testing

    Authors: Seung-Hyun Nam, Hyun-Young Park, Joonwoo Bae, Si-Hyeon Lee

    Abstract: For multiple hypothesis testing based on classical data samples, we demonstrate a quantum advantage in the optimal privacy-utility trade-off (PUT), where the privacy and utility measures are set to (quantum) local differential privacy and the pairwise-minimum Chernoff information, respectively. To show the quantum advantage, we consider some class of hypotheses that we coin smoothed point masses.… ▽ More

    Submitted 12 February, 2025; v1 submitted 17 January, 2025; originally announced January 2025.

    Comments: 12 pages, 1 figure. More references for Q(L)DP were added in v2

  29. arXiv:2501.08279  [pdf, other

    cs.CV

    SmartEraser: Remove Anything from Images using Masked-Region Guidance

    Authors: Longtao Jiang, Zhendong Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Lei Shi, Dong Chen, Houqiang Li

    Abstract: Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Maske… ▽ More

    Submitted 29 March, 2025; v1 submitted 14 January, 2025; originally announced January 2025.

    Comments: Project at: https://longtaojiang.github.io/smarteraser.github.io/

    Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

  30. arXiv:2501.01368  [pdf, other

    cs.CV

    Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement

    Authors: Z. Zhang, B. Liu, J. Bao, L. Chen, S. Zhu, J. Yu

    Abstract: Recent text-to-image generation favors various forms of spatial conditions, e.g., masks, bounding boxes, and key points. However, the majority of the prior art requires form-specific annotations to fine-tune the original model, leading to poor test-time generalizability. Meanwhile, existing training-free methods work well only with simplified prompts and spatial conditions. In this work, we propos… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

  31. arXiv:2412.18552  [pdf, other

    cs.CL

    Distilling Fine-grained Sentiment Understanding from Large Language Models

    Authors: Yice Zhang, Guangyu Xie, Hongling Xu, Kaiheng Hou, Jianzhu Bao, Qianlong Wang, Shiwei Chen, Ruifeng Xu

    Abstract: Fine-grained sentiment analysis (FSA) aims to extract and summarize user opinions from vast opinionated text. Recent studies demonstrate that large language models (LLMs) possess exceptional sentiment understanding capabilities. However, directly deploying LLMs for FSA applications incurs high inference costs. Therefore, this paper investigates the distillation of fine-grained sentiment understand… ▽ More

    Submitted 30 December, 2024; v1 submitted 24 December, 2024; originally announced December 2024.

  32. arXiv:2412.14449  [pdf, other

    cs.CV eess.IV

    Color Enhancement for V-PCC Compressed Point Cloud via 2D Attribute Map Optimization

    Authors: Jingwei Bao, Yu Liu, Zeliang Li, Shuyuan Zhu, Siu-Kei Au Yeung

    Abstract: Video-based point cloud compression (V-PCC) converts the dynamic point cloud data into video sequences using traditional video codecs for efficient encoding. However, this lossy compression scheme introduces artifacts that degrade the color attributes of the data. This paper introduces a framework designed to enhance the color quality in the V-PCC compressed point clouds. We propose the lightweigh… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Comments: IEEE VCIP 2024

  33. arXiv:2412.13862  [pdf, other

    cs.LG cs.CL

    Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

    Authors: Yuzhong Hong, Hanshan Zhang, Junwei Bao, Hongfei Jiang, Yang Song

    Abstract: Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

  34. arXiv:2412.13508  [pdf, other

    eess.IV cs.CV

    Plug-and-Play Tri-Branch Invertible Block for Image Rescaling

    Authors: Jingwei Bao, Jinhua Hao, Pengcheng Xu, Ming Sun, Chao Zhou, Shuyuan Zhu

    Abstract: High-resolution (HR) images are commonly downscaled to low-resolution (LR) to reduce bandwidth, followed by upscaling to restore their original details. Recent advancements in image rescaling algorithms have employed invertible neural networks (INNs) to create a unified framework for downscaling and upscaling, ensuring a one-to-one mapping between LR and HR images. Traditional methods, utilizing d… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025. Code is available at https://github.com/Jingwei-Bao/T-InvBlocks

  35. arXiv:2412.12865  [pdf, other

    cs.CL

    Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models

    Authors: Yuchen Fan, Yuzhong Hong, Qiushi Wang, Junwei Bao, Hongfei Jiang, Yang Song

    Abstract: Alignment, endowing a pre-trained Large language model (LLM) with the ability to follow instructions, is crucial for its real-world applications. Conventional supervised fine-tuning (SFT) methods formalize it as causal language modeling typically with a cross-entropy objective, requiring a large amount of high-quality instruction-response pairs. However, the quality of widely used SFT datasets can… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

    Comments: AAAI2025, 12 pages, 9 figures

  36. arXiv:2412.06441  [pdf, other

    cs.CL

    BoRA: Bi-dimensional Weight-Decomposed Low-Rank Adaptation

    Authors: Qiushi Wang, Yuchen Fan, Junwei Bao, Hongfei Jiang, Yang Song

    Abstract: In recent years, Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) have significantly enhanced the adaptability of large-scale pre-trained models. Weight-Decomposed Low-Rank Adaptation (DoRA) improves upon LoRA by separating the magnitude and direction components of the weight matrix, leading to superior performance. However, DoRA's improvements are limited to the vert… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  37. arXiv:2412.04531  [pdf, other

    cs.CV cs.AI cs.LG

    MageBench: Bridging Large Multimodal Models to Agents

    Authors: Miaosen Zhang, Qi Dai, Yifan Yang, Jianmin Bao, Dongdong Chen, Kai Qiu, Chong Luo, Xin Geng, Baining Guo

    Abstract: LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along th… ▽ More

    Submitted 5 December, 2024; originally announced December 2024.

    Comments: 37 pages, 32 figures, github link: https://github.com/microsoft/MageBench

  38. arXiv:2412.04296  [pdf, other

    eess.IV cs.CV cs.LG

    Structure-Aware Stylized Image Synthesis for Robust Medical Image Segmentation

    Authors: Jie Bao, Zhixin Zhou, Wen Jung Li, Rui Luo

    Abstract: Accurate medical image segmentation is essential for effective diagnosis and treatment planning but is often challenged by domain shifts caused by variations in imaging devices, acquisition conditions, and patient-specific attributes. Traditional domain generalization methods typically require inclusion of parts of the test domain within the training set, which is not always feasible in clinical s… ▽ More

    Submitted 5 December, 2024; originally announced December 2024.

  39. arXiv:2411.19650  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Authors: Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, Baining Guo

    Abstract: The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks succes… ▽ More

    Submitted 29 November, 2024; originally announced November 2024.

    Comments: Project Webpage: https://cogact.github.io/

  40. arXiv:2411.18995  [pdf, other

    cs.CV

    MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

    Authors: Jongseong Bae, Susang Kim, Minsu Cho, Ha Young Kim

    Abstract: Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN… ▽ More

    Submitted 28 November, 2024; originally announced November 2024.

  41. arXiv:2411.18309  [pdf, other

    cs.CV cs.AI

    MvKeTR: Chest CT Report Generation with Multi-View Perception and Knowledge Enhancement

    Authors: Xiwei Deng, Xianchun He, Jiangfeng Bao, Yudan Zhou, Shuhui Cai, Congbo Cai, Zhong Chen

    Abstract: CT report generation (CTRG) aims to automatically generate diagnostic reports for 3D volumes, relieving clinicians' workload and improving patient care. Despite clinical value, existing works fail to effectively incorporate diagnostic information from multiple anatomical views and lack related clinical expertise essential for accurate and reliable diagnosis. To resolve these limitations, we propos… ▽ More

    Submitted 6 January, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

    Comments: 11 pages, 10 figures

  42. arXiv:2411.17248  [pdf, other

    cs.CV

    DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model

    Authors: JiHwan Moon, Jihoon Park, Jungeun Kim, Jongseong Bae, Hyeongwoo Jeon, Ha Young Kim

    Abstract: Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a novel gloss-free SLT framework that leverag… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: Project page: https://diffslt.github.io/

  43. arXiv:2411.17044  [pdf, other

    cs.CV cs.GR

    4D Scaffold Gaussian Splatting for Memory Efficient Dynamic Scene Reconstruction

    Authors: Woong Oh Cho, In Cho, Seoha Kim, Jeongmin Bae, Youngjung Uh, Seon Joo Kim

    Abstract: Existing 4D Gaussian methods for dynamic scene reconstruction offer high visual fidelity and fast rendering. However, these methods suffer from excessive memory and storage demands, which limits their practical deployment. This paper proposes a 4D anchor-based framework that retains visual quality and rendering speed of 4D Gaussians while significantly reducing storage costs. Our method extends 3D… ▽ More

    Submitted 25 November, 2024; originally announced November 2024.

  44. arXiv:2411.16789  [pdf, other

    cs.CV cs.CL

    Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

    Authors: Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim

    Abstract: Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Si… ▽ More

    Submitted 25 November, 2024; originally announced November 2024.

  45. arXiv:2411.16129  [pdf, other

    cs.CV

    Three Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion

    Authors: Jongseong Bae, Junwoo Ha, Ha Young Kim

    Abstract: Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both des… ▽ More

    Submitted 25 March, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: Accepted to CVPR 2025

  46. arXiv:2411.13552  [pdf, other

    cs.CV

    REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

    Authors: Rui Tian, Qi Dai, Jianmin Bao, Kai Qiu, Yifan Yang, Chong Luo, Zuxuan Wu, Yu-Gang Jiang

    Abstract: Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access. One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents based on a content image. Towards this go… ▽ More

    Submitted 26 November, 2024; v1 submitted 20 November, 2024; originally announced November 2024.

    Comments: Code available at https://github.com/microsoft/Reducio-VAE

  47. arXiv:2411.12580  [pdf, other

    cs.CL cs.LG

    Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

    Authors: Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwarak Talupuru, Acyr Locatelli, Robert Kirk, Tim Rocktäschel, Edward Grefenstette, Max Bartolo

    Abstract: The capabilities and limitations of Large Language Models have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume o… ▽ More

    Submitted 6 March, 2025; v1 submitted 19 November, 2024; originally announced November 2024.

    Comments: Published at ICLR 2025

  48. arXiv:2411.04376  [pdf, other

    cs.LG cs.CR eess.IV

    Game-Theoretic Defenses for Robust Conformal Prediction Against Adversarial Attacks in Medical Imaging

    Authors: Rui Luo, Jie Bao, Zhixin Zhou, Chuangyin Dang

    Abstract: Adversarial attacks pose significant threats to the reliability and safety of deep learning models, especially in critical domains such as medical imaging. This paper introduces a novel framework that integrates conformal prediction with game-theoretic defensive strategies to enhance model robustness against both known and unknown adversarial perturbations. We address three primary research questi… ▽ More

    Submitted 3 March, 2025; v1 submitted 6 November, 2024; originally announced November 2024.

  49. arXiv:2410.19684  [pdf, other

    cs.RO

    Soft Finger Grasp Force and Contact State Estimation from Tactile Sensors

    Authors: Hun Jang, Joonbum Bae, Kevin Haninger

    Abstract: Soft robotic fingers can improve adaptability in grasping and manipulation, compensating for geometric variation in object or environmental contact, but today lack force capacity and fine dexterity. Integrated tactile sensors can provide grasp and task information which can improve dexterity,but should ideally not require object-specific training. The total force vector exerted by a finger provide… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

  50. arXiv:2410.16345  [pdf, other

    cs.LG physics.data-an

    Exploring how deep learning decodes anomalous diffusion via Grad-CAM

    Authors: Jaeyong Bae, Yongjoo Baek, Hawoong Jeong

    Abstract: While deep learning has been successfully applied to the data-driven classification of anomalous diffusion mechanisms, how the algorithm achieves the feat still remains a mystery. In this study, we use a well-known technique aimed at achieving explainable AI, namely the Gradient-weighted Class Activation Map (Grad-CAM), to investigate how deep learning (implemented by ResNets) recognizes the disti… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

    Comments: 14 pages, 12 figures