-
ROSAQ: Rotation-based Saliency-Aware Weight Quantization for Efficiently Compressing Large Language Models
Authors:
Junho Yoon,
Geom Lee,
Donghyeon Jeon,
Inho Kang,
Seung-Hoon Na
Abstract:
Quantization has been widely studied as an effective technique for reducing the memory requirement of large language models (LLMs), potentially improving the latency time as well. Utilizing the characteristic of rotational invariance of transformer, we propose the rotation-based saliency-aware weight quantization (ROSAQ), which identifies salient channels in the projection feature space, not in th…
▽ More
Quantization has been widely studied as an effective technique for reducing the memory requirement of large language models (LLMs), potentially improving the latency time as well. Utilizing the characteristic of rotational invariance of transformer, we propose the rotation-based saliency-aware weight quantization (ROSAQ), which identifies salient channels in the projection feature space, not in the original feature space, where the projected "principal" dimensions are naturally considered as "salient" features. The proposed ROSAQ consists of 1) PCA-based projection, which first performs principal component analysis (PCA) on a calibration set and transforms via the PCA projection, 2) Salient channel dentification, which selects dimensions corresponding to the K-largest eigenvalues as salient channels, and 3) Saliency-aware quantization with mixed-precision, which uses FP16 for salient dimensions and INT3/4 for other dimensions. Experiment results show that ROSAQ shows improvements over the baseline saliency-aware quantization on the original feature space and other existing quantization methods. With kernel fusion, ROSAQ presents about 2.3x speed up over FP16 implementation in generating 256 tokens with a batch size of 64.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Bidirectional Biometric Authentication Using Transciphering and (T)FHE
Authors:
Joon Soo Yoo,
Tae Min Ahn,
Ji Won Yoon
Abstract:
Biometric authentication systems pose privacy risks, as leaked templates such as iris or fingerprints can lead to security breaches. Fully Homomorphic Encryption (FHE) enables secure encrypted evaluation, but its deployment is hindered by large ciphertexts, high key overhead, and limited trust models. We propose the Bidirectional Transciphering Framework (BTF), combining FHE, transciphering, and a…
▽ More
Biometric authentication systems pose privacy risks, as leaked templates such as iris or fingerprints can lead to security breaches. Fully Homomorphic Encryption (FHE) enables secure encrypted evaluation, but its deployment is hindered by large ciphertexts, high key overhead, and limited trust models. We propose the Bidirectional Transciphering Framework (BTF), combining FHE, transciphering, and a non-colluding trusted party to enable efficient and privacy-preserving biometric authentication. The key architectural innovation is the introduction of a trusted party that assists in evaluation and key management, along with a double encryption mechanism to preserve the FHE trust model, where client data remains private. BTF addresses three core deployment challenges: reducing the size of returned FHE ciphertexts, preventing clients from falsely reporting successful authentication, and enabling scalable, centralized FHE key management. We implement BTF using TFHE and the Trivium cipher, and evaluate it on iris-based biometric data. Our results show up to a 121$\times$ reduction in transmission size compared to standard FHE models, demonstrating practical scalability and deployment potential.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Versatile and Fast Location-Based Private Information Retrieval with Fully Homomorphic Encryption over the Torus
Authors:
Joon Soo Yoo,
Taeho Kim,
Ji Won Yoon
Abstract:
Location-based services often require users to share sensitive locational data, raising privacy concerns due to potential misuse or exploitation by untrusted servers. In response, we present VeLoPIR, a versatile location-based private information retrieval (PIR) system designed to preserve user privacy while enabling efficient and scalable query processing. VeLoPIR introduces three operational mod…
▽ More
Location-based services often require users to share sensitive locational data, raising privacy concerns due to potential misuse or exploitation by untrusted servers. In response, we present VeLoPIR, a versatile location-based private information retrieval (PIR) system designed to preserve user privacy while enabling efficient and scalable query processing. VeLoPIR introduces three operational modes-interval validation, coordinate validation, and identifier matching-that support a broad range of real-world applications, including information and emergency alerts. To enhance performance, VeLoPIR incorporates multi-level algorithmic optimizations with parallel structures, achieving significant scalability across both CPU and GPU platforms. We also provide formal security and privacy proofs, confirming the system's robustness under standard cryptographic assumptions. Extensive experiments on real-world datasets demonstrate that VeLoPIR achieves up to 11.55 times speed-up over a prior baseline. The implementation of VeLoPIR is publicly available at https://github.com/PrivStatBool/VeLoPIR.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Debiasing Online Preference Learning via Preference Feature Preservation
Authors:
Dongyoung Kim,
Jinsung Yoon,
Jinwoo Shin,
Jaehyung Kim
Abstract:
Recent preference learning frameworks for large language models (LLMs) simplify human preferences with binary pairwise comparisons and scalar rewards. This simplification could make LLMs' responses biased to mostly preferred features, and would be exacerbated during the iterations of online preference learning steps. To address these challenges, we propose a novel framework coined PFP (Preference…
▽ More
Recent preference learning frameworks for large language models (LLMs) simplify human preferences with binary pairwise comparisons and scalar rewards. This simplification could make LLMs' responses biased to mostly preferred features, and would be exacerbated during the iterations of online preference learning steps. To address these challenges, we propose a novel framework coined PFP (Preference Feature Preservation). The key idea of PFP is maintaining the distribution of human preference features and utilizing such rich signals throughout the online preference learning process. Specifically, PFP first extract preference features from offline pairwise human preference data and trains a feature classifier. Then, using trained classifier and the distribution preserving optimization, PFP maps appropriate preference features for a new input instruction during online learning. Lastly, PFP trains LLM using the existing preference learning method, by incorporating the preference feature into system prompts and enabling LLM to explicitly handle various human preferences. Our experiments demonstrate that PFP successfully mitigates the bias in preference features during online learning, and hence achieves superior performance compared to previous preference learning methods on standard benchmarks to evaluate LLM alignment.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Fast Monte Carlo Tree Diffusion: 100x Speedup via Parallel Sparse Planning
Authors:
Jaesik Yoon,
Hyeonseo Cho,
Yoshua Bengio,
Sungjin Ahn
Abstract:
Diffusion models have recently emerged as a powerful approach for trajectory planning. However, their inherently non-sequential nature limits their effectiveness in long-horizon reasoning tasks at test time. The recently proposed Monte Carlo Tree Diffusion (MCTD) offers a promising solution by combining diffusion with tree-based search, achieving state-of-the-art performance on complex planning pr…
▽ More
Diffusion models have recently emerged as a powerful approach for trajectory planning. However, their inherently non-sequential nature limits their effectiveness in long-horizon reasoning tasks at test time. The recently proposed Monte Carlo Tree Diffusion (MCTD) offers a promising solution by combining diffusion with tree-based search, achieving state-of-the-art performance on complex planning problems. Despite its strengths, our analysis shows that MCTD incurs substantial computational overhead due to the sequential nature of tree search and the cost of iterative denoising. To address this, we propose Fast-MCTD, a more efficient variant that preserves the strengths of MCTD while significantly improving its speed and scalability. Fast-MCTD integrates two techniques: Parallel MCTD, which enables parallel rollouts via delayed tree updates and redundancy-aware selection; and Sparse MCTD, which reduces rollout length through trajectory coarsening. Experiments show that Fast-MCTD achieves up to 100x speedup over standard MCTD while maintaining or improving planning performance. Remarkably, it even outperforms Diffuser in inference speed on some tasks, despite Diffuser requiring no search and yielding weaker solutions. These results position Fast-MCTD as a practical and scalable solution for diffusion-based inference-time reasoning.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
GPS Spoofing Attacks on AI-based Navigation Systems with Obstacle Avoidance in UAV
Authors:
Ji Hyuk Jung,
Mi Yeon Hong,
Ji Won Yoon
Abstract:
Recently, approaches using Deep Reinforcement Learning (DRL) have been proposed to solve UAV navigation systems in complex and unknown environments. However, despite extensive research and attention, systematic studies on various security aspects have not yet been conducted. Therefore, in this paper, we conduct research on security vulnerabilities in DRL-based navigation systems, particularly focu…
▽ More
Recently, approaches using Deep Reinforcement Learning (DRL) have been proposed to solve UAV navigation systems in complex and unknown environments. However, despite extensive research and attention, systematic studies on various security aspects have not yet been conducted. Therefore, in this paper, we conduct research on security vulnerabilities in DRL-based navigation systems, particularly focusing on GPS spoofing attacks against the system. Many recent basic DRL-based navigation systems fundamentally share an efficient structure. This paper presents an attack model that operates through GPS spoofing attacks briefly modeling the range of spoofing attack against EKF sensor fusion of PX4 autopilot, and combine this with the DRL-based system to design attack scenarios that are closer to reality. Finally, this paper experimentally demonstrated that attacks are possible both in the basic DRL system and in attack models combining the DRL system with PX4 autopilot system.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
Authors:
Sangwon Jang,
Taekyung Ki,
Jaehyeong Jo,
Jaehong Yoon,
Soo Ye Kim,
Zhe Lin,
Sung Ju Hwang
Abstract:
Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation b…
▽ More
Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.
△ Less
Submitted 8 June, 2025;
originally announced June 2025.
-
Active Test-time Vision-Language Navigation
Authors:
Heeju Ko,
Sungjune Kim,
Gyeongrok Oh,
Jeongyoon Yoon,
Honglak Lee,
Sujin Jang,
Seungryong Kim,
Sangpil Kim
Abstract:
Vision-Language Navigation (VLN) policies trained on offline datasets often exhibit degraded task performance when deployed in unfamiliar navigation environments at test time, where agents are typically evaluated without access to external interaction or feedback. Entropy minimization has emerged as a practical solution for reducing prediction uncertainty at test time; however, it can suffer from…
▽ More
Vision-Language Navigation (VLN) policies trained on offline datasets often exhibit degraded task performance when deployed in unfamiliar navigation environments at test time, where agents are typically evaluated without access to external interaction or feedback. Entropy minimization has emerged as a practical solution for reducing prediction uncertainty at test time; however, it can suffer from accumulated errors, as agents may become overconfident in incorrect actions without sufficient contextual grounding. To tackle these challenges, we introduce ATENA (Active TEst-time Navigation Agent), a test-time active learning framework that enables a practical human-robot interaction via episodic feedback on uncertain navigation outcomes. In particular, ATENA learns to increase certainty in successful episodes and decrease it in failed ones, improving uncertainty calibration. Here, we propose mixture entropy optimization, where entropy is obtained from a combination of the action and pseudo-expert distributions-a hypothetical action distribution assuming the agent's selected action to be optimal-controlling both prediction confidence and action preference. In addition, we propose a self-active learning strategy that enables an agent to evaluate its navigation outcomes based on confident predictions. As a result, the agent stays actively engaged throughout all iterations, leading to well-grounded and adaptive decision-making. Extensive evaluations on challenging VLN benchmarks-REVERIE, R2R, and R2R-CE-demonstrate that ATENA successfully overcomes distributional shifts at test time, outperforming the compared baseline methods across various settings.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding
Authors:
Emmanouil Zaranis,
António Farinhas,
Saul Santos,
Beatriz Canaverde,
Miguel Moura Ramos,
Aditya K Surikuchi,
André Viveiros,
Baohao Liao,
Elena Bueno-Benito,
Nithin Sivakumaran,
Pavlo Vasylenko,
Shoubin Yu,
Sonal Sannigrahi,
Wafaa Mohammed,
Ben Peters,
Danae Sánchez Villegas,
Elias Stengel-Eskin,
Giuseppe Attanasio,
Jaehong Yoon,
Stella Frank,
Alessandro Suglia,
Chrysoula Zerva,
Desmond Elliott,
Mariella Dimiccoli,
Mohit Bansal
, et al. (6 additional authors not shown)
Abstract:
Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced…
▽ More
Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF$^2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF$^2$ includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information -- an ability current VLMs lack.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Adaptive Cucker-Smale Networks: Limiting Laplacian Time-Varying Dynamics
Authors:
Christian Kuehn,
Jaeyoung Yoon
Abstract:
Differences in opinion can be seen as distances between individuals, and such differences do not always vanish over time. In this paper, we propose a modeling framework that captures the formation of opinion clusters, based on extensions of the Cucker Smale and Hegselmann Krause models to a combined adaptive (or co-evolutionary) network. Reducing our model to a singular limit of fast adaptation, w…
▽ More
Differences in opinion can be seen as distances between individuals, and such differences do not always vanish over time. In this paper, we propose a modeling framework that captures the formation of opinion clusters, based on extensions of the Cucker Smale and Hegselmann Krause models to a combined adaptive (or co-evolutionary) network. Reducing our model to a singular limit of fast adaptation, we mathematically analyze the asymptotic behavior of the resulting Laplacian dynamics over various classes of temporal graphs and use these results to explain the behavior of the original proposed adaptive model for fast adaptation. In particular, our approach provides a general methodology for analyzing linear consensus models over time-varying networks that naturally arise as singular limits in many adaptive network models.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
Authors:
Daeun Lee,
Jaehong Yoon,
Jaemin Cho,
Mohit Bansal
Abstract:
Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-…
▽ More
Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings
Authors:
Junseo Kim,
Jongwook Han,
Dongmin Choi,
Jongwook Yoon,
Eun-Ju Lee,
Yohan Jo
Abstract:
Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of com…
▽ More
Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensive datasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?
Authors:
Jiwan Chung,
Janghan Yoon,
Junhyeong Park,
Sangeyl Lee,
Joowon Yang,
Sooyeon Park,
Youngjae Yu
Abstract:
Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) pa…
▽ More
Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria-cyclic consistency, forward equivariance, and conjugated equivariance-our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at https://github.com/JiwanChung/ACON.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
Authors:
Zun Wang,
Jaemin Cho,
Jialu Li,
Han Lin,
Jaehong Yoon,
Yue Zhang,
Mohit Bansal
Abstract:
Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further in…
▽ More
Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Proximity engineering and interferometric quantification of a non-volatile anomalous phase-shift in zero-field polarity-reversible Josephson diodes
Authors:
Kun-Rok Jeon,
Jae-Keun Kim,
Jiho Yoon,
Jae-Chun Jeon,
Hyeon Han,
Audrey Cottet,
Takis Kontos,
Stuart S. P. Parkin
Abstract:
The recent realization of zero-field polarity-reversible supercurrent rectification in proximity-magnetized Rashba(-type) Pt Josephson junctions (JJs)5 promises its practical applications for superconducting logic circuits and cryogenic memories. Here, by substituting the Pt Josephson barrier for either 5d or 4d element proximity layer with different (para-)magnetic susceptibility, spin-orbit coup…
▽ More
The recent realization of zero-field polarity-reversible supercurrent rectification in proximity-magnetized Rashba(-type) Pt Josephson junctions (JJs)5 promises its practical applications for superconducting logic circuits and cryogenic memories. Here, by substituting the Pt Josephson barrier for either 5d or 4d element proximity layer with different (para-)magnetic susceptibility, spin-orbit coupling and electronic band structure, we identify the proximity role of the Josephson barrier in determining the zero-field diode properties. Ta (W) JJs reveal the zero-field diode efficiency of ~17 (~5)% at 2 K, slightly (much) smaller than that of the Pt JJs. Notably, the zero-field diode polarity of Ta and W JJs turns out to be opposite to the Pt JJs. Our results, along with a large zero-field diode efficiency found in highly magnetic-susceptible Pd JJs and a non-volatile anomalous phase-shift φ_0 probed by superconducting quantum interferometry, demonstrate the φ_0-tuning of zero-field diode performance via proximity engineering of interface magnetic ordering and Rashba effect.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Hybrid Latent Reasoning via Reinforcement Learning
Authors:
Zhenrui Yue,
Bowen Jin,
Huimin Zeng,
Honglei Zhuang,
Zhen Qin,
Jinsung Yoon,
Lanyu Shang,
Jiawei Han,
Dong Wang
Abstract:
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as th…
▽ More
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents
Authors:
Bowen Jin,
Jinsung Yoon,
Priyanka Kargupta,
Sercan O. Arik,
Jiawei Han
Abstract:
Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents r…
▽ More
Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at https://github.com/PeterGriffinJin/Search-R1.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Adaptive Cyclic Diffusion for Inference Scaling
Authors:
Gyubin Lee,
Truong Nhat Nguyen Bao,
Jaesik Yoon,
Dongwoo Lee,
Minsu Kim,
Yoshua Bengio,
Sungjin Ahn
Abstract:
Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. However, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamic…
▽ More
Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. However, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference-and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.
△ Less
Submitted 25 May, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation
Authors:
Yubin Kim,
Taehan Kim,
Wonjune Kang,
Eugene Park,
Joonsik Yoon,
Dongjae Lee,
Xin Liu,
Daniel McDuff,
Hyeonhoon Lee,
Cynthia Breazeal,
Hae Won Park
Abstract:
Vocal health plays a crucial role in peoples' lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fi…
▽ More
Vocal health plays a crucial role in peoples' lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, and present a multifaceted evaluation framework encompassing a safety assessment to mitigate diagnostic biases, cross-lingual performance analysis, and modality ablation studies. VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines. Its LLM-based method offers a scalable solution for broader adoption of health diagnostics, while underscoring the importance of ethical and technical validation.
△ Less
Submitted 26 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Authors:
Ryan Hoque,
Peide Huang,
David J. Yoon,
Mouli Sivapurapu,
Jian Zhang
Abstract:
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object man…
▽ More
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
3D-Fixup: Advancing Photo Editing with 3D Priors
Authors:
Yen-Chi Cheng,
Krishna Kumar Singh,
Jae Shin Yoon,
Alex Schwing,
Liangyan Gui,
Matheus Gadelha,
Paul Guerrero,
Nanxuan Zhao
Abstract:
Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To ac…
▽ More
Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at https://3dfixup.github.io/
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
IntrinsicEdit: Precise generative image manipulation in intrinsic space
Authors:
Linjie Lyu,
Valentin Deschaintre,
Yannick Hold-Geoffroy,
Miloš Hašan,
Jae Shin Yoon,
Thomas Leimkühler,
Christian Theobalt,
Iliyan Georgiev
Abstract:
Generative diffusion models have advanced image editing with high-quality results and intuitive interfaces such as prompts and semantic drawing. However, these interfaces lack precise control, and the associated methods typically specialize on a single editing task. We introduce a versatile, generative workflow that operates in an intrinsic-image latent space, enabling semantic, local manipulation…
▽ More
Generative diffusion models have advanced image editing with high-quality results and intuitive interfaces such as prompts and semantic drawing. However, these interfaces lack precise control, and the associated methods typically specialize on a single editing task. We introduce a versatile, generative workflow that operates in an intrinsic-image latent space, enabling semantic, local manipulation with pixel precision for a range of editing operations. Building atop the RGB-X diffusion framework, we address key challenges of identity preservation and intrinsic-channel entanglement. By incorporating exact diffusion inversion and disentangled channel manipulation, we enable precise, efficient editing with automatic resolution of global illumination effects -- all without additional data collection or model fine-tuning. We demonstrate state-of-the-art performance across a variety of tasks on complex images, including color and texture adjustments, object insertion and removal, global relighting, and their combinations.
△ Less
Submitted 15 May, 2025; v1 submitted 13 May, 2025;
originally announced May 2025.
-
Size dependence of the properties of synthetic-antiferromagnet-based stochastic magnetic tunnel junctions for probabilistic computing
Authors:
Takuma Kinoshita,
Ju-Young Yoon,
Nuno Caçoilo,
Ryota Mochizuki,
Haruna Kaneko,
Shun Kanai,
Hideo Ohno,
Shunsuke Fukami
Abstract:
Stochastic magnetic tunnel junctions (s-MTJs) are core components for spintronics-based probabilistic computing (p-computing), a promising candidate for energy-efficient unconventional computing. To achieve reliable performance under practical conditions, the use of a synthetic antiferromagnetic (SAF) free-layer configuration was proposed due to its enhanced tolerance to magnetic field perturbatio…
▽ More
Stochastic magnetic tunnel junctions (s-MTJs) are core components for spintronics-based probabilistic computing (p-computing), a promising candidate for energy-efficient unconventional computing. To achieve reliable performance under practical conditions, the use of a synthetic antiferromagnetic (SAF) free-layer configuration was proposed due to its enhanced tolerance to magnetic field perturbations. For engineering the SAF s-MTJs, we systematically investigate the properties of the SAF s-MTJs as a function of the junction size. We observe that decreasing junction size leads to shorter relaxation times, enhanced magnetic field robustness, and enhanced insensitivity to bias voltage. These findings provide key insights toward high-performance p-computers with reliable operation.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness
Authors:
Jaehyun Jeon,
Jang Han Yoon,
Min Soo Kim,
Sumin Shim,
Yejin Choi,
Hanbin Kim,
Youngjae Yu
Abstract:
Evaluating user interface (UI) design effectiveness extends beyond aesthetics to influencing user behavior, a principle central to Design Persuasiveness. A/B testing is the predominant method for determining which UI variations drive higher user engagement, but it is costly and time-consuming. While recent Vision-Language Models (VLMs) can process automated UI analysis, current approaches focus on…
▽ More
Evaluating user interface (UI) design effectiveness extends beyond aesthetics to influencing user behavior, a principle central to Design Persuasiveness. A/B testing is the predominant method for determining which UI variations drive higher user engagement, but it is costly and time-consuming. While recent Vision-Language Models (VLMs) can process automated UI analysis, current approaches focus on isolated design attributes rather than comparative persuasiveness-the key factor in optimizing user interactions. To address this, we introduce WiserUI-Bench, a benchmark designed for Pairwise UI Design Persuasiveness Assessment task, featuring 300 real-world UI image pairs labeled with A/B test results and expert rationales. Additionally, we propose G-FOCUS, a novel inference-time reasoning strategy that enhances VLM-based persuasiveness assessment by reducing position bias and improving evaluation accuracy. Experimental results show that G-FOCUS surpasses existing inference strategies in consistency and accuracy for pairwise UI evaluation. Through promoting VLM-driven evaluation of UI persuasiveness, our work offers an approach to complement A/B testing, propelling progress in scalable UI preference modeling and design optimization. Code and data will be released publicly.
△ Less
Submitted 9 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
Retrieval Augmented Time Series Forecasting
Authors:
Sungwon Han,
Seungeon Lee,
Meeyoung Cha,
Sercan O Arik,
Jinsung Yoon
Abstract:
Time series forecasting uses historical data to predict future trends, leveraging the relationships between past observations and available features. In this paper, we propose RAFT, a retrieval-augmented time series forecasting method to provide sufficient inductive biases and complement the model's learning capacity. When forecasting the subsequent time frames, we directly retrieve historical dat…
▽ More
Time series forecasting uses historical data to predict future trends, leveraging the relationships between past observations and available features. In this paper, we propose RAFT, a retrieval-augmented time series forecasting method to provide sufficient inductive biases and complement the model's learning capacity. When forecasting the subsequent time frames, we directly retrieve historical data candidates from the training dataset with patterns most similar to the input, and utilize the future values of these candidates alongside the inputs to obtain predictions. This simple approach augments the model's capacity by externally providing information about past patterns via retrieval modules. Our empirical evaluations on ten benchmark datasets show that RAFT consistently outperforms contemporary baselines with an average win ratio of 86%.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning
Authors:
Ju Yeon Kang,
Ji Won Yoon,
Semin Kim,
Min Hyun Han,
Nam Soo Kim
Abstract:
Recently, fake audio detection has gained significant attention, as advancements in speech synthesis and voice conversion have increased the vulnerability of automatic speaker verification (ASV) systems to spoofing attacks. A key challenge in this task is generalizing models to detect unseen, out-of-distribution (OOD) attacks. Although existing approaches have shown promising results, they inheren…
▽ More
Recently, fake audio detection has gained significant attention, as advancements in speech synthesis and voice conversion have increased the vulnerability of automatic speaker verification (ASV) systems to spoofing attacks. A key challenge in this task is generalizing models to detect unseen, out-of-distribution (OOD) attacks. Although existing approaches have shown promising results, they inherently suffer from overconfidence issues due to the usage of softmax for classification, which can produce unreliable predictions when encountering unpredictable spoofing attempts. To deal with this limitation, we propose a novel framework called fake audio detection with evidential learning (FADEL). By modeling class probabilities with a Dirichlet distribution, FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios. Experimental results on the ASVspoof2019 Logical Access (LA) and ASVspoof2021 LA datasets indicate that the proposed method significantly improves the performance of baseline models. Furthermore, we demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Quantum sensing with arbitrary frequency resolution via correlation measurements
Authors:
Jungbae Yoon,
Keyuan Zhong,
Guoqing Wang,
Boning Li,
Donghun Lee,
Paola Cappellaro
Abstract:
Achieving high-frequency spectral resolution with quantum sensors, while crucial in fields ranging from physical to biological sciences, is challenging due to their finite coherence time. Here, we introduce a novel protocol that achieves this goal by measuring phase correlations of AC magnetic fields using ensembles of NV centers. Our method extends the sensing dynamic range to frequencies higher…
▽ More
Achieving high-frequency spectral resolution with quantum sensors, while crucial in fields ranging from physical to biological sciences, is challenging due to their finite coherence time. Here, we introduce a novel protocol that achieves this goal by measuring phase correlations of AC magnetic fields using ensembles of NV centers. Our method extends the sensing dynamic range to frequencies higher than the system's Rabi frequency while achieving arbitrary frequency resolution, limited only by the target field coherence time. Moreover, our approach operates more robustly with respect to the magnetic field's amplitude. Thanks to this robustness, our protocol allows the application of more $π$-pulses in pulse sequences such as CPMG, enabling the decoupling of a broader range of frequency noise. The higher harmonics generated in this process continue to act as a part of the signal, ultimately improving the frequency resolution. This method paves the way for achieving arbitrary frequency resolution with improved performances, making it highly versatile for quantum sensing applications across diverse scientific fields.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
Authors:
Jialu Li,
Shoubin Yu,
Han Lin,
Jaemin Cho,
Jaehong Yoon,
Mohit Bansal
Abstract:
Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-…
▽ More
Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Euler-Lagrange study of Microbubble-Laden Turbulent Flow over Superhydrophobic surfaces
Authors:
Byeong-Cheon Kim,
Kyoungsik Chang,
Sang-Wook Lee,
Jaiyoung Ryu,
Minjae Kim,
Jaemoon Yoon
Abstract:
For slow-speed ships, underwater vehicles, and pipe transportation systems, viscous resistance accounts for a large proportion of the total energy losses. As such, various technologies have been developed to reduce viscous resistance and enhance energy efficiency in these applications. Air injection and surface treatment are two representative drag reduction techniques. Additionally, efforts to co…
▽ More
For slow-speed ships, underwater vehicles, and pipe transportation systems, viscous resistance accounts for a large proportion of the total energy losses. As such, various technologies have been developed to reduce viscous resistance and enhance energy efficiency in these applications. Air injection and surface treatment are two representative drag reduction techniques. Additionally, efforts to combine multiple drag-reduction techniques have been the subject of extensive research. In this study, the synergistic effects of integrating microbubble injection and superhydrophobic Surface(SHS) drag reduction approaches were analyzed. A 2-way coupling Euler-Lagrange approach was used alongside direct numerical simulation, based on the spectral element method, to investigate the synergistic effects of applying two separate drag reduction methods. Three types of SHS were investigated in our simulations; post type, transverse ridge type, and ridge type. The drag reduction performances and flow characteristics of the various configurations, with and without microbubble injection, were compared in a turbulent horizontal channel flow with $Re_τ=180$. The results of these tests showed that, combining post-type SHS with microbubbles was the most effective, producing a synergistic drag reduction effect. However, combining microbubble injection with ridge-type SHS increased drag relative to ridge-type SHS alone, showing the importance of carefully selecting wall type for the best possible performance.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Sums of squares of integers except for a fixed one
Authors:
Wonjun Chae,
Yun-seong Ji,
Kisuk Kim,
Kyoungmin Kim,
Byeong-Kweon Oh,
Jongheun Yoon
Abstract:
In this article, we study a sum of squares of integers except for a fixed one. For any nonnegative integer $n$, we find the minimum number of squares of integers except for $n$ whose sums represent all positive integers that are represented by a sum of squares except for it. This problem could be considered as a generalization of Dubouis's result for the case when $n=0$.
In this article, we study a sum of squares of integers except for a fixed one. For any nonnegative integer $n$, we find the minimum number of squares of integers except for $n$ whose sums represent all positive integers that are represented by a sum of squares except for it. This problem could be considered as a generalization of Dubouis's result for the case when $n=0$.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization
Authors:
Junying Wang,
Jingyuan Liu,
Xin Sun,
Krishna Kumar Singh,
Zhixin Shu,
He Zhang,
Jimei Yang,
Nanxuan Zhao,
Tuanfeng Y. Wang,
Simon S. Chen,
Ulrich Neumann,
Jae Shin Yoon
Abstract:
This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To ad…
▽ More
This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection
Authors:
Yoon Gyo Jung,
Jaewoo Park,
Jaeho Yoon,
Kuan-Chuan Peng,
Wonchul Kim,
Andrew Beng Jin Teoh,
Octavia Camps
Abstract:
We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice v…
▽ More
We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Advancing Human-Machine Teaming: Concepts, Challenges, and Applications
Authors:
Dian Chen,
Han Jun Yoon,
Zelin Wan,
Nithin Alluru,
Sang Won Lee,
Richard He,
Terrence J. Moore,
Frederica F. Nelson,
Sunghyun Yoon,
Hyuk Lim,
Dan Dongseong Kim,
Jin-Hee Cho
Abstract:
Human-Machine Teaming (HMT) is revolutionizing collaboration across domains such as defense, healthcare, and autonomous systems by integrating AI-driven decision-making, trust calibration, and adaptive teaming. This survey presents a comprehensive taxonomy of HMT, analyzing theoretical models, including reinforcement learning, instance-based learning, and interdependence theory, alongside interdis…
▽ More
Human-Machine Teaming (HMT) is revolutionizing collaboration across domains such as defense, healthcare, and autonomous systems by integrating AI-driven decision-making, trust calibration, and adaptive teaming. This survey presents a comprehensive taxonomy of HMT, analyzing theoretical models, including reinforcement learning, instance-based learning, and interdependence theory, alongside interdisciplinary methodologies. Unlike prior reviews, we examine team cognition, ethical AI, multi-modal interactions, and real-world evaluation frameworks. Key challenges include explainability, role allocation, and scalable benchmarking. We propose future research in cross-domain adaptation, trust-aware AI, and standardized testbeds. By bridging computational and social sciences, this work lays a foundation for resilient, ethical, and scalable HMT systems.
△ Less
Submitted 6 May, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
Cube: A Roblox View of 3D Intelligence
Authors:
Foundation AI Team,
Kiran Bhat,
Nishchaie Khanna,
Karun Channa,
Tinghui Zhou,
Yiheng Zhu,
Xiaoxia Sun,
Charles Shang,
Anirudh Sudarshan,
Maurice Chu,
Daiqing Li,
Kangle Deng,
Jean-Philippe Fauconnier,
Tijmen Verhulsdonck,
Maneesh Agrawala,
Kayvon Fatahalian,
Alexander Weiss,
Christian Reiser,
Ravi Kiran Chirravuri,
Ravali Kandur,
Alejandro Pelaez,
Akash Garg,
Michael Palleschi,
Jessica Wang,
Skylar Litz
, et al. (22 additional authors not shown)
Abstract:
Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation…
▽ More
Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.
△ Less
Submitted 14 April, 2025; v1 submitted 19 March, 2025;
originally announced March 2025.
-
Humanoid Policy ~ Human Policy
Authors:
Ri-Zhao Qiu,
Shiqi Yang,
Xuxin Cheng,
Chaitanya Chawla,
Jialong Li,
Tairan He,
Ge Yan,
David J. Yoon,
Ryan Hoque,
Lars Paulsen,
Ge Yang,
Jian Zhang,
Sha Yi,
Guanya Shi,
Xiaolong Wang
Abstract:
Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embo…
▽ More
Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency. Code and data: https://human-as-robot.github.io/
△ Less
Submitted 24 March, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
Elemental abundances of 44 very metal-poor stars determined from Subaru/IRD near-infrared spectra
Authors:
Wako Aoki,
Timothy C. Beers,
Satoshi Honda,
Tadafumi Matsuno,
Vinicius M. Placco,
Jinmi Yoon,
Masayuki Kuzuhara,
Hiroki Harakawa,
Teruyuki Hirano,
Takayuki Kotani,
Takashi Kurokawa,
Jun Nishikawa,
Masashi Omiya,
Motohide Tamura,
Sebastien Vievard
Abstract:
Abundances of five elements, Na, Mg, Al, Si, and Sr, are investigated for 44 very metal-poor stars (-4.0 < [Fe/H] < -1.5) in the Galactic halo system based on an Local Thermodinamic Equilibrium (LTE) analysis of high-resolution near-infrared spectra obtained with the Infrared Doppler instrument (IRD) on the Subaru Telescope. Mg and Si abundances are determined for all 44 stars. The Si abundances a…
▽ More
Abundances of five elements, Na, Mg, Al, Si, and Sr, are investigated for 44 very metal-poor stars (-4.0 < [Fe/H] < -1.5) in the Galactic halo system based on an Local Thermodinamic Equilibrium (LTE) analysis of high-resolution near-infrared spectra obtained with the Infrared Doppler instrument (IRD) on the Subaru Telescope. Mg and Si abundances are determined for all 44 stars. The Si abundances are determined from up to 29 lines, which provide reliable abundance ratios compared to previous results from a few optical lines. The Mg and Si of these stars are over-abundant, relative to iron, and are well-explained by chemical-evolution models. No significant scatter is found in the abundance ratios of both elements with respect to iron, except for a few outliers. The small scatter of the abundance ratios of these elements provides constraints on the variations of stellar and supernova's yields at very low metallicity. Al abundances are determined for 27 stars from near-infrared lines (e.g., 1312nm), which are expected to be less affected by non-LTE (NLTE) effects than optical resonance lines. The average of the [Al/Fe] ratios is close to the solar value, and no dependence on metallicity is found over -3.0 < [Fe/H] < -2.0. Na abundances are determined for 12 stars; they exhibit Solar abundance ratios and no dependence on metallicity. The Sr abundances determined from the Sr II triplet are significantly higher than those from the optical resonance lines obtained by previous studies for our sample. This discrepancy shows a clear dependence on temperature and surface gravity, supporting models that predict large NLTE effects on the near-infrared lines for metal-poor red giants.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
DeepSeek-Inspired Exploration of RL-based LLMs and Synergy with Wireless Networks: A Survey
Authors:
Yu Qiao,
Phuong-Nam Tran,
Ji Su Yoon,
Loc X. Nguyen,
Eui-Nam Huh,
Dusit Niyato,
Choong Seon Hong
Abstract:
Reinforcement learning (RL)-based large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, have gained significant attention for their exceptional capabilities in natural language processing and multimodal data understanding. Meanwhile, the rapid expansion of information services has driven the growing need for intelligence, efficient, and adaptable wireless networks. Wireless networks…
▽ More
Reinforcement learning (RL)-based large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, have gained significant attention for their exceptional capabilities in natural language processing and multimodal data understanding. Meanwhile, the rapid expansion of information services has driven the growing need for intelligence, efficient, and adaptable wireless networks. Wireless networks require the empowerment of RL-based LLMs while these models also benefit from wireless networks to broaden their application scenarios. Specifically, RL-based LLMs can enhance wireless communication systems through intelligent resource allocation, adaptive network optimization, and real-time decision-making. Conversely, wireless networks provide a vital infrastructure for the efficient training, deployment, and distributed inference of RL-based LLMs, especially in decentralized and edge computing environments. This mutual empowerment highlights the need for a deeper exploration of the interplay between these two domains. We first review recent advancements in wireless communications, highlighting the associated challenges and potential solutions. We then discuss the progress of RL-based LLMs, focusing on key technologies for LLM training, challenges, and potential solutions. Subsequently, we explore the mutual empowerment between these two fields, highlighting key motivations, open challenges, and potential solutions. Finally, we provide insights into future directions, applications, and their societal impact to further explore this intersection, paving the way for next-generation intelligent communication systems. Overall, this survey provides a comprehensive overview of the relationship between RL-based LLMs and wireless networks, offering a vision where these domains empower each other to drive innovations.
△ Less
Submitted 16 April, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Authors:
Bowen Jin,
Hansi Zeng,
Zhenrui Yue,
Jinsung Yoon,
Sercan Arik,
Dong Wang,
Hamed Zamani,
Jiawei Han
Abstract:
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Searc…
▽ More
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.
△ Less
Submitted 8 April, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Medical Hallucinations in Foundation Models and Their Impact on Healthcare
Authors:
Yubin Kim,
Hyewon Jeong,
Shan Chen,
Shuyue Stella Li,
Mingyu Lu,
Kumail Alhamoud,
Jimin Mun,
Cristina Grau,
Minseok Jung,
Rodrigo Gameiro,
Lizhou Fan,
Eugene Park,
Tristan Lin,
Joonsik Yoon,
Wonjin Yoon,
Maarten Sap,
Yulia Tsvetkov,
Paul Liang,
Xuhai Xu,
Xin Liu,
Daniel McDuff,
Hyeonhoon Lee,
Hae Won Park,
Samir Tulebaev,
Cynthia Breazeal
Abstract:
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examine…
▽ More
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical hallucination.
△ Less
Submitted 25 February, 2025;
originally announced March 2025.
-
Balancing Act: Trading Off Doppler Odometry and Map Registration for Efficient Lidar Localization
Authors:
Katya M. Papais,
Daniil Lisus,
David J. Yoon,
Andrew Lambert,
Keith Y. K. Leung,
Timothy D. Barfoot
Abstract:
Most autonomous vehicles rely on accurate and efficient localization, which is achieved by comparing live sensor data to a preexisting map, to navigate their environment. Balancing the accuracy of localization with computational efficiency remains a significant challenge, as high-accuracy methods often come with higher computational costs. In this paper, we present two ways of improving lidar loca…
▽ More
Most autonomous vehicles rely on accurate and efficient localization, which is achieved by comparing live sensor data to a preexisting map, to navigate their environment. Balancing the accuracy of localization with computational efficiency remains a significant challenge, as high-accuracy methods often come with higher computational costs. In this paper, we present two ways of improving lidar localization efficiency and study their impact on performance. First, we integrate a lightweight Doppler-based odometry method into a topometric localization pipeline and compare its performance against an iterative closest point (ICP)-based method. We highlight the trade-offs between these approaches: the Doppler estimator offers faster, lightweight updates, while ICP provides higher accuracy at the cost of increased computational load. Second, by controlling the frequency of localization updates and leveraging odometry estimates between them, we demonstrate that accurate localization can be maintained while optimizing for computational efficiency using either odometry method. Our experimental results show that localizing every 10 lidar frames strikes a favourable balance, achieving a localization accuracy below 0.05 meters in translation and below 0.1 degrees in orientation while reducing computational effort by over 30% in an ICP-based pipeline. We quantify the trade-off of accuracy to computational effort using over 100 kilometers of real-world driving data in different on-road environments.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs
Authors:
Yi-Lin Sung,
Prateek Yadav,
Jialu Li,
Jaehong Yoon,
Mohit Bansal
Abstract:
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which…
▽ More
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which have large attention scores). Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate outliers (those with exceptionally large magnitude), (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
A Single Scale Doesn't Fit All: Adaptive Motion Scaling for Efficient and Precise Teleoperation
Authors:
Jeonghyeon Yoon,
Sanghyeok Park,
Hyojae Park,
Cholin Kim,
Sihyeoung Park,
Minho Hwang
Abstract:
Teleoperation is increasingly employed in environments where direct human access is difficult, such as hazardous exploration or surgical field. However, if the motion scale factor(MSF) intended to compensate for workspace-size differences is set inappropriately, repeated clutching operations and reduced precision can significantly raise cognitive load. This paper presents a shared controller that…
▽ More
Teleoperation is increasingly employed in environments where direct human access is difficult, such as hazardous exploration or surgical field. However, if the motion scale factor(MSF) intended to compensate for workspace-size differences is set inappropriately, repeated clutching operations and reduced precision can significantly raise cognitive load. This paper presents a shared controller that dynamically applies the MSF based on the user's intended motion scale. Inspired by human motor skills, the leader arm trajectory is divided into coarse(fast, large-range movements) and fine(precise, small-range movements), with three features extracted to train a fuzzy C-means(FCM) clustering model that probabilistically classifies the user's motion scale. Scaling the robot's motion accordingly reduces unnecessary repetition for large-scale movements and enables more precise control for fine operations. Incorporating recent trajectory data into model updates and offering user feedback for adjusting the MSF range and response speed allows mutual adaptation between user and system. In peg transfer experiments, compared to using a fixed single scale, the proposed approach demonstrated improved task efficiency(number of clutching and task completion time decreased 38.46% and 11.96% respectively), while NASA-TLX scores confirmed a meaningful reduction(58.01% decreased) in cognitive load. This outcome suggests that a user-intent-based motion scale adjustment can effectively enhance both efficiency and precision in teleoperation.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy
Authors:
Felix Dobslaw,
Robert Feldt,
Juyeon Yoon,
Shin Yoo
Abstract:
Large Language Models (LLMs) and Multi-Agent LLMs (MALLMs) introduce non-determinism unlike traditional or machine learning software, requiring new approaches to verifying correctness beyond simple output comparisons or statistical accuracy over test datasets.
This paper presents a taxonomy for LLM test case design, informed by both the research literature, our experience, and open-source tools…
▽ More
Large Language Models (LLMs) and Multi-Agent LLMs (MALLMs) introduce non-determinism unlike traditional or machine learning software, requiring new approaches to verifying correctness beyond simple output comparisons or statistical accuracy over test datasets.
This paper presents a taxonomy for LLM test case design, informed by both the research literature, our experience, and open-source tools that represent the state of practice. We identify key variation points that impact test correctness and highlight open challenges that the research, industry, and open-source communities must address as LLMs become integral to software systems.
Our taxonomy defines four facets of LLM test case design, addressing ambiguity in both inputs and outputs while establishing best practices. It distinguishes variability in goals, the system under test, and inputs, and introduces two key oracle types: atomic and aggregated. Our mapping indicates that current tools insufficiently account for these variability points, highlighting the need for closer collaboration between academia and practitioners to improve the reliability and reproducibility of LLM testing.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Rapid low-temperature synthesis of graphene-coated SiC substrates for remote and van der Waals epitaxy
Authors:
Se H. Kim,
Hanjoo Lee,
Dong Gwan Kim,
Donghan Kim,
Seugki Kim,
Hyunho Yang,
Yunsu Jang,
Jangho Yoon,
Hyunsoo Kim,
Seoyong Ha,
ByoungTak Lee,
Jung-Hee Lee,
Roy Byung Kyu Chung,
Hongsik Park,
Sungkyu Kim,
Tae Hoon Lee,
Hyun S. Kum
Abstract:
Non-conventional epitaxial techniques, such as van der Waals epitaxy (vdWE) and remote epitaxy, have attracted substantial attention in the semiconductor research community for their capability to repeatedly produce high-quality free-standing films from a single mother wafer. Successful implementation of these epitaxial techniques depends on creating a robust, uniform two-dimensional (2D) material…
▽ More
Non-conventional epitaxial techniques, such as van der Waals epitaxy (vdWE) and remote epitaxy, have attracted substantial attention in the semiconductor research community for their capability to repeatedly produce high-quality free-standing films from a single mother wafer. Successful implementation of these epitaxial techniques depends on creating a robust, uniform two-dimensional (2D) material surface. The conventional method for fabricating graphene on silicon carbide (SiC) is high-temperature graphitization. However, the extremely high temperature required for silicon sublimation (typically above 1500 °C) causes step-bunching of the SiC surface, forming non-uniform multilayer graphene stripes and an unfavorable surface morphology for epitaxial growth. Here, we developed a wafer-scale graphitization technique that allows fast synthesis of single-crystalline graphene at ultra-low temperatures by metal-assisted graphitization (MAG). We found annealing conditions that enable SiC dissociation while avoiding silicide formation, producing uniform single-crystalline graphene while maintaining the surface morphology of the substrate. The graphene thickness can be controlled by varying the metal thickness or annealing temperature, enabling remote epitaxy or vdWE. We successfully produced freestanding single-crystalline III-N (AlN, GaN) films on graphene/SiC via the 2D material-based layer transfer technique. Our results show that low-temperature graphene synthesis via MAG offers a promising route to producing large-scale ultra-wide bandgap free-standing crystalline membranes.
△ Less
Submitted 20 May, 2025; v1 submitted 24 February, 2025;
originally announced February 2025.
-
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
Authors:
Yue Huang,
Chujie Gao,
Siyuan Wu,
Haoran Wang,
Xiangqi Wang,
Yujun Zhou,
Yanbo Wang,
Jiayi Ye,
Jiawen Shi,
Qihui Zhang,
Yuan Li,
Han Bao,
Zhaoyi Liu,
Tianrui Guan,
Dongping Chen,
Ruoxi Chen,
Kehan Guo,
Andy Zou,
Bryan Hooi Kuen-Yew,
Caiming Xiong,
Elias Stengel-Eskin,
Hongyang Zhang,
Hongzhi Yin,
Huan Zhang,
Huaxiu Yao
, et al. (41 additional authors not shown)
Abstract:
Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, a…
▽ More
Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, as well as industry practices and standards. Based on this analysis, we propose a set of guiding principles for GenFMs, developed through extensive multidisciplinary collaboration that integrates technical, ethical, legal, and societal perspectives. Second, we introduce TrustGen, the first dynamic benchmarking platform designed to evaluate trustworthiness across multiple dimensions and model types, including text-to-image, large language, and vision-language models. TrustGen leverages modular components--metadata curation, test case generation, and contextual variation--to enable adaptive and iterative assessments, overcoming the limitations of static evaluation methods. Using TrustGen, we reveal significant progress in trustworthiness while identifying persistent challenges. Finally, we provide an in-depth discussion of the challenges and future directions for trustworthy GenFMs, which reveals the complex, evolving nature of trustworthiness, highlighting the nuanced trade-offs between utility and trustworthiness, and consideration for various downstream applications, identifying persistent challenges and providing a strategic roadmap for future research. This work establishes a holistic framework for advancing trustworthiness in GenAI, paving the way for safer and more responsible integration of GenFMs into critical applications. To facilitate advancement in the community, we release the toolkit for dynamic evaluation.
△ Less
Submitted 11 May, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Monte Carlo Tree Diffusion for System 2 Planning
Authors:
Jaesik Yoon,
Hyeonseo Cho,
Doojin Baek,
Yoshua Bengio,
Sungjin Ahn
Abstract:
Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)-whose performance naturally improves with inference-time computation scaling-standard diffusion-based planners offer only limited avenues for the scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength o…
▽ More
Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)-whose performance naturally improves with inference-time computation scaling-standard diffusion-based planners offer only limited avenues for the scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree-structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long-horizon tasks show that MCTD outperforms diffusion baselines, yielding higher-quality solutions as inference-time computation increases.
△ Less
Submitted 10 June, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
"It's Great Because It's Ran By Us": Empowering Teen Volunteer Discord Moderators to Design Healthy and Engaging Youth-Led Online Communities
Authors:
Jina Yoon,
Amy X. Zhang,
Joseph Seering
Abstract:
Online communities can offer many benefits for youth including peer learning, cultural expression, and skill development. However, most HCI research on youth-focused online communities has centered communities developed by adults for youth rather than by the youth themselves. In this work, we interviewed 11 teenagers (ages 13-17) who moderate online Discord communities created by youth, for youth.…
▽ More
Online communities can offer many benefits for youth including peer learning, cultural expression, and skill development. However, most HCI research on youth-focused online communities has centered communities developed by adults for youth rather than by the youth themselves. In this work, we interviewed 11 teenagers (ages 13-17) who moderate online Discord communities created by youth, for youth. Participants were identified by Discord platform staff as leaders of well-moderated servers through an intensive exam and application-based process. We also interviewed 2 young adults who volunteered as mentors of some of our teen participants. We present our findings about the benefits, motivations, and risks of teen-led online communities, as well as the role of external stakeholders of these youth spaces. We contextualize our work within the broader teen online safety landscape to provide recommendations to better support, encourage, and protect teen moderators and their online communities. This empirical work contributes one of the first studies to date with teen Discord moderators and aims to empower safe youth-led online communities.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
LLM Alignment as Retriever Optimization: An Information Retrieval Perspective
Authors:
Bowen Jin,
Jinsung Yoon,
Zhen Qin,
Ziqi Wang,
Wei Xiong,
Yu Meng,
Jiawei Han,
Sercan O. Arik
Abstract:
Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based…
▽ More
Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR's retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO's effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.
△ Less
Submitted 9 June, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling
Authors:
Jiefeng Chen,
Jie Ren,
Xinyun Chen,
Chengrun Yang,
Ruoxi Sun,
Jinsung Yoon,
Sercan Ö Arık
Abstract:
Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing parallel scaling methods, such as repeated sampling or reward model scoring, often suffer from premature convergence and high costs due to task-specific reward model training, while sequential methods like SELF-R…
▽ More
Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing parallel scaling methods, such as repeated sampling or reward model scoring, often suffer from premature convergence and high costs due to task-specific reward model training, while sequential methods like SELF-REFINE cannot effectively leverage increased compute. This paper introduces Self-Enhanced Test-Time Scaling (SETS), a new approach that overcomes these limitations by strategically combining parallel and sequential techniques. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This innovative design facilitates efficient and scalable test-time computation for enhanced performance on complex tasks. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
△ Less
Submitted 23 May, 2025; v1 submitted 31 January, 2025;
originally announced January 2025.
-
Higher-order chiral scalar from boundary reduction of 3d higher-spin gravity
Authors:
Calvin Yi-Ren Chen,
Euihun Joung,
Karapet Mkrtchyan,
Junggi Yoon
Abstract:
We use a recently proposed covariant procedure to reduce the Chern-Simons action of three-dimensional higher-spin gravity to the boundary, resulting in a Lorentz covariant action for higher-order chiral scalars. After gauge-fixing, we obtain a higher-derivative action generalizing the $s=1$ Floreanini-Jackiw and $s=2$ Alekseev-Shatashvili actions to arbitrary spin $s$. For simplicity, we treat the…
▽ More
We use a recently proposed covariant procedure to reduce the Chern-Simons action of three-dimensional higher-spin gravity to the boundary, resulting in a Lorentz covariant action for higher-order chiral scalars. After gauge-fixing, we obtain a higher-derivative action generalizing the $s=1$ Floreanini-Jackiw and $s=2$ Alekseev-Shatashvili actions to arbitrary spin $s$. For simplicity, we treat the case of general spin at the linearized level, while the full non-linear asymptotic boundary conditions are presented in component form for the $SL(3,\mathbb R)$ case. Finally, we extend the spin-3 linearized analysis to a background with non-trivial higher-spin charge and show that it has a richer structure of zero modes.
△ Less
Submitted 14 April, 2025; v1 submitted 27 January, 2025;
originally announced January 2025.