Skip to main content

Showing 1–6 of 6 results for author: Baek, D D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.18530  [pdf, other

    cs.AI cs.CY cs.LG

    Scaling Laws For Scalable Oversight

    Authors: Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark

    Abstract: Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system bein… ▽ More

    Submitted 9 May, 2025; v1 submitted 25 April, 2025; originally announced April 2025.

    Comments: 32 pages, 18 figures; The first three authors contributed equally

  2. arXiv:2503.03730  [pdf, other

    cs.LG

    Towards Understanding Distilled Reasoning Models: A Representational Approach

    Authors: David D. Baek, Max Tegmark

    Abstract: In this paper, we investigate how model distillation impacts the development of reasoning features in large language models (LLMs). To explore this, we train a crosscoder on Qwen-series models and their fine-tuned variants. Our results suggest that the crosscoder learns features corresponding to various types of reasoning, including self-reflection and computation verification. Moreover, we observ… ▽ More

    Submitted 24 March, 2025; v1 submitted 5 March, 2025; originally announced March 2025.

    Comments: 13 pages, 9 figures

    Journal ref: ICLR 2025 Workshop on Building Trust in Language Models and Applications

  3. arXiv:2502.01628  [pdf, other

    cs.LG

    Harmonic Loss Trains Interpretable AI Models

    Authors: David D. Baek, Ziming Liu, Riya Tyagi, Max Tegmark

    Abstract: In this paper, we introduce **harmonic loss** as an alternative to the standard cross-entropy loss for training neural networks and large language models (LLMs). Harmonic loss enables improved interpretability and faster convergence, owing to its scale invariance and finite convergence point by design, which can be interpreted as a class center. We first validate the performance of harmonic models… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

    Comments: 12 pages, 7 figures; The first two authors contributed equally

  4. arXiv:2410.19750  [pdf, other

    q-bio.NC cs.AI cs.LG

    The Geometry of Concepts: Sparse Autoencoder Feature Structure

    Authors: Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, Max Tegmark

    Abstract: Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-ki… ▽ More

    Submitted 30 March, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

    Comments: 16 pages, 12 figures

    Journal ref: Entropy 2025, 27(4), 344

  5. arXiv:2410.08255  [pdf, other

    cs.LG cs.AI

    Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning

    Authors: David D. Baek, Yuxiao Li, Max Tegmark

    Abstract: Motivated by interpretability and reliability, we investigate how neural networks represent knowledge during graph learning, We find hints of universality, where equivalent representations are learned across a range of model sizes (from $10^2$ to $10^9$ parameters) and contexts (MLP toy models, LLM in-context learning and LLM training). We show that these attractor representations optimize general… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

    Comments: 14 pages, 13 figures

  6. GenEFT: Understanding Statics and Dynamics of Model Generalization via Effective Theory

    Authors: David D. Baek, Ziming Liu, Max Tegmark

    Abstract: We present GenEFT: an effective theory framework for shedding light on the statics and dynamics of neural network generalization, and illustrate it with graph learning examples. We first investigate the generalization phase transition as data size increases, comparing experimental results with information-theory-based approximations. We find generalization in a Goldilocks zone where the decoder is… ▽ More

    Submitted 20 March, 2025; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: 12 pages, 6 figures

    Journal ref: Phys. Rev. E 111, 035307 (2025)