Skip to main content

Showing 1–1 of 1 results for author: Da Silva, P Q

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.04635  [pdf, other

    cs.CL

    Steering off Course: Reliability Challenges in Steering Language Models

    Authors: Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, Sachin Kumar

    Abstract: Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods -- DoLa, function vectors, a… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.