Skip to main content

Showing 1–5 of 5 results for author: Tamirisa, R

.
  1. arXiv:2502.08640  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.CY

    Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

    Authors: Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks

    Abstract: As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, l… ▽ More

    Submitted 19 February, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

    Comments: Website: https://www.emergent-values.ai

  2. arXiv:2408.00761  [pdf, other

    cs.LG cs.AI cs.CL

    Tamper-Resistant Safeguards for Open-Weight LLMs

    Authors: Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika

    Abstract: Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps… ▽ More

    Submitted 10 February, 2025; v1 submitted 1 August, 2024; originally announced August 2024.

    Comments: Website: https://www.tamper-resistant-safeguards.com

  3. arXiv:2404.02478  [pdf, other

    cs.LG cs.AI

    FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning

    Authors: Rishub Tamirisa, Chulin Xie, Wenxuan Bao, Andy Zhou, Ron Arel, Aviv Shamsian

    Abstract: Standard federated learning approaches suffer when client data distributions have sufficient heterogeneity. Recent methods addressed the client data heterogeneity issue via personalized federated learning (PFL) - a class of FL algorithms aiming to personalize learned global knowledge to better suit the clients' local data distributions. Existing PFL methods usually decouple global updates in deep… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: Published in CVPR 2024

  4. arXiv:2403.03218  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer , et al. (32 additional authors not shown)

    Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe… ▽ More

    Submitted 15 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: See the project page at https://wmdp.ai

  5. arXiv:2306.13264  [pdf, other

    cs.LG cs.AI

    FedSelect: Customized Selection of Parameters for Fine-Tuning during Personalized Federated Learning

    Authors: Rishub Tamirisa, John Won, Chengjun Lu, Ron Arel, Andy Zhou

    Abstract: Recent advancements in federated learning (FL) seek to increase client-level performance by fine-tuning client parameters on local data or personalizing architectures for the local task. Existing methods for such personalization either prune a global model or fine-tune a global model on a local client distribution. However, these existing methods either personalize at the expense of retaining impo… ▽ More

    Submitted 8 June, 2024; v1 submitted 22 June, 2023; originally announced June 2023.

    Journal ref: International Workshop on Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities in Conjunction with ICML 2023