Skip to main content

Showing 1–50 of 257 results for author: Salakhutdinov, R

.
  1. arXiv:2506.07822  [pdf, ps, other

    cs.LG cs.AI

    Accelerating Diffusion Models in Offline RL via Reward-Aware Consistency Trajectory Distillation

    Authors: Xintong Duan, Yutong He, Fahim Tajwar, Ruslan Salakhutdinov, J. Zico Kolter, Jeff Schneider

    Abstract: Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While the consistency model offers a potential solution, its applications to decision-making often struggle with suboptimal demonstrations or rely on complex concurrent training of multiple networks. In this work, we propose a novel approach to consistency distillat… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  2. arXiv:2505.21444  [pdf, other

    cs.LG

    Can Large Reasoning Models Self-Train?

    Authors: Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, Andrea Zanette

    Abstract: Scaling the performance of large language models (LLMs) increasingly depends on methods that reduce reliance on human supervision. Reinforcement learning from automated verification offers an alternative, but it incurs scalability limitations due to dependency upon human-designed verifiers. Self-training, where the model's own judgment provides the supervisory signal, presents a compelling directi… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Project website: https://self-rewarding-llm-training.github.io/

  3. arXiv:2503.09780  [pdf, other

    cs.AI

    AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

    Authors: Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, Kamalika Chaudhuri

    Abstract: Autonomous AI agents that can follow instructions and perform complex multi-step tasks have tremendous potential to boost human productivity. However, to perform many of these tasks, the agents need access to personal information from their users, raising the question of whether they are capable of using it appropriately. In this work, we introduce a new benchmark AgentDAM that measures if AI web-… ▽ More

    Submitted 16 May, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

    Comments: project page: https://github.com/facebookresearch/ai-agent-privacy

  4. arXiv:2503.07572  [pdf, other

    cs.LG cs.AI cs.CL

    Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

    Authors: Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar

    Abstract: Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formal… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  5. arXiv:2502.17543  [pdf, other

    cs.LG cs.AI cs.CL

    Training a Generally Curious Agent

    Authors: Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Sadia Rahman, J Zico Kolter, Jeff Schneider, Ruslan Salakhutdinov

    Abstract: Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on… ▽ More

    Submitted 26 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: ICML 2025. Project Website: https://paprika-llm.github.io

  6. arXiv:2502.17432  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    FACTR: Force-Attending Curriculum Training for Contact-Rich Policy Learning

    Authors: Jason Jingzhou Liu, Yulong Li, Kenneth Shaw, Tony Tao, Ruslan Salakhutdinov, Deepak Pathak

    Abstract: Many contact-rich tasks humans perform, such as box pickup or rolling dough, rely on force feedback for reliable execution. However, this force information, which is readily available in most robot arms, is not commonly used in teleoperation and policy learning. Consequently, robot behavior is often limited to quasi-static kinematic tasks that do not require intricate force-feedback. In this paper… ▽ More

    Submitted 24 April, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: Video results, codebases, and instructions: https://jasonjzliu.com/factr/

  7. arXiv:2502.06776  [pdf, other

    cs.LG cs.AI

    InSTA: Towards Internet-Scale Training For Agents

    Authors: Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov

    Abstract: The predominant approach for training web navigation agents is to gather human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data is an inefficient resource. We develop a pipeline to facilitate internet-scale training for agents without laborious human annotations. In the first stage, an LLM annotates 150k sites with agentic tasks. In the… ▽ More

    Submitted 22 May, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

    Comments: Improved results, zero-shot transfer to Web Voyager

  8. arXiv:2502.06130  [pdf, other

    cs.CV cs.CL

    Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

    Authors: Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q. Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-Philippe Morency, Katia Sycara, Yaqi Xie

    Abstract: While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditione… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: Accepted by ICLR 2025. Project page:https://zhangce01.github.io/DeGF/

  9. arXiv:2502.04576  [pdf, other

    cs.LG cs.CL

    Self-Regulation and Requesting Interventions

    Authors: So Yeon Min, Yue Wu, Jimin Sun, Max Kaufmann, Fahim Tajwar, Yonatan Bisk, Ruslan Salakhutdinov

    Abstract: Human intelligence involves metacognitive abilities like self-regulation, recognizing limitations, and seeking assistance only when needed. While LLM Agents excel in many domains, they often lack this awareness. Overconfident agents risk catastrophic failures, while those that seek help excessively hinder efficiency. A key challenge is enabling agents with a limited intervention budget $C$ is to d… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

  10. arXiv:2501.13241  [pdf, other

    cs.LG

    State Combinatorial Generalization In Decision Making With Conditional Diffusion Models

    Authors: Xintong Duan, Yutong He, Fahim Tajwar, Wen-Tse Chen, Ruslan Salakhutdinov, Jeff Schneider

    Abstract: Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-sh… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

  11. arXiv:2412.05467  [pdf, other

    cs.LG cs.AI cs.SE

    The BrowserGym Ecosystem for Web Agent Research

    Authors: Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste

    Abstract: The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) i… ▽ More

    Submitted 28 February, 2025; v1 submitted 6 December, 2024; originally announced December 2024.

  12. arXiv:2412.00557  [pdf, other

    cs.CV cs.AI cs.LG

    Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

    Authors: Michail Dontas, Yutong He, Naoki Murata, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov

    Abstract: Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thus limiting their generalizability. In this work, we present LADiBI, a training-free framework that uses large-scale text-to-i… ▽ More

    Submitted 30 November, 2024; originally announced December 2024.

  13. arXiv:2410.22332  [pdf, other

    cs.RO cs.CV cs.LG

    Local Policies Enable Zero-shot Long-horizon Manipulation

    Authors: Murtaza Dalal, Min Liu, Walter Talbott, Chen Chen, Deepak Pathak, Jian Zhang, Ruslan Salakhutdinov

    Abstract: Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering,… ▽ More

    Submitted 9 March, 2025; v1 submitted 29 October, 2024; originally announced October 2024.

    Comments: ICRA 2025 accepted paper. Main Paper 7 pages, 3 tables, 3 figures. Appendix 6 pages, 2 figures, 6 tables

  14. arXiv:2410.15153  [pdf, other

    cs.CL

    Evaluating Deep Unlearning in Large Language Models

    Authors: Ruihan Wu, Chhavi Yadav, Russ Salakhutdinov, Kamalika Chaudhuri

    Abstract: Machine unlearning is a key requirement of many data protection regulations such as GDPR. Prior work on unlearning has mostly considered superficial unlearning tasks where a single or a few related pieces of information are required to be removed. However, the task of unlearning a fact is much more challenging in recent large language models (LLMs), because the facts in LLMs can be deduced from ea… ▽ More

    Submitted 9 November, 2024; v1 submitted 19 October, 2024; originally announced October 2024.

  15. arXiv:2409.18313  [pdf, other

    cs.RO cs.AI cs.LG

    Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

    Authors: Quanting Xie, So Yeon Min, Pengliang Ji, Yue Yang, Tianyi Zhang, Kedi Xu, Aarav Bajaj, Ruslan Salakhutdinov, Matthew Johnson-Roberson, Yonatan Bisk

    Abstract: There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhorse of large-scale non-parametric knowledge; however, existing techniques do not directly transfer to the embodied domain, which is multimodal, where data is highly correlated, and percept… ▽ More

    Submitted 20 January, 2025; v1 submitted 26 September, 2024; originally announced September 2024.

    Comments: Web: https://quanting-xie.github.io/Embodied-RAG-web/

  16. arXiv:2409.05864  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Neural MP: A Generalist Neural Motion Planner

    Authors: Murtaza Dalal, Jiahui Yang, Russell Mendonca, Youssef Khaky, Ruslan Salakhutdinov, Deepak Pathak

    Abstract: The current paradigm for motion planning generates solutions from scratch for every new problem, which consumes significant amounts of time and computational resources. For complex, cluttered scenes, motion planning approaches can often take minutes to produce a solution, while humans are able to accurately and safely reach any goal in seconds by leveraging their prior experience. We seek to do th… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: Website at mihdalal.github.io/neuralmotionplanner. Main paper: 7 pages, 4 figures, 2 tables. Appendix: 9 pages, 5 figures, 6 tables

  17. arXiv:2407.12061  [pdf, other

    cs.HC cs.AI cs.RO

    Situated Instruction Following

    Authors: So Yeon Min, Xavi Puig, Devendra Singh Chaplot, Tsung-Yen Yang, Akshara Rai, Priyam Parashar, Ruslan Salakhutdinov, Yonatan Bisk, Roozbeh Mottaghi

    Abstract: Language is never spoken in a vacuum. It is expressed, comprehended, and contextualized within the holistic backdrop of the speaker's history, actions, and environment. Since humans are used to communicating efficiently with situated language, the practicality of robotic assistants hinge on their ability to understand and act upon implicit and situated instructions. In traditional instruction foll… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: European Conference on Computer Vision 2024 (ECCV 2024)

  18. arXiv:2407.09801  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.MM

    IoT-LM: Large Multisensory Language Models for the Internet of Things

    Authors: Shentong Mo, Russ Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang

    Abstract: The Internet of Things (IoT) network integrating billions of smart physical devices embedded with sensors, software, and communication technologies is a critical and rapidly expanding component of our modern world. The IoT ecosystem provides a rich source of real-world modalities such as motion, thermal, geolocation, imaging, depth, sensors, and audio to recognize the states of humans and physical… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2311.06217

  19. arXiv:2407.03418  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    HEMM: Holistic Evaluation of Multimodal Foundation Models

    Authors: Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei Yu, Ruslan Salakhutdinov, Louis-Philippe Morency

    Abstract: Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation o… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Code available at https://github.com/pliang279/HEMM

  20. arXiv:2407.01476  [pdf, other

    cs.AI cs.CL cs.LG

    Tree Search for Language Model Agents

    Authors: Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov

    Abstract: Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards… ▽ More

    Submitted 12 October, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: 12 pages. Models and code available at https://jykoh.com/search-agents

  21. arXiv:2406.12814  [pdf, other

    cs.LG cs.CL cs.CR cs.CV

    Dissecting Adversarial Robustness of Multimodal LM Agents

    Authors: Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan

    Abstract: As language models (LMs) are used to build autonomous agents in real environments, ensuring their adversarial robustness becomes a critical challenge. Unlike chatbots, agents are compound systems with multiple components taking actions, which existing LMs safety evaluations do not adequately address. To bridge this gap, we manually create 200 targeted adversarial tasks and evaluation scripts in a… ▽ More

    Submitted 4 February, 2025; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: ICLR 2025. Also oral at NeurIPS 2024 Open-World Agents Workshop

  22. arXiv:2406.07506  [pdf, other

    cs.CV cs.AI cs.LG

    Understanding Visual Concepts Across Models

    Authors: Brandon Trabucco, Max Gurinas, Kyle Doherty, Ruslan Salakhutdinov

    Abstract: Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. <orange-cat> = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and fin… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Official code at: https://github.com/visual-words/visual-words

  23. arXiv:2405.03702  [pdf, other

    cs.CV cs.LG

    Leafy Spurge Dataset: Real-world Weed Classification Within Aerial Drone Imagery

    Authors: Kyle Doherty, Max Gurinas, Erik Samsoe, Charles Casper, Beau Larkin, Philip Ramsey, Brandon Trabucco, Ruslan Salakhutdinov

    Abstract: Invasive plant species are detrimental to the ecology of both agricultural and wildland areas. Euphorbia esula, or leafy spurge, is one such plant that has spread through much of North America from Eastern Europe. When paired with contemporary computer vision systems, unmanned aerial vehicles, or drones, offer the means to track expansion of problem plants, such as leafy spurge, and improve chance… ▽ More

    Submitted 8 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: Official Dataset Technical Report. Used in DA-Fusion (arXiv:2302.07944)

  24. arXiv:2405.01534  [pdf, other

    cs.LG cs.AI cs.CV cs.RO

    Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

    Authors: Murtaza Dalal, Tarun Chiruvolu, Devendra Chaplot, Ruslan Salakhutdinov

    Abstract: Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furtherm… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: Published at ICLR 2024. Website at https://mihdalal.github.io/planseqlearn/ 9 pages, 3 figures, 3 tables; 14 pages appendix (7 additional figures)

  25. arXiv:2404.18928  [pdf, other

    cs.CV cs.AI cs.CL cs.GR cs.LG

    Stylus: Automatic Adapter Selection for Diffusion Models

    Authors: Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica

    Abstract: Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prom… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Project Website: https://stylus-diffusion.github.io

  26. arXiv:2404.11483  [pdf, other

    cs.AI cs.LG

    AgentKit: Structured LLM Reasoning with Dynamic Graphs

    Authors: Yue Wu, Yewen Fan, So Yeon Min, Shrimai Prabhumoye, Stephen McAleer, Yonatan Bisk, Ruslan Salakhutdinov, Yuanzhi Li, Tom Mitchell

    Abstract: We propose an intuitive LLM prompting framework (AgentKit) for multifunctional agents. AgentKit offers a unified framework for explicitly constructing a complex "thought process" from simple natural language prompts. The basic building block in AgentKit is a node, containing a natural language prompt for a specific subtask. The user then puts together chains of nodes, like stacking LEGO pieces. Th… ▽ More

    Submitted 24 July, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

  27. arXiv:2403.19103  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

    Authors: Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Nathaniel Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J. Zico Kolter

    Abstract: Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, or produce non-intuitive prompts. In this work… ▽ More

    Submitted 27 April, 2025; v1 submitted 27 March, 2024; originally announced March 2024.

  28. arXiv:2403.04082  [pdf, other

    cs.LG stat.ML

    Inference via Interpolation: Contrastive Representations Provably Enable Planning and Inference

    Authors: Benjamin Eysenbach, Vivek Myers, Ruslan Salakhutdinov, Sergey Levine

    Abstract: Given time series data, how can we answer questions like "what will happen in the future?" and "how did we get here?" These sorts of probabilistic inference questions are challenging when observations are high-dimensional. In this paper, we show how these questions can have compact, closed form solutions in terms of learned representations. The key idea is to apply a variant of contrastive learnin… ▽ More

    Submitted 21 May, 2025; v1 submitted 6 March, 2024; originally announced March 2024.

    Comments: Code: https://github.com/vivekmyers/contrastive_planning

    Journal ref: Neural information processing systems (2024)

  29. arXiv:2403.01382  [pdf, other

    cs.CL

    Automatic Question-Answer Generation for Long-Tail Knowledge

    Authors: Rohan Kumar, Youngmin Kim, Sunitha Ravi, Haitian Sun, Christos Faloutsos, Ruslan Salakhutdinov, Minji Yoon

    Abstract: Pretrained Large Language Models (LLMs) have gained significant attention for addressing open-domain Question Answering (QA). While they exhibit high accuracy in answering questions related to common knowledge, LLMs encounter difficulties in learning about uncommon long-tail knowledge (tail entities). Since manually constructing QA datasets demands substantial human resources, the types of existin… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

    Comments: Accepted at KDD 2023 KnowledgeNLP

  30. arXiv:2402.17553  [pdf, other

    cs.AI cs.CL cs.CV cs.HC

    OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

    Authors: Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov

    Abstract: For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They coul… ▽ More

    Submitted 21 July, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

  31. arXiv:2401.13649  [pdf, other

    cs.LG cs.CL cs.CV

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    Authors: Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

    Abstract: Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augmen… ▽ More

    Submitted 5 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted to ACL 2024. 24 pages. Project page: https://jykoh.com/vwa

  32. arXiv:2311.16424  [pdf, other

    cs.LG cs.AI cs.CV

    Manifold Preserving Guided Diffusion

    Authors: Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon

    Abstract: Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  33. arXiv:2311.09580  [pdf, other

    cs.CL

    MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

    Authors: Haofei Yu, Zhengyang Qi, Lawrence Jang, Ruslan Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang

    Abstract: Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today's multimodal models mainly focus on the correspondence between images and text, using this for tasks like image-text matching. However, this covers only a subset of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or humo… ▽ More

    Submitted 25 September, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

  34. arXiv:2311.06217  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.MM

    MultiIoT: Benchmarking Machine Learning for the Internet of Things

    Authors: Shentong Mo, Louis-Philippe Morency, Russ Salakhutdinov, Paul Pu Liang

    Abstract: The next generation of machine learning systems must be adept at perceiving and interacting with the physical world through a diverse array of sensory channels. Commonly referred to as the `Internet of Things (IoT)' ecosystem, sensory data from motion, thermal, geolocation, depth, wireless signals, video, and audio are increasingly used to model the states of physical environments and the humans i… ▽ More

    Submitted 4 July, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

  35. arXiv:2310.20141  [pdf, other

    cs.LG cs.AI

    Contrastive Difference Predictive Coding

    Authors: Chongyi Zheng, Ruslan Salakhutdinov, Benjamin Eysenbach

    Abstract: Predicting and reasoning about the future lie at the heart of many time-series questions. For example, goal-conditioned reinforcement learning can be viewed as learning representations to predict which states are likely to be visited in the future. While prior methods have used contrastive predictive coding to model time series data, learning representations that encode long-term dependencies usua… ▽ More

    Submitted 25 February, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: ICLR 2024. Website (https://chongyi-zheng.github.io/td_infonce) and code (https://github.com/chongyi-zheng/td_infonce)

  36. arXiv:2310.07478  [pdf, other

    cs.AI

    Multimodal Graph Learning for Generative Tasks

    Authors: Minji Yoon, Jing Yu Koh, Bryan Hooi, Ruslan Salakhutdinov

    Abstract: Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize: for example, from plain text to image-caption pairs. Most multimodal learning algorithms focus on modeling simple one-to-one pairs of data from two modalities, such as image-caption pairs, or audio-text pairs. However, in most real-world settings, entities of different modalit… ▽ More

    Submitted 12 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

  37. arXiv:2310.04373  [pdf, other

    cs.LG cs.AI

    Confronting Reward Model Overoptimization with Constrained RLHF

    Authors: Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, Stephen McAleer

    Abstract: Large language models are typically aligned with human preferences by optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriat… ▽ More

    Submitted 10 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

  38. arXiv:2308.08661  [pdf, other

    cs.CL cs.AI

    Answering Ambiguous Questions with a Database of Questions, Answers, and Revisions

    Authors: Haitian Sun, William W. Cohen, Ruslan Salakhutdinov

    Abstract: Many open-domain questions are under-specified and thus have multiple possible answers, each of which is correct under a different interpretation of the question. Answering such ambiguous questions is challenging, as it requires retrieving and then reasoning about diverse information from multiple passages. We present a new state-of-the-art for answering ambiguous questions that exploits a databas… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

  39. arXiv:2307.13101  [pdf, other

    cs.LG cs.AI cs.RO

    Contrastive Example-Based Control

    Authors: Kyle Hatch, Benjamin Eysenbach, Rafael Rafailov, Tianhe Yu, Ruslan Salakhutdinov, Sergey Levine, Chelsea Finn

    Abstract: While many real-world problems that might benefit from reinforcement learning, these problems rarely fit into the MDP mold: interacting with the environment is often expensive and specifying reward functions is challenging. Motivated by these challenges, prior work has developed data-driven approaches that learn entirely from samples from the transition dynamics and examples of high-return states.… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: This is an updated version of a manuscript that originally appeared at L4DC 2023. The project website is here https://sites.google.com/view/laeo-rl

    Journal ref: Proceedings of The 5th Annual Learning for Dynamics and Control Conference, PMLR 211:155-169, 2023

  40. arXiv:2307.12968  [pdf, other

    cs.LG cs.AI

    A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning

    Authors: Benjamin Eysenbach, Matthieu Geist, Sergey Levine, Ruslan Salakhutdinov

    Abstract: As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. One-step methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. One-step methods, such as advantage… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: Accepted to ICML 2023. Video (https://www.youtube.com/watch?v=1xlixIHZ0R4) and code (https://github.com/ben-eysenbach/ac-connection)

  41. arXiv:2306.16413  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.MM

    MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning

    Authors: Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe Morency, Ruslan Salakhutdinov

    Abstract: Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiZoo, a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark spanning 15 datase… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

    Comments: JMLR Open Source Software 2023, Code available at https://github.com/pliang279/MultiBench

  42. arXiv:2306.14636  [pdf, other

    cs.CV

    Localized Text-to-Image Generation for Free via Cross Attention Control

    Authors: Yutong He, Ruslan Salakhutdinov, J. Zico Kolter

    Abstract: Despite the tremendous success in text-to-image generative models, localized text-to-image generation (that is, generating objects or features at specific locations in an image while maintaining a consistent overall generation) still requires either explicit training or substantial additional inference time. In this work, we show that localized generation can be achieved by simply controlling cros… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

  43. arXiv:2306.05268  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.MM

    Factorized Contrastive Learning: Going Beyond Multi-view Redundancy

    Authors: Paul Pu Liang, Zihao Deng, Martin Ma, James Zou, Louis-Philippe Morency, Ruslan Salakhutdinov

    Abstract: In a wide range of multimodal tasks, contrastive learning has become a particularly appealing approach since it can successfully learn representations from abundant unlabeled data with only pairing information (e.g., image-caption or video-audio pairs). Underpinning these approaches is the assumption of multi-view redundancy - that shared information between modalities is necessary and sufficient… ▽ More

    Submitted 30 October, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023. Code available at: https://github.com/pliang279/FactorCL

  44. arXiv:2306.04539  [pdf, other

    cs.LG cs.CL cs.CV cs.IT stat.ML

    Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

    Authors: Paul Pu Liang, Chun Kai Ling, Yun Cheng, Alex Obolenskiy, Yudong Liu, Rohan Pandey, Alex Wilf, Louis-Philippe Morency, Ruslan Salakhutdinov

    Abstract: In many machine learning systems that jointly learn from multiple modalities, a core research question is to understand the nature of multimodal interactions: how modalities combine to provide new task-relevant information that was not present in either alone. We study this challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data and naturally co-occurri… ▽ More

    Submitted 13 June, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: ICLR 2024, Code available at: https://github.com/pliang279/PID

  45. arXiv:2306.04125  [pdf, other

    cs.LG cs.CL cs.HC

    Multimodal Fusion Interactions: A Study of Human and Automatic Quantification

    Authors: Paul Pu Liang, Yun Cheng, Ruslan Salakhutdinov, Louis-Philippe Morency

    Abstract: In order to perform multimodal fusion of heterogeneous signals, we need to understand their interactions: how each modality individually provides information useful for a task and how this information changes in the presence of other modalities. In this paper, we perform a comparative study of how humans annotate two categorizations of multimodal interactions: (1) partial labels, where different a… ▽ More

    Submitted 30 October, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

    Comments: International Conference on Multimodal Interaction (ICMI '23), Code available at: https://github.com/pliang279/PID. arXiv admin note: text overlap with arXiv:2302.12247

  46. arXiv:2306.03346  [pdf, ps, other

    cs.LG cs.AI

    Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data

    Authors: Chongyi Zheng, Benjamin Eysenbach, Homer Walke, Patrick Yin, Kuan Fang, Ruslan Salakhutdinov, Sergey Levine

    Abstract: Robotic systems that rely primarily on self-supervised learning have the potential to decrease the amount of human annotation and engineering effort required to learn control strategies. In the same way that prior robotic systems have leveraged self-supervised techniques from computer vision (CV) and natural language processing (NLP), our work builds on prior work showing that the reinforcement le… ▽ More

    Submitted 10 June, 2025; v1 submitted 5 June, 2023; originally announced June 2023.

    Comments: ICLR 2024 Spotlight (< 5%). Website (https://chongyi-zheng.github.io/stable_contrastive_rl) and code (https://github.com/chongyi-zheng/stable_contrastive_rl)

  47. arXiv:2305.17216  [pdf, other

    cs.CL cs.CV cs.LG

    Generating Images with Multimodal Language Models

    Authors: Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov

    Abstract: We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to… ▽ More

    Submitted 13 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023. Project page: http://jykoh.com/gill

  48. arXiv:2305.16309  [pdf, other

    cs.RO cs.CV cs.LG

    Imitating Task and Motion Planning with Visuomotor Transformers

    Authors: Murtaza Dalal, Ajay Mandlekar, Caelan Garrett, Ankur Handa, Ruslan Salakhutdinov, Dieter Fox

    Abstract: Imitation learning is a powerful tool for training robot manipulation policies, allowing them to learn from expert demonstrations without manual programming or trial-and-error. However, common methods of data collection, such as human supervision, scale poorly, as they are time-consuming and labor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomously generate large-scale dataset… ▽ More

    Submitted 17 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Conference on Robot Learning (CoRL) 2023. 8 pages, 5 figures, 2 tables; 11 pages appendix (10 additional figures)

  49. arXiv:2305.15486  [pdf, other

    cs.AI cs.LG

    SPRING: Studying the Paper and Reasoning to Play Games

    Authors: Yue Wu, Shrimai Prabhumoye, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Tom Mitchell, Yuanzhi Li

    Abstract: Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read the game's original aca… ▽ More

    Submitted 11 December, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

  50. arXiv:2305.02412  [pdf, other

    cs.CL cs.AI cs.LG

    Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents

    Authors: Yue Wu, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Yuanzhi Li, Tom Mitchell, Shrimai Prabhumoye

    Abstract: Pre-trained large language models (LLMs) capture procedural knowledge about the world. Recent work has leveraged LLM's ability to generate abstract plans to simplify challenging control tasks, either by action scoring, or action modeling (fine-tuning). However, the transformer architecture inherits several constraints that make it difficult for the LLM to directly serve as the agent: e.g. limited… ▽ More

    Submitted 7 May, 2023; v1 submitted 3 May, 2023; originally announced May 2023.