Skip to main content

Showing 1–50 of 90 results for author: Bisk, Y

.
  1. arXiv:2505.19662  [pdf, ps, other

    cs.AI cs.CV

    FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

    Authors: Atsunori Moteki, Shoichi Masui, Fan Yang, Yueqi Song, Yonatan Bisk, Graham Neubig, Ikuo Kusajima, Yasuto Watanabe, Hiroyuki Ishida, Jun Takahashi, Shan Jiang

    Abstract: This paper proposes FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are required to monitor and report safety and health incidents, as well as manufacturing-related incidents, that may occur in real-world work environments. Existing agentic AI benchmarks have been limited to evaluating web tasks and are insufficien… ▽ More

    Submitted 30 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: 6 pages, 2 figures, 4 tables

  2. arXiv:2504.17674  [pdf, other

    cs.CL cs.LG

    Energy Considerations of Large Language Model Inference and Efficiency Optimizations

    Authors: Jared Fernandez, Clara Na, Vashisth Tiwari, Yonatan Bisk, Sasha Luccioni, Emma Strubell

    Abstract: As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optim… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: 16 pages

  3. arXiv:2504.11336  [pdf, other

    cs.LG cs.AI cs.CL

    Looking beyond the next token

    Authors: Abitha Thankaraj, Yiding Jiang, J. Zico Kolter, Yonatan Bisk

    Abstract: The structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans' natural writing and reasoning process, where goals are typically known before the exact argument or phrasings. While this mismatch has been well studied in the literature, the working assumption has been that architectural changes are needed to… ▽ More

    Submitted 23 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

  4. arXiv:2504.02259  [pdf, other

    cs.CV

    Re-thinking Temporal Search for Long-Form Video Understanding

    Authors: Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, Manling Li

    Abstract: Efficiently understanding long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding and address a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). Our contributions are twofold: First, we frame temporal search as a Long Video Haystack problem: findi… ▽ More

    Submitted 6 April, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR 2025; A real-world long video needle-in-haystack benchmark; long-video QA with human ref frames

  5. arXiv:2502.04576  [pdf, other

    cs.LG cs.CL

    Self-Regulation and Requesting Interventions

    Authors: So Yeon Min, Yue Wu, Jimin Sun, Max Kaufmann, Fahim Tajwar, Yonatan Bisk, Ruslan Salakhutdinov

    Abstract: Human intelligence involves metacognitive abilities like self-regulation, recognizing limitations, and seeking assistance only when needed. While LLM Agents excel in many domains, they often lack this awareness. Overconfident agents risk catastrophic failures, while those that seek help excessively hinder efficiency. A key challenge is enabling agents with a limited intervention budget $C$ is to d… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

  6. arXiv:2502.00197  [pdf, other

    cs.LG stat.ML

    Model Successor Functions

    Authors: Yingshan Chang, Yonatan Bisk

    Abstract: The notion of generalization has moved away from the classical one defined in statistical learning theory towards an emphasis on out-of-domain generalization (OODG). Recently, there is a growing focus on inductive generalization, where a progression of difficulty implicitly governs the direction of domain shifts. In inductive generalization, it is often assumed that the training data lie in the ea… ▽ More

    Submitted 31 January, 2025; originally announced February 2025.

  7. arXiv:2412.12175  [pdf, other

    cs.LG cs.AI cs.CL

    Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning

    Authors: Melanie Sclar, Jane Yu, Maryam Fazel-Zarandi, Yulia Tsvetkov, Yonatan Bisk, Yejin Choi, Asli Celikyilmaz

    Abstract: Do large language models (LLMs) have theory of mind? A plethora of papers and benchmarks have been introduced to evaluate if current models have been able to develop this key ability of social intelligence. However, all rely on limited datasets with simple patterns that can potentially lead to problematic blind spots in evaluation and an overestimation of model capabilities. We introduce ExploreTo… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

  8. arXiv:2411.13055  [pdf, other

    cs.LG cs.DC

    Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

    Authors: Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn

    Abstract: Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestratio… ▽ More

    Submitted 12 April, 2025; v1 submitted 20 November, 2024; originally announced November 2024.

  9. arXiv:2411.04448  [pdf, other

    cs.CL

    Gradient Localization Improves Lifelong Pretraining of Language Models

    Authors: Jared Fernandez, Yonatan Bisk, Emma Strubell

    Abstract: Large Language Models (LLMs) trained on web-scale text corpora have been shown to capture world knowledge in their parameters. However, the mechanism by which language models store different types of knowledge is poorly understood. In this work, we examine two types of knowledge relating to temporally sensitive entities and demonstrate that each type is localized to different sets of parameters wi… ▽ More

    Submitted 7 November, 2024; originally announced November 2024.

    Comments: EMNLP Findings 2024

  10. arXiv:2410.18932  [pdf, other

    cs.RO cs.AI cs.CV

    ANAVI: Audio Noise Awareness using Visuals of Indoor environments for NAVIgation

    Authors: Vidhi Jain, Rishi Veerapaneni, Yonatan Bisk

    Abstract: We propose Audio Noise Awareness using Visuals of Indoors for NAVIgation for quieter robot path planning. While humans are naturally aware of the noise they make and its impact on those around them, robots currently lack this awareness. A key challenge in achieving audio awareness for robots is estimating how loud will the robot's actions be at a listener's location? Since sound depends upon the g… ▽ More

    Submitted 24 October, 2024; originally announced October 2024.

    Comments: 8th Conference on Robot Learning (CoRL) 2024

  11. arXiv:2409.18313  [pdf, other

    cs.RO cs.AI cs.LG

    Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

    Authors: Quanting Xie, So Yeon Min, Pengliang Ji, Yue Yang, Tianyi Zhang, Kedi Xu, Aarav Bajaj, Ruslan Salakhutdinov, Matthew Johnson-Roberson, Yonatan Bisk

    Abstract: There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhorse of large-scale non-parametric knowledge; however, existing techniques do not directly transfer to the embodied domain, which is multimodal, where data is highly correlated, and percept… ▽ More

    Submitted 20 January, 2025; v1 submitted 26 September, 2024; originally announced September 2024.

    Comments: Web: https://quanting-xie.github.io/Embodied-RAG-web/

  12. arXiv:2409.10683  [pdf, other

    cs.RO cs.AI cs.CV

    MotIF: Motion Instruction Fine-tuning

    Authors: Minyoung Hwang, Joey Hejna, Dorsa Sadigh, Yonatan Bisk

    Abstract: While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf visio… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

  13. arXiv:2407.12061  [pdf, other

    cs.HC cs.AI cs.RO

    Situated Instruction Following

    Authors: So Yeon Min, Xavi Puig, Devendra Singh Chaplot, Tsung-Yen Yang, Akshara Rai, Priyam Parashar, Ruslan Salakhutdinov, Yonatan Bisk, Roozbeh Mottaghi

    Abstract: Language is never spoken in a vacuum. It is expressed, comprehended, and contextualized within the holistic backdrop of the speaker's history, actions, and environment. Since humans are used to communicating efficiently with situated language, the practicality of robotic assistants hinge on their ability to understand and act upon implicit and situated instructions. In traditional instruction foll… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: European Conference on Computer Vision 2024 (ECCV 2024)

  14. arXiv:2407.08876  [pdf, other

    cs.CV cs.RO

    DegustaBot: Zero-Shot Visual Preference Estimation for Personalized Multi-Object Rearrangement

    Authors: Benjamin A. Newman, Pranay Gupta, Kris Kitani, Yonatan Bisk, Henny Admoni, Chris Paxton

    Abstract: De gustibus non est disputandum ("there is no accounting for others' tastes") is a common Latin maxim describing how many solutions in life are determined by people's personal preferences. Many household tasks, in particular, can only be considered fully successful when they account for personal preferences such as the visual aesthetic of the scene. For example, setting a table could be optimized… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: 19 pages, 10 figures

  15. arXiv:2407.06939  [pdf, other

    cs.RO cs.CV

    Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge

    Authors: Sriram Yenamandra, Arun Ramachandran, Mukul Khanna, Karmesh Yadav, Jay Vakil, Andrew Melnik, Michael Büttner, Leon Harz, Lyon Brown, Gora Chand Nandi, Arjun PS, Gaurav Kumar Yadav, Rahul Kala, Robert Haschke, Yang Luo, Jinxin Zhu, Yansen Han, Bingyi Lu, Xuan Gu, Qinyuan Liu, Yaping Zhao, Qiting Ye, Chenxiao Dou, Yansong Chua, Volodymyr Kuzma , et al. (20 additional authors not shown)

    Abstract: In order to develop robots that can effectively serve as versatile and capable home assistants, it is crucial for them to reliably perceive and interact with a wide variety of objects across diverse environments. To this end, we proposed Open Vocabulary Mobile Manipulation as a key benchmark task for robotics: finding any object in a novel environment and placing it on any receptacle surface withi… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  16. arXiv:2407.00369  [pdf, other

    cs.CL

    How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models

    Authors: Jaeyoung Lee, Ximing Lu, Jack Hessel, Faeze Brahman, Youngjae Yu, Yonatan Bisk, Yejin Choi, Saadia Gabriel

    Abstract: Given the growing influx of misinformation across news and social media, there is a critical need for systems that can provide effective real-time verification of news claims. Large language or multimodal model based verification has been proposed to scale up online policing mechanisms for mitigating spread of false and harmful content. While these can potentially reduce burden on human fact-check… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  17. arXiv:2406.19228  [pdf, other

    cs.CL cs.AI cs.LG

    Tools Fail: Detecting Silent Errors in Faulty Tools

    Authors: Jimin Sun, So Yeon Min, Yingshan Chang, Yonatan Bisk

    Abstract: Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model's ability to detect "silent" tool errors, a… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 18 pages, 12 figures

  18. arXiv:2406.05191  [pdf, other

    cs.CV

    DiffusionPID: Interpreting Diffusion via Partial Information Decomposition

    Authors: Rushikesh Zawar, Shaurya Dewan, Prakanshul Saxena, Yingshan Chang, Andrew Luo, Yonatan Bisk

    Abstract: Text-to-image diffusion models have made significant progress in generating naturalistic images from textual inputs, and demonstrate the capacity to learn and represent complex visual-semantic relationships. While these diffusion models have achieved remarkable success, the underlying mechanisms driving their performance are not yet fully accounted for, with many unanswered questions surrounding w… ▽ More

    Submitted 14 November, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Journal ref: Thirty-Eighth Annual Conference on Neural Information Processing Systems (2024)

  19. arXiv:2405.20131  [pdf, other

    cs.LG cs.CL

    Language Models Need Inductive Biases to Count Inductively

    Authors: Yingshan Chang, Yonatan Bisk

    Abstract: Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of countin… ▽ More

    Submitted 25 October, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  20. arXiv:2404.11483  [pdf, other

    cs.AI cs.LG

    AgentKit: Structured LLM Reasoning with Dynamic Graphs

    Authors: Yue Wu, Yewen Fan, So Yeon Min, Shrimai Prabhumoye, Stephen McAleer, Yonatan Bisk, Ruslan Salakhutdinov, Yuanzhi Li, Tom Mitchell

    Abstract: We propose an intuitive LLM prompting framework (AgentKit) for multifunctional agents. AgentKit offers a unified framework for explicitly constructing a complex "thought process" from simple natural language prompts. The basic building block in AgentKit is a node, containing a natural language prompt for a specific subtask. The user then puts together chains of nodes, like stacking LEGO pieces. Th… ▽ More

    Submitted 24 July, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

  21. arXiv:2404.01258  [pdf, other

    cs.CV cs.AI

    Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

    Authors: Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang

    Abstract: Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large… ▽ More

    Submitted 2 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

  22. arXiv:2404.01158  [pdf, other

    cs.CL cs.RO

    Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

    Authors: Casey Kennington, Malihe Alikhani, Heather Pon-Barry, Katherine Atwell, Yonatan Bisk, Daniel Fried, Felix Gervits, Zhao Han, Mert Inan, Michael Johnston, Raj Korpan, Diane Litman, Matthew Marge, Cynthia Matuszek, Ross Mead, Shiwali Mohan, Raymond Mooney, Natalie Parde, Jivko Sinapov, Angela Stewart, Matthew Stone, Stefanie Tellex, Tom Williams

    Abstract: The ability to interact with machines using natural human language is becoming not just commonplace, but expected. The next step is not just text interfaces, but speech interfaces and not just with computers, but with all machines including robots. In this paper, we chronicle the recent history of this growing field of spoken dialogue with robots and offer the community three proposals, the first… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: NSF Report on the "Dialogue with Robots" Workshop held in Pittsburg, PA, April 2023

  23. arXiv:2403.16394  [pdf, other

    cs.LG cs.CV

    Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation

    Authors: Yingshan Chang, Yasi Zhang, Zhiyuan Fang, Yingnian Wu, Yonatan Bisk, Feng Gao

    Abstract: The literature on text-to-image generation is plagued by issues of faithfully composing entities with relations. But there lacks a formal understanding of how entity-relation compositions can be effectively learned. Moreover, the underlying phenomenon space that meaningfully reflects the problem structure is not well-defined, leading to an arms race for larger quantities of data in the hope that g… ▽ More

    Submitted 25 October, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

  24. arXiv:2403.12943  [pdf, other

    cs.RO cs.AI

    Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

    Authors: Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi

    Abstract: Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person's intent and perform the inferred task despite differences in the embodiments and environments. We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstr… ▽ More

    Submitted 27 August, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: Robotics: Science & Systems (RSS) 2024. https://vid2robot.github.io/

  25. arXiv:2403.10534  [pdf, other

    cs.CV cs.AI

    VISREAS: Complex Visual Reasoning with Unanswerable Questions

    Authors: Syeda Nahida Akter, Sangwu Lee, Yingshan Chang, Yonatan Bisk, Eric Nyberg

    Abstract: Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, t… ▽ More

    Submitted 22 February, 2024; originally announced March 2024.

    Comments: 18 pages, 14 figures, 5 tables

  26. arXiv:2403.08715  [pdf, other

    cs.CL

    SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language Agents

    Authors: Ruiyi Wang, Haofei Yu, Wenxin Zhang, Zhengyang Qi, Maarten Sap, Graham Neubig, Yonatan Bisk, Hao Zhu

    Abstract: Humans learn social skills through both imitation and social interaction. This social learning process is largely understudied by existing research on building language agents. Motivated by this gap, we propose an interactive learning method, SOTOPIA-$π$, improving the social intelligence of language agents. This method leverages behavior cloning and self-reinforcement training on filtered social… ▽ More

    Submitted 25 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  27. arXiv:2312.10807  [pdf, other

    cs.RO

    Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

    Authors: Hongkuan Zhou, Xiangtong Yao, Oier Mees, Yuan Meng, Ted Xiao, Yonatan Bisk, Jean Oh, Edward Johns, Mohit Shridhar, Dhruv Shah, Jesse Thomason, Kai Huang, Joyce Chai, Zhenshan Bing, Alois Knoll

    Abstract: Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robotic ac… ▽ More

    Submitted 17 February, 2025; v1 submitted 17 December, 2023; originally announced December 2023.

    Comments: 37 pages, 15 figures, 4 tables, 354 citations

  28. arXiv:2312.08782  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis

    Authors: Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, Shibo Zhao, Shayegan Omidshafiei, Dong-Ki Kim, Ali-akbar Agha-mohammadi, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Chen Wang, Zsolt Kira, Fei Xia, Yonatan Bisk

    Abstract: Building general-purpose robots that operate seamlessly in any environment, with any object, and utilizing various skills to complete diverse tasks has been a long-standing goal in Artificial Intelligence. However, as a community, we have been constraining most robotic systems by designing them for specific tasks, training them on specific datasets, and deploying them within specific environments.… ▽ More

    Submitted 1 October, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

  29. arXiv:2310.11667  [pdf, other

    cs.AI cs.CL cs.LG

    SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

    Authors: Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, Maarten Sap

    Abstract: Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide va… ▽ More

    Submitted 22 March, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: Preprint, 43 pages. The first two authors contribute equally

  30. arXiv:2310.08864  [pdf, other

    cs.RO

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (269 additional authors not shown)

    Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More

    Submitted 14 May, 2025; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Project website: https://robotics-transformer-x.github.io

  31. arXiv:2309.10103  [pdf, other

    cs.RO cs.AI

    Reasoning about the Unseen for Efficient Outdoor Object Navigation

    Authors: Quanting Xie, Tianyi Zhang, Kedi Xu, Matthew Johnson-Roberson, Yonatan Bisk

    Abstract: Robots should exist anywhere humans do: indoors, outdoors, and even unmapped environments. In contrast, the focus of recent advancements in Object Goal Navigation(OGN) has targeted navigating in indoor environments by leveraging spatial and semantic cues that do not generalize outdoors. While these contributions provide valuable insights into indoor scenarios, the broader spectrum of real-world ro… ▽ More

    Submitted 1 October, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: 6 pages, 7 figures

  32. arXiv:2309.08508  [pdf, other

    cs.RO

    MOSAIC: Learning Unified Multi-Sensory Object Property Representations for Robot Learning via Interactive Perception

    Authors: Gyan Tatiya, Jonathan Francis, Ho-Hsiang Wu, Yonatan Bisk, Jivko Sinapov

    Abstract: A holistic understanding of object properties across diverse sensory modalities (e.g., visual, audio, and haptic) is essential for tasks ranging from object categorization to complex manipulation. Drawing inspiration from cognitive science studies that emphasize the significance of multi-sensory integration in human perception, we introduce MOSAIC (Multimodal Object property learning with Self-Att… ▽ More

    Submitted 22 February, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted to the 2024 IEEE International Conference on Robotics and Automation (ICRA), May 13 to 17, 2024; Yokohama, Japan

  33. arXiv:2307.13854  [pdf, other

    cs.AI cs.CL cs.LG

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig

    Abstract: With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, w… ▽ More

    Submitted 16 April, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/

  34. arXiv:2307.13850  [pdf, other

    cs.LG cs.AI cs.CV cs.RO

    MAEA: Multimodal Attribution for Embodied AI

    Authors: Vidhi Jain, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Yonatan Bisk

    Abstract: Understanding multimodal perception for embodied AI is an open question because such inputs may contain highly complementary as well as redundant information for the task. A relevant direction for multimodal policies is understanding the global trends of each modality at the fusion layer. To this end, we disentangle the attributions for visual, language, and previous action inputs across different… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

  35. arXiv:2306.17842  [pdf, other

    cs.CV cs.CL cs.MM

    SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

    Authors: Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang

    Abstract: In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details n… ▽ More

    Submitted 28 October, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 spotlight

  36. arXiv:2306.11565  [pdf, other

    cs.RO cs.AI cs.CV

    HomeRobot: Open-Vocabulary Mobile Manipulation

    Authors: Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav, Austin Wang, Mukul Khanna, Theophile Gervet, Tsung-Yen Yang, Vidhi Jain, Alexander William Clegg, John Turner, Zsolt Kira, Manolis Savva, Angel Chang, Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mottaghi, Yonatan Bisk, Chris Paxton

    Abstract: HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it invol… ▽ More

    Submitted 10 January, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: 37 pages, 22 figures, 8 tables

  37. arXiv:2305.15486  [pdf, other

    cs.AI cs.LG

    SPRING: Studying the Paper and Reasoning to Play Games

    Authors: Yue Wu, Shrimai Prabhumoye, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Tom Mitchell, Yuanzhi Li

    Abstract: Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read the game's original aca… ▽ More

    Submitted 11 December, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

  38. arXiv:2305.02412  [pdf, other

    cs.CL cs.AI cs.LG

    Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents

    Authors: Yue Wu, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Yuanzhi Li, Tom Mitchell, Shrimai Prabhumoye

    Abstract: Pre-trained large language models (LLMs) capture procedural knowledge about the world. Recent work has leveraged LLM's ability to generate abstract plans to simplify challenging control tasks, either by action scoring, or action modeling (fine-tuning). However, the transformer architecture inherits several constraints that make it difficult for the LLM to directly serve as the agent: e.g. limited… ▽ More

    Submitted 7 May, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

  39. arXiv:2304.11235  [pdf, other

    cs.RO cs.AI

    Spatial-Language Attention Policies for Efficient Robot Learning

    Authors: Priyam Parashar, Vidhi Jain, Xiaohan Zhang, Jay Vakil, Sam Powers, Yonatan Bisk, Chris Paxton

    Abstract: Despite great strides in language-guided manipulation, existing work has been constrained to table-top settings. Table-tops allow for perfect and consistent camera angles, properties are that do not hold in mobile manipulation. Task plans that involve moving around the environment must be robust to egocentric views and changes in the plane and angle of grasp. A further challenge is ensuring this i… ▽ More

    Submitted 7 November, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

  40. arXiv:2303.01502  [pdf, other

    cs.CL cs.AI

    Computational Language Acquisition with Theory of Mind

    Authors: Andy Liu, Hao Zhu, Emmy Liu, Yonatan Bisk, Graham Neubig

    Abstract: Unlike current state-of-the-art language models, young children actively acquire language through interactions with their surrounding environment and caretakers. One mechanism that has been argued to be critical to language learning is the ability to infer the mental states of other agents in social environments, coined Theory of Mind (ToM) by Premack & Woodruff (1978). Drawing inspiration from th… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: 9 pages, 3 figures. To be published in the 11th International Conference on Learning Representations, ICLR 2023, Conference Track Proceedings

  41. arXiv:2302.06117  [pdf, other

    cs.LG

    The Framework Tax: Disparities Between Inference Efficiency in NLP Research and Deployment

    Authors: Jared Fernandez, Jacob Kahn, Clara Na, Yonatan Bisk, Emma Strubell

    Abstract: Increased focus on the computational efficiency of NLP systems has motivated the design of efficient model architectures and improvements to underlying hardware accelerators. However, the resulting increases in computational throughput and reductions in floating point operations have not directly translated to improvements in wall-clock inference latency. We demonstrate that these discrepancies ca… ▽ More

    Submitted 22 December, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

    Comments: EMNLP 2023

  42. arXiv:2212.05923  [pdf, other

    cs.RO cs.LG

    Self-Supervised Object Goal Navigation with In-Situ Finetuning

    Authors: So Yeon Min, Yao-Hung Hubert Tsai, Wei Ding, Ali Farhadi, Ruslan Salakhutdinov, Yonatan Bisk, Jian Zhang

    Abstract: A household robot should be able to navigate to target objects without requiring users to first annotate everything in their home. Most current approaches to object navigation do not test on real robots and rely solely on reconstructed scans of houses and their expensively labeled semantic 3D meshes. In this work, our goal is to build an agent that builds self-supervised models of the world via ex… ▽ More

    Submitted 1 April, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

  43. arXiv:2211.05392  [pdf, other

    cs.CL

    EvEntS ReaLM: Event Reasoning of Entity States via Language Models

    Authors: Evangelia Spiliopoulou, Artidoro Pagnoni, Yonatan Bisk, Eduard Hovy

    Abstract: This paper investigates models of event implications. Specifically, how well models predict entity state-changes, by targeting their understanding of physical attributes. Nominally, Large Language models (LLM) have been exposed to procedural knowledge about how objects interact, yet our benchmarking shows they fail to reason about the world. Conversely, we also demonstrate that existing approaches… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: EMNLP 2022

  44. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  45. arXiv:2210.04443  [pdf, other

    cs.LG cs.AI cs.CL

    Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue

    Authors: So Yeon Min, Hao Zhu, Ruslan Salakhutdinov, Yonatan Bisk

    Abstract: Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange. The recent introduction of benchmarks (Padmakumar et al., 2022) raises the question of how best to train and evaluate models for this multi-turn, multi-agent, long-horizon task. This paper contributes to that conversation, by arguing that imitation learning (IL) and r… ▽ More

    Submitted 11 October, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: To Appear in the Proceedings of EMNLP 2022

  46. arXiv:2207.02442  [pdf, other

    cs.RO cs.AI cs.LG

    Transformers are Adaptable Task Planners

    Authors: Vidhi Jain, Yixin Lin, Eric Undersander, Yonatan Bisk, Akshara Rai

    Abstract: Every home is different, and every person likes things done in their particular way. Therefore, home robots of the future need to both reason about the sequential nature of day-to-day tasks and generalize to user's preferences. To this end, we propose a Transformer Task Planner(TTP) that learns high-level actions from demonstrations by leveraging object attribute-based representations. TTP can be… ▽ More

    Submitted 6 July, 2022; originally announced July 2022.

    Comments: https://anonymous.4open.science/r/temporal_task_planner-Paper148/

  47. arXiv:2205.11686  [pdf, other

    cs.CL cs.CV

    On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

    Authors: Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, Ana Marasović

    Abstract: Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these models work for more complex generative tasks, i.e. conditioning on both text and images? Are multimodal models simply visually adapted language models, or… ▽ More

    Submitted 22 October, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: v2: EMNLP Findings 2022 accepted paper camera-ready version. 9 pages main, 2 pages appendix

  48. arXiv:2205.09256  [pdf, other

    cs.CV cs.MM

    Training Vision-Language Transformers from Captions

    Authors: Liangke Gui, Yingshan Chang, Qiuyuan Huang, Subhojit Som, Alex Hauptmann, Jianfeng Gao, Yonatan Bisk

    Abstract: Vision-Language Transformers can be learned without low-level human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Visi… ▽ More

    Submitted 14 June, 2023; v1 submitted 18 May, 2022; originally announced May 2022.

  49. arXiv:2203.03022  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    HEAR: Holistic Evaluation of Audio Representations

    Authors: Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk

    Abstract: What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, in… ▽ More

    Submitted 29 May, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

    Comments: to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track

  50. arXiv:2112.09544  [pdf

    cs.CY

    It's Time to Do Something: Mitigating the Negative Impacts of Computing Through a Change to the Peer Review Process

    Authors: Brent Hecht, Lauren Wilcox, Jeffrey P. Bigham, Johannes Schöning, Ehsan Hoque, Jason Ernst, Yonatan Bisk, Luigi De Russis, Lana Yarosh, Bushra Anjum, Danish Contractor, Cathy Wu

    Abstract: The computing research community needs to work much harder to address the downsides of our innovations. Between the erosion of privacy, threats to democracy, and automation's effect on employment (among many other issues), we can no longer simply assume that our research will have a net positive impact on the world. While bending the arc of computing innovation towards societal benefit may at firs… ▽ More

    Submitted 17 December, 2021; originally announced December 2021.

    Comments: First published on the ACM Future of Computing Academy blog on March 29, 2018. This is the archival version