Skip to main content

Showing 1–50 of 95 results for author: Ryoo, S

.
  1. arXiv:2506.02298  [pdf, ps, other

    cs.CL cs.AI cs.LG

    LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

    Authors: Thai Hoang, Kung-Hsiang Huang, Shirley Kokane, Jianguo Zhang, Zuxin Liu, Ming Zhu, Jake Grigsby, Tian Lan, Michael S Ryoo, Chien-Sheng Wu, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles

    Abstract: Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: LAM Simulator framework for agentic data generation

  2. arXiv:2505.07817  [pdf, ps, other

    cs.RO cs.CV

    Pixel Motion as Universal Representation for Robot Control

    Authors: Kanchana Ranasinghe, Xiang Li, Cristina Mata, Jongwoo Park, Michael S Ryoo

    Abstract: We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  3. arXiv:2503.00720  [pdf, other

    math.DS math-ph math.CA

    Quantitative relaxation dynamics from generic initial configurations in the inertial Kuramoto model

    Authors: Hangjun Cho, Jiu-Gang Dong, Seung-Yeal Ha, Seung-Yeon Ryoo

    Abstract: We study the relaxation dynamics of the inertial Kuramoto model toward a phase-locked state from a generic initial phase configuration. For this, we propose a sufficient framework in terms of initial data and system parameters for asymptotic phase-locking. It can be roughly stated as set of conditions such as a positive initial order parameter, a coupling strength sufficiently larger than initial… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

    Comments: 94 pages

    MSC Class: 34D05; 34D06; 34C15; 82C22

  4. arXiv:2503.00560  [pdf, ps, other

    math.DG math.GR math.MG

    Asymptotics of Riemannian Lie groups with nilpotency step 2

    Authors: Enrico Le Donne, Luca Nalon, Sebastiano Nicolussi Golo, Seung-Yeon Ryoo

    Abstract: We derive estimates comparing asymptotic Riemannian or sub-Riemannian metrics in step-2 nilpotent Lie groups. Given a sub-Riemannian metric, we construct a Carnot metric whose square remains at a bounded distance from the square of the original metric. As a consequence, we obtain a refined estimate of the error term in the asymptotic expansion of the volume of (sub-)Riemannian metric balls. To ach… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

    MSC Class: 20F65; 22E25; 53C17; 53C23; 53C60

  5. arXiv:2501.16289  [pdf, other

    cs.CV

    Multi-view Structural Convolution Network for Domain-Invariant Point Cloud Recognition of Autonomous Vehicles

    Authors: Younggun Kim, Beomsik Cho, Seonghoon Ryoo, Soomok Lee

    Abstract: Point cloud representation has recently become a research hotspot in the field of computer vision and has been utilized for autonomous vehicles. However, adapting deep learning networks for point cloud data recognition is challenging due to the variability in datasets and sensor technologies. This variability underscores the necessity for adaptive techniques to maintain accuracy under different co… ▽ More

    Submitted 30 April, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

    Comments: 36 pages, 6 figures

  6. arXiv:2412.18596  [pdf, other

    cs.CV

    LatentCRF: Continuous CRF for Efficient Latent Diffusion

    Authors: Kanchana Ranasinghe, Sadeep Jayasumana, Andreas Veit, Ayan Chakrabarti, Daniel Glasner, Michael S Ryoo, Srikumar Ramalingam, Sanjiv Kumar

    Abstract: Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships among the latent vectors in the LDM. By replacing some… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

  7. arXiv:2411.14688  [pdf, other

    cs.CV cs.CL cs.LG

    Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

    Authors: AJ Piergiovanni, Dahun Kim, Michael S. Ryoo, Isaac Noble, Anelia Angelova

    Abstract: Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding a… ▽ More

    Submitted 21 November, 2024; originally announced November 2024.

  8. arXiv:2411.02397  [pdf, other

    cs.CV

    Adaptive Caching for Faster Video Generation with Diffusion Transformers

    Authors: Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S. Ryoo, Tian Xie

    Abstract: Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a train… ▽ More

    Submitted 7 November, 2024; v1 submitted 4 November, 2024; originally announced November 2024.

    Comments: Project-page is available at https://adacache-dit.github.io

  9. arXiv:2410.16267  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

    Authors: Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Jongwoo Park, Kanchana Ranasinghe, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles

    Abstract: We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much f… ▽ More

    Submitted 9 June, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

  10. arXiv:2408.08872  [pdf, other

    cs.CV cs.AI cs.CL

    xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

    Authors: Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles , et al. (2 additional authors not shown)

    Abstract: This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tas… ▽ More

    Submitted 28 August, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

  11. arXiv:2407.04833  [pdf, other

    cs.CV cs.AI

    3D Adaptive Structural Convolution Network for Domain-Invariant Point Cloud Recognition

    Authors: Younggun Kim, Beomsik Cho, Seonghoon Ryoo, Soomok Lee

    Abstract: Adapting deep learning networks for point cloud data recognition in self-driving vehicles faces challenges due to the variability in datasets and sensor technologies, emphasizing the need for adaptive techniques to maintain accuracy across different conditions. In this paper, we introduce the 3D Adaptive Structural Convolution Network (3D-ASCN), a cutting-edge framework for 3D point cloud recognit… ▽ More

    Submitted 21 October, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: 11 pages, 3 figures

    ACM Class: I.2.10; I.5.1

    Journal ref: ACCV 2024 (Asian Conference on Computer Vision)

  12. arXiv:2406.20095  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

    Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo

    Abstract: Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot… ▽ More

    Submitted 30 January, 2025; v1 submitted 28 June, 2024; originally announced June 2024.

    Comments: ICLR 2025

  13. arXiv:2406.09396  [pdf, other

    cs.CV

    Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA

    Authors: Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryu, Donghyun Kim, Michael S. Ryoo

    Abstract: Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore use of large langua… ▽ More

    Submitted 20 March, 2025; v1 submitted 13 June, 2024; originally announced June 2024.

  14. arXiv:2404.07449  [pdf, other

    cs.CV

    Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

    Authors: Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin

    Abstract: Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  15. arXiv:2403.16998  [pdf, other

    cs.CV

    Understanding Long Videos with Multimodal Language Models

    Authors: Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video in… ▽ More

    Submitted 23 February, 2025; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: 17 pages (main paper), 7 pages appendix. ICLR 2025 conference paper

  16. arXiv:2403.14622  [pdf, other

    cs.CV

    Language Repository for Long Video Understanding

    Authors: Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S. Ryoo

    Abstract: Language has become a prominent modality in computer vision with the rise of LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise a… ▽ More

    Submitted 20 December, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

  17. arXiv:2312.03817  [pdf, other

    cs.CV

    Diffusion Illusions: Hiding Images in Plain Sight

    Authors: Ryan Burgert, Xiang Li, Abe Leite, Kanchana Ranasinghe, Michael S. Ryoo

    Abstract: We explore the problem of computationally generating special `prime' images that produce optical illusions when physically arranged and viewed in a certain way. First, we propose a formal definition for this problem. Next, we introduce Diffusion Illusions, the first comprehensive pipeline designed to automatically generate a wide range of these illusions. Specifically, we both adapt the existing `… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

  18. arXiv:2311.05698  [pdf, other

    cs.CV

    Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

    Authors: AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

    Abstract: One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volu… ▽ More

    Submitted 3 April, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

    Comments: CVPR 2024

  19. arXiv:2310.20704  [pdf, other

    cs.CV cs.AI

    Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders

    Authors: Srijan Das, Tanmay Jain, Dominick Reilly, Pranav Balaji, Soumyajit Karmakar, Shyam Marjit, Xiang Li, Abhijit Das, Michael S. Ryoo

    Abstract: Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite their success, ViTs lack inductive biases, which can make it difficult to train them with limited data. To address this challenge, prior studies suggest training ViTs with self-supervised learning (SSL) and fine-tuning sequentially. However, we observe that jointly optimizing ViTs for the primary task and a Self-Supervis… ▽ More

    Submitted 27 December, 2023; v1 submitted 31 October, 2023; originally announced October 2023.

    Comments: Accepted to WACV 2024

  20. arXiv:2309.00696  [pdf, other

    cs.CV

    AAN: Attributes-Aware Network for Temporal Action Detection

    Authors: Rui Dai, Srijan Das, Michael S. Ryoo, Francois Bremond

    Abstract: The challenge of long-term video understanding remains constrained by the efficient extraction of object semantics and the modelling of their relationships for downstream tasks. Although the CLIP visual features exhibit discriminative properties for various vision tasks, particularly in object encoding, they are suboptimal for long-term video understanding. To address this issue, we present the At… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  21. arXiv:2307.01849  [pdf, other

    cs.RO cs.CV cs.LG

    Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

    Authors: Xiang Li, Varun Belagali, Jinghuan Shang, Michael S. Ryoo

    Abstract: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states.… ▽ More

    Submitted 11 January, 2024; v1 submitted 4 July, 2023; originally announced July 2023.

    Comments: 15 pages, 13 figures. Code, pretrained checkpoints, and datasets are available at https://github.com/LostXine/crossway_diffusion Video demo is at https://youtu.be/9deKHueZBuk

  22. arXiv:2306.04021  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Energy-Based Models for Cross-Modal Localization using Convolutional Transformers

    Authors: Alan Wu, Michael S. Ryoo

    Abstract: We present a novel framework using Energy-Based Models (EBMs) for localizing a ground vehicle mounted with a range sensor against satellite imagery in the absence of GPS. Lidar sensors have become ubiquitous on autonomous vehicles for describing its surrounding environment. Map priors are typically built using the same sensor modality for localization purposes. However, these map building endeavor… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: ICRA 2023

  23. arXiv:2306.00975  [pdf, other

    cs.LG cs.CV cs.RO

    Active Vision Reinforcement Learning under Limited Visual Observability

    Authors: Jinghuan Shang, Michael S. Ryoo

    Abstract: In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where an embodied agent simultaneously learns action policy for the task while also controlling its visual observations in partially observable environments. We denote the former as motor policy and the latter as sensory policy. For example, humans solve real world tasks by hand manipulation (motor policy) togethe… ▽ More

    Submitted 5 November, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023. Project page at https://elicassion.github.io/sugarl/sugarl.html Code at https://github.com/elicassion/sugarl Environment library at https://github.com/elicassion/active-gym

  24. arXiv:2304.02560  [pdf, other

    cs.CV

    VicTR: Video-conditioned Text Representations for Activity Recognition

    Authors: Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

    Abstract: Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely… ▽ More

    Submitted 29 March, 2024; v1 submitted 5 April, 2023; originally announced April 2023.

    Comments: To appear at CVPR 2024

  25. arXiv:2211.13224  [pdf, other

    cs.CV cs.CL cs.LG

    Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

    Authors: Ryan Burgert, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo

    Abstract: Recently, text-to-image diffusion models have shown remarkable capabilities in creating realistic images from natural language prompts. However, few works have explored using these models for semantic localization or grounding. In this work, we explore how an off-the-shelf text-to-image diffusion model, trained without exposure to localization information, can ground various semantic phrases witho… ▽ More

    Submitted 21 June, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: 19 pages; contains appendix

  26. arXiv:2211.09119  [pdf, other

    cs.LG cs.CV cs.RO

    Token Turing Machines

    Authors: Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab

    Abstract: We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history (i.e., frames). This memory is efficiently addressed, read and written using a Transformer as the p… ▽ More

    Submitted 13 April, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

    Comments: CVPR 2023 camera-ready copy

    Journal ref: CVPR 2023

  27. arXiv:2210.15943  [pdf, other

    cs.CV

    Grafting Vision Transformers

    Authors: Jongwoo Park, Kumara Kahatapitiya, Donghyun Kim, Shivchander Sudalairaj, Quanfu Fan, Michael S. Ryoo

    Abstract: Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better perfor… ▽ More

    Submitted 3 April, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

  28. arXiv:2209.09874  [pdf, other

    cs.RO cs.AI cs.CV

    Open-vocabulary Queryable Scene Representations for Real World Planning

    Authors: Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao, Keerthana Gopalakrishnan, Michael S. Ryoo, Austin Stone, Daniel Kappler

    Abstract: Large language models (LLMs) have unlocked new capabilities of task planning from human instructions. However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene. In this paper, we develop NLMap, an open-vocabulary and queryable scene representation to address this problem. NLMap serves as a framework to gather and integrate conte… ▽ More

    Submitted 15 October, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

    Comments: v2, added references to concurrent work and acknowledgments

  29. arXiv:2208.00934  [pdf, other

    cs.CV

    Video Question Answering with Iterative Video-Text Co-Tokenization

    Authors: AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

    Abstract: Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

    Comments: ECCV 2022

  30. arXiv:2207.11305  [pdf, ps, other

    math.MG math.CA math.FA math.GR

    Vertical versus horizontal inequalities on simply connected nilpotent Lie groups and groups of polynomial growth

    Authors: Seung-Yeon Ryoo

    Abstract: We establish ``vertical versus horizontal inequalities'' for functions from nonabelian simply connected nilpotent Lie groups and not virtually abelian finitely generated groups of polynomial growth into uniformly convex Banach spaces using the vector-valued Littlewood--Paley--Stein theory approach of Lafforgue and Naor (2012). This is a quantitative nonembeddability statement that shows that any L… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

    Comments: 56 pages

    MSC Class: 30L15 (Primary) 26B05; 46B85 (Secondary)

  31. arXiv:2207.00579  [pdf, other

    cs.CV cs.LG

    Video + CLIP Baseline for Ego4D Long-term Action Anticipation

    Authors: Srijan Das, Michael S. Ryoo

    Abstract: In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

    Comments: Secured second position in the Ego4D Challenge for Long-Term Action Anticipation track at CVPR 2022

  32. arXiv:2206.11895  [pdf, other

    cs.CV cs.LG cs.RO

    Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

    Authors: Jinghuan Shang, Srijan Das, Michael S. Ryoo

    Abstract: Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers,… ▽ More

    Submitted 12 January, 2023; v1 submitted 23 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022. Our code is at https://github.com/elicassion/3DTRL Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html v3, v4 for minor updates on figures and visualizations

  33. arXiv:2206.05266  [pdf, other

    cs.LG cs.CV cs.RO

    Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?

    Authors: Xiang Li, Jinghuan Shang, Srijan Das, Michael S. Ryoo

    Abstract: We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful… ▽ More

    Submitted 13 January, 2023; v1 submitted 10 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022. Code for ELo-SACv3 is at https://github.com/LostXine/elo-sac and code for ELo-Rainbow is at https://github.com/LostXine/elo-rainbow

  34. arXiv:2112.03906  [pdf, other

    cs.CV

    Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

    Authors: Srijan Das, Michael S. Ryoo

    Abstract: Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious. Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mi… ▽ More

    Submitted 27 July, 2023; v1 submitted 7 December, 2021; originally announced December 2021.

    Comments: Accepted at MVA 2023

  35. arXiv:2112.03905  [pdf, other

    cs.CV

    ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints

    Authors: Srijan Das, Michael S. Ryoo

    Abstract: Learning self-supervised video representation predominantly focuses on discriminating instances generated from simple data augmentation schemes. However, the learned representation often fails to generalize over unseen camera viewpoints. To this end, we propose ViewCLR, that learns self-supervised video representation invariant to camera viewpoint changes. We introduce a view-generator that can be… ▽ More

    Submitted 7 December, 2021; originally announced December 2021.

    Comments: 13 pages, Codes and models will updated soon

  36. arXiv:2112.03902  [pdf, other

    cs.CV

    MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

    Authors: Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, Francois Bremond

    Abstract: Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. The temporal relation is complex in those datasets, including challenges like composite action, and co-occurring action. For detecting actions in those complex videos, efficiently capturing both short-term and long-term temporal information in the video is critical. To this end, we… ▽ More

    Submitted 29 March, 2022; v1 submitted 7 December, 2021; originally announced December 2021.

    Comments: Accepted in CVPR 2022

  37. arXiv:2111.13677  [pdf, other

    cs.CV

    SWAT: Spatial Structure Within and Among Tokens

    Authors: Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: Modeling visual data as tokens (i.e., image patches) using attention mechanisms, feed-forward networks or convolutions has been highly effective in recent years. Such methods usually have a common pipeline: a tokenization method, followed by a set of layers/blocks for information mixing, both within and among tokens. When image patches are converted into tokens, they are often flattened, discardin… ▽ More

    Submitted 20 November, 2023; v1 submitted 26 November, 2021; originally announced November 2021.

    Comments: Accepted to be published at IJCAI23

  38. arXiv:2111.13675  [pdf, other

    cs.CV

    Weakly-guided Self-supervised Pretraining for Temporal Activity Detection

    Authors: Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua

    Abstract: Temporal Activity Detection aims to predict activity classes per frame, in contrast to video-level predictions in Activity Classification (i.e., Activity Recognition). Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited. Thus, commonly, previous work on temporal activity detection resorts to fine-tuning a classification model pretrained o… ▽ More

    Submitted 4 February, 2023; v1 submitted 26 November, 2021; originally announced November 2021.

    Comments: Published as a conference paper at AAAI 2023

  39. StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning

    Authors: Jinghuan Shang, Kumara Kahatapitiya, Xiang Li, Michael S. Ryoo

    Abstract: Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like… ▽ More

    Submitted 3 January, 2023; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: Accepted to ECCV 2022. Our code is available at https://github.com/elicassion/StARformer

  40. arXiv:2110.04367  [pdf, other

    cs.LG stat.ML

    Hybrid Random Features

    Authors: Krzysztof Choromanski, Haoxian Chen, Han Lin, Yuanzhe Ma, Arijit Sehanobish, Deepali Jain, Michael S Ryoo, Jake Varley, Andy Zeng, Valerii Likhosherstov, Dmitry Kalashnikov, Vikas Sindhwani, Adrian Weller

    Abstract: We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the… ▽ More

    Submitted 30 January, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

    Comments: Published as a conference paper at ICLR 2022

  41. arXiv:2109.14761  [pdf, ps, other

    math.DS math.CA

    Asymptotic formation and orbital stability of phase-locked states in Kuramoto--Lohe type synchronization models on Lie groups

    Authors: Seung-Yeon Ryoo

    Abstract: Some mathematical models of synchronization, such as the Kuramoto model (1975) and its generalizations pioneered by Lohe (2009), are formulated as ordinary differential equations describing populations of particles on Lie groups with locally attractive interactions. We suggest a model of synchronization on Lie groups and present a framework to understand the formation of phase-locked states and th… ▽ More

    Submitted 4 January, 2025; v1 submitted 29 September, 2021; originally announced September 2021.

    Comments: 19 pages, to appear in Commun. Math. Sci

    MSC Class: 34D06 (Primary) 34H15; 82C22 (Secondary)

  42. arXiv:2109.01066  [pdf, other

    cs.CV

    4D-Net for Learned Multi-Modal Alignment

    Authors: AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, Anelia Angelova

    Abstract: We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time. We are able to incorporate the 4D information by performing a novel dynamic connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints. Our approach outperforms the state-of-the-art and strong baselines… ▽ More

    Submitted 2 September, 2021; originally announced September 2021.

    Comments: ICCV 2021

  43. arXiv:2108.01069  [pdf, other

    cs.RO cs.CV cs.LG

    Self-Supervised Disentangled Representation Learning for Third-Person Imitation Learning

    Authors: Jinghuan Shang, Michael S. Ryoo

    Abstract: Humans learn to imitate by observing others. However, robot imitation learning generally requires expert demonstrations in the first-person view (FPV). Collecting such FPV videos for every robot could be very expensive. Third-person imitation learning (TPIL) is the concept of learning action policies by observing other agents in a third-person view (TPV), similar to what humans do. This ultimately… ▽ More

    Submitted 2 August, 2021; originally announced August 2021.

    Comments: Preprint. 8 pages. Accepted at IROS 2021

  44. arXiv:2106.14733  [pdf, other

    cs.CV

    Unsupervised Discovery of Actions in Instructional Videos

    Authors: AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

    Abstract: In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos. Instructional videos contain complex activities and are a rich source of information for intelligent agents, such as, autonomous robots or virtual assistants, which can, for example, automatically `read' the steps from an instructional video and execute them. However,… ▽ More

    Submitted 28 June, 2021; originally announced June 2021.

    Comments: Full paper

  45. arXiv:2106.11297  [pdf, other

    cs.CV cs.LG

    TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

    Authors: Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

    Abstract: In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual… ▽ More

    Submitted 3 April, 2022; v1 submitted 21 June, 2021; originally announced June 2021.

    Comments: This is the full version of the paper, extending its conference paper at NeurIPS 2021. Version 1.1 of the code is released

    Journal ref: NeurIPS 2021

  46. arXiv:2106.03738  [pdf, other

    cs.CV

    Unsupervised Action Segmentation for Instructional Videos

    Authors: AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

    Abstract: In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos based on a sequential stochastic autoregressive model for temporal segmentation of videos. This… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: 4 page abstract for LUV workshop

  47. arXiv:2103.16516  [pdf, other

    cs.CV

    Recognizing Actions in Videos from Unseen Viewpoints

    Authors: AJ Piergiovanni, Michael S. Ryoo

    Abstract: Standard methods for video recognition use large CNNs designed to capture spatio-temporal data. However, training these models requires a large amount of labeled training data, containing a wide variety of actions, scenes, settings and camera viewpoints. In this paper, we show that current convolutional neural network models are unable to recognize actions from camera viewpoints not present in the… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

    Journal ref: CVPR 2021

  48. arXiv:2103.14633  [pdf, other

    cs.RO cs.CV cs.LG cs.NE

    Visionary: Vision architecture discovery for robot learning

    Authors: Iretiayo Akinola, Anelia Angelova, Yao Lu, Yevgen Chebotar, Dmitry Kalashnikov, Jacob Varley, Julian Ibarz, Michael S. Ryoo

    Abstract: We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs. Our approach automatically designs architectures while training on the task - discovering novel ways of combining and attending image feature representations with actions as well as features from previous layer… ▽ More

    Submitted 26 March, 2021; originally announced March 2021.

    Journal ref: ICRA 2021

  49. arXiv:2103.01302  [pdf, other

    cs.CV

    Coarse-Fine Networks for Temporal Activity Detection in Videos

    Authors: Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. Traditional Video models process inputs at one (or few) fixed temporal resolution without any dynamic frame selection. However, we argue that, processing multiple temporal resolutions of the input a… ▽ More

    Submitted 1 April, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

    Comments: To appear at CVPR 2021

  50. arXiv:2011.07092  [pdf, other

    cs.CV

    Reducing Inference Latency with Concurrent Architectures for Image Recognition

    Authors: Ramyad Hadidi, Jiashen Cao, Michael S. Ryoo, Hyesoon Kim

    Abstract: Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency. The current approaches in decreasing latency only increase parallelism within a layer. This is because architectures typically capture a single-chain dependency pattern that prevents efficient distribution with a higher concurrency (i.e., simultaneous execution of one in… ▽ More

    Submitted 13 November, 2020; originally announced November 2020.