-
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback
Authors:
Thai Hoang,
Kung-Hsiang Huang,
Shirley Kokane,
Jianguo Zhang,
Zuxin Liu,
Ming Zhu,
Jake Grigsby,
Tian Lan,
Michael S Ryoo,
Chien-Sheng Wu,
Shelby Heinecke,
Huan Wang,
Silvio Savarese,
Caiming Xiong,
Juan Carlos Niebles
Abstract:
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our…
▽ More
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high-quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self-generated datasets using our framework achieve significant performance gains, up to a 49.3\% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR's efficiency and effectiveness in speeding up development of AI agents.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Pixel Motion as Universal Representation for Robot Control
Authors:
Kanchana Ranasinghe,
Xiang Li,
Cristina Mata,
Jongwoo Park,
Michael S Ryoo
Abstract:
We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted…
▽ More
We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted from videos in a self-supervised manner, enabling diffusion model training on web-scale video-caption data. Treating generated pixel motion as learned universal representations, our low level System 1 module translates these into robot actions via motion-to-action mapping functions, which can be either hand-crafted or learned with minimal supervision. System 2 operates as a high-level policy applied at sparse temporal intervals, while System 1 acts as a low-level policy at dense temporal intervals. This hierarchical decoupling enables flexible, scalable, and generalizable robot control under both unsupervised and supervised settings, bridging the gap between language, motion, and action. Checkout https://kahnchana.github.io/LangToMo for visualizations.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Quantitative relaxation dynamics from generic initial configurations in the inertial Kuramoto model
Authors:
Hangjun Cho,
Jiu-Gang Dong,
Seung-Yeal Ha,
Seung-Yeon Ryoo
Abstract:
We study the relaxation dynamics of the inertial Kuramoto model toward a phase-locked state from a generic initial phase configuration. For this, we propose a sufficient framework in terms of initial data and system parameters for asymptotic phase-locking. It can be roughly stated as set of conditions such as a positive initial order parameter, a coupling strength sufficiently larger than initial…
▽ More
We study the relaxation dynamics of the inertial Kuramoto model toward a phase-locked state from a generic initial phase configuration. For this, we propose a sufficient framework in terms of initial data and system parameters for asymptotic phase-locking. It can be roughly stated as set of conditions such as a positive initial order parameter, a coupling strength sufficiently larger than initial frequency diameter and intrinsic frequency diameter, but less than the inverse of inertia. Under the proposed framework, generic initial configuration undergoes three dynamic stages (initial layer, condensation and relaxation stages) before it reaches a phase-locked state asymptotically. The first stage is the initial layer stage in analogy with fluid mechanics, during which the effect of the initial natural frequency distribution is dominant, compared to that of the sinusoidal coupling between oscillators. The second stage is the condensation stage, during which the order parameter increases, and at the end of which a majority cluster is contained in a sufficiently small arc. Finally, the third stage is the persistence and relaxation stage, during which the majority cluster remains stable (persistence) and the total configuration relaxes toward a phase-locked state asymptotically (relaxation). The intricate proof involves with several key tools such as the quasi-monotonicity of the order parameter (for the condensation stage), a nonlinear Grönwall inequality on the diameter of the majority cluster (for the persistence stage), and a variant of the classical Łojasiewicz gradient theorem (for the relaxation stage).
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Asymptotics of Riemannian Lie groups with nilpotency step 2
Authors:
Enrico Le Donne,
Luca Nalon,
Sebastiano Nicolussi Golo,
Seung-Yeon Ryoo
Abstract:
We derive estimates comparing asymptotic Riemannian or sub-Riemannian metrics in step-2 nilpotent Lie groups. Given a sub-Riemannian metric, we construct a Carnot metric whose square remains at a bounded distance from the square of the original metric. As a consequence, we obtain a refined estimate of the error term in the asymptotic expansion of the volume of (sub-)Riemannian metric balls. To ach…
▽ More
We derive estimates comparing asymptotic Riemannian or sub-Riemannian metrics in step-2 nilpotent Lie groups. Given a sub-Riemannian metric, we construct a Carnot metric whose square remains at a bounded distance from the square of the original metric. As a consequence, we obtain a refined estimate of the error term in the asymptotic expansion of the volume of (sub-)Riemannian metric balls. To achieve this, we develop a novel technique to perturb sub-Riemannian geodesics, allowing us to modify their endpoints in a prescribed vertical direction.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Multi-view Structural Convolution Network for Domain-Invariant Point Cloud Recognition of Autonomous Vehicles
Authors:
Younggun Kim,
Beomsik Cho,
Seonghoon Ryoo,
Soomok Lee
Abstract:
Point cloud representation has recently become a research hotspot in the field of computer vision and has been utilized for autonomous vehicles. However, adapting deep learning networks for point cloud data recognition is challenging due to the variability in datasets and sensor technologies. This variability underscores the necessity for adaptive techniques to maintain accuracy under different co…
▽ More
Point cloud representation has recently become a research hotspot in the field of computer vision and has been utilized for autonomous vehicles. However, adapting deep learning networks for point cloud data recognition is challenging due to the variability in datasets and sensor technologies. This variability underscores the necessity for adaptive techniques to maintain accuracy under different conditions. In this paper, we present the Multi-View Structural Convolution Network (MSCN) designed for domain-invariant point cloud recognition. MSCN comprises Structural Convolution Layers (SCL) that extract local context geometric features from point clouds and Structural Aggregation Layers (SAL) that extract and aggregate both local and overall context features from point clouds. Additionally, our MSCN enhances feature representation robustness by training with unseen domain point clouds derived from source domain point clouds. This method acquires domain-invariant features and exhibits robust, consistent performance across various point cloud datasets, ensuring compatibility with diverse sensor configurations without the need for parameter adjustments. This highlights MSCN's potential to significantly improve the reliability and domain invariant features in different environments. Our code is available at https://github.com/MLMLab/MSCN.
△ Less
Submitted 30 April, 2025; v1 submitted 27 January, 2025;
originally announced January 2025.
-
LatentCRF: Continuous CRF for Efficient Latent Diffusion
Authors:
Kanchana Ranasinghe,
Sadeep Jayasumana,
Andreas Veit,
Ayan Chakrabarti,
Daniel Glasner,
Michael S Ryoo,
Srikumar Ramalingam,
Sanjiv Kumar
Abstract:
Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships among the latent vectors in the LDM. By replacing some…
▽ More
Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships among the latent vectors in the LDM. By replacing some of the computationally-intensive LDM inference iterations with our lightweight LatentCRF, we achieve a superior balance between quality, speed and diversity. We increase inference efficiency by 33% with no loss in image quality or diversity compared to the full LDM. LatentCRF is an easy add-on, which does not require modifying the LDM.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Authors:
AJ Piergiovanni,
Dahun Kim,
Michael S. Ryoo,
Isaac Noble,
Anelia Angelova
Abstract:
Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding a…
▽ More
Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding architecture, which models the sequence of visual features for each time segment, outputting localized descriptions and efficiently leverages the context from the previous video segments. This allows the model to output frequent, detailed captions to more comprehensively describe the video, according to its actual local content, rather than mimic the training data. Second, we propose an optimization for efficient training and inference, which enables scaling to longer videos. Our approach shows excellent performance compared to both offline and online methods, and uses 20\% less compute. The annotations produced are much more comprehensive and frequent, and can further be utilized in automatic video tagging and in large-scale video data harvesting.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Authors:
Kumara Kahatapitiya,
Haozhe Liu,
Sen He,
Ding Liu,
Menglin Jia,
Chenyang Zhang,
Michael S. Ryoo,
Tian Xie
Abstract:
Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a train…
▽ More
Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.
△ Less
Submitted 7 November, 2024; v1 submitted 4 November, 2024;
originally announced November 2024.
-
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Authors:
Michael S. Ryoo,
Honglu Zhou,
Shrikant Kendre,
Can Qin,
Le Xue,
Manli Shu,
Jongwoo Park,
Kanchana Ranasinghe,
Silvio Savarese,
Ran Xu,
Caiming Xiong,
Juan Carlos Niebles
Abstract:
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much f…
▽ More
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html
△ Less
Submitted 9 June, 2025; v1 submitted 21 October, 2024;
originally announced October 2024.
-
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Authors:
Le Xue,
Manli Shu,
Anas Awadalla,
Jun Wang,
An Yan,
Senthil Purushwalkam,
Honglu Zhou,
Viraj Prabhu,
Yutong Dai,
Michael S Ryoo,
Shrikant Kendre,
Jieyu Zhang,
Can Qin,
Shu Zhang,
Chia-Chih Chen,
Ning Yu,
Juntao Tan,
Tulika Manoj Awalgaonkar,
Shelby Heinecke,
Huan Wang,
Yejin Choi,
Ludwig Schmidt,
Zeyuan Chen,
Silvio Savarese,
Juan Carlos Niebles
, et al. (2 additional authors not shown)
Abstract:
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tas…
▽ More
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
△ Less
Submitted 28 August, 2024; v1 submitted 16 August, 2024;
originally announced August 2024.
-
3D Adaptive Structural Convolution Network for Domain-Invariant Point Cloud Recognition
Authors:
Younggun Kim,
Beomsik Cho,
Seonghoon Ryoo,
Soomok Lee
Abstract:
Adapting deep learning networks for point cloud data recognition in self-driving vehicles faces challenges due to the variability in datasets and sensor technologies, emphasizing the need for adaptive techniques to maintain accuracy across different conditions. In this paper, we introduce the 3D Adaptive Structural Convolution Network (3D-ASCN), a cutting-edge framework for 3D point cloud recognit…
▽ More
Adapting deep learning networks for point cloud data recognition in self-driving vehicles faces challenges due to the variability in datasets and sensor technologies, emphasizing the need for adaptive techniques to maintain accuracy across different conditions. In this paper, we introduce the 3D Adaptive Structural Convolution Network (3D-ASCN), a cutting-edge framework for 3D point cloud recognition. It combines 3D convolution kernels, a structural tree structure, and adaptive neighborhood sampling for effective geometric feature extraction. This method obtains domain-invariant features and demonstrates robust, adaptable performance on a variety of point cloud datasets, ensuring compatibility across diverse sensor configurations without the need for parameter adjustments. This highlights its potential to significantly enhance the reliability and efficiency of self-driving vehicle technology.
△ Less
Submitted 21 October, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Authors:
Xiang Li,
Cristina Mata,
Jongwoo Park,
Kumara Kahatapitiya,
Yoo Sung Jang,
Jinghuan Shang,
Kanchana Ranasinghe,
Ryan Burgert,
Mu Cai,
Yong Jae Lee,
Michael S. Ryoo
Abstract:
Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot…
▽ More
Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
△ Less
Submitted 30 January, 2025; v1 submitted 28 June, 2024;
originally announced June 2024.
-
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Authors:
Jongwoo Park,
Kanchana Ranasinghe,
Kumara Kahatapitiya,
Wonjeong Ryu,
Donghyun Kim,
Michael S. Ryoo
Abstract:
Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore use of large langua…
▽ More
Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore use of large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Questioning these decision choices, we explore optimal strategies for key-frame selection that can significantly reduce these redundancies, namely Hierarchical Keyframe Selector. Our proposed framework, LVNet, achieves state-of-the-art performance at a comparable caption scale across three benchmark LVQA datasets: EgoSchema, NExT-QA, and IntentQA, while also demonstrating a strong performance on videos up to an hour long in VideoMME. Our code will be released publicly. The code can be found at https://github.com/jongwoopark7978/LVNet.
△ Less
Submitted 20 March, 2025; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Authors:
Kanchana Ranasinghe,
Satya Narayan Shukla,
Omid Poursaeed,
Michael S. Ryoo,
Tsung-Yu Lin
Abstract:
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these…
▽ More
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Understanding Long Videos with Multimodal Language Models
Authors:
Kanchana Ranasinghe,
Xiang Li,
Kumara Kahatapitiya,
Michael S. Ryoo
Abstract:
Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video in…
▽ More
Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video specific information. Building on this, we explore injecting video-specific information into an LLM-based framework. We utilize off-the-shelf vision tools to extract three object-centric information modalities from videos, and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establish its strong generality. Code: https://github.com/kahnchana/mvu
△ Less
Submitted 23 February, 2025; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Language Repository for Long Video Understanding
Authors:
Kumara Kahatapitiya,
Kanchana Ranasinghe,
Jongwoo Park,
Michael S. Ryoo
Abstract:
Language has become a prominent modality in computer vision with the rise of LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise a…
▽ More
Language has become a prominent modality in computer vision with the rise of LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i.e., all-textual) representation. Our repository is updated iteratively based on multi-scale video chunks. We introduce write and read operations that focus on pruning redundancies in text, and extracting information at various temporal scales. The proposed framework is evaluated on zero-shot visual question-answering benchmarks including EgoSchema, NExT-QA, IntentQA and NExT-GQA, showing state-of-the-art performance at its scale. Our code is available at https://github.com/kkahatapitiya/LangRepo.
△ Less
Submitted 20 December, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
Diffusion Illusions: Hiding Images in Plain Sight
Authors:
Ryan Burgert,
Xiang Li,
Abe Leite,
Kanchana Ranasinghe,
Michael S. Ryoo
Abstract:
We explore the problem of computationally generating special `prime' images that produce optical illusions when physically arranged and viewed in a certain way. First, we propose a formal definition for this problem. Next, we introduce Diffusion Illusions, the first comprehensive pipeline designed to automatically generate a wide range of these illusions. Specifically, we both adapt the existing `…
▽ More
We explore the problem of computationally generating special `prime' images that produce optical illusions when physically arranged and viewed in a certain way. First, we propose a formal definition for this problem. Next, we introduce Diffusion Illusions, the first comprehensive pipeline designed to automatically generate a wide range of these illusions. Specifically, we both adapt the existing `score distillation loss' and propose a new `dream target loss' to optimize a group of differentially parametrized prime images, using a frozen text-to-image diffusion model. We study three types of illusions, each where the prime images are arranged in different ways and optimized using the aforementioned losses such that images derived from them align with user-chosen text prompts or images. We conduct comprehensive experiments on these illusions and verify the effectiveness of our proposed method qualitatively and quantitatively. Additionally, we showcase the successful physical fabrication of our illusions -- as they are all designed to work in the real world. Our code and examples are publicly available at our interactive project website: https://diffusionillusions.com
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
Authors:
AJ Piergiovanni,
Isaac Noble,
Dahun Kim,
Michael S. Ryoo,
Victor Gomes,
Anelia Angelova
Abstract:
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volu…
▽ More
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder.
We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.
△ Less
Submitted 3 April, 2024; v1 submitted 9 November, 2023;
originally announced November 2023.
-
Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders
Authors:
Srijan Das,
Tanmay Jain,
Dominick Reilly,
Pranav Balaji,
Soumyajit Karmakar,
Shyam Marjit,
Xiang Li,
Abhijit Das,
Michael S. Ryoo
Abstract:
Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite their success, ViTs lack inductive biases, which can make it difficult to train them with limited data. To address this challenge, prior studies suggest training ViTs with self-supervised learning (SSL) and fine-tuning sequentially. However, we observe that jointly optimizing ViTs for the primary task and a Self-Supervis…
▽ More
Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite their success, ViTs lack inductive biases, which can make it difficult to train them with limited data. To address this challenge, prior studies suggest training ViTs with self-supervised learning (SSL) and fine-tuning sequentially. However, we observe that jointly optimizing ViTs for the primary task and a Self-Supervised Auxiliary Task (SSAT) is surprisingly beneficial when the amount of training data is limited. We explore the appropriate SSL tasks that can be optimized alongside the primary task, the training schemes for these tasks, and the data scale at which they can be most effective. Our findings reveal that SSAT is a powerful technique that enables ViTs to leverage the unique characteristics of both the self-supervised and primary tasks, achieving better performance than typical ViTs pre-training with SSL and fine-tuning sequentially. Our experiments, conducted on 10 datasets, demonstrate that SSAT significantly improves ViT performance while reducing carbon footprint. We also confirm the effectiveness of SSAT in the video domain for deepfake detection, showcasing its generalizability. Our code is available at https://github.com/dominickrei/Limited-data-vits.
△ Less
Submitted 27 December, 2023; v1 submitted 31 October, 2023;
originally announced October 2023.
-
AAN: Attributes-Aware Network for Temporal Action Detection
Authors:
Rui Dai,
Srijan Das,
Michael S. Ryoo,
Francois Bremond
Abstract:
The challenge of long-term video understanding remains constrained by the efficient extraction of object semantics and the modelling of their relationships for downstream tasks. Although the CLIP visual features exhibit discriminative properties for various vision tasks, particularly in object encoding, they are suboptimal for long-term video understanding. To address this issue, we present the At…
▽ More
The challenge of long-term video understanding remains constrained by the efficient extraction of object semantics and the modelling of their relationships for downstream tasks. Although the CLIP visual features exhibit discriminative properties for various vision tasks, particularly in object encoding, they are suboptimal for long-term video understanding. To address this issue, we present the Attributes-Aware Network (AAN), which consists of two key components: the Attributes Extractor and a Graph Reasoning block. These components facilitate the extraction of object-centric attributes and the modelling of their relationships within the video. By leveraging CLIP features, AAN outperforms state-of-the-art approaches on two popular action detection datasets: Charades and Toyota Smarthome Untrimmed datasets.
△ Less
Submitted 1 September, 2023;
originally announced September 2023.
-
Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning
Authors:
Xiang Li,
Varun Belagali,
Jinghuan Shang,
Michael S. Ryoo
Abstract:
Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states.…
▽ More
Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.
△ Less
Submitted 11 January, 2024; v1 submitted 4 July, 2023;
originally announced July 2023.
-
Energy-Based Models for Cross-Modal Localization using Convolutional Transformers
Authors:
Alan Wu,
Michael S. Ryoo
Abstract:
We present a novel framework using Energy-Based Models (EBMs) for localizing a ground vehicle mounted with a range sensor against satellite imagery in the absence of GPS. Lidar sensors have become ubiquitous on autonomous vehicles for describing its surrounding environment. Map priors are typically built using the same sensor modality for localization purposes. However, these map building endeavor…
▽ More
We present a novel framework using Energy-Based Models (EBMs) for localizing a ground vehicle mounted with a range sensor against satellite imagery in the absence of GPS. Lidar sensors have become ubiquitous on autonomous vehicles for describing its surrounding environment. Map priors are typically built using the same sensor modality for localization purposes. However, these map building endeavors using range sensors are often expensive and time-consuming. Alternatively, we leverage the use of satellite images as map priors, which are widely available, easily accessible, and provide comprehensive coverage. We propose a method using convolutional transformers that performs accurate metric-level localization in a cross-modal manner, which is challenging due to the drastic difference in appearance between the sparse range sensor readings and the rich satellite imagery. We train our model end-to-end and demonstrate our approach achieving higher accuracy than the state-of-the-art on KITTI, Pandaset, and a custom dataset.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
Active Vision Reinforcement Learning under Limited Visual Observability
Authors:
Jinghuan Shang,
Michael S. Ryoo
Abstract:
In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where an embodied agent simultaneously learns action policy for the task while also controlling its visual observations in partially observable environments. We denote the former as motor policy and the latter as sensory policy. For example, humans solve real world tasks by hand manipulation (motor policy) togethe…
▽ More
In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where an embodied agent simultaneously learns action policy for the task while also controlling its visual observations in partially observable environments. We denote the former as motor policy and the latter as sensory policy. For example, humans solve real world tasks by hand manipulation (motor policy) together with eye movements (sensory policy). ActiveVision-RL poses challenges on coordinating two policies given their mutual influence. We propose SUGARL, Sensorimotor Understanding Guided Active Reinforcement Learning, a framework that models motor and sensory policies separately, but jointly learns them using with an intrinsic sensorimotor reward. This learnable reward is assigned by sensorimotor reward module, incentivizes the sensory policy to select observations that are optimal to infer its own motor action, inspired by the sensorimotor stage of humans. Through a series of experiments, we show the effectiveness of our method across a range of observability conditions and its adaptability to existed RL algorithms. The sensory policies learned through our method are observed to exhibit effective active vision strategies.
△ Less
Submitted 5 November, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
VicTR: Video-conditioned Text Representations for Activity Recognition
Authors:
Kumara Kahatapitiya,
Anurag Arnab,
Arsha Nagrani,
Michael S. Ryoo
Abstract:
Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely…
▽ More
Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.
△ Less
Submitted 29 March, 2024; v1 submitted 5 April, 2023;
originally announced April 2023.
-
Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
Authors:
Ryan Burgert,
Kanchana Ranasinghe,
Xiang Li,
Michael S. Ryoo
Abstract:
Recently, text-to-image diffusion models have shown remarkable capabilities in creating realistic images from natural language prompts. However, few works have explored using these models for semantic localization or grounding. In this work, we explore how an off-the-shelf text-to-image diffusion model, trained without exposure to localization information, can ground various semantic phrases witho…
▽ More
Recently, text-to-image diffusion models have shown remarkable capabilities in creating realistic images from natural language prompts. However, few works have explored using these models for semantic localization or grounding. In this work, we explore how an off-the-shelf text-to-image diffusion model, trained without exposure to localization information, can ground various semantic phrases without segmentation-specific re-training. We introduce an inference time optimization process capable of generating segmentation masks conditioned on natural language prompts. Our proposal, Peekaboo, is a first-of-its-kind zero-shot, open-vocabulary, unsupervised semantic grounding technique leveraging diffusion models without any training. We evaluate Peekaboo on the Pascal VOC dataset for unsupervised semantic segmentation and the RefCOCO dataset for referring segmentation, showing results competitive with promising results. We also demonstrate how Peekaboo can be used to generate images with transparency, even though the underlying diffusion model was only trained on RGB images - which to our knowledge we are the first to attempt. Please see our project page, including our code: https://ryanndagreat.github.io/peekaboo
△ Less
Submitted 21 June, 2023; v1 submitted 23 November, 2022;
originally announced November 2022.
-
Token Turing Machines
Authors:
Michael S. Ryoo,
Keerthana Gopalakrishnan,
Kumara Kahatapitiya,
Ted Xiao,
Kanishka Rao,
Austin Stone,
Yao Lu,
Julian Ibarz,
Anurag Arnab
Abstract:
We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history (i.e., frames). This memory is efficiently addressed, read and written using a Transformer as the p…
▽ More
We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history (i.e., frames). This memory is efficiently addressed, read and written using a Transformer as the processing unit/controller at each step. The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step. We show that TTM outperforms other alternatives, such as other Transformer models designed for long sequences and recurrent neural networks, on two real-world sequential visual understanding tasks: online temporal activity detection from videos and vision-based robot action policy learning.
Code is publicly available at: https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing
△ Less
Submitted 13 April, 2023; v1 submitted 16 November, 2022;
originally announced November 2022.
-
Grafting Vision Transformers
Authors:
Jongwoo Park,
Kumara Kahatapitiya,
Donghyun Kim,
Shivchander Sudalairaj,
Quanfu Fan,
Michael S. Ryoo
Abstract:
Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better perfor…
▽ More
Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike. It has the flexibility of branching out at arbitrary depths and shares most of the parameters and computations of the backbone. GrafT shows consistent gains over various well-known models which includes both hybrid and pure Transformer types, both homogeneous and pyramid structures, and various self-attention methods. In particular, it largely benefits mobile-size models by providing high-level semantics. On the ImageNet-1k dataset, GrafT delivers +3.9%, +1.4%, and +1.9% top-1 accuracy improvement to DeiT-T, Swin-T, and MobileViT-XXS, respectively. Our code and models will be made available.
△ Less
Submitted 3 April, 2023; v1 submitted 28 October, 2022;
originally announced October 2022.
-
Open-vocabulary Queryable Scene Representations for Real World Planning
Authors:
Boyuan Chen,
Fei Xia,
Brian Ichter,
Kanishka Rao,
Keerthana Gopalakrishnan,
Michael S. Ryoo,
Austin Stone,
Daniel Kappler
Abstract:
Large language models (LLMs) have unlocked new capabilities of task planning from human instructions. However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene. In this paper, we develop NLMap, an open-vocabulary and queryable scene representation to address this problem. NLMap serves as a framework to gather and integrate conte…
▽ More
Large language models (LLMs) have unlocked new capabilities of task planning from human instructions. However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene. In this paper, we develop NLMap, an open-vocabulary and queryable scene representation to address this problem. NLMap serves as a framework to gather and integrate contextual information into LLM planners, allowing them to see and query available objects in the scene before generating a context-conditioned plan. NLMap first establishes a natural language queryable scene representation with Visual Language models (VLMs). An LLM based object proposal module parses instructions and proposes involved objects to query the scene representation for object availability and location. An LLM planner then plans with such information about the scene. NLMap allows robots to operate without a fixed list of objects nor executable options, enabling real robot operation unachievable by previous methods. Project website: https://nlmap-saycan.github.io
△ Less
Submitted 15 October, 2022; v1 submitted 20 September, 2022;
originally announced September 2022.
-
Video Question Answering with Iterative Video-Text Co-Tokenization
Authors:
AJ Piergiovanni,
Kairo Morton,
Weicheng Kuo,
Michael S. Ryoo,
Anelia Angelova
Abstract:
Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization…
▽ More
Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
-
Vertical versus horizontal inequalities on simply connected nilpotent Lie groups and groups of polynomial growth
Authors:
Seung-Yeon Ryoo
Abstract:
We establish ``vertical versus horizontal inequalities'' for functions from nonabelian simply connected nilpotent Lie groups and not virtually abelian finitely generated groups of polynomial growth into uniformly convex Banach spaces using the vector-valued Littlewood--Paley--Stein theory approach of Lafforgue and Naor (2012). This is a quantitative nonembeddability statement that shows that any L…
▽ More
We establish ``vertical versus horizontal inequalities'' for functions from nonabelian simply connected nilpotent Lie groups and not virtually abelian finitely generated groups of polynomial growth into uniformly convex Banach spaces using the vector-valued Littlewood--Paley--Stein theory approach of Lafforgue and Naor (2012). This is a quantitative nonembeddability statement that shows that any Lipschitz mapping from the aforementioned groups into a uniformly convex space must quantitatively collapse along certain subgroups. As a consequence, a ball of radius $r\ge 2$ in the aforementioned groups must incur bilipschitz distortion at least a constant multiple of $(\log r)^{1/q}$ into a $q(\ge 2)$-uniformly convex Banach space. This bound is sharp for the $L^p$ ($1<p<\infty$) spaces.
In the special case of mappings of Carnot groups into the $L^p$ ($1<p<\infty$) spaces, we prove that the quantitative collapse occurs on a larger subgroup that is the commutator subgroup; this is in line with the qualitative Pansu--Semmes nonembeddability argument given by Cheeger and Kleiner (2006) and Lee and Naor (2006). We prove this by establishing a version of the classical Dorronsoro theorem on Carnot groups. Previously, in the setting of Heisenberg groups, Fässler and Orponen (2019) established a one-sided Dorronsoro theorem with a restriction $0<α<2$ on the range of exponents $α$ of the Laplacian; this restriction does not appear in the commutative setting and is caused by their use of horizontal polynomials as approximants. We identify the correct class of approximant polynomials and prove the two-sided Dorronsoro theorem with the full range $0<α<\infty$ of exponents in the general setting of Carnot groups, thus strengthening and extending the work of Fässler and Orponen.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
Video + CLIP Baseline for Ego4D Long-term Action Anticipation
Authors:
Srijan Das,
Michael S. Ryoo
Abstract:
In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information…
▽ More
In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames. We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation. Our code is available at github.com/srijandas07/clip_baseline_LTA_Ego4d.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space
Authors:
Jinghuan Shang,
Srijan Das,
Michael S. Ryoo
Abstract:
Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers,…
▽ More
Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens, trained in an unsupervised fashion. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our code is available at https://github.com/elicassion/3DTRL.
△ Less
Submitted 12 January, 2023; v1 submitted 23 June, 2022;
originally announced June 2022.
-
Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?
Authors:
Xiang Li,
Jinghuan Shang,
Srijan Das,
Michael S. Ryoo
Abstract:
We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful…
▽ More
We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful improvement over the baselines only taking advantage of image augmentation when the same amount of data and augmentation is used. We further perform evolutionary searches to find the optimal combination of multiple self-supervised losses for RL, but find that even such a loss combination fails to meaningfully outperform the methods that only utilize carefully designed image augmentations. After evaluating these approaches together in multiple different environments including a real-world robot environment, we confirm that no single self-supervised loss or image augmentation method can dominate all environments and that the current framework for joint optimization of SSL and RL is limited. Finally, we conduct the ablation study on multiple factors and demonstrate the properties of representations learned with different approaches.
△ Less
Submitted 13 January, 2023; v1 submitted 10 June, 2022;
originally announced June 2022.
-
Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
Authors:
Srijan Das,
Michael S. Ryoo
Abstract:
Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious.
Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mi…
▽ More
Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious.
Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities. We find that our video mixing strategy STC-mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the quality of learned video representations. We conduct thorough experiments for two downstream tasks: action recognition and video retrieval on two small scale video datasets UCF101, and HMDB51. We also demonstrate the effectiveness of our STC-mix on NTU dataset where domain knowledge is limited.
We show that the performance of our STC-mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.
△ Less
Submitted 27 July, 2023; v1 submitted 7 December, 2021;
originally announced December 2021.
-
ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints
Authors:
Srijan Das,
Michael S. Ryoo
Abstract:
Learning self-supervised video representation predominantly focuses on discriminating instances generated from simple data augmentation schemes. However, the learned representation often fails to generalize over unseen camera viewpoints. To this end, we propose ViewCLR, that learns self-supervised video representation invariant to camera viewpoint changes. We introduce a view-generator that can be…
▽ More
Learning self-supervised video representation predominantly focuses on discriminating instances generated from simple data augmentation schemes. However, the learned representation often fails to generalize over unseen camera viewpoints. To this end, we propose ViewCLR, that learns self-supervised video representation invariant to camera viewpoint changes. We introduce a view-generator that can be considered as a learnable augmentation for any self-supervised pre-text tasks, to generate latent viewpoint representation of a video. ViewCLR maximizes the similarities between the latent viewpoint representation with its representation from the original viewpoint, enabling the learned video encoder to generalize over unseen camera viewpoints. Experiments on cross-view benchmark datasets including NTU RGB+D dataset show that ViewCLR stands as a state-of-the-art viewpoint invariant self-supervised method.
△ Less
Submitted 7 December, 2021;
originally announced December 2021.
-
MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
Authors:
Rui Dai,
Srijan Das,
Kumara Kahatapitiya,
Michael S. Ryoo,
Francois Bremond
Abstract:
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. The temporal relation is complex in those datasets, including challenges like composite action, and co-occurring action. For detecting actions in those complex videos, efficiently capturing both short-term and long-term temporal information in the video is critical. To this end, we…
▽ More
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. The temporal relation is complex in those datasets, including challenges like composite action, and co-occurring action. For detecting actions in those complex videos, efficiently capturing both short-term and long-term temporal information in the video is critical. To this end, we propose a novel ConvTransformer network for action detection. This network comprises three main components: (1) Temporal Encoder module extensively explores global and local temporal relations at multiple temporal resolutions. (2) Temporal Scale Mixer module effectively fuses the multi-scale features to have a unified feature representation. (3) Classification module is used to learn the instance center-relative position and predict the frame-level classification scores. The extensive experiments on multiple datasets, including Charades, TSU and MultiTHUMOS, confirm the effectiveness of our proposed method. Our network outperforms the state-of-the-art methods on all three datasets.
△ Less
Submitted 29 March, 2022; v1 submitted 7 December, 2021;
originally announced December 2021.
-
SWAT: Spatial Structure Within and Among Tokens
Authors:
Kumara Kahatapitiya,
Michael S. Ryoo
Abstract:
Modeling visual data as tokens (i.e., image patches) using attention mechanisms, feed-forward networks or convolutions has been highly effective in recent years. Such methods usually have a common pipeline: a tokenization method, followed by a set of layers/blocks for information mixing, both within and among tokens. When image patches are converted into tokens, they are often flattened, discardin…
▽ More
Modeling visual data as tokens (i.e., image patches) using attention mechanisms, feed-forward networks or convolutions has been highly effective in recent years. Such methods usually have a common pipeline: a tokenization method, followed by a set of layers/blocks for information mixing, both within and among tokens. When image patches are converted into tokens, they are often flattened, discarding the spatial structure within each patch. As a result, any processing that follows (eg: multi-head self-attention) may fail to recover and/or benefit from such information. In this paper, we argue that models can have significant gains when spatial structure is preserved during tokenization, and is explicitly used during the mixing stage. We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing, both of which can be combined with existing models with minimal effort. We introduce a family of models (SWAT), showing improvements over the likes of DeiT, MLP-Mixer and Swin Transformer, across multiple benchmarks including ImageNet classification and ADE20K segmentation. Our code is available at https://github.com/kkahatapitiya/SWAT.
△ Less
Submitted 20 November, 2023; v1 submitted 26 November, 2021;
originally announced November 2021.
-
Weakly-guided Self-supervised Pretraining for Temporal Activity Detection
Authors:
Kumara Kahatapitiya,
Zhou Ren,
Haoxiang Li,
Zhenyu Wu,
Michael S. Ryoo,
Gang Hua
Abstract:
Temporal Activity Detection aims to predict activity classes per frame, in contrast to video-level predictions in Activity Classification (i.e., Activity Recognition). Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited. Thus, commonly, previous work on temporal activity detection resorts to fine-tuning a classification model pretrained o…
▽ More
Temporal Activity Detection aims to predict activity classes per frame, in contrast to video-level predictions in Activity Classification (i.e., Activity Recognition). Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited. Thus, commonly, previous work on temporal activity detection resorts to fine-tuning a classification model pretrained on large-scale classification datasets (e.g., Kinetics-400). However, such pretrained models are not ideal for downstream detection, due to the disparity between the pretraining and the downstream fine-tuning tasks. In this work, we propose a novel 'weakly-guided self-supervised' pretraining method for detection. We leverage weak labels (classification) to introduce a self-supervised pretext task (detection) by generating frame-level pseudo labels, multi-action frames, and action segments. Simply put, we design a detection task similar to downstream, on large-scale classification data, without extra annotations. We show that the models pretrained with the proposed weakly-guided self-supervised detection task outperform prior work on multiple challenging activity detection benchmarks, including Charades and MultiTHUMOS. Our extensive ablations further provide insights on when and how to use the proposed models for activity detection. Code is available at https://github.com/kkahatapitiya/SSDet.
△ Less
Submitted 4 February, 2023; v1 submitted 26 November, 2021;
originally announced November 2021.
-
StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning
Authors:
Jinghuan Shang,
Kumara Kahatapitiya,
Xiang Li,
Michael S. Ryoo
Abstract:
Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like…
▽ More
Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. Our approach first extracts StAR-representations by self-attending image state patches, action, and reward tokens within a short temporal window. These are then combined with pure image state representations -- extracted as convolutional features, to perform self-attention over the whole sequence. Our experiments show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, in both offline-RL and imitation learning settings. StARformer is also more compliant with longer sequences of inputs. Our code is available at https://github.com/elicassion/StARformer.
△ Less
Submitted 3 January, 2023; v1 submitted 12 October, 2021;
originally announced October 2021.
-
Hybrid Random Features
Authors:
Krzysztof Choromanski,
Haoxian Chen,
Han Lin,
Yuanzhe Ma,
Arijit Sehanobish,
Deepali Jain,
Michael S Ryoo,
Jake Varley,
Andy Zeng,
Valerii Likhosherstov,
Dmitry Kalashnikov,
Vikas Sindhwani,
Adrian Weller
Abstract:
We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the…
▽ More
We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems.
△ Less
Submitted 30 January, 2022; v1 submitted 8 October, 2021;
originally announced October 2021.
-
Asymptotic formation and orbital stability of phase-locked states in Kuramoto--Lohe type synchronization models on Lie groups
Authors:
Seung-Yeon Ryoo
Abstract:
Some mathematical models of synchronization, such as the Kuramoto model (1975) and its generalizations pioneered by Lohe (2009), are formulated as ordinary differential equations describing populations of particles on Lie groups with locally attractive interactions. We suggest a model of synchronization on Lie groups and present a framework to understand the formation of phase-locked states and th…
▽ More
Some mathematical models of synchronization, such as the Kuramoto model (1975) and its generalizations pioneered by Lohe (2009), are formulated as ordinary differential equations describing populations of particles on Lie groups with locally attractive interactions. We suggest a model of synchronization on Lie groups and present a framework to understand the formation of phase-locked states and their orbital stability. This is a sequel to a previous joint work with Ha and Ko (2017).
△ Less
Submitted 4 January, 2025; v1 submitted 29 September, 2021;
originally announced September 2021.
-
4D-Net for Learned Multi-Modal Alignment
Authors:
AJ Piergiovanni,
Vincent Casser,
Michael S. Ryoo,
Anelia Angelova
Abstract:
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time. We are able to incorporate the 4D information by performing a novel dynamic connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints. Our approach outperforms the state-of-the-art and strong baselines…
▽ More
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time. We are able to incorporate the 4D information by performing a novel dynamic connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints. Our approach outperforms the state-of-the-art and strong baselines on the Waymo Open Dataset. 4D-Net is better able to use motion cues and dense image information to detect distant objects more successfully.
△ Less
Submitted 2 September, 2021;
originally announced September 2021.
-
Self-Supervised Disentangled Representation Learning for Third-Person Imitation Learning
Authors:
Jinghuan Shang,
Michael S. Ryoo
Abstract:
Humans learn to imitate by observing others. However, robot imitation learning generally requires expert demonstrations in the first-person view (FPV). Collecting such FPV videos for every robot could be very expensive. Third-person imitation learning (TPIL) is the concept of learning action policies by observing other agents in a third-person view (TPV), similar to what humans do. This ultimately…
▽ More
Humans learn to imitate by observing others. However, robot imitation learning generally requires expert demonstrations in the first-person view (FPV). Collecting such FPV videos for every robot could be very expensive. Third-person imitation learning (TPIL) is the concept of learning action policies by observing other agents in a third-person view (TPV), similar to what humans do. This ultimately allows utilizing human and robot demonstration videos in TPV from many different data sources, for the policy learning. In this paper, we present a TPIL approach for robot tasks with egomotion. Although many robot tasks with ground/aerial mobility often involve actions with camera egomotion, study on TPIL for such tasks has been limited. Here, FPV and TPV observations are visually very different; FPV shows egomotion while the agent appearance is only observable in TPV. To enable better state learning for TPIL, we propose our disentangled representation learning method. We use a dual auto-encoder structure plus representation permutation loss and time-contrastive loss to ensure the state and viewpoint representations are well disentangled. Our experiments show the effectiveness of our approach.
△ Less
Submitted 2 August, 2021;
originally announced August 2021.
-
Unsupervised Discovery of Actions in Instructional Videos
Authors:
AJ Piergiovanni,
Anelia Angelova,
Michael S. Ryoo,
Irfan Essa
Abstract:
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos. Instructional videos contain complex activities and are a rich source of information for intelligent agents, such as, autonomous robots or virtual assistants, which can, for example, automatically `read' the steps from an instructional video and execute them. However,…
▽ More
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos. Instructional videos contain complex activities and are a rich source of information for intelligent agents, such as, autonomous robots or virtual assistants, which can, for example, automatically `read' the steps from an instructional video and execute them. However, videos are rarely annotated with atomic activities, their boundaries or duration. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos. We propose a sequential stochastic autoregressive model for temporal segmentation of videos, which learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling for videos. Our approach outperforms the state-of-the-art unsupervised methods with large margins. We will open source the code.
△ Less
Submitted 28 June, 2021;
originally announced June 2021.
-
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Authors:
Michael S. Ryoo,
AJ Piergiovanni,
Anurag Arnab,
Mostafa Dehghani,
Anelia Angelova
Abstract:
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual…
▽ More
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount. We obtain comparable results to the state-of-the-arts on ImageNet while being computationally more efficient. We also confirm the effectiveness of the approach on multiple video datasets, including Kinetics-400, Kinetics-600, Charades, and AViD.
The code is available at: https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner
△ Less
Submitted 3 April, 2022; v1 submitted 21 June, 2021;
originally announced June 2021.
-
Unsupervised Action Segmentation for Instructional Videos
Authors:
AJ Piergiovanni,
Anelia Angelova,
Michael S. Ryoo,
Irfan Essa
Abstract:
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos based on a sequential stochastic autoregressive model for temporal segmentation of videos. This…
▽ More
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos based on a sequential stochastic autoregressive model for temporal segmentation of videos. This learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Recognizing Actions in Videos from Unseen Viewpoints
Authors:
AJ Piergiovanni,
Michael S. Ryoo
Abstract:
Standard methods for video recognition use large CNNs designed to capture spatio-temporal data. However, training these models requires a large amount of labeled training data, containing a wide variety of actions, scenes, settings and camera viewpoints. In this paper, we show that current convolutional neural network models are unable to recognize actions from camera viewpoints not present in the…
▽ More
Standard methods for video recognition use large CNNs designed to capture spatio-temporal data. However, training these models requires a large amount of labeled training data, containing a wide variety of actions, scenes, settings and camera viewpoints. In this paper, we show that current convolutional neural network models are unable to recognize actions from camera viewpoints not present in their training data (i.e., unseen view action recognition). To address this, we develop approaches based on 3D representations and introduce a new geometric convolutional layer that can learn viewpoint invariant representations. Further, we introduce a new, challenging dataset for unseen view recognition and show the approaches ability to learn viewpoint invariant representations.
△ Less
Submitted 30 March, 2021;
originally announced March 2021.
-
Visionary: Vision architecture discovery for robot learning
Authors:
Iretiayo Akinola,
Anelia Angelova,
Yao Lu,
Yevgen Chebotar,
Dmitry Kalashnikov,
Jacob Varley,
Julian Ibarz,
Michael S. Ryoo
Abstract:
We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs. Our approach automatically designs architectures while training on the task - discovering novel ways of combining and attending image feature representations with actions as well as features from previous layer…
▽ More
We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs. Our approach automatically designs architectures while training on the task - discovering novel ways of combining and attending image feature representations with actions as well as features from previous layers. The obtained new architectures demonstrate better task success rates, in some cases with a large margin, compared to a recent high performing baseline. Our real robot experiments also confirm that it improves grasping performance by 6%. This is the first approach to demonstrate a successful neural architecture search and attention connectivity search for a real-robot task.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Coarse-Fine Networks for Temporal Activity Detection in Videos
Authors:
Kumara Kahatapitiya,
Michael S. Ryoo
Abstract:
In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. Traditional Video models process inputs at one (or few) fixed temporal resolution without any dynamic frame selection. However, we argue that, processing multiple temporal resolutions of the input a…
▽ More
In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. Traditional Video models process inputs at one (or few) fixed temporal resolution without any dynamic frame selection. However, we argue that, processing multiple temporal resolutions of the input and doing so dynamically by learning to estimate the importance of each frame can largely improve video representations, specially in the domain of temporal activity localization. To this end, we propose (1) Grid Pool, a learned temporal downsampling layer to extract coarse features, and, (2) Multi-stage Fusion, a spatio-temporal attention mechanism to fuse a fine-grained context with the coarse features. We show that our method outperforms the state-of-the-arts for action detection in public datasets including Charades with a significantly reduced compute and memory footprint. The code is available at https://github.com/kkahatapitiya/Coarse-Fine-Networks
△ Less
Submitted 1 April, 2021; v1 submitted 1 March, 2021;
originally announced March 2021.
-
Reducing Inference Latency with Concurrent Architectures for Image Recognition
Authors:
Ramyad Hadidi,
Jiashen Cao,
Michael S. Ryoo,
Hyesoon Kim
Abstract:
Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency. The current approaches in decreasing latency only increase parallelism within a layer. This is because architectures typically capture a single-chain dependency pattern that prevents efficient distribution with a higher concurrency (i.e., simultaneous execution of one in…
▽ More
Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency. The current approaches in decreasing latency only increase parallelism within a layer. This is because architectures typically capture a single-chain dependency pattern that prevents efficient distribution with a higher concurrency (i.e., simultaneous execution of one inference among devices). Such single-chain dependencies are so widespread that even implicitly biases recent neural architecture search (NAS) studies. In this visionary paper, we draw attention to an entirely new space of NAS that relaxes the single-chain dependency to provide higher concurrency and distribution opportunities. To quantitatively compare these architectures, we propose a score that encapsulates crucial metrics such as communication, concurrency, and load balancing. Additionally, we propose a new generator and transformation block that consistently deliver superior architectures compared to current state-of-the-art methods. Finally, our preliminary results show that these new architectures reduce the inference latency and deserve more attention.
△ Less
Submitted 13 November, 2020;
originally announced November 2020.