Search | arXiv e-print repository

An Empirical Study of Autoregressive Pre-training from Videos

Authors: Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik

Abstract: We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different… ▽ More We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/ △ Less

Submitted 9 January, 2025; originally announced January 2025.

arXiv:2411.08034 [pdf, other]

Scaling Properties of Diffusion Models for Perceptual Tasks

Authors: Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik

Abstract: In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Th… ▽ More In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io . △ Less

Submitted 16 November, 2024; v1 submitted 12 November, 2024; originally announced November 2024.

arXiv:2405.10716 [pdf]

Scanning Acoustic Microscopy for Quantifying Two-phase Transfer in Operando Alkaline Water Electrolyzer

Authors: Zehua Dou, Hannes Rox, Zyzi Ramos, Robert Baumann, Rachappa Ravishankar, Peter Czurratis, Xuegeng Yang, Andrés Fabian Lasagni, Kerstin Eckert, Juergen Czarske, David Weik

Abstract: Improved understandings of two-phase transport in electrochemical gas-evolving systems are increasingly demanded, while high-performance imaging techniques using simplified instrumentations are not readily available. This work presents volumetric scanning acoustic microscopy (SAM) imaging for quantifying the dynamics of gas bubbles and electrolyte in porous Nickel electrodes with different wettabi… ▽ More Improved understandings of two-phase transport in electrochemical gas-evolving systems are increasingly demanded, while high-performance imaging techniques using simplified instrumentations are not readily available. This work presents volumetric scanning acoustic microscopy (SAM) imaging for quantifying the dynamics of gas bubbles and electrolyte in porous Nickel electrodes with different wettability and structures during alkaline water electrolysis (AWE). We realize high-resolution 3D imaging at 10's um level using high frequency spherically focused ultrasound. The high resolution allowed us to clearly visualize the spatial distributions of produced bubbles in the porous electrodes over time. Moreover, we are able to quantify the residual gas volume in an electrode and its coverage due to bubble evolution, which dominate its transport overpotential. Taking these advantages, we elucidate the impacts of electrodes' wettability and structures on their electrolysis performance, on a regular laboratory base. The obtained knowledge provides us important optimization guidelines of AWE designs and operating schemes. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: Research artical on an emerging field. 33 pages, 6 figures, 61 references, 10 supplementary figures available. Journal submission in progress

arXiv:2404.10151 [pdf, other]

Distributing Context-Aware Shared Memory Data Structures: A Case Study on Singly-Linked Lists

Authors: Raaghav Ravishankar, Sandeep Kulkarni, Sathya Peri, Gokarna Sharma

Abstract: In this paper, we study the partitioning of a context-aware shared memory data structure so that it can be implemented as a distributed data structure running on multiple machines. By context-aware data structures, we mean that the result of an operation not only depends upon the value of the shared data but also upon the previous operations performed by the same client. While there is substantial… ▽ More In this paper, we study the partitioning of a context-aware shared memory data structure so that it can be implemented as a distributed data structure running on multiple machines. By context-aware data structures, we mean that the result of an operation not only depends upon the value of the shared data but also upon the previous operations performed by the same client. While there is substantial work on designing distributed data structures, designing distributed context-aware data structures has not received much attention. We focus on singly-linked lists as a case study of the context-aware data structure. We start with a shared memory context-aware lock-free singly-linked list and show how it can be transformed into a distributed lock-free context-aware singly-linked list. The main challenge in such a transformation is to preserve properties of client-visible operations of the underlying data structure. We present two protocols that preserve these properties of client-visible operations of the linked list. In the first protocol, the distribution is done in the background as a low priority task, while in the second protocol the client-visible operations help the task of distribution without affecting client latency. In both protocols, the client-visible operations remain lock-free. Also, our transformation approach does not utilize any hardware primitives (except a compare-and-swap operation on a single word). We note that our transformation is generic and can be used for other lock-free context-aware data structures that can be constructed from singly-linked lists. △ Less

Submitted 24 May, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.09138 [pdf, other]

From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation

Authors: Artur Kiulian, Anton Polishko, Mykola Khandoga, Oryna Chubych, Jack Connor, Raghav Ravishankar, Adarsh Shirawalmath

Abstract: In the rapidly advancing field of AI and NLP, generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology. Our paper addresses this by fine-tuning the… ▽ More In the rapidly advancing field of AI and NLP, generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology. Our paper addresses this by fine-tuning the open-source Gemma and Mistral LLMs with Ukrainian datasets, aiming to improve their linguistic proficiency and benchmarking them against other existing models capable of processing Ukrainian language. This endeavor not only aims to mitigate language bias in technology but also promotes inclusivity in the digital realm. Our transparent and reproducible approach encourages further NLP research and development. Additionally, we present the Ukrainian Knowledge and Instruction Dataset (UKID) to aid future efforts in language model fine-tuning. Our research not only advances the field of NLP but also highlights the importance of linguistic diversity in AI, which is crucial for cultural preservation, education, and expanding AI's global utility. Ultimately, we advocate for a future where technology is inclusive, enabling AI to communicate effectively across all languages, especially those currently underrepresented. △ Less

Submitted 14 April, 2024; originally announced April 2024.

arXiv:2402.07912 [pdf, other]

Spatial Computing: Concept, Applications, Challenges and Future Directions

Authors: Gokul Yenduri, Ramalingam M, Praveen Kumar Reddy Maddikunta, Thippa Reddy Gadekallu, Rutvij H Jhaveri, Ajay Bandi, Junxin Chen, Wei Wang, Adarsh Arunkumar Shirawalmath, Raghav Ravishankar, Weizheng Wang

Abstract: Spatial computing is a technological advancement that facilitates the seamless integration of devices into the physical environment, resulting in a more natural and intuitive digital world user experience. Spatial computing has the potential to become a significant advancement in the field of computing. From GPS and location-based services to healthcare, spatial computing technologies have influen… ▽ More Spatial computing is a technological advancement that facilitates the seamless integration of devices into the physical environment, resulting in a more natural and intuitive digital world user experience. Spatial computing has the potential to become a significant advancement in the field of computing. From GPS and location-based services to healthcare, spatial computing technologies have influenced and improved our interactions with the digital world. The use of spatial computing in creating interactive digital environments has become increasingly popular and effective. This is explained by its increasing significance among researchers and industrial organisations, which motivated us to conduct this review. This review provides a detailed overview of spatial computing, including its enabling technologies and its impact on various applications. Projects related to spatial computing are also discussed. In this review, we also explored the potential challenges and limitations of spatial computing. Furthermore, we discuss potential solutions and future directions. Overall, this paper aims to provide a comprehensive understanding of spatial computing, its enabling technologies, their impact on various applications, emerging challenges, and potential solutions. △ Less

Submitted 30 January, 2024; originally announced February 2024.

Comments: Submitted to peer reviewe

arXiv:2211.13856 [pdf]

WSSL: Weighted Self-supervised Learning Framework For Image-inpainting

Authors: Shubham Gupta, Rahul Kunigal Ravishankar, Madhoolika Gangaraju, Poojasree Dwarkanath, Natarajan Subramanyam

Abstract: Image inpainting is the process of regenerating lost parts of the image. Supervised algorithm-based methods have shown excellent results but have two significant drawbacks. They do not perform well when tested with unseen data. They fail to capture the global context of the image, resulting in a visually unappealing result. We propose a novel self-supervised learning framework for image-inpainting… ▽ More Image inpainting is the process of regenerating lost parts of the image. Supervised algorithm-based methods have shown excellent results but have two significant drawbacks. They do not perform well when tested with unseen data. They fail to capture the global context of the image, resulting in a visually unappealing result. We propose a novel self-supervised learning framework for image-inpainting: Weighted Self-Supervised Learning (WSSL) to tackle these problems. We designed WSSL to learn features from multiple weighted pretext tasks. These features are then utilized for the downstream task, image-inpainting. To improve the performance of our framework and produce more visually appealing images, we also present a novel loss function for image inpainting. The loss function takes advantage of both reconstruction loss and perceptual loss functions to regenerate the image. Our experimentation shows WSSL outperforms previous methods, and our loss function helps produce better results. △ Less

Submitted 24 August, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: 9 Pages, document submitted for publication at CGVCVIP 2022 - ISBN 978-989-8704-42-9

Showing 1–7 of 7 results for author: Ravishankar, R