-
OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation
Authors:
Raktim Gautam Goswami,
Prashanth Krishnamurthy,
Yann LeCun,
Farshad Khorrami
Abstract:
Visual imitation learning enables robotic agents to acquire skills by observing expert demonstration videos. In the one-shot setting, the agent generates a policy after observing a single expert demonstration without additional fine-tuning. Existing approaches typically train and evaluate on the same set of tasks, varying only object configurations, and struggle to generalize to unseen tasks with…
▽ More
Visual imitation learning enables robotic agents to acquire skills by observing expert demonstration videos. In the one-shot setting, the agent generates a policy after observing a single expert demonstration without additional fine-tuning. Existing approaches typically train and evaluate on the same set of tasks, varying only object configurations, and struggle to generalize to unseen tasks with different semantic or structural requirements. While some recent methods attempt to address this, they exhibit low success rates on hard test tasks that, despite being visually similar to some training tasks, differ in context and require distinct responses. Additionally, most existing methods lack an explicit model of environment dynamics, limiting their ability to reason about future states. To address these limitations, we propose a novel framework for one-shot visual imitation learning via world-model-guided trajectory generation. Given an expert demonstration video and the agent's initial observation, our method leverages a learned world model to predict a sequence of latent states and actions. This latent trajectory is then decoded into physical waypoints that guide the agent's execution. Our method is evaluated on two simulated benchmarks and three real-world robotic platforms, where it consistently outperforms prior approaches, with over 30% improvement in some cases.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
A Chain-Driven, Sandwich-Legged Quadruped Robot: Design and Experimental Analysis
Authors:
Aman Singh,
Bhavya Giri Goswami,
Ketan Nehete,
Shishir N. Y. Kolathaya
Abstract:
This paper introduces a chain-driven, sandwich-legged, mid-size quadruped robot designed as an accessible research platform. The design prioritizes enhanced locomotion capabilities, improved reliability and safety of the actuation system, and simplified, cost-effective manufacturing processes. Locomotion performance is optimized through a sandwiched leg design and a dual-motor configuration, reduc…
▽ More
This paper introduces a chain-driven, sandwich-legged, mid-size quadruped robot designed as an accessible research platform. The design prioritizes enhanced locomotion capabilities, improved reliability and safety of the actuation system, and simplified, cost-effective manufacturing processes. Locomotion performance is optimized through a sandwiched leg design and a dual-motor configuration, reducing leg inertia for agile movements. Reliability and safety are achieved by integrating robust cable strain reliefs, efficient heat sinks for motor thermal management, and mechanical limits to restrict leg motion. Simplified design considerations include a quasi-direct drive (QDD) actuator and the adoption of low-cost fabrication techniques, such as laser cutting and 3D printing, to minimize cost and ensure rapid prototyping. The robot weighs approximately 25 kg and is developed at a cost under \$8000, making it a scalable and affordable solution for robotics research. Experimental validations demonstrate the platform's capability to execute trot and crawl gaits on flat terrain and slopes, highlighting its potential as a versatile and reliable quadruped research platform.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training
Authors:
Raktim Gautam Goswami,
Prashanth Krishnamurthy,
Yann LeCun,
Farshad Khorrami
Abstract:
Vision-based pose estimation of articulated robots with unknown joint angles has applications in collaborative robotics and human-robot interaction tasks. Current frameworks use neural network encoders to extract image features and downstream layers to predict joint angles and robot pose. While images of robots inherently contain rich information about the robot's physical structures, existing met…
▽ More
Vision-based pose estimation of articulated robots with unknown joint angles has applications in collaborative robotics and human-robot interaction tasks. Current frameworks use neural network encoders to extract image features and downstream layers to predict joint angles and robot pose. While images of robots inherently contain rich information about the robot's physical structures, existing methods often fail to leverage it fully; therefore, limiting performance under occlusions and truncations. To address this, we introduce RoboPEPP, a method that fuses information about the robot's physical model into the encoder using a masking-based self-supervised embedding-predictive architecture. Specifically, we mask the robot's joints and pre-train an encoder-predictor model to infer the joints' embeddings from surrounding unmasked regions, enhancing the encoder's understanding of the robot's physical model. The pre-trained encoder-predictor pair, along with joint angle and keypoint prediction networks, is then fine-tuned for pose and joint angle estimation. Random masking of input during fine-tuning and keypoint filtering during evaluation further improves robustness. Our method, evaluated on several datasets, achieves the best results in robot pose and joint angle estimation while being the least sensitive to occlusions and requiring the lowest execution time.
△ Less
Submitted 2 May, 2025; v1 submitted 26 November, 2024;
originally announced November 2024.
-
OrionNav: Online Planning for Robot Autonomy with Context-Aware LLM and Open-Vocabulary Semantic Scene Graphs
Authors:
Venkata Naren Devarakonda,
Raktim Gautam Goswami,
Ali Umut Kaypak,
Naman Patel,
Rooholla Khorrambakht,
Prashanth Krishnamurthy,
Farshad Khorrami
Abstract:
Enabling robots to autonomously navigate unknown, complex, dynamic environments and perform diverse tasks remains a fundamental challenge in developing robust autonomous physical agents. These agents must effectively perceive their surroundings while leveraging world knowledge for decision-making. Although recent approaches utilize vision-language and large language models for scene understanding…
▽ More
Enabling robots to autonomously navigate unknown, complex, dynamic environments and perform diverse tasks remains a fundamental challenge in developing robust autonomous physical agents. These agents must effectively perceive their surroundings while leveraging world knowledge for decision-making. Although recent approaches utilize vision-language and large language models for scene understanding and planning, they often rely on offline processing, offboard compute, make simplifying assumptions about the environment and perception, limiting real-world applicability. We present a novel framework for real-time onboard autonomous navigation in unknown environments that change over time by integrating multi-level abstraction in both perception and planning pipelines. Our system fuses data from multiple onboard sensors for localization and mapping and integrates it with open-vocabulary semantics to generate hierarchical scene graphs from continuously updated semantic object map. The LLM-based planner uses these graphs to create multi-step plans that guide low-level controllers in executing navigation tasks specified in natural language. The system's real-time operation enables the LLM to adjust its plans based on updates to the scene graph and task execution status, ensuring continuous adaptation to new situations or when the current plan cannot accomplish the task, a key advantage over static or rule-based systems. We demonstrate our system's efficacy on a quadruped navigating dynamic environments, showcasing its adaptability and robustness in diverse scenarios.
△ Less
Submitted 22 October, 2024; v1 submitted 8 October, 2024;
originally announced October 2024.
-
FlashMix: Fast Map-Free LiDAR Localization via Feature Mixing and Contrastive-Constrained Accelerated Training
Authors:
Raktim Gautam Goswami,
Naman Patel,
Prashanth Krishnamurthy,
Farshad Khorrami
Abstract:
Map-free LiDAR localization systems accurately localize within known environments by predicting sensor position and orientation directly from raw point clouds, eliminating the need for large maps and descriptors. However, their long training times hinder rapid adaptation to new environments. To address this, we propose FlashMix, which uses a frozen, scene-agnostic backbone to extract local point d…
▽ More
Map-free LiDAR localization systems accurately localize within known environments by predicting sensor position and orientation directly from raw point clouds, eliminating the need for large maps and descriptors. However, their long training times hinder rapid adaptation to new environments. To address this, we propose FlashMix, which uses a frozen, scene-agnostic backbone to extract local point descriptors, aggregated with an MLP mixer to predict sensor pose. A buffer of local descriptors is used to accelerate training by orders of magnitude, combined with metric learning or contrastive loss regularization of aggregated descriptors to improve performance and convergence. We evaluate FlashMix on various LiDAR localization benchmarks, examining different regularizations and aggregators, demonstrating its effectiveness for rapid and accurate LiDAR localization in real-world scenarios. The code is available at https://github.com/raktimgg/FlashMix.
△ Less
Submitted 27 September, 2024;
originally announced October 2024.
-
SALSA: Swift Adaptive Lightweight Self-Attention for Enhanced LiDAR Place Recognition
Authors:
Raktim Gautam Goswami,
Naman Patel,
Prashanth Krishnamurthy,
Farshad Khorrami
Abstract:
Large-scale LiDAR mappings and localization leverage place recognition techniques to mitigate odometry drifts, ensuring accurate mapping. These techniques utilize scene representations from LiDAR point clouds to identify previously visited sites within a database. Local descriptors, assigned to each point within a point cloud, are aggregated to form a scene representation for the point cloud. Thes…
▽ More
Large-scale LiDAR mappings and localization leverage place recognition techniques to mitigate odometry drifts, ensuring accurate mapping. These techniques utilize scene representations from LiDAR point clouds to identify previously visited sites within a database. Local descriptors, assigned to each point within a point cloud, are aggregated to form a scene representation for the point cloud. These descriptors are also used to re-rank the retrieved point clouds based on geometric fitness scores. We propose SALSA, a novel, lightweight, and efficient framework for LiDAR place recognition. It consists of a Sphereformer backbone that uses radial window attention to enable information aggregation for sparse distant points, an adaptive self-attention layer to pool local descriptors into tokens, and a multi-layer-perceptron Mixer layer for aggregating the tokens to generate a scene descriptor. The proposed framework outperforms existing methods on various LiDAR place recognition datasets in terms of both retrieval and metric localization while operating in real-time.
△ Less
Submitted 30 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
A Collision Cone Approach for Control Barrier Functions
Authors:
Manan Tayal,
Bhavya Giri Goswami,
Karthik Rajgopal,
Rajpal Singh,
Tejas Rao,
Jishnu Keshavan,
Pushpak Jagtap,
Shishir Kolathaya
Abstract:
This work presents a unified approach for collision avoidance using Collision-Cone Control Barrier Functions (CBFs) in both ground (UGV) and aerial (UAV) unmanned vehicles. We propose a novel CBF formulation inspired by collision cones, to ensure safety by constraining the relative velocity between the vehicle and the obstacle to always point away from each other. The efficacy of this approach is…
▽ More
This work presents a unified approach for collision avoidance using Collision-Cone Control Barrier Functions (CBFs) in both ground (UGV) and aerial (UAV) unmanned vehicles. We propose a novel CBF formulation inspired by collision cones, to ensure safety by constraining the relative velocity between the vehicle and the obstacle to always point away from each other. The efficacy of this approach is demonstrated through simulations and hardware implementations on the TurtleBot, Stoch-Jeep, and Crazyflie 2.1 quadrotor robot, showcasing its effectiveness in avoiding collisions with dynamic obstacles in both ground and aerial settings. The real-time controller is developed using CBF Quadratic Programs (CBF-QPs). Comparative analysis with the state-of-the-art CBFs highlights the less conservative nature of the proposed approach. Overall, this research contributes to a novel control formation that can give a guarantee for collision avoidance in unmanned vehicles by modifying the control inputs from existing path-planning controllers.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Collision Cone Control Barrier Functions: Experimental Validation on UGVs for Kinematic Obstacle Avoidance
Authors:
Bhavya Giri Goswami,
Manan Tayal,
Karthik Rajgopal,
Pushpak Jagtap,
Shishir Kolathaya
Abstract:
Autonomy advances have enabled robots in diverse environments and close human interaction, necessitating controllers with formal safety guarantees. This paper introduces an experimental platform designed for the validation and demonstration of a novel class of Control Barrier Functions (CBFs) tailored for Unmanned Ground Vehicles (UGVs) to proactively prevent collisions with kinematic obstacles by…
▽ More
Autonomy advances have enabled robots in diverse environments and close human interaction, necessitating controllers with formal safety guarantees. This paper introduces an experimental platform designed for the validation and demonstration of a novel class of Control Barrier Functions (CBFs) tailored for Unmanned Ground Vehicles (UGVs) to proactively prevent collisions with kinematic obstacles by integrating the concept of collision cones. While existing CBF formulations excel with static obstacles, extensions to torque/acceleration-controlled unicycle and bicycle models have seen limited success. Conventional CBF applications in nonholonomic UGV models have demonstrated control conservatism, particularly in scenarios where steering/thrust control was deemed infeasible. Drawing inspiration from collision cones in path planning, we present a pioneering CBF formulation ensuring theoretical safety guarantees for both unicycle and bicycle models. The core premise revolves around aligning the obstacle's velocity away from the vehicle, establishing a constraint to perpetually avoid vectors directed towards it. This control methodology is rigorously validated through simulations and experimental verification on the Copernicus mobile robot (Unicycle Model) and FOCAS-Car (Bicycle Model).
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Control Barrier Functions in UGVs for Kinematic Obstacle Avoidance: A Collision Cone Approach
Authors:
Phani Thontepu,
Bhavya Giri Goswami,
Manan Tayal,
Neelaksh Singh,
Shyamsundar P I,
Shyam Sundar M G,
Suresh Sundaram,
Vaibhav Katewa,
Shishir Kolathaya
Abstract:
In this paper, we propose a new class of Control Barrier Functions (CBFs) for Unmanned Ground Vehicles (UGVs) that help avoid collisions with kinematic (non-zero velocity) obstacles. While the current forms of CBFs have been successful in guaranteeing safety/collision avoidance with static obstacles, extensions for the dynamic case have seen limited success. Moreover, with the UGV models like the…
▽ More
In this paper, we propose a new class of Control Barrier Functions (CBFs) for Unmanned Ground Vehicles (UGVs) that help avoid collisions with kinematic (non-zero velocity) obstacles. While the current forms of CBFs have been successful in guaranteeing safety/collision avoidance with static obstacles, extensions for the dynamic case have seen limited success. Moreover, with the UGV models like the unicycle or the bicycle, applications of existing CBFs have been conservative in terms of control, i.e., steering/thrust control has not been possible under certain scenarios. Drawing inspiration from the classical use of collision cones for obstacle avoidance in trajectory planning, we introduce its novel CBF formulation with theoretical guarantees on safety for both the unicycle and bicycle models. The main idea is to ensure that the velocity of the obstacle w.r.t. the vehicle is always pointing away from the vehicle. Accordingly, we construct a constraint that ensures that the velocity vector always avoids a cone of vectors pointing at the vehicle. The efficacy of this new control methodology is later verified by Pybullet simulations on TurtleBot3 and F1Tenth.
△ Less
Submitted 16 October, 2023; v1 submitted 23 September, 2022;
originally announced September 2022.
-
Design, Manufacturing, and Controls of a Prismatic Quadruped Robot: PRISMA
Authors:
Team Robocon,
IIT Roorkee,
:,
Bhavya Giri Goswami,
Aman Verma,
Gautam Jha,
Vandan Gajjar,
Vedant Neekhra,
Utkarsh Deepak,
Aayush Singh Chauhan
Abstract:
Most of the quadrupeds developed are highly actuated, and their control is hence quite cumbersome. They need advanced electronics equipment to solve convoluted inverse kinematic equations continuously. In addition, they demand special and costly sensors to autonomously navigate through the environment as traditional distance sensors usually fail because of the continuous perturbation due to the mo…
▽ More
Most of the quadrupeds developed are highly actuated, and their control is hence quite cumbersome. They need advanced electronics equipment to solve convoluted inverse kinematic equations continuously. In addition, they demand special and costly sensors to autonomously navigate through the environment as traditional distance sensors usually fail because of the continuous perturbation due to the motion of the robot. Another challenge is maintaining the continuous dynamic stability of the robot while walking, which requires complicated and state-of-the-art control algorithms. This paper presents a thorough description of the hardware design and control architecture of our in-house prismatic joint quadruped robot called the PRISMA. We aim to forge a robust and kinematically stable quadruped robot that can use elementary control algorithms and utilize conventional sensors to navigate an unknown environment. We discuss the benefits and limitations of the robot in terms of its motion, different foot trajectories, manufacturability, and controls.
△ Less
Submitted 26 December, 2021;
originally announced December 2021.
-
Phase Aware Speech Enhancement using Realisation of Complex-valued LSTM
Authors:
Raktim Gautam Goswami,
Sivaganesh Andhavarapu,
K Sri Rama Murty
Abstract:
Most of the deep learning based speech enhancement (SE) methods rely on estimating the magnitude spectrum of the clean speech signal from the observed noisy speech signal, either by magnitude spectral masking or regression. These methods reuse the noisy phase while synthesizing the time-domain waveform from the estimated magnitude spectrum. However, there have been recent works highlighting the im…
▽ More
Most of the deep learning based speech enhancement (SE) methods rely on estimating the magnitude spectrum of the clean speech signal from the observed noisy speech signal, either by magnitude spectral masking or regression. These methods reuse the noisy phase while synthesizing the time-domain waveform from the estimated magnitude spectrum. However, there have been recent works highlighting the importance of phase in SE. There was an attempt to estimate the complex ratio mask taking phase into account using complex-valued feed-forward neural network (FFNN). But FFNNs cannot capture the sequential information essential for phase estimation. In this work, we propose a realisation of complex-valued long short-term memory (RCLSTM) network to estimate the complex ratio mask (CRM) using sequential information along time. The proposed RCLSTM is designed to process the complex-valued sequences using complex arithmetic, and hence it preserves the dependencies between the real and imaginary parts of CRM and thereby the phase. The proposed method is evaluated on the noisy speech mixtures formed from the Voice-Bank corpus and DEMAND database. When compared to real value based masking methods, the proposed RCLSTM improves over them in several objective measures including perceptual evaluation of speech quality (PESQ), in which it improves by over 4.3%
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
Unravelling Robustness of Deep Learning based Face Recognition Against Adversarial Attacks
Authors:
Gaurav Goswami,
Nalini Ratha,
Akshay Agarwal,
Richa Singh,
Mayank Vatsa
Abstract:
Deep neural network (DNN) architecture based models have high expressive power and learning capacity. However, they are essentially a black box method since it is not easy to mathematically formulate the functions that are learned within its many layers of representation. Realizing this, many researchers have started to design methods to exploit the drawbacks of deep learning based algorithms ques…
▽ More
Deep neural network (DNN) architecture based models have high expressive power and learning capacity. However, they are essentially a black box method since it is not easy to mathematically formulate the functions that are learned within its many layers of representation. Realizing this, many researchers have started to design methods to exploit the drawbacks of deep learning based algorithms questioning their robustness and exposing their singularities. In this paper, we attempt to unravel three aspects related to the robustness of DNNs for face recognition: (i) assessing the impact of deep architectures for face recognition in terms of vulnerabilities to attacks inspired by commonly observed distortions in the real world that are well handled by shallow learning methods along with learning based adversaries; (ii) detecting the singularities by characterizing abnormal filter response behavior in the hidden layers of deep networks; and (iii) making corrections to the processing pipeline to alleviate the problem. Our experimental evaluation using multiple open-source DNN-based face recognition networks, including OpenFace and VGG-Face, and two publicly available databases (MEDS and PaSC) demonstrates that the performance of deep learning based face recognition algorithms can suffer greatly in the presence of such distortions. The proposed method is also compared with existing detection algorithms and the results show that it is able to detect the attacks with very high accuracy by suitably designing a classifier using the response of the hidden layers in the network. Finally, we present several effective countermeasures to mitigate the impact of adversarial attacks and improve the overall robustness of DNN-based face recognition.
△ Less
Submitted 22 February, 2018;
originally announced March 2018.