Skip to main content

Showing 1–3 of 3 results for author: Gavryushin, A

.
  1. arXiv:2504.06084  [pdf, other

    cs.RO cs.CV

    MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos

    Authors: Alexey Gavryushin, Xi Wang, Robert J. S. Malate, Chenyu Yang, Xiangyi Jia, Shubh Goel, Davide Liconti, René Zurbrügg, Robert K. Katzschmann, Marc Pollefeys

    Abstract: Large-scale egocentric video datasets capture diverse human activities across a wide range of scenarios, offering rich and detailed insights into how humans interact with objects, especially those that require fine-grained dexterous control. Such complex, dexterous skills with precise controls are crucial for many robotic manipulation tasks, yet are often insufficiently addressed by traditional da… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  2. arXiv:2503.22869  [pdf, ps, other

    cs.CV

    SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories

    Authors: Alexey Gavryushin, Alexandros Delitzas, Luc Van Gool, Marc Pollefeys, Kaichun Mo, Xi Wang

    Abstract: When humans grasp an object, they naturally form trajectories in their minds to manipulate it for specific tasks. Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems in learning to operate effectively within the physical world. We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interact… ▽ More

    Submitted 29 May, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

  3. arXiv:2301.09209  [pdf, other

    cs.CV cs.CL

    Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

    Authors: Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo, Kaichun Mo, Luc Van Gool, Otmar Hilliges, Xi Wang

    Abstract: We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vi… ▽ More

    Submitted 10 March, 2024; v1 submitted 22 January, 2023; originally announced January 2023.