Skip to main content

Showing 1–7 of 7 results for author: Mavroudi, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.13180  [pdf, ps, other

    cs.CV cs.AI cs.LG

    PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

    Authors: Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl , et al. (4 additional authors not shown)

    Abstract: Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the… ▽ More

    Submitted 23 July, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: Technical Report

  2. arXiv:2504.05451  [pdf, other

    cs.CV

    Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation

    Authors: Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman

    Abstract: Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric t… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  3. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 25 September, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: Expanded manuscript (compared to arxiv v1 from Nov 2023 and CVPR 2024 paper from June 2024) for more comprehensive dataset and benchmark presentation, plus new results on v2 data release

  4. arXiv:2306.03802  [pdf, other

    cs.CV cs.AI

    Learning to Ground Instructional Articles in Videos through Narrations

    Authors: Effrosyni Mavroudi, Triantafyllos Afouras, Lorenzo Torresani

    Abstract: In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: 17 pages, 4 figures and 10 tables

  5. arXiv:2302.08063  [pdf, other

    cs.CV

    MINOTAUR: Multi-task Video Grounding From Multimodal Queries

    Authors: Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

    Abstract: Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i… ▽ More

    Submitted 17 March, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

    Comments: 22 pages, 8 figures and 13 tables

  6. arXiv:1905.07385  [pdf, other

    cs.CV

    Representation Learning on Visual-Symbolic Graphs for Video Understanding

    Authors: Effrosyni Mavroudi, Benjamín Béjar Haro, René Vidal

    Abstract: Events in natural videos typically arise from spatio-temporal interactions between actors and objects and involve multiple co-occurring activities and object classes. To capture this rich visual and semantic context, we propose using two graphs: (1) an attributed spatio-temporal visual graph whose nodes correspond to actors and objects and whose edges encode different types of interactions, and (2… ▽ More

    Submitted 30 September, 2020; v1 submitted 17 May, 2019; originally announced May 2019.

    Comments: ECCV 2020

  7. arXiv:1801.09571  [pdf, other

    cs.CV

    End-to-End Fine-Grained Action Segmentation and Recognition Using Conditional Random Field Models and Discriminative Sparse Coding

    Authors: Effrosyni Mavroudi, Divya Bhaskara, Shahin Sefati, Haider Ali, René Vidal

    Abstract: Fine-grained action segmentation and recognition is an important yet challenging task. Given a long, untrimmed sequence of kinematic data, the task is to classify the action at each time frame and segment the time series into the correct sequence of actions. In this paper, we propose a novel framework that combines a temporal Conditional Random Field (CRF) model with a powerful frame-level represe… ▽ More

    Submitted 29 January, 2018; originally announced January 2018.

    Comments: Camera ready version accepted at WACV 2018