Skip to main content

Showing 1–10 of 10 results for author: Damavandi, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.13180  [pdf, other

    cs.CV cs.AI cs.LG

    PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

    Authors: Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl , et al. (4 additional authors not shown)

    Abstract: Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Technical report

  2. arXiv:2409.06107  [pdf, other

    cs.CL cs.AI

    Doppelgänger's Watch: A Split Objective Approach to Large Language Models

    Authors: Shervin Ghasemlou, Ashish Katiyar, Aparajita Saraf, Seungwhan Moon, Mangesh Pujari, Pinar Donmez, Babak Damavandi, Anuj Kumar

    Abstract: In this paper, we investigate the problem of "generation supervision" in large language models, and present a novel bicameral architecture to separate supervision signals from their core capability, helpfulness. Doppelgänger, a new module parallel to the underlying language model, supervises the generation of each token, and learns to concurrently predict the supervision score(s) of the sequences… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

  3. arXiv:2403.04735  [pdf, other

    cs.CV

    SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM

    Authors: Jielin Qiu, Andrea Madotto, Zhaojiang Lin, Paul A. Crook, Yifan Ethan Xu, Xin Luna Dong, Christos Faloutsos, Lei Li, Babak Damavandi, Seungwhan Moon

    Abstract: Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric V… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  4. arXiv:2309.16058  [pdf, other

    cs.LG cs.CL cs.CV

    AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

    Authors: Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya Srinet, Babak Damavandi, Anuj Kumar

    Abstract: We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  5. arXiv:2211.08462  [pdf, other

    cs.CL

    Navigating Connected Memories with a Task-oriented Dialog System

    Authors: Seungwhan Moon, Satwik Kottur, Alborz Geramifard, Babak Damavandi

    Abstract: Recent years have seen an increasing trend in the volume of personal media captured by users, thanks to the advent of smartphones and smart glasses, resulting in large media collections. Despite conversation being an intuitive human-computer interface, current efforts focus mostly on single-shot natural language based media retrieval to aid users query their media and re-live their memories. This… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: 13 pages, 3 tables, 9 figures

  6. arXiv:2211.03940  [pdf, other

    cs.CL

    Tell Your Story: Task-Oriented Dialogs for Interactive Content Creation

    Authors: Satwik Kottur, Seungwhan Moon, Aram H. Markosyan, Hardik Shah, Babak Damavandi, Alborz Geramifard

    Abstract: People capture photos and videos to relive and share memories of personal significance. Recently, media montages (stories) have become a popular mode of sharing these memories due to their intuitive and powerful storytelling capabilities. However, creating such montages usually involves a lot of manual searches, clicks, and selections that are time-consuming and cumbersome, adversely affecting use… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: 8 pages, 6 figures, 2 tables

  7. arXiv:2210.14395  [pdf, other

    cs.CV cs.CL cs.LG

    IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors from Egocentric Videos and Text

    Authors: Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Alireza Dirafzoon, Aparajita Saraf, Amy Bearman, Babak Damavandi

    Abstract: We present IMU2CLIP, a novel pre-training approach to align Inertial Measurement Unit (IMU) motion sensor recordings with video and text, by projecting them into the joint representation space of Contrastive Language-Image Pre-training (CLIP). The proposed approach allows IMU2CLIP to translate human motions (as measured by IMU sensors) into their corresponding textual descriptions and videos -- wh… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

  8. arXiv:2105.05964  [pdf, other

    cs.CV

    Connecting What to Say With Where to Look by Modeling Human Attention Traces

    Authors: Zihang Meng, Licheng Yu, Ning Zhang, Tamara Berg, Babak Damavandi, Vikas Singh, Amy Bearman

    Abstract: We introduce a unified framework to jointly model images, text, and human attention traces. Our work is built on top of the recent Localized Narratives annotation framework [30], where each word of a given caption is paired with a mouse trace segment. We propose two novel tasks: (1) predict a trace given an image and caption (i.e., visual grounding), and (2) predict a caption and a trace given onl… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

  9. arXiv:2104.08667  [pdf, other

    cs.CL cs.AI

    SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

    Authors: Satwik Kottur, Seungwhan Moon, Alborz Geramifard, Babak Damavandi

    Abstract: Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the real-world multimodal environment. Existing task-oriented dialog datasets aimed towards virtual assistance fall short and do not situate the dialog in the user's multimodal context. To overcome, we present a new dataset for Situated and Interac… ▽ More

    Submitted 20 October, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: 10 pages, 7 figures, 5 tables

  10. arXiv:1606.07470  [pdf, other

    cs.CL stat.ML

    NN-grams: Unifying neural network and n-gram language models for Speech Recognition

    Authors: Babak Damavandi, Shankar Kumar, Noam Shazeer, Antoine Bruguier

    Abstract: We present NN-grams, a novel, hybrid language model integrating n-grams and neural networks (NN) for speech recognition. The model takes as input both word histories as well as n-gram counts. Thus, it combines the memorization capacity and scalability of an n-gram model with the generalization ability of neural networks. We report experiments where the model is trained on 26B words. NN-grams are e… ▽ More

    Submitted 23 June, 2016; originally announced June 2016.

    Comments: To be published in the proceedings of INTERSPEECH 2016