-
Moment Sampling in Video LLMs for Long-Form Video QA
Authors:
Mustafa Chasmai,
Gauri Jagatap,
Gouthaman KV,
Grant Van Horn,
Subhransu Maji,
Andrea Fanelli
Abstract:
Recent advancements in video large language models (Video LLMs) have significantly advanced the field of video question answering (VideoQA). While existing methods perform well on short videos, they often struggle with long-range reasoning in longer videos. To scale Video LLMs for longer video content, frame sub-sampling (selecting frames at regular intervals) is commonly used. However, this appro…
▽ More
Recent advancements in video large language models (Video LLMs) have significantly advanced the field of video question answering (VideoQA). While existing methods perform well on short videos, they often struggle with long-range reasoning in longer videos. To scale Video LLMs for longer video content, frame sub-sampling (selecting frames at regular intervals) is commonly used. However, this approach is suboptimal, often leading to the loss of crucial frames or the inclusion of redundant information from multiple similar frames. Missing key frames impairs the model's ability to answer questions accurately, while redundant frames lead the model to focus on irrelevant video segments and increase computational resource consumption. In this paper, we investigate the use of a general-purpose text-to-video moment retrieval model to guide the frame sampling process. We propose "moment sampling", a novel, model-agnostic approach that enables the model to select the most relevant frames according to the context of the question. Specifically, we employ a lightweight moment retrieval model to prioritize frame selection. By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs. Through extensive experiments on four long-form VideoQA datasets, using four state-of-the-art Video LLMs, we demonstrate the effectiveness of the proposed approach.
△ Less
Submitted 17 June, 2025;
originally announced July 2025.
-
Can deepfakes be created by novice users?
Authors:
Pulak Mehta,
Gauri Jagatap,
Kevin Gallagher,
Brian Timmerman,
Progga Deb,
Siddharth Garg,
Rachel Greenstadt,
Brendan Dolan-Gavitt
Abstract:
Recent advancements in machine learning and computer vision have led to the proliferation of Deepfakes. As technology democratizes over time, there is an increasing fear that novice users can create Deepfakes, to discredit others and undermine public discourse. In this paper, we conduct user studies to understand whether participants with advanced computer skills and varying levels of computer sci…
▽ More
Recent advancements in machine learning and computer vision have led to the proliferation of Deepfakes. As technology democratizes over time, there is an increasing fear that novice users can create Deepfakes, to discredit others and undermine public discourse. In this paper, we conduct user studies to understand whether participants with advanced computer skills and varying levels of computer science expertise can create Deepfakes of a person saying a target statement using limited media files. We conduct two studies; in the first study (n = 39) participants try creating a target Deepfake in a constrained time frame using any tool they desire. In the second study (n = 29) participants use pre-specified deep learning-based tools to create the same Deepfake. We find that for the first study, 23.1% of the participants successfully created complete Deepfakes with audio and video, whereas, for the second user study, 58.6% of the participants were successful in stitching target speech to the target video. We further use Deepfake detection software tools as well as human examiner-based analysis, to classify the successfully generated Deepfake outputs as fake, suspicious, or real. The software detector classified 80% of the Deepfakes as fake, whereas the human examiners classified 100% of the videos as fake. We conclude that creating Deepfakes is a simple enough task for a novice user given adequate tools and time; however, the resulting Deepfakes are not sufficiently real-looking and are unable to completely fool detection software as well as human examiners
△ Less
Submitted 27 April, 2023;
originally announced April 2023.
-
Adversarial Token Attacks on Vision Transformers
Authors:
Ameya Joshi,
Gauri Jagatap,
Chinmay Hegde
Abstract:
Vision transformers rely on a patch token based self attention mechanism, in contrast to convolutional networks. We investigate fundamental differences between these two families of models, by designing a block sparsity based adversarial token attack. We probe and analyze transformer as well as convolutional models with token attacks of varying patch sizes. We infer that transformer models are mor…
▽ More
Vision transformers rely on a patch token based self attention mechanism, in contrast to convolutional networks. We investigate fundamental differences between these two families of models, by designing a block sparsity based adversarial token attack. We probe and analyze transformer as well as convolutional models with token attacks of varying patch sizes. We infer that transformer models are more sensitive to token attacks than convolutional models, with ResNets outperforming Transformer models by up to $\sim30\%$ in robust accuracy for single token attacks.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
Provable Compressed Sensing with Generative Priors via Langevin Dynamics
Authors:
Thanh V. Nguyen,
Gauri Jagatap,
Chinmay Hegde
Abstract:
Deep generative models have emerged as a powerful class of priors for signals in various inverse problems such as compressed sensing, phase retrieval and super-resolution. Here, we assume an unknown signal to lie in the range of some pre-trained generative model. A popular approach for signal recovery is via gradient descent in the low-dimensional latent space. While gradient descent has achieved…
▽ More
Deep generative models have emerged as a powerful class of priors for signals in various inverse problems such as compressed sensing, phase retrieval and super-resolution. Here, we assume an unknown signal to lie in the range of some pre-trained generative model. A popular approach for signal recovery is via gradient descent in the low-dimensional latent space. While gradient descent has achieved good empirical performance, its theoretical behavior is not well understood. In this paper, we introduce the use of stochastic gradient Langevin dynamics (SGLD) for compressed sensing with a generative prior. Under mild assumptions on the generative model, we prove the convergence of SGLD to the true signal. We also demonstrate competitive empirical performance to standard gradient descent.
△ Less
Submitted 24 February, 2021;
originally announced February 2021.
-
Adversarially Robust Learning via Entropic Regularization
Authors:
Gauri Jagatap,
Ameya Joshi,
Animesh Basak Chowdhury,
Siddharth Garg,
Chinmay Hegde
Abstract:
In this paper we propose a new family of algorithms, ATENT, for training adversarially robust deep neural networks. We formulate a new loss function that is equipped with an additional entropic regularization. Our loss function considers the contribution of adversarial samples that are drawn from a specially designed distribution in the data space that assigns high probability to points with high…
▽ More
In this paper we propose a new family of algorithms, ATENT, for training adversarially robust deep neural networks. We formulate a new loss function that is equipped with an additional entropic regularization. Our loss function considers the contribution of adversarial samples that are drawn from a specially designed distribution in the data space that assigns high probability to points with high loss and in the immediate neighborhood of training samples. Our proposed algorithms optimize this loss to seek adversarially robust valleys of the loss landscape. Our approach achieves competitive (or better) performance in terms of robust classification accuracy as compared to several state-of-the-art robust learning approaches on benchmark datasets such as MNIST and CIFAR-10.
△ Less
Submitted 19 February, 2021; v1 submitted 27 August, 2020;
originally announced August 2020.
-
Algorithmic Guarantees for Inverse Imaging with Untrained Network Priors
Authors:
Gauri Jagatap,
Chinmay Hegde
Abstract:
Deep neural networks as image priors have been recently introduced for problems such as denoising, super-resolution and inpainting with promising performance gains over hand-crafted image priors such as sparsity and low-rank. Unlike learned generative priors they do not require any training over large datasets. However, few theoretical guarantees exist in the scope of using untrained neural networ…
▽ More
Deep neural networks as image priors have been recently introduced for problems such as denoising, super-resolution and inpainting with promising performance gains over hand-crafted image priors such as sparsity and low-rank. Unlike learned generative priors they do not require any training over large datasets. However, few theoretical guarantees exist in the scope of using untrained neural network priors for inverse imaging problems. We explore new applications and theory for untrained neural network priors. Specifically, we consider the problem of solving linear inverse problems, such as compressive sensing, as well as non-linear problems, such as compressive phase retrieval. We model images to lie in the range of an untrained deep generative network with a fixed seed. We further present a projected gradient descent scheme that can be used for both compressive sensing and phase retrieval and provide rigorous theoretical guarantees for its convergence. We also show both theoretically as well as empirically that with deep network priors, one can achieve better compression rates for the same image quality compared to hand crafted priors.
△ Less
Submitted 27 March, 2020; v1 submitted 20 June, 2019;
originally announced June 2019.
-
Learning ReLU Networks via Alternating Minimization
Authors:
Gauri Jagatap,
Chinmay Hegde
Abstract:
We propose and analyze a new family of algorithms for training neural networks with ReLU activations. Our algorithms are based on the technique of alternating minimization: estimating the activation patterns of each ReLU for all given samples, interleaved with weight updates via a least-squares step. The main focus of our paper are 1-hidden layer networks with $k$ hidden neurons and ReLU activatio…
▽ More
We propose and analyze a new family of algorithms for training neural networks with ReLU activations. Our algorithms are based on the technique of alternating minimization: estimating the activation patterns of each ReLU for all given samples, interleaved with weight updates via a least-squares step. The main focus of our paper are 1-hidden layer networks with $k$ hidden neurons and ReLU activation. We show that under standard distributional assumptions on the $d-$dimensional input data, our algorithm provably recovers the true `ground truth' parameters in a linearly convergent fashion. This holds as long as the weights are sufficiently well initialized; furthermore, our method requires only $n=\widetilde{O}(dk^2)$ samples. We also analyze the special case of 1-hidden layer networks with skipped connections, commonly used in ResNet-type architectures, and propose a novel initialization strategy for the same. For ReLU based ResNet type networks, we provide the first linear convergence guarantee with an end-to-end algorithm. We also extend this framework to deeper networks and empirically demonstrate its convergence to a global minimum.
△ Less
Submitted 10 October, 2018; v1 submitted 20 June, 2018;
originally announced June 2018.
-
Sample-Efficient Algorithms for Recovering Structured Signals from Magnitude-Only Measurements
Authors:
Gauri Jagatap,
Chinmay Hegde
Abstract:
We consider the problem of recovering a signal $\mathbf{x}^* \in \mathbf{R}^n$, from magnitude-only measurements $y_i = |\left\langle\mathbf{a}_i,\mathbf{x}^*\right\rangle|$ for $i=[m]$. Also called the phase retrieval, this is a fundamental challenge in bio-,astronomical imaging and speech processing. The problem above is ill-posed; additional assumptions on the signal and/or the measurements are…
▽ More
We consider the problem of recovering a signal $\mathbf{x}^* \in \mathbf{R}^n$, from magnitude-only measurements $y_i = |\left\langle\mathbf{a}_i,\mathbf{x}^*\right\rangle|$ for $i=[m]$. Also called the phase retrieval, this is a fundamental challenge in bio-,astronomical imaging and speech processing. The problem above is ill-posed; additional assumptions on the signal and/or the measurements are necessary. In this paper we first study the case where the signal $\mathbf{x}^*$ is $s$-sparse. We develop a novel algorithm that we call Compressive Phase Retrieval with Alternating Minimization, or CoPRAM. Our algorithm is simple; it combines the classical alternating minimization approach for phase retrieval with the CoSaMP algorithm for sparse recovery. Despite its simplicity, we prove that CoPRAM achieves a sample complexity of $O(s^2\log n)$ with Gaussian measurements $\mathbf{a}_i$, matching the best known existing results; moreover, it demonstrates linear convergence in theory and practice. Additionally, it requires no extra tuning parameters other than signal sparsity $s$ and is robust to noise. When the sorted coefficients of the sparse signal exhibit a power law decay, we show that CoPRAM achieves a sample complexity of $O(s\log n)$, which is close to the information-theoretic limit. We also consider the case where the signal $\mathbf{x}^*$ arises from structured sparsity models. We specifically examine the case of block-sparse signals with uniform block size of $b$ and block sparsity $k=s/b$. For this problem, we design a recovery algorithm Block CoPRAM that further reduces the sample complexity to $O(ks\log n)$. For sufficiently large block lengths of $b=Θ(s)$, this bound equates to $O(s\log n)$. To our knowledge, this constitutes the first end-to-end algorithm for phase retrieval where the Gaussian sample complexity has a sub-quadratic dependence on the signal sparsity level.
△ Less
Submitted 26 November, 2017; v1 submitted 18 May, 2017;
originally announced May 2017.