-
FANVID: A Benchmark for Face and License Plate Recognition in Low-Resolution Videos
Authors:
Kavitha Viswanathan,
Vrinda Goel,
Shlesh Gholap,
Devayan Ghosh,
Madhav Gupta,
Dhruvi Ganatra,
Sanket Potdar,
Amit Sethi
Abstract:
Real-world surveillance often renders faces and license plates unrecognizable in individual low-resolution (LR) frames, hindering reliable identification. To advance temporal recognition models, we present FANVID, a novel video-based benchmark comprising nearly 1,463 LR clips (180 x 320, 20--60 FPS) featuring 63 identities and 49 license plates from three English-speaking countries. Each video inc…
▽ More
Real-world surveillance often renders faces and license plates unrecognizable in individual low-resolution (LR) frames, hindering reliable identification. To advance temporal recognition models, we present FANVID, a novel video-based benchmark comprising nearly 1,463 LR clips (180 x 320, 20--60 FPS) featuring 63 identities and 49 license plates from three English-speaking countries. Each video includes distractor faces and plates, increasing task difficulty and realism. The dataset contains 31,096 manually verified bounding boxes and labels.
FANVID defines two tasks: (1) face matching -- detecting LR faces and matching them to high-resolution mugshots, and (2) license plate recognition -- extracting text from LR plates without a predefined database. Videos are downsampled from high-resolution sources to ensure that faces and text are indecipherable in single frames, requiring models to exploit temporal information. We introduce evaluation metrics adapted from mean Average Precision at IoU > 0.5, prioritizing identity correctness for faces and character-level accuracy for text.
A baseline method with pre-trained video super-resolution, detection, and recognition achieved performance scores of 0.58 (face matching) and 0.42 (plate recognition), highlighting both the feasibility and challenge of the tasks. FANVID's selection of faces and plates balances diversity with recognition challenge. We release the software for data access, evaluation, baseline, and annotation to support reproducibility and extension. FANVID aims to catalyze innovation in temporal modeling for LR recognition, with applications in surveillance, forensics, and autonomous vehicles.
△ Less
Submitted 8 June, 2025;
originally announced June 2025.
-
Towards a comprehensive taxonomy of online abusive language informed by machine leaning
Authors:
Samaneh Hosseini Moghaddam,
Kelly Lyons,
Cheryl Regehr,
Vivek Goel,
Kaitlyn Regehr
Abstract:
The proliferation of abusive language in online communications has posed significant risks to the health and wellbeing of individuals and communities. The growing concern regarding online abuse and its consequences necessitates methods for identifying and mitigating harmful content and facilitating continuous monitoring, moderation, and early intervention. This paper presents a taxonomy for distin…
▽ More
The proliferation of abusive language in online communications has posed significant risks to the health and wellbeing of individuals and communities. The growing concern regarding online abuse and its consequences necessitates methods for identifying and mitigating harmful content and facilitating continuous monitoring, moderation, and early intervention. This paper presents a taxonomy for distinguishing key characteristics of abusive language within online text. Our approach uses a systematic method for taxonomy development, integrating classification systems of 18 existing multi-label datasets to capture key characteristics relevant to online abusive language classification. The resulting taxonomy is hierarchical and faceted, comprising 5 categories and 17 dimensions. It classifies various facets of online abuse, including context, target, intensity, directness, and theme of abuse. This shared understanding can lead to more cohesive efforts, facilitate knowledge exchange, and accelerate progress in the field of online abuse detection and mitigation among researchers, policy makers, online platform owners, and other stakeholders.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach
Authors:
Yunuo Chen,
Junli Cao,
Anil Kag,
Vidit Goel,
Sergei Korolev,
Chenfanfu Jiang,
Sergey Tulyakov,
Jian Ren
Abstract:
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space. The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling it to track 2D objects with 3D Cartesian coordinates. Building on this, we regularize t…
▽ More
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space. The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling it to track 2D objects with 3D Cartesian coordinates. Building on this, we regularize the shape and motion of objects in the video to eliminate undesired artifacts, \eg, nonphysical deformation. Consequently, we enhance the quality of generated RGB videos and alleviate common issues like object morphing, which are prevalent in current video models due to a lack of shape awareness. With our 3D augmentation and regularization, our model is capable of handling contact-rich scenarios such as task-oriented videos. These videos involve complex interactions of solids, where 3D information is essential for perceiving deformation and contact. Furthermore, our model improves the overall quality of video generation by promoting the 3D consistency of moving objects and reducing abrupt changes in shape and motion.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
Wonderland: Navigating 3D Scenes from a Single Image
Authors:
Hanwen Liang,
Junli Cao,
Vidit Goel,
Guocheng Qian,
Sergei Korolev,
Demetri Terzopoulos,
Konstantinos N. Plataniotis,
Sergey Tulyakov,
Jian Ren
Abstract:
How can one efficiently generate high-quality, wide-scope 3D scenes from arbitrary single images? Existing methods suffer several drawbacks, such as requiring multi-view data, time-consuming per-scene optimization, distorted geometry in occluded areas, and low visual quality in backgrounds. Our novel 3D scene reconstruction pipeline overcomes these limitations to tackle the aforesaid challenge. Sp…
▽ More
How can one efficiently generate high-quality, wide-scope 3D scenes from arbitrary single images? Existing methods suffer several drawbacks, such as requiring multi-view data, time-consuming per-scene optimization, distorted geometry in occluded areas, and low visual quality in backgrounds. Our novel 3D scene reconstruction pipeline overcomes these limitations to tackle the aforesaid challenge. Specifically, we introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that encode multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets affirm that our model significantly outperforms existing single-view 3D scene generation methods, especially with out-of-domain images. Thus, we demonstrate for the first time that a 3D reconstruction model can effectively be built upon the latent space of a diffusion model in order to realize efficient 3D scene generation.
△ Less
Submitted 26 April, 2025; v1 submitted 16 December, 2024;
originally announced December 2024.
-
Cross-Domain Evaluation of Few-Shot Classification Models: Natural Images vs. Histopathological Images
Authors:
Ardhendu Sekhar,
Aditya Bhattacharya,
Vinayak Goyal,
Vrinda Goel,
Aditya Bhangale,
Ravi Kant Gupta,
Amit Sethi
Abstract:
In this study, we investigate the performance of few-shot classification models across different domains, specifically natural images and histopathological images. We first train several few-shot classification models on natural images and evaluate their performance on histopathological images. Subsequently, we train the same models on histopathological images and compare their performance. We inc…
▽ More
In this study, we investigate the performance of few-shot classification models across different domains, specifically natural images and histopathological images. We first train several few-shot classification models on natural images and evaluate their performance on histopathological images. Subsequently, we train the same models on histopathological images and compare their performance. We incorporated four histopathology datasets and one natural images dataset and assessed performance across 5-way 1-shot, 5-way 5-shot, and 5-way 10-shot scenarios using a selection of state-of-the-art classification techniques. Our experimental results reveal insights into the transferability and generalization capabilities of few-shot classification models between diverse image domains. We analyze the strengths and limitations of these models in adapting to new domains and provide recommendations for optimizing their performance in cross-domain scenarios. This research contributes to advancing our understanding of few-shot learning in the context of image classification across diverse domains.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
HER2 and FISH Status Prediction in Breast Biopsy H&E-Stained Images Using Deep Learning
Authors:
Ardhendu Sekhar,
Vrinda Goel,
Garima Jain,
Abhijeet Patil,
Ravi Kant Gupta,
Tripti Bameta,
Swapnil Rane,
Amit Sethi
Abstract:
The current standard for detecting human epidermal growth factor receptor 2 (HER2) status in breast cancer patients relies on HER2 amplification, identified through fluorescence in situ hybridization (FISH) or immunohistochemistry (IHC). However, hematoxylin and eosin (H\&E) tumor stains are more widely available, and accurately predicting HER2 status using H\&E could reduce costs and expedite tre…
▽ More
The current standard for detecting human epidermal growth factor receptor 2 (HER2) status in breast cancer patients relies on HER2 amplification, identified through fluorescence in situ hybridization (FISH) or immunohistochemistry (IHC). However, hematoxylin and eosin (H\&E) tumor stains are more widely available, and accurately predicting HER2 status using H\&E could reduce costs and expedite treatment selection. Deep Learning algorithms for H&E have shown effectiveness in predicting various cancer features and clinical outcomes, including moderate success in HER2 status prediction. In this work, we employed a customized weak supervision classification technique combined with MoCo-v2 contrastive learning to predict HER2 status. We trained our pipeline on 182 publicly available H&E Whole Slide Images (WSIs) from The Cancer Genome Atlas (TCGA), for which annotations by the pathology team at Yale School of Medicine are publicly available. Our pipeline achieved an Area Under the Curve (AUC) of 0.85 across four different test folds. Additionally, we tested our model on 44 H&E slides from the TCGA-BRCA dataset, which had an HER2 score of 2+ and included corresponding HER2 status and FISH test results. These cases are considered equivocal for IHC, requiring an expensive FISH test on their IHC slides for disambiguation. Our pipeline demonstrated an AUC of 0.81 on these challenging H&E slides. Reducing the need for FISH test can have significant implications in cancer treatment equity for underserved populations.
△ Less
Submitted 26 September, 2024; v1 submitted 25 August, 2024;
originally announced August 2024.
-
Lightweight Predictive 3D Gaussian Splats
Authors:
Junli Cao,
Vidit Goel,
Chaoyang Wang,
Anil Kag,
Ju Hu,
Sergei Korolev,
Chenfanfu Jiang,
Sergey Tulyakov,
Jian Ren
Abstract:
Recent approaches representing 3D objects and scenes using Gaussian splats show increased rendering speed across a variety of platforms and devices. While rendering such representations is indeed extremely efficient, storing and transmitting them is often prohibitively expensive. To represent large-scale scenes, one often needs to store millions of 3D Gaussians, occupying gigabytes of disk space.…
▽ More
Recent approaches representing 3D objects and scenes using Gaussian splats show increased rendering speed across a variety of platforms and devices. While rendering such representations is indeed extremely efficient, storing and transmitting them is often prohibitively expensive. To represent large-scale scenes, one often needs to store millions of 3D Gaussians, occupying gigabytes of disk space. This poses a very practical limitation, prohibiting widespread adoption.Several solutions have been proposed to strike a balance between disk size and rendering quality, noticeably reducing the visual quality. In this work, we propose a new representation that dramatically reduces the hard drive footprint while featuring similar or improved quality when compared to the standard 3D Gaussian splats. When compared to other compact solutions, ours offers higher quality renderings with significantly reduced storage, being able to efficiently run on a mobile device in real-time. Our key observation is that nearby points in the scene can share similar representations. Hence, only a small ratio of 3D points needs to be stored. We introduce an approach to identify such points which are called parent points. The discarded points called children points along with attributes can be efficiently predicted by tiny MLPs.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
Authors:
Moreno D'Incà,
Elia Peruzzo,
Massimiliano Mancini,
Dejia Xu,
Vidit Goel,
Xingqian Xu,
Zhangyang Wang,
Humphrey Shi,
Nicu Sebe
Abstract:
Text-to-image generative models are becoming increasingly popular and accessible to the general public. As these models see large-scale deployments, it is necessary to deeply investigate their safety and fairness to not disseminate and perpetuate any kind of biases. However, existing works focus on detecting closed sets of biases defined a priori, limiting the studies to well-known concepts. In th…
▽ More
Text-to-image generative models are becoming increasingly popular and accessible to the general public. As these models see large-scale deployments, it is necessary to deeply investigate their safety and fairness to not disseminate and perpetuate any kind of biases. However, existing works focus on detecting closed sets of biases defined a priori, limiting the studies to well-known concepts. In this paper, we tackle the challenge of open-set bias detection in text-to-image generative models presenting OpenBias, a new pipeline that identifies and quantifies the severity of biases agnostically, without access to any precompiled set. OpenBias has three stages. In the first phase, we leverage a Large Language Model (LLM) to propose biases given a set of captions. Secondly, the target generative model produces images using the same set of captions. Lastly, a Vision Question Answering model recognizes the presence and extent of the previously proposed biases. We study the behavior of Stable Diffusion 1.5, 2, and XL emphasizing new biases, never investigated before. Via quantitative experiments, we demonstrate that OpenBias agrees with current closed-set bias detection methods and human judgement.
△ Less
Submitted 5 August, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
VASE: Object-Centric Appearance and Shape Manipulation of Real Videos
Authors:
Elia Peruzzo,
Vidit Goel,
Dejia Xu,
Xingqian Xu,
Yifan Jiang,
Zhangyang Wang,
Humphrey Shi,
Nicu Sebe
Abstract:
Recently, several works tackled the video editing task fostered by the success of large-scale text-to-image generative models. However, most of these methods holistically edit the frame using the text, exploiting the prior given by foundation diffusion models and focusing on improving the temporal consistency across frames. In this work, we introduce a framework that is object-centric and is desig…
▽ More
Recently, several works tackled the video editing task fostered by the success of large-scale text-to-image generative models. However, most of these methods holistically edit the frame using the text, exploiting the prior given by foundation diffusion models and focusing on improving the temporal consistency across frames. In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object. We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control. We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities. Further details, code and examples are available on our project page: https://helia95.github.io/vase-website/
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
Video Instance Matting
Authors:
Jiachen Li,
Roberto Henschel,
Vidit Goel,
Marianna Ohanyan,
Shant Navasardyan,
Humphrey Shi
Abstract:
Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimat…
▽ More
Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.
△ Less
Submitted 8 November, 2023; v1 submitted 7 November, 2023;
originally announced November 2023.
-
Optimization of Tritium Breeding Ratio in a DT and DD Submersion Tokamak Fusion Reactor
Authors:
Vikram Goel,
Soha Aslam,
Sejal Dua
Abstract:
The mass of stars is enough to confine a plasma to fuse light atoms, but this is not possible to engineer on Earth. Fortunately, nuclear engineering can rely on the magnetic confinement of a plasma using superconducting coils so long as the Tritium Breeding Ratio (TBR) is optimized. This paper will investigate some of the materials which can increase the rate at which Tritium is produced within th…
▽ More
The mass of stars is enough to confine a plasma to fuse light atoms, but this is not possible to engineer on Earth. Fortunately, nuclear engineering can rely on the magnetic confinement of a plasma using superconducting coils so long as the Tritium Breeding Ratio (TBR) is optimized. This paper will investigate some of the materials which can increase the rate at which Tritium is produced within the breeding blanket layer of Submersion Tokamak reactors, a design that uses magnetic confinement of a plasma in the shape of a torus to execute nuclear fusion. Using the Paramak Python module to model several geometries and OpenMC to run a simulation, it can be observed how neutron multipliers, enrichment, and the neutron energy spectrum affect TBR. This experiment will mainly observe different material choices that have been considered and their TBR based on their cross sections, dose rate, thermal properties and safety. By altering the neutron energy spectrum to account for DD and DT plasma, the difference in these compounds' Tritium breeding efficacy is noted. Neutron energy spectra are an important factor in optimising the TBR levels as the neutrons generated by the fusion reactions in the plasma interact with the breeder material in the blanket and produce tritium through the reaction with Lithium. Since Tritium is a rare isotope of hydrogen that is used as fuel in fusion reactions and has a short half-life, it is essential to produce tritium within the fusion reactor itself. Without the tritium breeding capability, it would not be feasible to generate energy via fusion. A TBR greater than unity indicates that the reactor can generate more tritium than it consumes, ensuring self-sufficiency in the tritium inventory. Since Tritium is the most reliable and efficient fuel for these reactors, optimising the TBR is of paramount importance in the long road to commercialization of nuclear fusion.
△ Less
Submitted 29 September, 2023;
originally announced October 2023.
-
Interactive Neural Painting
Authors:
Elia Peruzzo,
Willi Menapace,
Vidit Goel,
Federica Arrigoni,
Hao Tang,
Xingqian Xu,
Arman Chopikyan,
Nikita Orlov,
Yuxiao Hu,
Humphrey Shi,
Nicu Sebe,
Elisa Ricci
Abstract:
In the last few years, Neural Painting (NP) techniques became capable of producing extremely realistic artworks. This paper advances the state of the art in this emerging research domain by proposing the first approach for Interactive NP. Considering a setting where a user looks at a scene and tries to reproduce it on a painting, our objective is to develop a computational framework to assist the…
▽ More
In the last few years, Neural Painting (NP) techniques became capable of producing extremely realistic artworks. This paper advances the state of the art in this emerging research domain by proposing the first approach for Interactive NP. Considering a setting where a user looks at a scene and tries to reproduce it on a painting, our objective is to develop a computational framework to assist the users creativity by suggesting the next strokes to paint, that can be possibly used to complete the artwork. To accomplish such a task, we propose I-Paint, a novel method based on a conditional transformer Variational AutoEncoder (VAE) architecture with a two-stage decoder. To evaluate the proposed approach and stimulate research in this area, we also introduce two novel datasets. Our experiments show that our approach provides good stroke suggestions and compares favorably to the state of the art. Additional details, code and examples are available at https://helia95.github.io/inp-website.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor
Authors:
Vidit Goel,
Elia Peruzzo,
Yifan Jiang,
Dejia Xu,
Xingqian Xu,
Nicu Sebe,
Trevor Darrell,
Zhangyang Wang,
Humphrey Shi
Abstract:
Generative image editing has recently witnessed extremely fast-paced growth. Some works use high-level conditioning such as text, while others use low-level conditioning. Nevertheless, most of them lack fine-grained control over the properties of the different objects present in the image, i.e. object-level image editing. In this work, we tackle the task by perceiving the images as an amalgamation…
▽ More
Generative image editing has recently witnessed extremely fast-paced growth. Some works use high-level conditioning such as text, while others use low-level conditioning. Nevertheless, most of them lack fine-grained control over the properties of the different objects present in the image, i.e. object-level image editing. In this work, we tackle the task by perceiving the images as an amalgamation of various objects and aim to control the properties of each object in a fine-grained manner. Out of these properties, we identify structure and appearance as the most intuitive to understand and useful for editing purposes. We propose PAIR Diffusion, a generic framework that can enable a diffusion model to control the structure and appearance properties of each object in the image. We show that having control over the properties of each object in an image leads to comprehensive editing capabilities. Our framework allows for various object-level editing operations on real images such as reference image-based appearance editing, free-form shape editing, adding objects, and variations. Thanks to our design, we do not require any inversion step. Additionally, we propose multimodal classifier-free guidance which enables editing images using both reference images and text when using our approach with foundational diffusion models. We validate the above claims by extensively evaluating our framework on both unconditional and foundational diffusion models. Please refer to https://vidit98.github.io/publication/conference-paper/pair_diff.html for code and model release.
△ Less
Submitted 8 April, 2024; v1 submitted 30 March, 2023;
originally announced March 2023.
-
GPT-4 Technical Report
Authors:
OpenAI,
Josh Achiam,
Steven Adler,
Sandhini Agarwal,
Lama Ahmad,
Ilge Akkaya,
Florencia Leoni Aleman,
Diogo Almeida,
Janko Altenschmidt,
Sam Altman,
Shyamal Anadkat,
Red Avila,
Igor Babuschkin,
Suchir Balaji,
Valerie Balcom,
Paul Baltescu,
Haiming Bao,
Mohammad Bavarian,
Jeff Belgum,
Irwan Bello,
Jake Berdine,
Gabriel Bernadett-Shapiro,
Christopher Berner,
Lenny Bogdonoff,
Oleg Boiko
, et al. (256 additional authors not shown)
Abstract:
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo…
▽ More
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
△ Less
Submitted 4 March, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Hatemongers ride on echo chambers to escalate hate speech diffusion
Authors:
Vasu Goel,
Dhruv Sahnan,
Subhabrata Dutta,
Anil Bandhakavi,
Tanmoy Chakraborty
Abstract:
Recent years have witnessed a swelling rise of hateful and abusive content over online social networks. While detection and moderation of hate speech have been the early go-to countermeasures, the solution requires a deeper exploration of the dynamics of hate generation and propagation. We analyze more than 32 million posts from over 6.8 million users across three popular online social networks to…
▽ More
Recent years have witnessed a swelling rise of hateful and abusive content over online social networks. While detection and moderation of hate speech have been the early go-to countermeasures, the solution requires a deeper exploration of the dynamics of hate generation and propagation. We analyze more than 32 million posts from over 6.8 million users across three popular online social networks to investigate the interrelations between hateful behavior, information dissemination, and polarised organization mediated by echo chambers. We find that hatemongers play a more crucial role in governing the spread of information compared to singled-out hateful content. This observation holds for both the growth of information cascades as well as the conglomeration of hateful actors. Dissection of the core-wise distribution of these networks points towards the fact that hateful users acquire a more well-connected position in the social network and often flock together to build up information cascades. We observe that this cohesion is far from mere organized behavior; instead, in these networks, hatemongers dominate the echo chambers -- groups of users actively align themselves to specific ideological positions. The observed dominance of hateful users to inflate information cascades is primarily via user interactions amplified within these echo chambers. We conclude our study with a cautionary note that popularity-based recommendation of content is susceptible to be exploited by hatemongers given their potential to escalate content popularity via echo-chambered interactions.
△ Less
Submitted 5 February, 2023;
originally announced February 2023.
-
VMFormer: End-to-End Video Matting with Transformer
Authors:
Jiachen Li,
Vidit Goel,
Marianna Ohanyan,
Shant Navasardyan,
Yunchao Wei,
Humphrey Shi
Abstract:
Video matting aims to predict the alpha mattes for each frame from a given input video sequence. Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN) for the past few years, which have become the de-facto standard for both academia and industry. However, they have inbuilt inductive bias of locality and do not capture global characteristics of an image d…
▽ More
Video matting aims to predict the alpha mattes for each frame from a given input video sequence. Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN) for the past few years, which have become the de-facto standard for both academia and industry. However, they have inbuilt inductive bias of locality and do not capture global characteristics of an image due to the CNN-based architectures. They also lack long-range temporal modeling considering computational costs when dealing with feature maps of multiple frames. In this paper, we propose VMFormer: a transformer-based end-to-end method for video matting. It makes predictions on alpha mattes of each frame from learnable queries given a video input sequence. Specifically, it leverages self-attention layers to build global integration of feature sequences with short-range temporal modeling on successive frames. We further apply queries to learn global representations through cross-attention in the transformer decoder with long-range temporal modeling upon all queries. In the prediction stage, both queries and corresponding feature maps are used to make the final prediction of alpha matte. Experiments show that VMFormer outperforms previous CNN-based video matting methods on the composited benchmarks. To our best knowledge, it is the first end-to-end video matting solution built upon a full vision transformer with predictions on the learnable queries. The project is open-sourced at https://chrisjuniorli.github.io/project/VMFormer/
△ Less
Submitted 30 November, 2022; v1 submitted 26 August, 2022;
originally announced August 2022.
-
VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution
Authors:
Zeyuan Chen,
Yinbo Chen,
Jingwen Liu,
Xingqian Xu,
Vidit Goel,
Zhangyang Wang,
Humphrey Shi,
Xiaolong Wang
Abstract:
Videos typically record the streaming and continuous visual data as discrete consecutive frames. Since the storage cost is expensive for videos of high fidelity, most of them are stored in a relatively low resolution and frame rate. Recent works of Space-Time Video Super-Resolution (STVSR) are developed to incorporate temporal interpolation and spatial super-resolution in a unified framework. Howe…
▽ More
Videos typically record the streaming and continuous visual data as discrete consecutive frames. Since the storage cost is expensive for videos of high fidelity, most of them are stored in a relatively low resolution and frame rate. Recent works of Space-Time Video Super-Resolution (STVSR) are developed to incorporate temporal interpolation and spatial super-resolution in a unified framework. However, most of them only support a fixed up-sampling scale, which limits their flexibility and applications. In this work, instead of following the discrete representations, we propose Video Implicit Neural Representation (VideoINR), and we show its applications for STVSR. The learned implicit neural representation can be decoded to videos of arbitrary spatial resolution and frame rate. We show that VideoINR achieves competitive performances with state-of-the-art STVSR methods on common up-sampling scales and significantly outperforms prior works on continuous and out-of-training-distribution scales. Our project page is at http://zeyuan-chen.com/VideoINR/ .
△ Less
Submitted 9 June, 2022;
originally announced June 2022.
-
K-12BERT: BERT for K-12 education
Authors:
Vasu Goel,
Dhruv Sahnan,
Venktesh V,
Gaurav Sharma,
Deep Dwivedi,
Mukesh Mohania
Abstract:
Online education platforms are powered by various NLP pipelines, which utilize models like BERT to aid in content curation. Since the inception of the pre-trained language models like BERT, there have also been many efforts toward adapting these pre-trained models to specific domains. However, there has not been a model specifically adapted for the education domain (particularly K-12) across subje…
▽ More
Online education platforms are powered by various NLP pipelines, which utilize models like BERT to aid in content curation. Since the inception of the pre-trained language models like BERT, there have also been many efforts toward adapting these pre-trained models to specific domains. However, there has not been a model specifically adapted for the education domain (particularly K-12) across subjects to the best of our knowledge. In this work, we propose to train a language model on a corpus of data curated by us across multiple subjects from various sources for K-12 education. We also evaluate our model, K12-BERT, on downstream tasks like hierarchical taxonomy tagging.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
DiVA: A Scalable, Interactive and Customizable Visual Analytics Platform for Information Diffusion on Large Networks
Authors:
Dhruv Sahnan,
Vasu Goel,
Sarah Masud,
Chhavi Jain,
Vikram Goyal,
Tanmoy Chakraborty
Abstract:
With an increasing outreach of digital platforms in our lives, researchers have taken a keen interest to study different facets of social interactions that seem to be evolving rapidly. Analysing the spread of information (aka diffusion) has brought forth multiple research areas such as modelling user engagement, determining emerging topics, forecasting virality of online posts and predicting infor…
▽ More
With an increasing outreach of digital platforms in our lives, researchers have taken a keen interest to study different facets of social interactions that seem to be evolving rapidly. Analysing the spread of information (aka diffusion) has brought forth multiple research areas such as modelling user engagement, determining emerging topics, forecasting virality of online posts and predicting information cascades. Despite such ever-increasing interest, there remains a vacuum among easy-to-use interfaces for large-scale visualisation of diffusion models. In this paper, we introduce DiVA -- Diffusion Visualisation and Analysis, a tool that provides a scalable web interface and extendable APIs to analyse various diffusion trends on networks. DiVA uniquely offers support for simultaneous comparison of two competing diffusion models and even the comparison with the ground-truth results, both of which help develop a coherent understanding of real-world scenarios. Along with performing an exhaustive feature comparison and system evaluation of DiVA against publicly-available web interfaces for information diffusion, we conducted a user study to understand the strengths and limitations of DiVA. We noticed that evaluators had a seamless user experience, especially when analysing diffusion on large networks.
△ Less
Submitted 21 August, 2022; v1 submitted 12 December, 2021;
originally announced December 2021.
-
MSN: Efficient Online Mask Selection Network for Video Instance Segmentation
Authors:
Vidit Goel,
Jiachen Li,
Shubhika Garg,
Harsh Maheshwari,
Humphrey Shi
Abstract:
In this work we present a novel solution for Video Instance Segmentation(VIS), that is automatically generating instance level segmentation masks along with object class and tracking them in a video. Our method improves the masks from segmentation and propagation branches in an online manner using the Mask Selection Network (MSN) hence limiting the noise accumulation during mask tracking. We propo…
▽ More
In this work we present a novel solution for Video Instance Segmentation(VIS), that is automatically generating instance level segmentation masks along with object class and tracking them in a video. Our method improves the masks from segmentation and propagation branches in an online manner using the Mask Selection Network (MSN) hence limiting the noise accumulation during mask tracking. We propose an effective design of MSN by using patch-based convolutional neural network. The network is able to distinguish between very subtle differences between the masks and choose the better masks out of the associated masks accurately. Further, we make use of temporal consistency and process the video sequences in both forward and reverse manner as a post processing step to recover lost objects. The proposed method can be used to adapt any video object segmentation method for the task of VIS. Our method achieves a score of 49.1 mAP on 2021 YouTube-VIS Challenge and was ranked third place among more than 30 global teams. Our code will be available at https://github.com/SHI-Labs/Mask-Selection-Networks.
△ Less
Submitted 19 June, 2021;
originally announced June 2021.
-
Can Adversarial Weight Perturbations Inject Neural Backdoors?
Authors:
Siddhant Garg,
Adarsh Kumar,
Vibhor Goel,
Yingyu Liang
Abstract:
Adversarial machine learning has exposed several security hazards of neural models and has become an important research topic in recent times. Thus far, the concept of an "adversarial perturbation" has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of "adversarial perturbations" t…
▽ More
Adversarial machine learning has exposed several security hazards of neural models and has become an important research topic in recent times. Thus far, the concept of an "adversarial perturbation" has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of "adversarial perturbations" to the space of model weights, specifically to inject backdoors in trained DNNs, which exposes a security risk of using publicly available trained models. Here, injecting a backdoor refers to obtaining a desired outcome from the model when a trigger pattern is added to the input, while retaining the original model predictions on a non-triggered input. From the perspective of an adversary, we characterize these adversarial perturbations to be constrained within an $\ell_{\infty}$ norm around the original model weights. We introduce adversarial perturbations in the model weights using a composite loss on the predictions of the original model and the desired trigger through projected gradient descent. We empirically show that these adversarial weight perturbations exist universally across several computer vision and natural language processing tasks. Our results show that backdoors can be successfully injected with a very small average relative change in model weight values for several applications.
△ Less
Submitted 21 September, 2020; v1 submitted 4 August, 2020;
originally announced August 2020.
-
IQ-VQA: Intelligent Visual Question Answering
Authors:
Vatsal Goel,
Mohit Chandak,
Ashish Anand,
Prithwijit Guha
Abstract:
Even though there has been tremendous progress in the field of Visual Question Answering, models today still tend to be inconsistent and brittle. To this end, we propose a model-independent cyclic framework which increases consistency and robustness of any VQA architecture. We train our models to answer the original question, generate an implication based on the answer and then also learn to answe…
▽ More
Even though there has been tremendous progress in the field of Visual Question Answering, models today still tend to be inconsistent and brittle. To this end, we propose a model-independent cyclic framework which increases consistency and robustness of any VQA architecture. We train our models to answer the original question, generate an implication based on the answer and then also learn to answer the generated implication correctly. As a part of the cyclic framework, we propose a novel implication generator which can generate implied questions from any question-answer pair. As a baseline for future works on consistency, we provide a new human annotated VQA-Implications dataset. The dataset consists of ~30k questions containing implications of 3 types - Logical Equivalence, Necessary Condition and Mutual Exclusion - made from the VQA v2.0 validation dataset. We show that our framework improves consistency of VQA models by ~15% on the rule-based dataset, ~7% on VQA-Implications dataset and robustness by ~2%, without degrading their performance. In addition, we also quantitatively show improvement in attention maps which highlights better multi-modal understanding of vision and language.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
IROS 2019 Lifelong Robotic Vision Challenge -- Lifelong Object Recognition Report
Authors:
Qi She,
Fan Feng,
Qi Liu,
Rosa H. M. Chan,
Xinyue Hao,
Chuanlin Lan,
Qihan Yang,
Vincenzo Lomonaco,
German I. Parisi,
Heechul Bae,
Eoin Brophy,
Baoquan Chen,
Gabriele Graffieti,
Vidit Goel,
Hyonyoung Han,
Sathursan Kanagarajah,
Somesh Kumar,
Siew-Kei Lam,
Tin Lun Lam,
Liang Ma,
Davide Maltoni,
Lorenzo Pellegrini,
Duvindu Piyasena,
Shiliang Pu,
Debdoot Sheet
, et al. (11 additional authors not shown)
Abstract:
This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition Challenge) with methods and results from the top $8$ finalists (out of over~$150$ teams). The competition dataset (L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is designed for driving lifelong/continual learning research and application in robotic vision domain, w…
▽ More
This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition Challenge) with methods and results from the top $8$ finalists (out of over~$150$ teams). The competition dataset (L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is designed for driving lifelong/continual learning research and application in robotic vision domain, with everyday objects in home, office, campus, and mall scenarios. The dataset explicitly quantifies the variants of illumination, object occlusion, object size, camera-object distance/angles, and clutter information. Rules are designed to quantify the learning capability of the robotic vision system when faced with the objects appearing in the dynamic environments in the contest. Individual reports, dataset information, rules, and released source code can be found at the project homepage: "https://lifelong-robotic-vision.github.io/competition/".
△ Less
Submitted 26 April, 2020;
originally announced April 2020.
-
Unsupervised Video Object Segmentation for Deep Reinforcement Learning
Authors:
Vik Goel,
Jameson Weng,
Pascal Poupart
Abstract:
We present a new technique for deep reinforcement learning that automatically detects moving objects and uses the relevant information for action selection. The detection of moving objects is done in an unsupervised way by exploiting structure from motion. Instead of directly learning a policy from raw images, the agent first learns to detect and segment moving objects by exploiting flow informati…
▽ More
We present a new technique for deep reinforcement learning that automatically detects moving objects and uses the relevant information for action selection. The detection of moving objects is done in an unsupervised way by exploiting structure from motion. Instead of directly learning a policy from raw images, the agent first learns to detect and segment moving objects by exploiting flow information in video sequences. The learned representation is then used to focus the policy of the agent on the moving objects. Over time, the agent identifies which objects are critical for decision making and gradually builds a policy based on relevant moving objects. This approach, which we call Motion-Oriented REinforcement Learning (MOREL), is demonstrated on a suite of Atari games where the ability to detect moving objects reduces the amount of interaction needed with the environment to obtain a good policy. Furthermore, the resulting policy is more interpretable than policies that directly map images to actions or values with a black box neural network. We can gain insight into the policy by inspecting the segmentation and motion of each object detected by the agent. This allows practitioners to confirm whether a policy is making decisions based on sensible information.
△ Less
Submitted 20 May, 2018;
originally announced May 2018.
-
Pointlike sets for varieties determined by groups
Authors:
Samuel J. v. Gool,
B. Steinberg
Abstract:
For a variety of finite groups $\mathbf H$, let $\overline{\mathbf H}$ denote the variety of finite semigroups all of whose subgroups lie in $\mathbf H$. We give a characterization of the subsets of a finite semigroup that are pointlike with respect to $\overline{\mathbf H}$. Our characterization is effective whenever $\mathbf H$ has a decidable membership problem. In particular, the separation pr…
▽ More
For a variety of finite groups $\mathbf H$, let $\overline{\mathbf H}$ denote the variety of finite semigroups all of whose subgroups lie in $\mathbf H$. We give a characterization of the subsets of a finite semigroup that are pointlike with respect to $\overline{\mathbf H}$. Our characterization is effective whenever $\mathbf H$ has a decidable membership problem. In particular, the separation problem for $\overline{\mathbf H}$-languages is decidable for any decidable variety of finite groups $\mathbf H$. This generalizes Henckell's theorem on decidability of aperiodic pointlikes.
△ Less
Submitted 14 January, 2018;
originally announced January 2018.
-
Embedding-Based Speaker Adaptive Training of Deep Neural Networks
Authors:
Xiaodong Cui,
Vaibhava Goel,
George Saon
Abstract:
An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling. In this approach, speaker embedding vectors, which are a constant given a particular speaker, are mapped through a control network to layer-dependent element-wise affine transformations to canonicalize the internal feature representations at the output o…
▽ More
An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling. In this approach, speaker embedding vectors, which are a constant given a particular speaker, are mapped through a control network to layer-dependent element-wise affine transformations to canonicalize the internal feature representations at the output of hidden layers of a main network. The control network for generating the speaker-dependent mappings is jointly estimated with the main network for the overall speaker adaptive acoustic modeling. Experiments on large vocabulary continuous speech recognition (LVCSR) tasks show that the proposed SAT scheme can yield superior performance over the widely-used speaker-aware training using i-vectors with speaker-adapted input features.
△ Less
Submitted 17 October, 2017;
originally announced October 2017.
-
Merge decompositions, two-sided Krohn-Rhodes, and aperiodic pointlikes
Authors:
Samuel J. v. Gool,
Benjamin Steinberg
Abstract:
This paper provides short proofs of two fundamental theorems of finite semigroup theory whose previous proofs were significantly longer, namely the two-sided Krohn-Rhodes decomposition theorem and Henckell's aperiodic pointlike theorem, using a new algebraic technique that we call the merge decomposition. A prototypical application of this technique decomposes a semigroup $T$ into a two-sided semi…
▽ More
This paper provides short proofs of two fundamental theorems of finite semigroup theory whose previous proofs were significantly longer, namely the two-sided Krohn-Rhodes decomposition theorem and Henckell's aperiodic pointlike theorem, using a new algebraic technique that we call the merge decomposition. A prototypical application of this technique decomposes a semigroup $T$ into a two-sided semidirect product whose components are built from two subsemigroups $T_1,T_2$, which together generate $T$, and the subsemigroup generated by their setwise product $T_1T_2$. In this sense we decompose $T$ by merging the subsemigroups $T_1$ and $T_2$. More generally, our technique merges semigroup homomorphisms from free semigroups.
△ Less
Submitted 27 August, 2017;
originally announced August 2017.
-
McGan: Mean and Covariance Feature Matching GAN
Authors:
Youssef Mroueh,
Tom Sercu,
Vaibhava Goel
Abstract:
We introduce new families of Integral Probability Metrics (IPM) for training Generative Adversarial Networks (GAN). Our IPMs are based on matching statistics of distributions embedded in a finite dimensional feature space. Mean and covariance feature matching IPMs allow for stable training of GANs, which we will call McGan. McGan minimizes a meaningful loss between distributions.
We introduce new families of Integral Probability Metrics (IPM) for training Generative Adversarial Networks (GAN). Our IPMs are based on matching statistics of distributions embedded in a finite dimensional feature space. Mean and covariance feature matching IPMs allow for stable training of GANs, which we will call McGan. McGan minimizes a meaningful loss between distributions.
△ Less
Submitted 8 June, 2017; v1 submitted 27 February, 2017;
originally announced February 2017.
-
ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation
Authors:
Helge Holzmann,
Vinay Goel,
Avishek Anand
Abstract:
Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability.
T…
▽ More
Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability.
Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.
△ Less
Submitted 3 February, 2017;
originally announced February 2017.
-
Self-critical Sequence Training for Image Captioning
Authors:
Steven J. Rennie,
Etienne Marcheret,
Youssef Mroueh,
Jarret Ross,
Vaibhava Goel
Abstract:
Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, signifi…
▽ More
Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a "baseline" to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure. Empirically we find that directly optimizing the CIDEr metric with SCST and greedy decoding at test-time is highly effective. Our results on the MSCOCO evaluation sever establish a new state-of-the-art on the task, improving the best result in terms of CIDEr from 104.9 to 114.7.
△ Less
Submitted 15 November, 2017; v1 submitted 1 December, 2016;
originally announced December 2016.
-
Dense Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition
Authors:
Tom Sercu,
Vaibhava Goel
Abstract:
In computer vision pixelwise dense prediction is the task of predicting a label for each pixel in the image. Convolutional neural networks achieve good performance on this task, while being computationally efficient. In this paper we carry these ideas over to the problem of assigning a sequence of labels to a set of speech frames, a task commonly known as framewise classification. We show that den…
▽ More
In computer vision pixelwise dense prediction is the task of predicting a label for each pixel in the image. Convolutional neural networks achieve good performance on this task, while being computationally efficient. In this paper we carry these ideas over to the problem of assigning a sequence of labels to a set of speech frames, a task commonly known as framewise classification. We show that dense prediction view of framewise classification offers several advantages and insights, including computational efficiency and the ability to apply batch normalization. When doing dense prediction we pay specific attention to strided pooling in time and introduce an asymmetric dilated convolution, called time-dilated convolution, that allows for efficient and elegant implementation of pooling in time. We show results using time-dilated convolutions in a very deep VGG-style CNN with batch normalization on the Hub5 Switchboard-2000 benchmark task. With a big n-gram language model, we achieve 7.7% WER which is the best single model single-pass performance reported so far.
△ Less
Submitted 14 December, 2016; v1 submitted 28 November, 2016;
originally announced November 2016.
-
Co-Occuring Directions Sketching for Approximate Matrix Multiply
Authors:
Youssef Mroueh,
Etienne Marcheret,
Vaibhava Goel
Abstract:
We introduce co-occurring directions sketching, a deterministic algorithm for approximate matrix product (AMM), in the streaming model. We show that co-occuring directions achieves a better error bound for AMM than other randomized and deterministic approaches for AMM. Co-occurring directions gives a $1 + ε$ -approximation of the optimal low rank approximation of a matrix product. Empirically our…
▽ More
We introduce co-occurring directions sketching, a deterministic algorithm for approximate matrix product (AMM), in the streaming model. We show that co-occuring directions achieves a better error bound for AMM than other randomized and deterministic approaches for AMM. Co-occurring directions gives a $1 + ε$ -approximation of the optimal low rank approximation of a matrix product. Empirically our algorithm outperforms competing methods for AMM, for a small sketch size. We validate empirically our theoretical findings and algorithms
△ Less
Submitted 24 October, 2016;
originally announced October 2016.
-
Pro-aperiodic monoids via saturated models
Authors:
Samuel J. v. Gool,
Benjamin Steinberg
Abstract:
We apply Stone duality and model theory to study the structure theory of free pro-aperiodic monoids. Stone duality implies that elements of the free pro-aperiodic monoid may be viewed as elementary equivalence classes of pseudofinite words. Model theory provides us with saturated words in each such class, i.e., words in which all possible factorizations are realized. We give several applications o…
▽ More
We apply Stone duality and model theory to study the structure theory of free pro-aperiodic monoids. Stone duality implies that elements of the free pro-aperiodic monoid may be viewed as elementary equivalence classes of pseudofinite words. Model theory provides us with saturated words in each such class, i.e., words in which all possible factorizations are realized. We give several applications of this new approach, including a solution to the word problem for $ω$-terms that avoids using McCammond's normal forms, as well as new proofs and extensions of other structural results concerning free pro-aperiodic monoids.
△ Less
Submitted 28 August, 2017; v1 submitted 25 September, 2016;
originally announced September 2016.
-
Advances in Very Deep Convolutional Neural Networks for LVCSR
Authors:
Tom Sercu,
Vaibhava Goel
Abstract:
Very deep CNNs with small 3x3 kernels have recently been shown to achieve very strong performance as acoustic models in hybrid NN-HMM speech recognition systems. In this paper we investigate how to efficiently scale these models to larger datasets. Specifically, we address the design choice of pooling and padding along the time dimension which renders convolutional evaluation of sequences highly i…
▽ More
Very deep CNNs with small 3x3 kernels have recently been shown to achieve very strong performance as acoustic models in hybrid NN-HMM speech recognition systems. In this paper we investigate how to efficiently scale these models to larger datasets. Specifically, we address the design choice of pooling and padding along the time dimension which renders convolutional evaluation of sequences highly inefficient. We propose a new CNN design without timepadding and without timepooling, which is slightly suboptimal for accuracy, but has two significant advantages: it enables sequence training and deployment by allowing efficient convolutional evaluation of full utterances, and, it allows for batch normalization to be straightforwardly adopted to CNNs on sequence data. Through batch normalization, we recover the lost peformance from removing the time-pooling, while keeping the benefit of efficient convolutional evaluation. We demonstrate the performance of our models both on larger scale data than before, and after sequence training. Our very deep CNN model sequence trained on the 2000h switchboard dataset obtains 9.4 word error rate on the Hub5 test-set, matching with a single model the performance of the 2015 IBM system combination, which was the previous best published result.
△ Less
Submitted 24 June, 2016; v1 submitted 6 April, 2016;
originally announced April 2016.
-
Asymmetrically Weighted CCA And Hierarchical Kernel Sentence Embedding For Image & Text Retrieval
Authors:
Youssef Mroueh,
Etienne Marcheret,
Vaibhava Goel
Abstract:
Joint modeling of language and vision has been drawing increasing interest. A multimodal data representation allowing for bidirectional retrieval of images by sentences and vice versa is a key aspect. In this paper we present three contributions in canonical correlation analysis (CCA) based multimodal retrieval. Firstly, we show that an asymmetric weighting of the canonical weights, while achievin…
▽ More
Joint modeling of language and vision has been drawing increasing interest. A multimodal data representation allowing for bidirectional retrieval of images by sentences and vice versa is a key aspect. In this paper we present three contributions in canonical correlation analysis (CCA) based multimodal retrieval. Firstly, we show that an asymmetric weighting of the canonical weights, while achieving a cross view mapping from the search to the query space, improves the retrieval performance. Secondly, we devise a computationally efficient model selection, crucial to generalization and stability, in the framework of the Björk Golub algorithm for regularized CCA via spectral filtering. Finally, we introduce a Hierarchical Kernel Sentence Embedding (HKSE) that approximates Kernel CCA for a special similarity kernel between distribution of words embedded in a vector space. State of the art results are obtained on MSCOCO and Flickr benchmarks when these three techniques are used in conjunction.
△ Less
Submitted 5 December, 2016; v1 submitted 19 November, 2015;
originally announced November 2015.
-
Random Maxout Features
Authors:
Youssef Mroueh,
Steven Rennie,
Vaibhava Goel
Abstract:
In this paper, we propose and study random maxout features, which are constructed by first projecting the input data onto sets of randomly generated vectors with Gaussian elements, and then outputing the maximum projection value for each set. We show that the resulting random feature map, when used in conjunction with linear models, allows for the locally linear estimation of the function of inter…
▽ More
In this paper, we propose and study random maxout features, which are constructed by first projecting the input data onto sets of randomly generated vectors with Gaussian elements, and then outputing the maximum projection value for each set. We show that the resulting random feature map, when used in conjunction with linear models, allows for the locally linear estimation of the function of interest in classification tasks, and for the locally linear embedding of points when used for dimensionality reduction or data visualization. We derive generalization bounds for learning that assess the error in approximating locally linear functions by linear functions in the maxout feature space, and empirically evaluate the efficacy of the approach on the MNIST and TIMIT classification tasks.
△ Less
Submitted 12 June, 2015; v1 submitted 11 June, 2015;
originally announced June 2015.
-
Deep Multimodal Learning for Audio-Visual Speech Recognition
Authors:
Youssef Mroueh,
Etienne Marcheret,
Vaibhava Goel
Abstract:
In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error…
▽ More
In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of $41\%$ under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of $35.83\%$ demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax layer to account for class specific correlations between modalities. We show that combining the posteriors from the bilinear networks with those from the fused model mentioned above results in a further significant phone error rate reduction, yielding a final PER of $34.03\%$.
△ Less
Submitted 22 January, 2015;
originally announced January 2015.