Search | arXiv e-print repository

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Authors: Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang

Abstract: Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of e… ▽ More Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream. Our project page can be found at https://shramanpramanick.github.io/EgoVLPv2/. △ Less

Submitted 18 August, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

Comments: Published in ICCV 2023

arXiv:2306.02680 [pdf, other]

BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion

Authors: Ahana Deb, Sayan Nag, Ayan Mahapatra, Soumitri Chattopadhyay, Aritra Marik, Pijush Kanti Gayen, Shankha Sanyal, Archi Banerjee, Samir Karmakar

Abstract: Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful represent… ▽ More Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Recent advancements in attention-based models, demonstrating their ability to learn powerful representations from multilingual datasets, have performed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts ($\underline{\textbf{Be}}$ngali speech acts recognition using Multimodal $\underline{\textbf{At}}$tention Fu$\underline{\textbf{s}}$ion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data. Project page: https://soumitri2001.github.io/BeAts △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: Accepted at INTERSPEECH 2023

arXiv:2304.00733 [pdf, other]

Unbiased Scene Graph Generation in Videos

Authors: Sayak Nag, Kyle Min, Subarna Tripathi, Amit K. Roy Chowdhury

Abstract: The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context usi… ▽ More The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs. △ Less

Submitted 29 June, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: Published in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023

arXiv:2303.14863 [pdf, other]

DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

Authors: Sauradip Nag, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang

Abstract: We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to r… ▽ More We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code will be made available at https://github.com/sauradip/DiffusionTAD. △ Less

Submitted 14 July, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

Comments: ICCV 2023; Code available at https://github.com/sauradip/DiffusionTAD

arXiv:2303.09695 [pdf, other]

PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds

Authors: Sauradip Nag, Anran Qi, Xiatian Zhu, Ariel Shamir

Abstract: Garment pattern design aims to convert a 3D garment to the corresponding 2D panels and their sewing structure. Existing methods rely either on template fitting with heuristics and prior assumptions, or on model learning with complicated shape parameterization. Importantly, both approaches do not allow for personalization of the output garment, which today has increasing demands. To fill this deman… ▽ More Garment pattern design aims to convert a 3D garment to the corresponding 2D panels and their sewing structure. Existing methods rely either on template fitting with heuristics and prior assumptions, or on model learning with complicated shape parameterization. Importantly, both approaches do not allow for personalization of the output garment, which today has increasing demands. To fill this demand, we introduce PersonalTailor: a personalized 2D pattern design method, where the user can input specific constraints or demands (in language or sketch) for personal 2D panel fabrication from 3D point clouds. PersonalTailor first learns a multi-modal panel embeddings based on unsupervised cross-modal association and attentive fusion. It then predicts a binary panel masks individually using a transformer encoder-decoder framework. Extensive experiments show that our PersonalTailor excels on both personalized and standard pattern fabrication tasks. △ Less

Submitted 11 August, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

Comments: Technical Report

arXiv:2303.05556 [pdf, other]

An Evaluation of Non-Contrastive Self-Supervised Learning for Federated Medical Image Analysis

Authors: Soumitri Chattopadhyay, Soham Ganguly, Sreejit Chaudhury, Sayan Nag, Samiran Chattopadhyay

Abstract: Privacy and annotation bottlenecks are two major issues that profoundly affect the practicality of machine learning-based medical image analysis. Although significant progress has been made in these areas, these issues are not yet fully resolved. In this paper, we seek to tackle these concerns head-on and systematically explore the applicability of non-contrastive self-supervised learning (SSL) al… ▽ More Privacy and annotation bottlenecks are two major issues that profoundly affect the practicality of machine learning-based medical image analysis. Although significant progress has been made in these areas, these issues are not yet fully resolved. In this paper, we seek to tackle these concerns head-on and systematically explore the applicability of non-contrastive self-supervised learning (SSL) algorithms under federated learning (FL) simulations for medical image analysis. We conduct thorough experimentation of recently proposed state-of-the-art non-contrastive frameworks under standard FL setups. With the SoTA Contrastive Learning algorithm, SimCLR as our comparative baseline, we benchmark the performances of our 4 chosen non-contrastive algorithms under non-i.i.d. data conditions and with a varying number of clients. We present a holistic evaluation of these techniques on 6 standardized medical imaging datasets. We further analyse different trends inferred from the findings of our research, with the aim to find directions for further research based on ours. To the best of our knowledge, ours is the first to perform such a thorough analysis of federated self-supervised learning for medical imaging. All of our source code will be made public upon acceptance of the paper. △ Less

Submitted 9 March, 2023; originally announced March 2023.

arXiv:2303.02245 [pdf, other]

Exploring Self-Supervised Representation Learning For Low-Resource Medical Image Analysis

Authors: Soumitri Chattopadhyay, Soham Ganguly, Sreejit Chaudhury, Sayan Nag, Samiran Chattopadhyay

Abstract: The success of self-supervised learning (SSL) has mostly been attributed to the availability of unlabeled yet large-scale datasets. However, in a specialized domain such as medical imaging which is a lot different from natural images, the assumption of data availability is unrealistic and impractical, as the data itself is scanty and found in small databases, collected for specific prognosis tasks… ▽ More The success of self-supervised learning (SSL) has mostly been attributed to the availability of unlabeled yet large-scale datasets. However, in a specialized domain such as medical imaging which is a lot different from natural images, the assumption of data availability is unrealistic and impractical, as the data itself is scanty and found in small databases, collected for specific prognosis tasks. To this end, we seek to investigate the applicability of self-supervised learning algorithms on small-scale medical imaging datasets. In particular, we evaluate $4$ state-of-the-art SSL methods on three publicly accessible \emph{small} medical imaging datasets. Our investigation reveals that in-domain low-resource SSL pre-training can yield competitive performance to transfer learning from large-scale datasets (such as ImageNet). Furthermore, we extensively analyse our empirical findings to provide valuable insights that can motivate for further research towards circumventing the need for pre-training on a large image corpus. To the best of our knowledge, this is the first attempt to holistically explore self-supervision on low-resource medical datasets. △ Less

Submitted 28 June, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

Comments: Accepted at IEEE ICIP 2023

arXiv:2302.09108 [pdf, other]

doi 10.1109/ISCAS46773.2023.10181988

ViTA: A Vision Transformer Inference Accelerator for Edge Applications

Authors: Shashank Nag, Gourav Datta, Souvik Kundu, Nitin Chandrachoodan, Peter A. Beerel

Abstract: Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks due to their ability to capture the global relation between features which leads to superior performance. However, they are compute-heavy and difficult to deploy in resource-constrained edge devices. Existing hardware accelerators, including t… ▽ More Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks due to their ability to capture the global relation between features which leads to superior performance. However, they are compute-heavy and difficult to deploy in resource-constrained edge devices. Existing hardware accelerators, including those for the closely-related BERT transformer models, do not target highly resource-constrained environments. In this paper, we address this gap and propose ViTA - a configurable hardware accelerator for inference of vision transformer models, targeting resource-constrained edge computing devices and avoiding repeated off-chip memory accesses. We employ a head-level pipeline and inter-layer MLP optimizations, and can support several commonly used vision transformer models with changes solely in our control logic. We achieve nearly 90% hardware utilization efficiency on most vision transformer models, report a power of 0.88W when synthesised with a clock of 150 MHz, and get reasonable frame rates - all of which makes ViTA suitable for edge applications. △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: Accepted at ISCAS 2023

Journal ref: 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 2023, pp. 1-5

arXiv:2211.14924 [pdf, other]

Post-Processing Temporal Action Detection

Authors: Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, Tao Xiang

Abstract: Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence, before temporal boundary estimation and action classification. This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original… ▽ More Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence, before temporal boundary estimation and action classification. This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution. In essence, this is due to a temporal quantization error introduced during the resolution downsampling and recovery. This could negatively impact the TAD performance, but is largely ignored by existing methods. To address this problem, in this work we introduce a novel model-agnostic post-processing method without model redesign and retraining. Specifically, we model the start and end points of action instances with a Gaussian distribution for enabling temporal boundary inference at a sub-snippet level. We further introduce an efficient Taylor-expansion based approximation, dubbed as Gaussian Approximated Post-processing (GAP). Extensive experiments demonstrate that our GAP can consistently improve a wide variety of pre-trained off-the-shelf TAD models on the challenging ActivityNet (+0.2% -0.7% in average mAP) and THUMOS (+0.2% -0.5% in average mAP) benchmarks. Such performance gains are already significant and highly comparable to those achieved by novel model designs. Also, GAP can be integrated with model training for further performance gain. Importantly, GAP enables lower temporal resolutions for more efficient inference, facilitating low-resource applications. The code will be available in https://github.com/sauradip/GAP △ Less

Submitted 3 March, 2023; v1 submitted 27 November, 2022; originally announced November 2022.

Comments: CVPR 2023; Code available at https://github.com/sauradip/GAP

arXiv:2211.14905 [pdf, other]

Multi-Modal Few-Shot Temporal Action Detection

Authors: Sauradip Nag, Mengmeng Xu, Xiatian Zhu, Juan-Manuel Perez-Rua, Bernard Ghanem, Yi-Zhe Song, Tao Xiang

Abstract: Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot… ▽ More Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET △ Less

Submitted 27 March, 2023; v1 submitted 27 November, 2022; originally announced November 2022.

Comments: Technical Report

arXiv:2210.15075 [pdf, other]

IDEAL: Improved DEnse locAL Contrastive Learning for Semi-Supervised Medical Image Segmentation

Authors: Hritam Basak, Soumitri Chattopadhyay, Rohit Kundu, Sayan Nag, Rammohan Mallipeddi

Abstract: Due to the scarcity of labeled data, Contrastive Self-Supervised Learning (SSL) frameworks have lately shown great potential in several medical image analysis tasks. However, the existing contrastive mechanisms are sub-optimal for dense pixel-level segmentation tasks due to their inability to mine local features. To this end, we extend the concept of metric learning to the segmentation task, using… ▽ More Due to the scarcity of labeled data, Contrastive Self-Supervised Learning (SSL) frameworks have lately shown great potential in several medical image analysis tasks. However, the existing contrastive mechanisms are sub-optimal for dense pixel-level segmentation tasks due to their inability to mine local features. To this end, we extend the concept of metric learning to the segmentation task, using a dense (dis)similarity learning for pre-training a deep encoder network, and employing a semi-supervised paradigm to fine-tune for the downstream task. Specifically, we propose a simple convolutional projection head for obtaining dense pixel-level features, and a new contrastive loss to utilize these dense projections thereby improving the local representations. A bidirectional consistency regularization mechanism involving two-stream model training is devised for the downstream task. Upon comparison, our IDEAL method outperforms the SoTA methods by fair margins on cardiac MRI segmentation. Code available: https://github.com/hritam-98/IDEAL-ICASSP23 △ Less

Submitted 2 March, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: Paper accepted for publication at IEEE ICASSP 2023

arXiv:2210.04135 [pdf, other]

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Authors: Shraman Pramanick, Li Jing, Sayan Nag, Jiachen Zhu, Hardik Shah, Yann LeCun, Rama Chellappa

Abstract: Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accu… ▽ More Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations. △ Less

Submitted 29 October, 2023; v1 submitted 8 October, 2022; originally announced October 2022.

Comments: Published in TMLR 2023

arXiv:2209.08905 [pdf, ps, other]

Shape evolution in the rapidly rotating $^{140}$Gd nucleus

Authors: H. Pai, S. Rajbanshi, Somnath Nag, Sajad Ali, R. Palit, G. Mukherjee, F. S. Babra, R. Banik, Soumik Bhattacharya, S. Biswas, S. Chakraborty, R. Donthi, S. Jadhav, Md. S. R. Laskar, B. S. Naidu, S. Nandi, A. Goswami

Abstract: Ground state band of $^{140}$Gd has been investigated following their population in the $^{112}$Sn($^{35}$Cl,~$α$p2n)$^{140}$Gd reaction at 195 MeV of beam energy using a large array of Compton suppressed HPGe clovers as the detection setup. Apart from other spectroscopic measurements, level lifetimes of the states have been extracted using the Doppler Shift Attenuation Method. Extracted quadrupol… ▽ More Ground state band of $^{140}$Gd has been investigated following their population in the $^{112}$Sn($^{35}$Cl,~$α$p2n)$^{140}$Gd reaction at 195 MeV of beam energy using a large array of Compton suppressed HPGe clovers as the detection setup. Apart from other spectroscopic measurements, level lifetimes of the states have been extracted using the Doppler Shift Attenuation Method. Extracted quadrupole moment along with the pairing independent cranked Nilsson-Strutinsky model calculations for the quadrupole band reveal that the nucleus preferably attains triaxiality with $γ$ = -30$^\circ$. The calculation though shows a slight possibility of rotation around the longest possible principal axis at high spin $\sim$ 30$\hbar$ which is beyond the scope of the present experiment. △ Less

Submitted 19 September, 2022; originally announced September 2022.

arXiv:2208.00955 [pdf, other]

Large-Scale Product Retrieval with Weakly Supervised Representation Learning

Authors: Xiao Han, Kam Woh Ng, Sauradip Nag, Zhiyu Qu

Abstract: Large-scale weakly supervised product retrieval is a practically useful yet computationally challenging problem. This paper introduces a novel solution for the eBay Visual Search Challenge (eProduct) held at the Ninth Workshop on Fine-Grained Visual Categorisation workshop (FGVC9) of CVPR 2022. This competition presents two challenges: (a) E-commerce is a drastically fine-grained domain including… ▽ More Large-scale weakly supervised product retrieval is a practically useful yet computationally challenging problem. This paper introduces a novel solution for the eBay Visual Search Challenge (eProduct) held at the Ninth Workshop on Fine-Grained Visual Categorisation workshop (FGVC9) of CVPR 2022. This competition presents two challenges: (a) E-commerce is a drastically fine-grained domain including many products with subtle visual differences; (b) A lacking of target instance-level labels for model training, with only coarse category labels and product titles available. To overcome these obstacles, we formulate a strong solution by a set of dedicated designs: (a) Instead of using text training data directly, we mine thousands of pseudo-attributes from product titles and use them as the ground truths for multi-label classification. (b) We incorporate several strong backbones with advanced training recipes for more discriminative representation learning. (c) We further introduce a number of post-processing techniques including whitening, re-ranking and model ensemble for retrieval enhancement. By achieving 71.53% MAR, our solution "Involution King" achieves the second position on the leaderboard. △ Less

Submitted 1 August, 2022; originally announced August 2022.

Comments: FGVC9 CVPR2022

arXiv:2207.08184 [pdf, other]

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Authors: Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, Tao Xiang

Abstract: Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action… ▽ More Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g, proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms state-of-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors. The PyTorch implementation of STALE is available at https://github.com/sauradip/STALE. △ Less

Submitted 17 July, 2022; originally announced July 2022.

Comments: ECCV 2022; Code available at https://github.com/sauradip/STALE

arXiv:2207.07059 [pdf, other]

Semi-Supervised Temporal Action Detection with Proposal-Free Masking

Authors: Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, Tao Xiang

Abstract: Existing temporal action detection (TAD) methods rely on a large number of training data with segment-level annotations. Collecting and annotating such a training set is thus highly expensive and unscalable. Semi-supervised TAD (SS-TAD) alleviates this problem by leveraging unlabeled videos freely available at scale. However, SS-TAD is also a much more challenging problem than supervised TAD, and… ▽ More Existing temporal action detection (TAD) methods rely on a large number of training data with segment-level annotations. Collecting and annotating such a training set is thus highly expensive and unscalable. Semi-supervised TAD (SS-TAD) alleviates this problem by leveraging unlabeled videos freely available at scale. However, SS-TAD is also a much more challenging problem than supervised TAD, and consequently much under-studied. Prior SS-TAD methods directly combine an existing proposal-based TAD method and a SSL method. Due to their sequential localization (e.g, proposal generation) and classification design, they are prone to proposal error propagation. To overcome this limitation, in this work we propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT) with a parallel localization (mask generation) and classification architecture. Such a novel design effectively eliminates the dependence between localization and classification by cutting off the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for prediction refinement, and a new pretext task for self-supervised model pre-training. Extensive experiments on two standard benchmarks show that our SPOT outperforms state-of-the-art alternatives, often by a large margin. The PyTorch implementation of SPOT is available at https://github.com/sauradip/SPOT △ Less

Submitted 14 July, 2022; originally announced July 2022.

Comments: ECCV 2022; Code available at https://github.com/sauradip/SPOT

arXiv:2207.06580 [pdf, other]

Proposal-Free Temporal Action Detection via Global Segmentation Mask Learning

Authors: Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, Tao Xiang

Abstract: Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video. This leads to complex model designs due to proposal generation and/or per-proposal action instance evaluation and the resultant high computational cost. In this work, for the first time, we propose a proposal-free Temporal Action detection model with Global Segmentation mask (… ▽ More Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video. This leads to complex model designs due to proposal generation and/or per-proposal action instance evaluation and the resultant high computational cost. In this work, for the first time, we propose a proposal-free Temporal Action detection model with Global Segmentation mask (TAGS). Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length. The TAGS model differs significantly from the conventional proposal-based methods by focusing on global temporal representation learning to directly detect local start and end points of action instances without proposals. Further, by modeling TAD holistically rather than locally at the individual proposal level, TAGS needs a much simpler model architecture with lower computational cost. Extensive experiments show that despite its simpler design, TAGS outperforms existing TAD methods, achieving new state-of-the-art performance on two benchmarks. Importantly, it is ~ 20x faster to train and ~1.6x more efficient for inference. Our PyTorch implementation of TAGS is available at https://github.com/sauradip/TAGS . △ Less

Submitted 19 August, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

Comments: ECCV 2022; Code available at https://github.com/sauradip/TAGS

arXiv:2207.06277 [pdf, other]

doi 10.1080/2150704X.2022.2097031

ACLNet: An Attention and Clustering-based Cloud Segmentation Network

Authors: Dhruv Makwana, Subhrajit Nag, Onkar Susladkar, Gayatri Deshmukh, Sai Chandra Teja R, Sparsh Mittal, C Krishna Mohan

Abstract: We propose a novel deep learning model named ACLNet, for cloud segmentation from ground images. ACLNet uses both deep neural network and machine learning (ML) algorithm to extract complementary features. Specifically, it uses EfficientNet-B0 as the backbone, "`a trous spatial pyramid pooling" (ASPP) to learn at multiple receptive fields, and "global attention module" (GAM) to extract finegrained d… ▽ More We propose a novel deep learning model named ACLNet, for cloud segmentation from ground images. ACLNet uses both deep neural network and machine learning (ML) algorithm to extract complementary features. Specifically, it uses EfficientNet-B0 as the backbone, "`a trous spatial pyramid pooling" (ASPP) to learn at multiple receptive fields, and "global attention module" (GAM) to extract finegrained details from the image. ACLNet also uses k-means clustering to extract cloud boundaries more precisely. ACLNet is effective for both daytime and nighttime images. It provides lower error rate, higher recall and higher F1-score than state-of-art cloud segmentation models. The source-code of ACLNet is available here: https://github.com/ckmvigil/ACLNet. △ Less

Submitted 13 July, 2022; originally announced July 2022.

Comments: 11 pages, 3 figures, 5 tables, Published in remote sensing letters

Journal ref: volume 13, pages 865-875, year 2022

arXiv:2207.06001 [pdf, other]

Studying the age of onset and detection of Chronic Myeloid Leukemia using a three-stage stochastic model

Authors: Suryadeepto Nag, Ananda Shikhara Bhat, Siddhartha P. Chakrabarty

Abstract: Chronic Myeloid Leukemia (CML) is a biphasic malignant clonal disorder that progresses, first with a chronic phase, where the cells have enhanced proliferation only, and then to a blast phase, where the cells have the ability of self-renewal. It is well-recognized that the Philadelphia chromosome (which contains the BCR-ABL fusion gene) is the "hallmark of CML". However, empirical studies have sho… ▽ More Chronic Myeloid Leukemia (CML) is a biphasic malignant clonal disorder that progresses, first with a chronic phase, where the cells have enhanced proliferation only, and then to a blast phase, where the cells have the ability of self-renewal. It is well-recognized that the Philadelphia chromosome (which contains the BCR-ABL fusion gene) is the "hallmark of CML". However, empirical studies have shown that the mere presence of BCR-ABL may not be a sufficient condition for the development of CML, and further modifications related to tumor suppressors may be necessary. Accordingly, we develop a three-mutation stochastic model of CML progression, with the three stages corresponding to the non-malignant cells with BCR-ABL presence, the malignant cells in the chronic phase and the malignant cells in the blast phase. We demonstrate that the model predictions agree with age incidence data from the United States. Finally, we develop a framework for the retrospective estimation of the time of onset of malignancy, from the time of detection of the cancer. △ Less

Submitted 13 July, 2022; originally announced July 2022.

arXiv:2207.00960 [pdf, other]

doi 10.1016/j.compind.2022.103720

WaferSegClassNet -- A Light-weight Network for Classification and Segmentation of Semiconductor Wafer Defects

Authors: Subhrajit Nag, Dhruv Makwana, Sai Chandra Teja R, Sparsh Mittal, C Krishna Mohan

Abstract: As the integration density and design intricacy of semiconductor wafers increase, the magnitude and complexity of defects in them are also on the rise. Since the manual inspection of wafer defects is costly, an automated artificial intelligence (AI) based computer-vision approach is highly desired. The previous works on defect analysis have several limitations, such as low accuracy and the need fo… ▽ More As the integration density and design intricacy of semiconductor wafers increase, the magnitude and complexity of defects in them are also on the rise. Since the manual inspection of wafer defects is costly, an automated artificial intelligence (AI) based computer-vision approach is highly desired. The previous works on defect analysis have several limitations, such as low accuracy and the need for separate models for classification and segmentation. For analyzing mixed-type defects, some previous works require separately training one model for each defect type, which is non-scalable. In this paper, we present WaferSegClassNet (WSCN), a novel network based on encoder-decoder architecture. WSCN performs simultaneous classification and segmentation of both single and mixed-type wafer defects. WSCN uses a "shared encoder" for classification, and segmentation, which allows training WSCN end-to-end. We use N-pair contrastive loss to first pretrain the encoder and then use BCE-Dice loss for segmentation, and categorical cross-entropy loss for classification. Use of N-pair contrastive loss helps in better embedding representation in the latent dimension of wafer maps. WSCN has a model size of only 0.51MB and performs only 0.2M FLOPS. Thus, it is much lighter than other state-of-the-art models. Also, it requires only 150 epochs for convergence, compared to 4,000 epochs needed by a previous work. We evaluate our model on the MixedWM38 dataset, which has 38,015 images. WSCN achieves an average classification accuracy of 98.2% and a dice coefficient of 0.9999. We are the first to show segmentation results on the MixedWM38 dataset. The source code can be obtained from https://github.com/ckmvigil/WaferSegClassNet. △ Less

Submitted 3 July, 2022; originally announced July 2022.

Comments: 11 pages, 2 figures, 7 tables, Published in Computers in Industry

Journal ref: Volume 142, 2022, 103720, ISSN 0166-3615,

arXiv:2207.00506 [pdf, other]

How Far Can I Go ? : A Self-Supervised Approach for Deterministic Video Depth Forecasting

Authors: Sauradip Nag, Nisarg Shah, Anran Qi, Raghavendra Ramachandra

Abstract: In this paper we present a novel self-supervised method to anticipate the depth estimate for a future, unobserved real-world urban scene. This work is the first to explore self-supervised learning for estimation of monocular depth of future unobserved frames of a video. Existing works rely on a large number of annotated samples to generate the probabilistic prediction of depth for unseen frames. H… ▽ More In this paper we present a novel self-supervised method to anticipate the depth estimate for a future, unobserved real-world urban scene. This work is the first to explore self-supervised learning for estimation of monocular depth of future unobserved frames of a video. Existing works rely on a large number of annotated samples to generate the probabilistic prediction of depth for unseen frames. However, this makes it unrealistic due to its requirement for large amount of annotated depth samples of video. In addition, the probabilistic nature of the case, where one past can have multiple future outcomes often leads to incorrect depth estimates. Unlike previous methods, we model the depth estimation of the unobserved frame as a view-synthesis problem, which treats the depth estimate of the unseen video frame as an auxiliary task while synthesizing back the views using learned pose. This approach is not only cost effective - we do not use any ground truth depth for training (hence practical) but also deterministic (a sequence of past frames map to an immediate future). To address this task we first develop a novel depth forecasting network DeFNet which estimates depth of unobserved future by forecasting latent features. Second, we develop a channel-attention based pose estimation network that estimates the pose of the unobserved frame. Using this learned pose, estimated depth map is reconstructed back into the image domain, thus forming a self-supervised solution. Our proposed approach shows significant improvements in Abs Rel metric compared to state-of-the-art alternatives on both short and mid-term forecasting setting, benchmarked on KITTI and Cityscapes. Code is available at https://github.com/sauradip/depthForecasting △ Less

Submitted 8 July, 2022; v1 submitted 1 July, 2022; originally announced July 2022.

Comments: Accepted in ML4AD Workshop, NeurIPS 2021

arXiv:2111.07042 [pdf]

Agile Satellite Planning for Multi-Payload Observations for Earth Science

Authors: Rich Levinson, Sreeja Nag, Vinay Ravindra

Abstract: We present planning challenges, methods and preliminary results for a new model-based paradigm for earth observing systems in adaptive remote sensing. Our heuristically guided constraint optimization planner produces coordinated plans for multiple satellites, each with multiple instruments (payloads). The satellites are agile, meaning they can quickly maneuver to change viewing angles in response… ▽ More We present planning challenges, methods and preliminary results for a new model-based paradigm for earth observing systems in adaptive remote sensing. Our heuristically guided constraint optimization planner produces coordinated plans for multiple satellites, each with multiple instruments (payloads). The satellites are agile, meaning they can quickly maneuver to change viewing angles in response to rapidly changing phenomena. The planner operates in a closed-loop context, updating the plan as it receives regular sensor data and updated predictions. We describe the planner's search space and search procedure, and present preliminary experiment results. Contributions include initial identification of the planner's search space, constraints, heuristics, and performance metrics applied to a soil moisture monitoring scenario using spaceborne radars. △ Less

Submitted 13 November, 2021; originally announced November 2021.

Journal ref: International Workshop on Planning & Scheduling for Space (IWPSS) 2021

arXiv:2110.10552 [pdf, other]

Few-Shot Temporal Action Localization with Query Adaptive Transformer

Authors: Sauradip Nag, Xiatian Zhu, Tao Xiang

Abstract: Existing temporal action localization (TAL) works rely on a large number of training videos with exhaustive segment-level annotation, preventing them from scaling to new classes. As a solution to this problem, few-shot TAL (FS-TAL) aims to adapt a model to a new class represented by as few as a single video. Exiting FS-TAL methods assume trimmed training videos for new classes. However, this setti… ▽ More Existing temporal action localization (TAL) works rely on a large number of training videos with exhaustive segment-level annotation, preventing them from scaling to new classes. As a solution to this problem, few-shot TAL (FS-TAL) aims to adapt a model to a new class represented by as few as a single video. Exiting FS-TAL methods assume trimmed training videos for new classes. However, this setting is not only unnatural actions are typically captured in untrimmed videos, but also ignores background video segments containing vital contextual cues for foreground action segmentation. In this work, we first propose a new FS-TAL setting by proposing to use untrimmed training videos. Further, a novel FS-TAL model is proposed which maximizes the knowledge transfer from training classes whilst enabling the model to be dynamically adapted to both the new class and each video of that class simultaneously. This is achieved by introducing a query adaptive Transformer in the model. Extensive experiments on two action localization benchmarks demonstrate that our method can outperform all the state of the art alternatives significantly in both single-domain and cross-domain scenarios. The source code can be found in https://github.com/sauradip/fewshotQAT △ Less

Submitted 20 October, 2021; originally announced October 2021.

Comments: BMVC 2021

arXiv:2109.04572 [pdf, other]

Deciphering Environmental Air Pollution with Large Scale City Data

Authors: Mayukh Bhattacharyya, Sayan Nag, Udita Ghosh

Abstract: Air pollution poses a serious threat to sustainable environmental conditions in the 21st century. Its importance in determining the health and living standards in urban settings is only expected to increase with time. Various factors ranging from artificial emissions to natural phenomena are known to be primary causal agents or influencers behind rising air pollution levels. However, the lack of l… ▽ More Air pollution poses a serious threat to sustainable environmental conditions in the 21st century. Its importance in determining the health and living standards in urban settings is only expected to increase with time. Various factors ranging from artificial emissions to natural phenomena are known to be primary causal agents or influencers behind rising air pollution levels. However, the lack of large scale data involving the major artificial and natural factors has hindered the research on the causes and relations governing the variability of the different air pollutants. Through this work, we introduce a large scale city-wise dataset for exploring the relationships among these agents over a long period of time. We also introduce a transformer based model - cosSquareFormer, for the problem of pollutant level estimation and forecasting. Our model outperforms most of the benchmark models for this task. We also analyze and explore the dataset through our model and other methodologies to bring out important inferences which enable us to understand the dynamics of the causal agents at a deeper level. Through our paper, we seek to provide a great set of foundations for further research into this domain that will demand critical attention of ours in the near future. △ Less

Submitted 15 June, 2022; v1 submitted 9 September, 2021; originally announced September 2021.

Comments: Accepted as a Oral Spotlight Paper at International Joint Conference of Artificial Intelligence (IJCAI) 2022

arXiv:2108.09598 [pdf, other]

SERF: Towards better training of deep neural networks using log-Softplus ERror activation Function

Authors: Sayan Nag, Mayukh Bhattacharyya

Abstract: Activation functions play a pivotal role in determining the training dynamics and neural network performance. The widely adopted activation function ReLU despite being simple and effective has few disadvantages including the Dying ReLU problem. In order to tackle such problems, we propose a novel activation function called Serf which is self-regularized and nonmonotonic in nature. Like Mish, Serf… ▽ More Activation functions play a pivotal role in determining the training dynamics and neural network performance. The widely adopted activation function ReLU despite being simple and effective has few disadvantages including the Dying ReLU problem. In order to tackle such problems, we propose a novel activation function called Serf which is self-regularized and nonmonotonic in nature. Like Mish, Serf also belongs to the Swish family of functions. Based on several experiments on computer vision (image classification and object detection) and natural language processing (machine translation, sentiment classification and multimodal entailment) tasks with different state-of-the-art architectures, it is observed that Serf vastly outperforms ReLU (baseline) and other activation functions including both Swish and Mish, with a markedly bigger margin on deeper architectures. Ablation studies further demonstrate that Serf based architectures perform better than those of Swish and Mish in varying scenarios, validating the effectiveness and compatibility of Serf with varying depth, complexity, optimizers, learning rates, batch sizes, initializers and dropout rates. Finally, we investigate the mathematical relation between Swish and Serf, thereby showing the impact of preconditioner function ingrained in the first derivative of Serf which provides a regularization effect making gradients smoother and optimization faster. △ Less

Submitted 24 August, 2021; v1 submitted 21 August, 2021; originally announced August 2021.

arXiv:2108.00340 [pdf, other]

Reconstruction guided Meta-learning for Few Shot Open Set Recognition

Authors: Sayak Nag, Dripta S. Raychaudhuri, Sujoy Paul, Amit K. Roy-Chowdhury

Abstract: In many applications, we are constrained to learn classifiers from very limited data (few-shot classification). The task becomes even more challenging if it is also required to identify samples from unknown categories (open-set classification). Learning a good abstraction for a class with very few samples is extremely difficult, especially under open-set settings. As a result, open-set recognition… ▽ More In many applications, we are constrained to learn classifiers from very limited data (few-shot classification). The task becomes even more challenging if it is also required to identify samples from unknown categories (open-set classification). Learning a good abstraction for a class with very few samples is extremely difficult, especially under open-set settings. As a result, open-set recognition has received minimal attention in the few-shot setting. However, it is a critical task in many applications like environmental monitoring, where the number of labeled examples for each class is limited. Existing few-shot open-set recognition (FSOSR) methods rely on thresholding schemes, with some considering uniform probability for open-class samples. However, this approach is often inaccurate, especially for fine-grained categorization, and makes them highly sensitive to the choice of a threshold. To address these concerns, we propose Reconstructing Exemplar-based Few-shot Open-set ClaSsifier (ReFOCS). By using a novel exemplar reconstruction-based meta-learning strategy ReFOCS streamlines FSOSR eliminating the need for a carefully tuned threshold by learning to be self-aware of the openness of a sample. The exemplars, act as class representatives and can be either provided in the training dataset or estimated in the feature domain. By testing on a wide variety of datasets, we show ReFOCS to outperform multiple state-of-the-art methods. △ Less

Submitted 30 September, 2023; v1 submitted 31 July, 2021; originally announced August 2021.

Comments: Accepted for publication in IEEE Transactions in Pattern Analysis and Machine Intelligence (TPAMI)

arXiv:2107.06518 [pdf, other]

Single Event Transition Risk: A Measure for Long Term Carbon Exposure

Authors: Suryadeepto Nag, Siddhartha P. Chakrabarty, Sankarshan Basu

Abstract: Although there is a growing consensus that a low-carbon transition will be necessary to mitigate the accelerated climate change, the magnitude of transition-risk for investors is difficult to measure exactly. Investors are therefore constrained by the unavailability of suitable measures to quantify the magnitude of the risk and are forced to use the likes of absolute emissions data or ESG scores i… ▽ More Although there is a growing consensus that a low-carbon transition will be necessary to mitigate the accelerated climate change, the magnitude of transition-risk for investors is difficult to measure exactly. Investors are therefore constrained by the unavailability of suitable measures to quantify the magnitude of the risk and are forced to use the likes of absolute emissions data or ESG scores in order to manage their portfolios. In this article, we define the Single Event Transition Risk (SETR) and illustrate how it can be used to approximate the magnitude of the total exposure of the price of a share to low-carbon transition. We also discuss potential applications of the single event framework and the SETR as a risk measure and discuss future direction on how this can be extended to a system with multiple transition events. △ Less

Submitted 25 May, 2022; v1 submitted 14 July, 2021; originally announced July 2021.

arXiv:2105.12247 [pdf, other]

GraphVICRegHSIC: Towards improved self-supervised representation learning for graphs with a hyrbid loss function

Authors: Sayan Nag

Abstract: Self-supervised learning and pre-training strategieshave developed over the last few years especiallyfor Convolutional Neural Networks (CNNs). Re-cently application of such methods can also be no-ticed for Graph Neural Networks (GNNs) . In thispaper, we have used a graph based self-supervisedlearning strategy with different loss functions (Bar-low Twins[Zbontaret al., 2021], HSIC[Tsaiet al.,2021],… ▽ More Self-supervised learning and pre-training strategieshave developed over the last few years especiallyfor Convolutional Neural Networks (CNNs). Re-cently application of such methods can also be no-ticed for Graph Neural Networks (GNNs) . In thispaper, we have used a graph based self-supervisedlearning strategy with different loss functions (Bar-low Twins[Zbontaret al., 2021], HSIC[Tsaiet al.,2021], VICReg[Bardeset al., 2021]) which haveshown promising results when applied with CNNspreviously. We have also proposed a hybrid lossfunction combining the advantages of VICReg andHSIC and called it as VICRegHSIC. The perfor-mance of these aforementioned methods have beencompared when applied to 7 different datasets suchas MUTAG, PROTEINS, IMDB-Binary, etc. Ex-periments showed that our hybrid loss function per-formed better than the remaining ones in 4 out of7 cases. Moreover, the impact of different batchsizes, projector dimensions and data augmentationstrategies have also been explored. △ Less

Submitted 26 November, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

Comments: Paper Accepted in the Weakly Supervised Representation Learning Workshop, IJCAI 2021 (IJCAI2021-WSRL)

arXiv:2105.02687 [pdf, ps, other]

Anisotropic Multiverse with Varying $c$, $G$ and Study of Thermodynamics

Authors: Ujjal Debnath, Soumak Nag

Abstract: We assume the anisotropic model of the Universe in the framework of varying speed of light $c$ and varying gravitational constant $G$ theories and study different types of singularities. For the singularity models, we write the scale factors in terms of cosmic time and found some conditions for possible singularities. For future singularities, we assume the forms of varying speed of light and vary… ▽ More We assume the anisotropic model of the Universe in the framework of varying speed of light $c$ and varying gravitational constant $G$ theories and study different types of singularities. For the singularity models, we write the scale factors in terms of cosmic time and found some conditions for possible singularities. For future singularities, we assume the forms of varying speed of light and varying gravitational constant. For regularizing big bang singularity, we assume two forms of scale factors: sine model and tangent model. For both the models, we examine the validity of null energy condition and strong energy condition. Start from the first law of thermodynamics, we study the thermodynamic behaviours of $n$ number of Universes (i.e., Multiverse) for (i) varying $c$, (ii) varying $G$ and (iii) both varying $c$ and $G$ models. We found the total entropies for all the cases in the anisotropic Multiverse model. We also found the nature of the Multiverse if total entropy is constant. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: 8 pages

arXiv:2105.00643 [pdf, other]

Modeling the dynamics of COVID-19 transmission in India: Social Distancing, Regional Spread and Healthcare Capacity

Authors: Suryadeepto Nag, Siddhartha P. Chakrabarty

Abstract: In the new paradigm of health-centric governance, policy makers are in a constant need for appropriate metrics and estimates in order to determine the best policies in a non-arbitrary fashion. Thus, in this paper, a compartmentalized model for the transmission of COVID-19 is developed to facilitate policy making. A socially distanced compartment is added to the model and its utility in quantifying… ▽ More In the new paradigm of health-centric governance, policy makers are in a constant need for appropriate metrics and estimates in order to determine the best policies in a non-arbitrary fashion. Thus, in this paper, a compartmentalized model for the transmission of COVID-19 is developed to facilitate policy making. A socially distanced compartment is added to the model and its utility in quantifying the magnitude of voluntary social distancing is illustrated. Modifications are made to incorporate inter-region migration, and suitable metrics are proposed to quantify the impact of migration on the rise of cases. The healthcare capacity is modeled and a method is developed to study the consequences of the saturation of the healthcare system. The model and related measures are used to study the nature of the transmission and spread of COVID-19 in India, and appropriate insights are drawn. △ Less

Submitted 19 April, 2022; v1 submitted 3 May, 2021; originally announced May 2021.

arXiv:2104.04636 [pdf, ps, other]

Continuous-Time Higher Order Markov Chains: Formulation and Parameter Estimation

Authors: Suryadeepto Nag

Abstract: Stochastic processes find applications in modelling systems in a variety of disciplines. A large number of stochastic models considered are Markovian in nature. It is often observed that higher order Markov processes can model the data better. However most higher order Markov models are discrete. Here, we propose a novel continuous-time formulation of higher order Markov processes, as stochastic d… ▽ More Stochastic processes find applications in modelling systems in a variety of disciplines. A large number of stochastic models considered are Markovian in nature. It is often observed that higher order Markov processes can model the data better. However most higher order Markov models are discrete. Here, we propose a novel continuous-time formulation of higher order Markov processes, as stochastic differential equations, and propose a method of parameter estimation by maximum likelihood methods. △ Less

Submitted 9 April, 2021; originally announced April 2021.

arXiv:2102.07940 [pdf, other]

Attitude Trajectory Optimization for Agile Satellites in Autonomous Remote Sensing Constellation

Authors: Emmanuel Sin, Sreeja Nag, Vinay Ravindra, Alan Li, Murat Arcak

Abstract: Agile attitude maneuvering maximizes the utility of remote sensing satellite constellations. By taking into account a satellite's physical properties and its actuator specifications, we may leverage the full performance potential of the attitude control system to conduct agile remote sensing beyond conventional slew-and-stabilize maneuvers. Employing a constellation of agile satellites, coordinate… ▽ More Agile attitude maneuvering maximizes the utility of remote sensing satellite constellations. By taking into account a satellite's physical properties and its actuator specifications, we may leverage the full performance potential of the attitude control system to conduct agile remote sensing beyond conventional slew-and-stabilize maneuvers. Employing a constellation of agile satellites, coordinated by an autonomous and responsive scheduler, can significantly increase overall response rate, revisit time and global coverage for the mission. In this paper, we use recent advances in sequential convex programming based trajectory optimization to enable rapid-target acquisition, pointing and tracking capabilities for a scheduler-based constellation. We present two problem formulations. The Minimum-Time Slew Optimal Control Problem determines the minimum time, required energy, and optimal trajectory to slew between any two orientations given nonlinear quaternion kinematics, gyrostat and actuator dynamics, and state/input constraints. By gridding the space of 3D rotations and efficiently solving this problem on the grid, we produce lookup tables or parametric fits off-line that can then be used on-line by a scheduler to compute accurate estimates of minimum-time and maneuver energy for real-time constellation scheduling. The Minimum-Effort Multi-Target Pointing Optimal Control Problem is used on-line by each satellite to produce continuous attitude-state and control-input trajectories that realize a given schedule while minimizing attitude error and control effort. The optimal trajectory may then be achieved by a low-level tracking controller. We demonstrate our approach with an example of a reference satellite in Sun-synchronous orbit passing over globally-distributed, Earth-observation targets. △ Less

Submitted 15 February, 2021; originally announced February 2021.

Comments: 24 pages, 27 figures

arXiv:2102.06038 [pdf]

A Fractal Approach to Characterize Emotions in Audio and Visual Domain: A Study on Cross-Modal Interaction

Authors: Sayan Nag, Uddalok Sarkar, Shankha Sanyal, Archi Banerjee, Souparno Roy, Samir Karmakar, Ranjan Sengupta, Dipak Ghosh

Abstract: It is already known that both auditory and visual stimulus is able to convey emotions in human mind to different extent. The strength or intensity of the emotional arousal vary depending on the type of stimulus chosen. In this study, we try to investigate the emotional arousal in a cross-modal scenario involving both auditory and visual stimulus while studying their source characteristics. A robus… ▽ More It is already known that both auditory and visual stimulus is able to convey emotions in human mind to different extent. The strength or intensity of the emotional arousal vary depending on the type of stimulus chosen. In this study, we try to investigate the emotional arousal in a cross-modal scenario involving both auditory and visual stimulus while studying their source characteristics. A robust fractal analytic technique called Detrended Fluctuation Analysis (DFA) and its 2D analogue has been used to characterize three (3) standardized audio and video signals quantifying their scaling exponent corresponding to positive and negative valence. It was found that there is significant difference in scaling exponents corresponding to the two different modalities. Detrended Cross Correlation Analysis (DCCA) has also been applied to decipher degree of cross-correlation among the individual audio and visual stimulus. This is the first of its kind study which proposes a novel algorithm with which emotional arousal can be classified in cross-modal scenario using only the source audio and visual signals while also attempting a correlation between them. △ Less

Submitted 11 February, 2021; originally announced February 2021.

arXiv:2102.06003 [pdf]

Language Independent Emotion Quantification using Non linear Modelling of Speech

Authors: Uddalok Sarkar, Sayan Nag, Chirayata Bhattacharya, Shankha Sanyal, Archi Banerjee, Ranjan Sengupta, Dipak Ghosh

Abstract: At present emotion extraction from speech is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking styles of a person, vocal tract information, timbral qualities and other congenital information regarding his voice. Our speech production system is a nonlinear system like most other real world system… ▽ More At present emotion extraction from speech is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking styles of a person, vocal tract information, timbral qualities and other congenital information regarding his voice. Our speech production system is a nonlinear system like most other real world systems. Hence the need arises for modelling our speech information using nonlinear techniques. In this work we have modelled our articulation system using nonlinear multifractal analysis. The multifractal spectral width and scaling exponents reveals essentially the complexity associated with the speech signals taken. The multifractal spectrums are well distinguishable the in low fluctuation region in case of different emotions. The source characteristics have been quantified with the help of different non-linear models like Multi-Fractal Detrended Fluctuation Analysis, Wavelet Transform Modulus Maxima. The Results obtained from this study gives a very good result in emotion clustering. △ Less

Submitted 11 February, 2021; originally announced February 2021.

arXiv:2102.00616 [pdf]

Neural Network architectures to classify emotions in Indian Classical Music

Authors: Uddalok Sarkar, Sayan Nag, Medha Basu, Archi Banerjee, Shankha Sanyal, Ranjan Sengupta, Dipak Ghosh

Abstract: Music is often considered as the language of emotions. It has long been known to elicit emotions in human being and thus categorizing music based on the type of emotions they induce in human being is a very intriguing topic of research. When the task comes to classify emotions elicited by Indian Classical Music (ICM), it becomes much more challenging because of the inherent ambiguity associated wi… ▽ More Music is often considered as the language of emotions. It has long been known to elicit emotions in human being and thus categorizing music based on the type of emotions they induce in human being is a very intriguing topic of research. When the task comes to classify emotions elicited by Indian Classical Music (ICM), it becomes much more challenging because of the inherent ambiguity associated with ICM. The fact that a single musical performance can evoke a variety of emotional response in the audience is implicit to the nature of ICM renditions. With the rapid advancements in the field of Deep Learning, this Music Emotion Recognition (MER) task is becoming more and more relevant and robust, hence can be applied to one of the most challenging test case i.e. classifying emotions elicited from ICM. In this paper we present a new dataset called JUMusEmoDB which presently has 400 audio clips (30 seconds each) where 200 clips correspond to happy emotions and the remaining 200 clips correspond to sad emotion. For supervised classification purposes, we have used 4 existing deep Convolutional Neural Network (CNN) based architectures (resnet18, mobilenet v2.0, squeezenet v1.0 and vgg16) on corresponding music spectrograms of the 2000 sub-clips (where every clip was segmented into 5 sub-clips of about 5 seconds each) which contain both time as well as frequency domain information. The initial results are quite inspiring, and we look forward to setting the baseline values for the dataset using this architecture. This type of CNN based classification algorithm using a rich corpus of Indian Classical Music is unique even in the global perspective and can be replicated in other modalities of music also. This dataset is still under development and we plan to include more data containing other emotional features as well. We plan to make the dataset publicly available soon. △ Less

Submitted 31 January, 2021; originally announced February 2021.

arXiv:2101.05458 [pdf, ps, other]

On the stability of equilibria of the physiologically-informed dynamic causal model

Authors: Sayan Nag

Abstract: Experimental manipulations perturb the neuronal activity. This phenomenon is manifested in the fMRI response. Dynamic causal model and its variants can model these neuronal responses along with the BOLD responses [1, 2, 3, 4, 5] . Physiologically-informed DCM (P-DCM) [5] gives state-of-the-art results in this aspect. But, P-DCM has more parameters compared to the standard DCM model and the stabili… ▽ More Experimental manipulations perturb the neuronal activity. This phenomenon is manifested in the fMRI response. Dynamic causal model and its variants can model these neuronal responses along with the BOLD responses [1, 2, 3, 4, 5] . Physiologically-informed DCM (P-DCM) [5] gives state-of-the-art results in this aspect. But, P-DCM has more parameters compared to the standard DCM model and the stability of this particular model is still unexplored. In this work, we will try to explore the stability of the P-DCM model and find the ranges of the model parameters which make it stable. △ Less

Submitted 13 January, 2021; originally announced January 2021.

arXiv:2012.05694 [pdf]

Lookahead optimizer improves the performance of Convolutional Autoencoders for reconstruction of natural images

Authors: Sayan Nag

Abstract: Autoencoders are a class of artificial neural networks which have gained a lot of attention in the recent past. Using the encoder block of an autoencoder the input image can be compressed into a meaningful representation. Then a decoder is employed to reconstruct the compressed representation back to a version which looks like the input image. It has plenty of applications in the field of data com… ▽ More Autoencoders are a class of artificial neural networks which have gained a lot of attention in the recent past. Using the encoder block of an autoencoder the input image can be compressed into a meaningful representation. Then a decoder is employed to reconstruct the compressed representation back to a version which looks like the input image. It has plenty of applications in the field of data compression and denoising. Another version of Autoencoders (AE) exist, called Variational AE (VAE) which acts as a generative model like GAN. Recently, an optimizer was introduced which is known as lookahead optimizer which significantly enhances the performances of Adam as well as SGD. In this paper, we implement Convolutional Autoencoders (CAE) and Convolutional Variational Autoencoders (CVAE) with lookahead optimizer (with Adam) and compare them with the Adam (only) optimizer counterparts. For this purpose, we have used a movie dataset comprising of natural images for the former case and CIFAR100 for the latter case. We show that lookahead optimizer (with Adam) improves the performance of CAEs for reconstruction of natural images. △ Less

Submitted 2 December, 2020; originally announced December 2020.

arXiv:2010.09946 [pdf]

Planning a Reference Constellation for Radiometric Cross-Calibration of Commercial Earth Observing Sensors

Authors: Sreeja Nag, Philip Dabney, Vinay Ravindra, Cody Anderson

Abstract: The Earth Observation planning community has access to tools that can propagate orbits and compute coverage of Earth observing imagers with customizable shapes and orientation, model the expected Earth Reflectance at various bands, epochs and directions, generate simplified instrument performance metrics for imagers and radars, and schedule single and multiple spacecraft payload operations. We are… ▽ More The Earth Observation planning community has access to tools that can propagate orbits and compute coverage of Earth observing imagers with customizable shapes and orientation, model the expected Earth Reflectance at various bands, epochs and directions, generate simplified instrument performance metrics for imagers and radars, and schedule single and multiple spacecraft payload operations. We are working toward integrating existing tools to design a planner that allows commercial small spacecraft to assess the opportunities for cross-calibration of their sensors against current satellite to be calibrated, specifications of the reference instruments, sensor stability, allowable latency between calibration measurements, differences in viewing and solar geometry between calibration measurements, etc. The planner would output cross-calibration opportunities for every reference target pair as a function of flexible user-defined parameters. We use a preliminary version of this planner to inform the design of a constellation of transfer radiometers that can serve as stable, radiometric references for commercial sensors to cross-calibrate with. We propose such a constellation for either vicarious cross-calibration using pre-selected sites, or top of the atmosphere (TOA) cross-calibration globally. Results from the calibration planner applied to a subset of informed architecture designs show that a 4 sat constellation provides multiple calibration opportunities within half a day planning horizon, for Cubesat sensors deployed into a typical rideshare orbits. While such opportunities are available for cross calibration image pairs within 5 deg of solar or view directions, and with-in an hour (for TOA) and less than a day (vicariously), the planner allows us to identify many more by relaxing user-defined restrictions. △ Less

Submitted 19 October, 2020; originally announced October 2020.

Journal ref: International Workshop on Planning and Scheduling for Space, Berkeley CA, July 2019

arXiv:2010.09940 [pdf]

Autonomous Scheduling of Agile Spacecraft Constellations with Delay Tolerant Networking for Reactive Imaging

Authors: Sreeja Nag, Alan S. Li, Vinay Ravindra, Marc Sanchez Net, Kar-Ming Cheung, Rod Lammers, Brian Bledsoe

Abstract: Small spacecraft now have precise attitude control systems available commercially, allowing them to slew in 3 degrees of freedom, and capture images within short notice. When combined with appropriate software, this agility can significantly increase response rate, revisit time and coverage. In prior work, we have demonstrated an algorithmic framework that combines orbital mechanics, attitude cont… ▽ More Small spacecraft now have precise attitude control systems available commercially, allowing them to slew in 3 degrees of freedom, and capture images within short notice. When combined with appropriate software, this agility can significantly increase response rate, revisit time and coverage. In prior work, we have demonstrated an algorithmic framework that combines orbital mechanics, attitude control and scheduling optimization to plan the time-varying, full-body orientation of agile, small spacecraft in a constellation. The proposed schedule optimization would run at the ground station autonomously, and the resultant schedules uplinked to the spacecraft for execution. The algorithm is generalizable over small steerable spacecraft, control capability, sensor specs, imaging requirements, and regions of interest. In this article, we modify the algorithm to run onboard small spacecraft, such that the constellation can make time-sensitive decisions to slew and capture images autonomously, without ground control. We have developed a communication module based on Delay/Disruption Tolerant Networking (DTN) for onboard data management and routing among the satellites, which will work in conjunction with the other modules to optimize the schedule of agile communication and steering. We then apply this preliminary framework on representative constellations to simulate targeted measurements of episodic precipitation events and subsequent urban floods. The command and control efficiency of our agile algorithm is compared to non-agile (11.3x improvement) and non-DTN (21% improvement) constellations. △ Less

Submitted 19 October, 2020; originally announced October 2020.

Journal ref: International Conference on Automated Planning and Scheduling SPARK Workshop, Berkeley, July 2019

arXiv:2010.03350 [pdf, other]

Modeling the commodity prices of base metals in Indian commodity market using a Higher Order Markovian Approach

Authors: Suryadeepto Nag, Sankarshan Basu, Siddhartha P. Chakrabarty

Abstract: A Higher Order Markovian (HOM) model to capture the dynamics of commodity prices is proposed as an alternative to a Markovian model. In particular, the order of the former model, is taken to be the delay, in the response of the industry, to the market information. This is then empirically analyzed for the prices of Copper Mini and four other bases metals, namely Aluminum, Lead, Nickel and Zinc, in… ▽ More A Higher Order Markovian (HOM) model to capture the dynamics of commodity prices is proposed as an alternative to a Markovian model. In particular, the order of the former model, is taken to be the delay, in the response of the industry, to the market information. This is then empirically analyzed for the prices of Copper Mini and four other bases metals, namely Aluminum, Lead, Nickel and Zinc, in the Indian commodities market. In case of Copper Mini, the usage of the HOM approach consistently offer improvement, over the Markovian approach, in terms of the errors in forecasting. Similar trends were observed for the other base metals considered, with the exception of Aluminum, which can be attributed the volatility in the Asian market during the COVID-19 outbreak. △ Less

Submitted 7 October, 2020; originally announced October 2020.

arXiv:2006.15100 [pdf, other]

doi 10.1109/VLSID49098.2020.00044

E2GC: Energy-efficient Group Convolution in Deep Neural Networks

Authors: Nandan Kumar Jha, Rajat Saini, Subhrajit Nag, Sparsh Mittal

Abstract: The number of groups ($g$) in group convolution (GConv) is selected to boost the predictive performance of deep neural networks (DNNs) in a compute and parameter efficient manner. However, we show that naive selection of $g$ in GConv creates an imbalance between the computational complexity and degree of data reuse, which leads to suboptimal energy efficiency in DNNs. We devise an optimum group si… ▽ More The number of groups ($g$) in group convolution (GConv) is selected to boost the predictive performance of deep neural networks (DNNs) in a compute and parameter efficient manner. However, we show that naive selection of $g$ in GConv creates an imbalance between the computational complexity and degree of data reuse, which leads to suboptimal energy efficiency in DNNs. We devise an optimum group size model, which enables a balance between computational cost and data movement cost, thus, optimize the energy-efficiency of DNNs. Based on the insights from this model, we propose an "energy-efficient group convolution" (E2GC) module where, unlike the previous implementations of GConv, the group size ($G$) remains constant. Further, to demonstrate the efficacy of the E2GC module, we incorporate this module in the design of MobileNet-V1 and ResNeXt-50 and perform experiments on two GPUs, P100 and P4000. We show that, at comparable computational complexity, DNNs with constant group size (E2GC) are more energy-efficient than DNNs with a fixed number of groups (F$g$GC). For example, on P100 GPU, the energy-efficiency of MobileNet-V1 and ResNeXt-50 is increased by 10.8% and 4.73% (respectively) when E2GC modules substitute the F$g$GC modules in both the DNNs. Furthermore, through our extensive experimentation with ImageNet-1K and Food-101 image classification datasets, we show that the E2GC module enables a trade-off between generalization ability and representational power of DNN. Thus, the predictive performance of DNNs can be optimized by selecting an appropriate $G$. The code and trained models are available at https://github.com/iithcandle/E2GC-release. △ Less

Submitted 26 June, 2020; originally announced June 2020.

Comments: Accepted as a conference paper in 2020 33rd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID)

ACM Class: I.5.1; I.5.2; I.5.5; C.0

Journal ref: VLSID (2020) 155-160

arXiv:2006.07334 [pdf, other]

doi 10.1140/epja/s10050-022-00809-4

High spin states of $^{204}$At: isomeric states and shears band structure

Authors: D. Kanjilal, S. K. Dey, S. S. Bhattacharjee, A. Bisoi, M. Das, C. C. Dey, S. Nag, R. Palit, S. Ray, S. Saha, J. Sethi, S. Saha

Abstract: High-spin states of neutron deficient Trans-Lead nucleus $^{204}$At were populated up to $\sim 8\,{\rm MeV}$ excitation through the $^{12}$C + $^{197}$Au fusion evaporation reaction. Decay of the associated levels through prompt and delayed $γ$-ray emissions were studied to evaluate the underlying nuclear structure. The level scheme, which was partly known, was extended further. An isomeric… ▽ More High-spin states of neutron deficient Trans-Lead nucleus $^{204}$At were populated up to $\sim 8\,{\rm MeV}$ excitation through the $^{12}$C + $^{197}$Au fusion evaporation reaction. Decay of the associated levels through prompt and delayed $γ$-ray emissions were studied to evaluate the underlying nuclear structure. The level scheme, which was partly known, was extended further. An isomeric $16^+$ level with observed lifetime $τ=52 \pm 5\, {\rm ns}$, was established from our measurements. Attempts were made to interpret the excited states based on multi quasiparticle and hole structures involving $2f_{5/2}$, $1h_{9/2}$, and $1i_{13/2}$ shell model states, along with moderate core excitation. Magnetic dipole band structure over the spin parity range:~$16^+ - 23^+$ was confirmed and evaluated in more detail, including the missing cross-over $E2$ transitions. Band-crossing along the shears band was observed and compared with the evidence of similar phenomena in the neighbouring neutron deficient $^{202}$Bi, $^{205}$Rn isotones and the $^{203}$At isotope. Based on comparison of the measured $B(M1)/B(E2)$ values for transitions along the band with the semiclassical model based estimates, the shears band of $^{204}$At was established along with the level scheme. △ Less

Submitted 1 September, 2022; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this article is published in The European Physical Journal A and is available online at https://doi.org/10.1140/epja/s10050-022-00809-4

Journal ref: Eur. Phys. J. A (2022) 58:159

arXiv:2005.12524 [pdf]

A New Unified Method for Detecting Text from Marathon Runners and Sports Players in Video

Authors: Sauradip Nag, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, Michael Blumenstein

Abstract: Detecting text located on the torsos of marathon runners and sports players in video is a challenging issue due to poor quality and adverse effects caused by flexible/colorful clothing, and different structures of human bodies or actions. This paper presents a new unified method for tackling the above challenges. The proposed method fuses gradient magnitude and direction coherence of text pixels i… ▽ More Detecting text located on the torsos of marathon runners and sports players in video is a challenging issue due to poor quality and adverse effects caused by flexible/colorful clothing, and different structures of human bodies or actions. This paper presents a new unified method for tackling the above challenges. The proposed method fuses gradient magnitude and direction coherence of text pixels in a new way for detecting candidate regions. Candidate regions are used for determining the number of temporal frame clusters obtained by K-means clustering on frame differences. This process in turn detects key frames. The proposed method explores Bayesian probability for skin portions using color values at both pixel and component levels of temporal frames, which provides fused images with skin components. Based on skin information, the proposed method then detects faces and torsos by finding structural and spatial coherences between them. We further propose adaptive pixels linking a deep learning model for text detection from torso regions. The proposed method is tested on our own dataset collected from marathon/sports video and three standard datasets, namely, RBNR, MMM and R-ID of marathon images, to evaluate the performance. In addition, the proposed method is also tested on the standard natural scene datasets, namely, CTW1500 and MS-COCO text datasets, to show the objectiveness of the proposed method. A comparative study with the state-of-the-art methods on bib number/text detection of different datasets shows that the proposed method outperforms the existing methods. △ Less

Submitted 26 May, 2020; originally announced May 2020.

Comments: Accepted in Pattern Recognition, Elsevier

arXiv:2004.08248 [pdf]

Acoustical classification of different speech acts using nonlinear methods

Authors: Chirayata Bhattacharyya, Sourya Sengupta, Sayan Nag, Shankha Sanyal, Archi Banerjee, Ranjan Sengupta, Dipak Ghosh

Abstract: A recitation is a way of combining the words together so that they have a sense of rhythm and thus an emotional content is imbibed within. In this study we envisaged to answer these questions in a scientific manner taking into consideration 5 (five) well known Bengali recitations of different poets conveying a variety of moods ranging from joy to sorrow. The clips were recited as well as read (in… ▽ More A recitation is a way of combining the words together so that they have a sense of rhythm and thus an emotional content is imbibed within. In this study we envisaged to answer these questions in a scientific manner taking into consideration 5 (five) well known Bengali recitations of different poets conveying a variety of moods ranging from joy to sorrow. The clips were recited as well as read (in the form of flat speech without any rhythm) by the same person to avoid any perceptual difference arising out of timbre variation. Next, the emotional content from the 5 recitations were standardized with the help of listening test conducted on a pool of 50 participants. The recitations as well as the speech were analyzed with the help of a latest non linear technique called Detrended Fluctuation Analysis (DFA) that gives a scaling exponent α, which is essentially the measure of long range correlations present in the signal. Similar pieces (the parts which have the exact lyrical content in speech as well as in the recital) were extracted from the complete signal and analyzed with the help of DFA technique. Our analysis shows that the scaling exponent for all parts of recitation were much higher in general as compared to their counterparts in speech. We have also established a critical value from our analysis, above which a mere speech may become a recitation. The case may be similar to the conventional phase transition, wherein the measurement of external condition at which the transformation occurs (generally temperature) is called phase transition. Further, we have also categorized the 5 recitations on the basis of their emotional content with the help of the same DFA technique. Analysis with a greater variety of recitations is being carried out to yield more interesting results. △ Less

Submitted 5 August, 2020; v1 submitted 15 April, 2020; originally announced April 2020.

Comments: 6 pages, 2 figures; Proceedings of WESPAC 2018, New Delhi, India, November 11-15, 2018

arXiv:2004.07820 [pdf]

Speaker Recognition in Bengali Language from Nonlinear Features

Authors: Uddalok Sarkar, Soumyadeep Pal, Sayan Nag, Chirayata Bhattacharya, Shankha Sanyal, Archi Banerjee, Ranjan Sengupta, Dipak Ghosh

Abstract: At present Automatic Speaker Recognition system is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking style of a person, vocal tract information, timbral qualities of his voice and other congenital information regarding his voice. The study of Bengali speech recognition and speaker identification… ▽ More At present Automatic Speaker Recognition system is a very important issue due to its diverse applications. Hence, it becomes absolutely necessary to obtain models that take into consideration the speaking style of a person, vocal tract information, timbral qualities of his voice and other congenital information regarding his voice. The study of Bengali speech recognition and speaker identification is scarce in the literature. Hence the need arises for involving Bengali subjects in modelling our speaker identification engine. In this work, we have extracted some acoustic features of speech using non linear multifractal analysis. The Multifractal Detrended Fluctuation Analysis reveals essentially the complexity associated with the speech signals taken. The source characteristics have been quantified with the help of different techniques like Correlation Matrix, skewness of MFDFA spectrum etc. The Results obtained from this study gives a good recognition rate for Bengali Speakers. △ Less

Submitted 15 April, 2020; originally announced April 2020.

Comments: arXiv admin note: text overlap with arXiv:1612.00171, arXiv:1601.07709

arXiv:2004.02071 [pdf, ps, other]

Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised Neural Machine Translation

Authors: Sreyashi Nag, Mihir Kale, Varun Lakshminarasimhan, Swapnil Singhavi

Abstract: We explore ways of incorporating bilingual dictionaries to enable semi-supervised neural machine translation. Conventional back-translation methods have shown success in leveraging target side monolingual data. However, since the quality of back-translation models is tied to the size of the available parallel corpora, this could adversely impact the synthetically generated sentences in a low resou… ▽ More We explore ways of incorporating bilingual dictionaries to enable semi-supervised neural machine translation. Conventional back-translation methods have shown success in leveraging target side monolingual data. However, since the quality of back-translation models is tied to the size of the available parallel corpora, this could adversely impact the synthetically generated sentences in a low resource setting. We propose a simple data augmentation technique to address both this shortcoming. We incorporate widely available bilingual dictionaries that yield word-by-word translations to generate synthetic sentences. This automatically expands the vocabulary of the model while maintaining high quality content. Our method shows an appreciable improvement in performance over strong baselines. △ Less

Submitted 4 April, 2020; originally announced April 2020.

arXiv:1912.05014 [pdf, other]

Hybrid Style Siamese Network: Incorporating style loss in complementary apparels retrieval

Authors: Mayukh Bhattacharyya, Sayan Nag

Abstract: Image Retrieval grows to be an integral part of fashion e-commerce ecosystem as it keeps expanding in multitudes. Other than the retrieval of visually similar items, the retrieval of visually compatible or complementary items is also an important aspect of it. Normal Siamese Networks tend to work well on complementary items retrieval. But it fails to identify low level style features which make it… ▽ More Image Retrieval grows to be an integral part of fashion e-commerce ecosystem as it keeps expanding in multitudes. Other than the retrieval of visually similar items, the retrieval of visually compatible or complementary items is also an important aspect of it. Normal Siamese Networks tend to work well on complementary items retrieval. But it fails to identify low level style features which make items compatible in human eyes. These low level style features are captured to a large extent in techniques used in neural style transfer. This paper proposes a mechanism of utilising those methods in this retrieval task and capturing the low level style features through a hybrid siamese network coupled with a hybrid loss. The experimental results indicate that the proposed method outperforms traditional siamese networks in retrieval tasks for complementary items. △ Less

Submitted 9 June, 2020; v1 submitted 23 November, 2019; originally announced December 2019.

Comments: Paper Accepted in the Third Workshop on Computer Vision for Fashion, Art and Design, CVPR 2020

arXiv:1912.03641 [pdf, other]

SaLite : A light-weight model for salient object detection

Authors: Kitty Varghese, Sauradip Nag

Abstract: Salient object detection is a prevalent computer vision task that has applications ranging from abnormality detection to abnormality processing. Context modelling is an important criterion in the domain of saliency detection. A global context helps in determining the salient object in a given image by contrasting away other objects in the global view of the scene. However, the local context featur… ▽ More Salient object detection is a prevalent computer vision task that has applications ranging from abnormality detection to abnormality processing. Context modelling is an important criterion in the domain of saliency detection. A global context helps in determining the salient object in a given image by contrasting away other objects in the global view of the scene. However, the local context features detects the boundaries of the salient object with higher accuracy in a given region. To incorporate the best of both worlds, our proposed SaLite model uses both global and local contextual features. It is an encoder-decoder based architecture in which the encoder uses a lightweight SqueezeNet and decoder is modelled using convolution layers. Modern deep based models entitled for saliency detection use a large number of parameters, which is difficult to deploy on embedded systems. This paper attempts to solve the above problem using SaLite which is a lighter process for salient object detection without compromising on performance. Our approach is extensively evaluated on three publicly available datasets namely DUTS, MSRA10K, and SOC. Experimental results show that our proposed SaLite has significant and consistent improvements over the state-of-the-art methods. △ Less

Submitted 8 December, 2019; originally announced December 2019.

Comments: This was submitted to NCVPRIPG 2019

arXiv:1906.12039 [pdf, ps, other]

Supervised Contextual Embeddings for Transfer Learning in Natural Language Processing Tasks

Authors: Mihir Kale, Aditya Siddhant, Sreyashi Nag, Radhika Parik, Matthias Grabmair, Anthony Tomasic

Abstract: Pre-trained word embeddings are the primary method for transfer learning in several Natural Language Processing (NLP) tasks. Recent works have focused on using unsupervised techniques such as language modeling to obtain these embeddings. In contrast, this work focuses on extracting representations from multiple pre-trained supervised models, which enriches word embeddings with task and domain spec… ▽ More Pre-trained word embeddings are the primary method for transfer learning in several Natural Language Processing (NLP) tasks. Recent works have focused on using unsupervised techniques such as language modeling to obtain these embeddings. In contrast, this work focuses on extracting representations from multiple pre-trained supervised models, which enriches word embeddings with task and domain specific knowledge. Experiments performed in cross-task, cross-domain and cross-lingual settings indicate that such supervised embeddings are helpful, especially in the low-resource setting, but the extent of gains is dependent on the nature of the task and domain. We make our code publicly available. △ Less

Submitted 28 June, 2019; originally announced June 2019.

Comments: Appeared in 2nd Learning from Limited Labeled Data (LLD) Workshop at ICLR 2019

arXiv:1903.07354 [pdf, ps, other]

doi 10.1103/PhysRevB.99.224203

Can many-body localization persist in the presence of long-range interactions or long-range hopping?

Authors: Sabyasachi Nag, Arti Garg

Abstract: We study many-body localization (MBL) in a one-dimensional system of spinless fermions with a deterministic aperiodic potential in the presence of long-range interactions or long-range hopping. Based on perturbative arguments there is a common belief that MBL can exist only in systems with short-range interactions and short-range hopping. We analyze effects of power-law interactions and power-law… ▽ More We study many-body localization (MBL) in a one-dimensional system of spinless fermions with a deterministic aperiodic potential in the presence of long-range interactions or long-range hopping. Based on perturbative arguments there is a common belief that MBL can exist only in systems with short-range interactions and short-range hopping. We analyze effects of power-law interactions and power-law hopping, separately, on a system which has all the single particle states localized in the absence of interactions. Since delocalization is driven by proliferation of resonances in the Fock space, we mapped this model to an effective Anderson model on a complex graph in the Fock space, and calculated the probability distribution of the number of resonances up to third order. Though the most-probable value of the number of resonances diverge for the system with long-range hopping ($t(r) \sim t_0/r^α$ with $α< 2$), there is no enhancement of the number of resonances as the range of power-law interactions increases. This indicates that the long-range hopping delocalizes the many-body localized system but in contrast to this, there is no signature of delocalization in the presence of long-range interactions. We further provide support in favor of this analysis based on dynamics of the system after a quench starting from a charge density wave ordered state, level spacing statistics, return probability, participation ratio and Shannon entropy in the Fock space. We demonstrate that MBL persists in the presence of long-range interactions though long-range hopping with $1<α<2$ delocalizes the system partially, with all the states extended for $α<1$. Even in a system which has single-particle mobility edges in the non-interacting limit, turning on long-range interactions does not cause delocalization. △ Less

Submitted 13 June, 2019; v1 submitted 18 March, 2019; originally announced March 2019.

Comments: 13 Figures

Journal ref: Phys. Rev. B 99, 224203 (2019)

Showing 51–100 of 159 results for author: Nag, S