Search | arXiv e-print repository

arXiv:2411.18007 [pdf]

AI-Driven Smartphone Solution for Digitizing Rapid Diagnostic Test Kits and Enhancing Accessibility for the Visually Impaired

Authors: R. B. Dastagir, J. T. Jami, S. Chanda, F. Hafiz, M. Rahman, K. Dey, M. M. Rahman, M. Qureshi, M. M. Chowdhury

Abstract: Rapid diagnostic tests are crucial for timely disease detection and management, yet accurate interpretation of test results remains challenging. In this study, we propose a novel approach to enhance the accuracy and reliability of rapid diagnostic test result interpretation by integrating artificial intelligence (AI) algorithms, including convolutional neural networks (CNN), within a smartphone-ba… ▽ More Rapid diagnostic tests are crucial for timely disease detection and management, yet accurate interpretation of test results remains challenging. In this study, we propose a novel approach to enhance the accuracy and reliability of rapid diagnostic test result interpretation by integrating artificial intelligence (AI) algorithms, including convolutional neural networks (CNN), within a smartphone-based application. The app enables users to take pictures of their test kits, which YOLOv8 then processes to precisely crop and extract the membrane region, even if the test kit is not centered in the frame or is positioned at the very edge of the image. This capability offers greater accessibility, allowing even visually impaired individuals to capture test images without needing perfect alignment, thus promoting user independence and inclusivity. The extracted image is analyzed by an additional CNN classifier that determines if the results are positive, negative, or invalid, providing users with the results and a confidence level. Through validation experiments with commonly used rapid test kits across various diagnostic applications, our results demonstrate that the synergistic integration of AI significantly improves sensitivity and specificity in test result interpretation. This improvement can be attributed to the extraction of the membrane zones from the test kit images using the state-of-the-art YOLO algorithm. Additionally, we performed SHapley Additive exPlanations (SHAP) analysis to investigate the factors influencing the model's decisions, identifying reasons behind both correct and incorrect classifications. By facilitating the differentiation of genuine test lines from background noise and providing valuable insights into test line intensity and uniformity, our approach offers a robust solution to challenges in rapid test interpretation. △ Less

Submitted 26 November, 2024; originally announced November 2024.

arXiv:2401.17026 [pdf]

doi 10.1109/TCYB.2017.2751740

Static and Dynamic Synthesis of Bengali and Devanagari Signatures

Authors: Miguel A. Ferrer, Sukalpa Chanda, Moises Diaz, Chayan Kr. Banerjee, Anirban Majumdar, Cristina Carmona-Duarte, Parikshit Acharya, Umapada Pal

Abstract: Developing an automatic signature verification system is challenging and demands a large number of training samples. This is why synthetic handwriting generation is an emerging topic in document image analysis. Some handwriting synthesizers use the motor equivalence model, the well-established hypothesis from neuroscience, which analyses how a human being accomplishes movement. Specifically, a mot… ▽ More Developing an automatic signature verification system is challenging and demands a large number of training samples. This is why synthetic handwriting generation is an emerging topic in document image analysis. Some handwriting synthesizers use the motor equivalence model, the well-established hypothesis from neuroscience, which analyses how a human being accomplishes movement. Specifically, a motor equivalence model divides human actions into two steps: 1) the effector independent step at cognitive level and 2) the effector dependent step at motor level. In fact, recent work reports the successful application to Western scripts of a handwriting synthesizer, based on this theory. This paper aims to adapt this scheme for the generation of synthetic signatures in two Indic scripts, Bengali (Bangla), and Devanagari (Hindi). For this purpose, we use two different online and offline databases for both Bengali and Devanagari signatures. This paper reports an effective synthesizer for static and dynamic signatures written in Devanagari or Bengali scripts. We obtain promising results with artificially generated signatures in terms of appearance and performance when we compare the results with those for real signatures. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: Accepted version. Published on IEEE Transactions on Cybernetics [ISSN 2168-2267], v. 48(10), p. 2896-2907

Journal ref: IEEE Transactions on Cybernetics, v. 48(10), p. 2896-2907, 2018

arXiv:2312.08010 [pdf, other]

EZ-CLIP: Efficient Zeroshot Video Action Recognition

Authors: Shahzad Ahmad, Sukalpa Chanda, Yogesh S Rawat

Abstract: Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown pr… ▽ More Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown promising results, they come at a significant computational cost and struggle with effectively modeling the crucial temporal aspects inherent to the video domain. In this study, we present EZ-CLIP, a simple and efficient adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal visual prompting for seamless temporal adaptation, requiring no fundamental alterations to the core CLIP architecture while preserving its remarkable generalization abilities. Moreover, we introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion, thereby enhancing its learning capabilities from video data. We conducted extensive experiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization.Impressively, with a mere 5.2 million learnable parameters (as opposed to the 71.1 million in the prior best model), EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations. △ Less

Submitted 19 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

arXiv:2311.11662 [pdf, other]

Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D Human Motion Recovery from Monocular Videos

Authors: Sushovan Chanda, Amogh Tiwari, Lokender Tiwari, Brojeshwar Bhowmick, Avinash Sharma, Hrishav Barua

Abstract: Recovering temporally consistent 3D human body pose, shape and motion from a monocular video is a challenging task due to (self-)occlusions, poor lighting conditions, complex articulated body poses, depth ambiguity, and limited availability of annotated data. Further, doing a simple perframe estimation is insufficient as it leads to jittery and implausible results. In this paper, we propose a nove… ▽ More Recovering temporally consistent 3D human body pose, shape and motion from a monocular video is a challenging task due to (self-)occlusions, poor lighting conditions, complex articulated body poses, depth ambiguity, and limited availability of annotated data. Further, doing a simple perframe estimation is insufficient as it leads to jittery and implausible results. In this paper, we propose a novel method for temporally consistent motion estimation from a monocular video. Instead of using generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose and camera initialization over a temporal window followed by a novel spatio-temporal feature aggregation by using a combination of self-similarity and self-attention over the body-aware features and the perframe initialization. Together, they yield enhanced spatiotemporal context for every frame by considering remaining past and future frames. These features are used to predict the pose and shape parameters of the human body model, which are further refined using an LSTM. Experimental results on the publicly available benchmark data show that our method attains significantly lower acceleration error and outperforms the existing state-of-the-art methods over all key quantitative evaluation metrics, including complex scenarios like partial occlusion, complex poses and even relatively low illumination. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2301.10172 [pdf, other]

MTTN: Multi-Pair Text to Text Narratives for Prompt Generation

Authors: Archan Ghosh, Debgandhar Ghosh, Madhurima Maji, Suchinta Chanda, Kalporup Goswami

Abstract: The increased interest in diffusion models has opened up opportunities for advancements in generative text modeling. These models can produce impressive images when given a well-crafted prompt, but creating a powerful or meaningful prompt can be hit-or-miss. To address this, we have created a large-scale dataset that is derived and synthesized from real prompts and indexed with popular image-text… ▽ More The increased interest in diffusion models has opened up opportunities for advancements in generative text modeling. These models can produce impressive images when given a well-crafted prompt, but creating a powerful or meaningful prompt can be hit-or-miss. To address this, we have created a large-scale dataset that is derived and synthesized from real prompts and indexed with popular image-text datasets such as MS-COCO and Flickr. We have also implemented stages that gradually reduce context and increase complexity, which will further enhance the output due to the complex annotations created. The dataset, called MTTN, includes over 2.4 million sentences divided into 5 stages, resulting in a total of over 12 million pairs, and a vocabulary of over 300,000 unique words, providing ample variation. The original 2.4 million pairs are designed to reflect the way language is used on the internet globally, making the dataset more robust for any model trained on it. △ Less

Submitted 29 January, 2023; v1 submitted 21 January, 2023; originally announced January 2023.

arXiv:2212.06959 [pdf, ps, other]

doi 10.1142/S0219887824500981

Mechanics of geodesics in Information geometry and Black Hole Thermodynamics

Authors: Sumanto Chanda, Tatsuaki Wada

Abstract: In this article we shall discuss the theory of geodesics in information geometry, and an application in astrophysics. We will study how gradient flows in information geometry describe geodesics, explore the related mechanics by introducing a constraint, and apply our theory to Gaussian model and black hole thermodynamics. Thus, we demonstrate how deformation of gradient flows leads to more general… ▽ More In this article we shall discuss the theory of geodesics in information geometry, and an application in astrophysics. We will study how gradient flows in information geometry describe geodesics, explore the related mechanics by introducing a constraint, and apply our theory to Gaussian model and black hole thermodynamics. Thus, we demonstrate how deformation of gradient flows leads to more general Randers-Finsler metrics, describe Hamiltonian mechanics that derive from a constraint, and prove duality via canonical transformation. We also verified our theories for a deformation of the Gaussian model, and described dynamical evolution of flat metrics for Kerr and Reissner-Nordström black holes. △ Less

Submitted 6 December, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

Comments: 24 pages. Corrections made. New section and 2 references added. Please comment

arXiv:2210.09922 [pdf, other]

Few-Shot Learning of Compact Models via Task-Specific Meta Distillation

Authors: Yong Wu, Shekhor Chanda, Mehrdad Hosseinzadeh, Zhi Liu, Yang Wang

Abstract: We consider a new problem of few-shot learning of compact models. Meta-learning is a popular approach for few-shot learning. Previous work in meta-learning typically assumes that the model architecture during meta-training is the same as the model architecture used for final deployment. In this paper, we challenge this basic assumption. For final deployment, we often need the model to be small. Bu… ▽ More We consider a new problem of few-shot learning of compact models. Meta-learning is a popular approach for few-shot learning. Previous work in meta-learning typically assumes that the model architecture during meta-training is the same as the model architecture used for final deployment. In this paper, we challenge this basic assumption. For final deployment, we often need the model to be small. But small models usually do not have enough capacity to effectively adapt to new tasks. In the mean time, we often have access to the large dataset and extensive computing power during meta-training since meta-training is typically performed on a server. In this paper, we propose task-specific meta distillation that simultaneously learns two models in meta-learning: a large teacher model and a small student model. These two models are jointly learned during meta-training. Given a new task during meta-testing, the teacher model is first adapted to this task, then the adapted teacher model is used to guide the adaptation of the student model. The adapted student model is used for final deployment. We demonstrate the effectiveness of our approach in few-shot image classification using model-agnostic meta-learning (MAML). Our proposed method outperforms other alternatives on several benchmark datasets. △ Less

Submitted 18 October, 2022; originally announced October 2022.

Comments: This paper has been accepted by WACV'2023

arXiv:2202.10401 [pdf, other]

Vision-Language Pre-Training with Triple Contrastive Learning

Authors: Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, Junzhou Huang

Abstract: Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However, simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result… ▽ More Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However, simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieves the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering. △ Less

Submitted 28 March, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: CVPR 2022; code: https://github.com/uta-smile/TCL

arXiv:2111.10618 [pdf, other]

PAANet: Progressive Alternating Attention for Automatic Medical Image Segmentation

Authors: Abhishek Srivastava, Sukalpa Chanda, Debesh Jha, Michael A. Riegler, Pål Halvorsen, Dag Johansen, Umapada Pal

Abstract: Medical image segmentation can provide detailed information for clinical analysis which can be useful for scenarios where the detailed location of a finding is important. Knowing the location of disease can play a vital role in treatment and decision-making. Convolutional neural network (CNN) based encoder-decoder techniques have advanced the performance of automated medical image segmentation sys… ▽ More Medical image segmentation can provide detailed information for clinical analysis which can be useful for scenarios where the detailed location of a finding is important. Knowing the location of disease can play a vital role in treatment and decision-making. Convolutional neural network (CNN) based encoder-decoder techniques have advanced the performance of automated medical image segmentation systems. Several such CNN-based methodologies utilize techniques such as spatial- and channel-wise attention to enhance performance. Another technique that has drawn attention in recent years is residual dense blocks (RDBs). The successive convolutional layers in densely connected blocks are capable of extracting diverse features with varied receptive fields and thus, enhancing performance. However, consecutive stacked convolutional operators may not necessarily generate features that facilitate the identification of the target structures. In this paper, we propose a progressive alternating attention network (PAANet). We develop progressive alternating attention dense (PAAD) blocks, which construct a guiding attention map (GAM) after every convolutional layer in the dense blocks using features from all scales. The GAM allows the following layers in the dense blocks to focus on the spatial locations relevant to the target region. Every alternate PAAD block inverts the GAM to generate a reverse attention map which guides ensuing layers to extract boundary and edge-related information, refining the segmentation process. Our experiments on three different biomedical image segmentation datasets exhibit that our PAANet achieves favourable performance when compared to other state-of-the-art methods. △ Less

Submitted 20 November, 2021; originally announced November 2021.

arXiv:2111.10614 [pdf, other]

GMSRF-Net: An improved generalizability with global multi-scale residual fusion network for polyp segmentation

Authors: Abhishek Srivastava, Sukalpa Chanda, Debesh Jha, Umapada Pal, Sharib Ali

Abstract: Colonoscopy is a gold standard procedure but is highly operator-dependent. Efforts have been made to automate the detection and segmentation of polyps, a precancerous precursor, to effectively minimize missed rate. Widely used computer-aided polyp segmentation systems actuated by encoder-decoder have achieved high performance in terms of accuracy. However, polyp segmentation datasets collected fro… ▽ More Colonoscopy is a gold standard procedure but is highly operator-dependent. Efforts have been made to automate the detection and segmentation of polyps, a precancerous precursor, to effectively minimize missed rate. Widely used computer-aided polyp segmentation systems actuated by encoder-decoder have achieved high performance in terms of accuracy. However, polyp segmentation datasets collected from varied centers can follow different imaging protocols leading to difference in data distribution. As a result, most methods suffer from performance drop and require re-training for each specific dataset. We address this generalizability issue by proposing a global multi-scale residual fusion network (GMSRF-Net). Our proposed network maintains high-resolution representations while performing multi-scale fusion operations for all resolution scales. To further leverage scale information, we design cross multi-scale attention (CMSA) and multi-scale feature selection (MSFS) modules within the GMSRF-Net. The repeated fusion operations gated by CMSA and MSFS demonstrate improved generalizability of the network. Experiments conducted on two different polyp segmentation datasets show that our proposed GMSRF-Net outperforms the previous top-performing state-of-the-art method by 8.34% and 10.31% on unseen CVC-ClinicDB and unseen Kvasir-SEG, in terms of dice coefficient. △ Less

Submitted 20 November, 2021; originally announced November 2021.

arXiv:2111.10605 [pdf, other]

Exploiting Multi-Scale Fusion, Spatial Attention and Patch Interaction Techniques for Text-Independent Writer Identification

Authors: Abhishek Srivastava, Sukalpa Chanda, Umapada Pal

Abstract: Text independent writer identification is a challenging problem that differentiates between different handwriting styles to decide the author of the handwritten text. Earlier writer identification relied on handcrafted features to reveal pieces of differences between writers. Recent work with the advent of convolutional neural network, deep learning-based methods have evolved. In this paper, three… ▽ More Text independent writer identification is a challenging problem that differentiates between different handwriting styles to decide the author of the handwritten text. Earlier writer identification relied on handcrafted features to reveal pieces of differences between writers. Recent work with the advent of convolutional neural network, deep learning-based methods have evolved. In this paper, three different deep learning techniques - spatial attention mechanism, multi-scale feature fusion and patch-based CNN were proposed to effectively capture the difference between each writer's handwriting. Our methods are based on the hypothesis that handwritten text images have specific spatial regions which are more unique to a writer's style, multi-scale features propagate characteristic features with respect to individual writers and patch-based features give more general and robust representations that helps to discriminate handwriting from different writers. The proposed methods outperforms various state-of-the-art methodologies on word-level and page-level writer identification methods on three publicly available datasets - CVL, Firemaker, CERUG-EN datasets and give comparable performance on the IAM dataset. △ Less

Submitted 20 November, 2021; originally announced November 2021.

Comments: 14 pages, 4 figures

arXiv:2111.10591 [pdf, other]

AGA-GAN: Attribute Guided Attention Generative Adversarial Network with U-Net for Face Hallucination

Authors: Abhishek Srivastava, Sukalpa Chanda, Umapada Pal

Abstract: The performance of facial super-resolution methods relies on their ability to recover facial structures and salient features effectively. Even though the convolutional neural network and generative adversarial network-based methods deliver impressive performances on face hallucination tasks, the ability to use attributes associated with the low-resolution images to improve performance is unsatisfa… ▽ More The performance of facial super-resolution methods relies on their ability to recover facial structures and salient features effectively. Even though the convolutional neural network and generative adversarial network-based methods deliver impressive performances on face hallucination tasks, the ability to use attributes associated with the low-resolution images to improve performance is unsatisfactory. In this paper, we propose an Attribute Guided Attention Generative Adversarial Network which employs novel attribute guided attention (AGA) modules to identify and focus the generation process on various facial features in the image. Stacking multiple AGA modules enables the recovery of both high and low-level facial structures. We design the discriminator to learn discriminative features exploiting the relationship between the high-resolution image and their corresponding facial attribute annotations. We then explore the use of U-Net based architecture to refine existing predictions and synthesize further facial details. Extensive experiments across several metrics show that our AGA-GAN and AGA-GAN+U-Net framework outperforms several other cutting-edge face hallucination state-of-the-art methods. We also demonstrate the viability of our method when every attribute descriptor is not known and thus, establishing its application in real-world scenarios. △ Less

Submitted 20 November, 2021; originally announced November 2021.

Comments: 27 pages, 9 Figures

arXiv:2108.09335 [pdf, other]

LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric Learning

Authors: Bhavya Vasudeva, Puneesh Deora, Saumik Bhattacharya, Umapada Pal, Sukalpa Chanda

Abstract: Deep metric learning has been effectively used to learn distance metrics for different visual tasks like image retrieval, clustering, etc. In order to aid the training process, existing methods either use a hard mining strategy to extract the most informative samples or seek to generate hard synthetics using an additional network. Such approaches face different challenges and can lead to biased em… ▽ More Deep metric learning has been effectively used to learn distance metrics for different visual tasks like image retrieval, clustering, etc. In order to aid the training process, existing methods either use a hard mining strategy to extract the most informative samples or seek to generate hard synthetics using an additional network. Such approaches face different challenges and can lead to biased embeddings in the former case, and (i) harder optimization (ii) slower training speed (iii) higher model complexity in the latter case. In order to overcome these challenges, we propose a novel approach that looks for optimal hard negatives (LoOp) in the embedding space, taking full advantage of each tuple by calculating the minimum distance between a pair of positives and a pair of negatives. Unlike mining-based methods, our approach considers the entire space between pairs of embeddings to calculate the optimal hard negatives. Extensive experiments combining our approach and representative metric learning losses reveal a significant boost in performance on three benchmark datasets. △ Less

Submitted 20 August, 2021; originally announced August 2021.

Comments: 17 pages, 9 figures, 5 tables. Accepted at The IEEE/CVF International Conference on Computer Vision (ICCV) 2021

arXiv:2107.10756

Semantic Text-to-Face GAN -ST^2FG

Authors: Manan Oza, Sukalpa Chanda, David Doermann

Abstract: Faces generated using generative adversarial networks (GANs) have reached unprecedented realism. These faces, also known as "Deep Fakes", appear as realistic photographs with very little pixel-level distortions. While some work has enabled the training of models that lead to the generation of specific properties of the subject, generating a facial image based on a natural language description has… ▽ More Faces generated using generative adversarial networks (GANs) have reached unprecedented realism. These faces, also known as "Deep Fakes", appear as realistic photographs with very little pixel-level distortions. While some work has enabled the training of models that lead to the generation of specific properties of the subject, generating a facial image based on a natural language description has not been fully explored. For security and criminal identification, the ability to provide a GAN-based system that works like a sketch artist would be incredibly useful. In this paper, we present a novel approach to generate facial images from semantic text descriptions. The learned model is provided with a text description and an outline of the type of face, which the model uses to sketch the features. Our models are trained using an Affine Combination Module (ACM) mechanism to combine the text embedding from BERT and the GAN latent space using a self-attention matrix. This avoids the loss of features due to inadequate "attention", which may happen if text embedding and latent vector are simply concatenated. Our approach is capable of generating images that are very accurately aligned to the exhaustive textual descriptions of faces with many fine detail features of the face and helps in generating better images. The proposed method is also capable of making incremental changes to a previously generated image if it is provided with additional textual descriptions or sentences. △ Less

Submitted 13 December, 2023; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: Experiments needs to be redone

arXiv:2105.15093 [pdf, other]

Pho(SC)-CTC -- A Hybrid Approach Towards Zero-shot Word Image Recognition

Authors: Ravi Bhatt, Anuj Rai, Narayanan C. Krishnan, Sukalpa Chanda

Abstract: Annotating words in a historical document image archive for word image recognition purpose demands time and skilled human resource (like historians, paleographers). In a real-life scenario, obtaining sample images for all possible words is also not feasible. However, Zero-shot learning methods could aptly be used to recognize unseen/out-of-lexicon words in such historical document images. Based on… ▽ More Annotating words in a historical document image archive for word image recognition purpose demands time and skilled human resource (like historians, paleographers). In a real-life scenario, obtaining sample images for all possible words is also not feasible. However, Zero-shot learning methods could aptly be used to recognize unseen/out-of-lexicon words in such historical document images. Based on previous state-of-the-art method for zero-shot word recognition Pho(SC)Net, we propose a hybrid model based on the CTC framework (Pho(SC)-CTC) that takes advantage of the rich features learned by Pho(SC)Net followed by a connectionist temporal classification (CTC) framework to perform the final classification. Encouraging results were obtained on two publicly available historical document datasets and one synthetic handwritten dataset, which justifies the efficacy of Pho(SC)-CTC and Pho(SC)Net. △ Less

Submitted 21 December, 2022; v1 submitted 31 May, 2021; originally announced May 2021.

Comments: Accepted (International Journal on Document Analysis and Recognition). This paper is the extension of the paper titled "Pho(SC)Net: An Approach Towards Zero-shot Word Image Recognition in Historical Documents" published in ICDAR 2021

arXiv:2105.09909 [pdf, other]

PLSM: A Parallelized Liquid State Machine for Unintentional Action Detection

Authors: Dipayan Das, Saumik Bhattacharya, Umapada Pal, Sukalpa Chanda

Abstract: Reservoir Computing (RC) offers a viable option to deploy AI algorithms on low-end embedded system platforms. Liquid State Machine (LSM) is a bio-inspired RC model that mimics the cortical microcircuits and uses spiking neural networks (SNN) that can be directly realized on neuromorphic hardware. In this paper, we present a novel Parallelized LSM (PLSM) architecture that incorporates spatio-tempor… ▽ More Reservoir Computing (RC) offers a viable option to deploy AI algorithms on low-end embedded system platforms. Liquid State Machine (LSM) is a bio-inspired RC model that mimics the cortical microcircuits and uses spiking neural networks (SNN) that can be directly realized on neuromorphic hardware. In this paper, we present a novel Parallelized LSM (PLSM) architecture that incorporates spatio-temporal read-out layer and semantic constraints on model output. To the best of our knowledge, such a formulation has been done for the first time in literature, and it offers a computationally lighter alternative to traditional deep-learning models. Additionally, we also present a comprehensive algorithm for the implementation of parallelizable SNNs and LSMs that are GPU-compatible. We implement the PLSM model to classify unintentional/accidental video clips, using the Oops dataset. From the experimental results on detecting unintentional action in video, it can be observed that our proposed model outperforms a self-supervised model and a fully supervised traditional deep learning model. All the implemented codes can be found at our repository https://github.com/anonymoussentience2020/Parallelized_LSM_for_Unintentional_Action_Recognition. △ Less

Submitted 6 May, 2021; originally announced May 2021.

arXiv:2105.07451 [pdf, other]

MSRF-Net: A Multi-Scale Residual Fusion Network for Biomedical Image Segmentation

Authors: Abhishek Srivastava, Debesh Jha, Sukalpa Chanda, Umapada Pal, Håvard D. Johansen, Dag Johansen, Michael A. Riegler, Sharib Ali, Pål Halvorsen

Abstract: Methods based on convolutional neural networks have improved the performance of biomedical image segmentation. However, most of these methods cannot efficiently segment objects of variable sizes and train on small and biased datasets, which are common for biomedical use cases. While methods exist that incorporate multi-scale fusion approaches to address the challenges arising with variable sizes,… ▽ More Methods based on convolutional neural networks have improved the performance of biomedical image segmentation. However, most of these methods cannot efficiently segment objects of variable sizes and train on small and biased datasets, which are common for biomedical use cases. While methods exist that incorporate multi-scale fusion approaches to address the challenges arising with variable sizes, they usually use complex models that are more suitable for general semantic segmentation problems. In this paper, we propose a novel architecture called Multi-Scale Residual Fusion Network (MSRF-Net), which is specially designed for medical image segmentation. The proposed MSRF-Net is able to exchange multi-scale features of varying receptive fields using a Dual-Scale Dense Fusion (DSDF) block. Our DSDF block can exchange information rigorously across two different resolution scales, and our MSRF sub-network uses multiple DSDF blocks in sequence to perform multi-scale fusion. This allows the preservation of resolution, improved information flow and propagation of both high- and low-level features to obtain accurate segmentation maps. The proposed MSRF-Net allows to capture object variabilities and provides improved results on different biomedical datasets. Extensive experiments on MSRF-Net demonstrate that the proposed method outperforms the cutting-edge medical image segmentation methods on four publicly available datasets. We achieve the dice coefficient of 0.9217, 0.9420, and 0.9224, 0.8824 on Kvasir-SEG, CVC-ClinicDB, 2018 Data Science Bowl dataset, and ISIC-2018 skin lesion segmentation challenge dataset respectively. We further conducted generalizability tests and achieved a dice coefficient of 0.7921 and 0.7575 on CVC-ClinicDB and Kvasir-SEG, respectively. △ Less

Submitted 30 January, 2022; v1 submitted 16 May, 2021; originally announced May 2021.

Journal ref: IEEE Journal of Biomedical and Health Informatics, 2022

arXiv:2105.05170 [pdf]

Mandating Code Disclosure is Unnecessary -- Strict Model Verification Does Not Require Accessing Original Computer Code

Authors: Sasanka Sekhar Chanda

Abstract: Mandating public availability of computer code underlying computational simulation modeling research ends up doing a disservice to the cause of model verification when inconsistencies between the specifications in the publication text and specifications in the computer code go unchallenged. Conversely, a model is verified when an independent researcher undertakes the set of mental processing tasks… ▽ More Mandating public availability of computer code underlying computational simulation modeling research ends up doing a disservice to the cause of model verification when inconsistencies between the specifications in the publication text and specifications in the computer code go unchallenged. Conversely, a model is verified when an independent researcher undertakes the set of mental processing tasks necessary to convert natural language specifications in a publication text into computer code instructions that produce numerical or graphical outputs identical to the outputs found in the original publication. The effort towards obtaining convergence with the numerical or graphical outputs directs intensive consideration of the publication text. The original computer code has little role to play in determining the verification status - verified/ failed verification. An insight is obtained that skillful deployment of human intelligence is feasible when effort-directing feedback processes are in place to appropriately go around the human frailty of giving up in the absence of actionable feedback. This principle can be put to use to develop better organizational configurations in business, government and society. △ Less

Submitted 11 May, 2021; originally announced May 2021.

Comments: 12 pages, 2 Figures

arXiv:2105.04311 [pdf]

Overcoming Complexity Catastrophe: An Algorithm for Beneficial Far-Reaching Adaptation under High Complexity

Authors: Sasanka Sekhar Chanda, Sai Yayavaram

Abstract: In his seminal work with NK algorithms, Kauffman noted that fitness outcomes from algorithms navigating an NK landscape show a sharp decline at high complexity arising from pervasive interdependence among problem dimensions. This phenomenon - where complexity effects dominate (Darwinian) adaptation efforts - is called complexity catastrophe. We present an algorithm - incremental change taking turn… ▽ More In his seminal work with NK algorithms, Kauffman noted that fitness outcomes from algorithms navigating an NK landscape show a sharp decline at high complexity arising from pervasive interdependence among problem dimensions. This phenomenon - where complexity effects dominate (Darwinian) adaptation efforts - is called complexity catastrophe. We present an algorithm - incremental change taking turns (ICTT) - that finds distant configurations having fitness superior to that reported in extant research, under high complexity. Thus, complexity catastrophe is not inevitable: a series of incremental changes can lead to excellent outcomes. △ Less

Submitted 10 May, 2021; originally announced May 2021.

Comments: 10 pages, 5 Figures

arXiv:2104.12620 [pdf]

An Algorithm to Effect Prompt Termination of Myopic Local Search on Kauffman-s NK Landscape

Authors: Sasanka Sekhar Chanda

Abstract: In Kauffman-s NK model, myopic local search involves flipping one randomly-chosen bit of an N-bit decision string in every time step and accepting the new configuration if that has higher fitness. One issue is that, this algorithm consumes the full extent of computational resources allocated - given by the number of alternative configurations inspected - even though search is expected to terminate… ▽ More In Kauffman-s NK model, myopic local search involves flipping one randomly-chosen bit of an N-bit decision string in every time step and accepting the new configuration if that has higher fitness. One issue is that, this algorithm consumes the full extent of computational resources allocated - given by the number of alternative configurations inspected - even though search is expected to terminate the moment there are no neighbors having higher fitness. Otherwise, the algorithm must compute the fitness of all N neighbors in every time step, consuming a high amount of resources. In order to get around this problem, I describe an algorithm that allows search to logically terminate relatively early, without having to evaluate fitness of all N neighbors at every time step. I further suggest that when the efficacy of two algorithms need to be compared head to head, imposing a common limit on the number of alternatives evaluated - metering - provides the necessary level field. △ Less

Submitted 11 May, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

Comments: 13 Pages, 10 Figures, 1 Table

arXiv:2101.06770 [pdf, other]

Improving Apparel Detection with Category Grouping and Multi-grained Branches

Authors: Qing Tian, Sampath Chanda, K C Amit Kumar, Douglas Gray

Abstract: Training an accurate object detector is expensive and time-consuming. One main reason lies in the laborious labeling process, i.e., annotating category and bounding box information for all instances in every image. In this paper, we examine ways to improve performance of deep object detectors without extra labeling. We first explore to group existing categories of high visual and semantic similari… ▽ More Training an accurate object detector is expensive and time-consuming. One main reason lies in the laborious labeling process, i.e., annotating category and bounding box information for all instances in every image. In this paper, we examine ways to improve performance of deep object detectors without extra labeling. We first explore to group existing categories of high visual and semantic similarities together as one super category (or, a superclass). Then, we study how this knowledge of hierarchical categories can be exploited to better detect object using multi-grained RCNN top branches. Experimental results on DeepFashion2 and OpenImagesV4-Clothing reveal that the proposed detection heads with multi-grained branches can boost the overall performance by 2.3 mAP for DeepFashion2 and 2.5 mAP for OpenImagesV4-Clothing with no additional time-consuming annotations. More importantly, classes that have fewer training samples tend to benefit more from the proposed multi-grained heads with superclass grouping. In particular, we improve the mAP for last 30% categories (in terms of training sample number) by 2.6 and 4.6 for DeepFashion2 and OpenImagesV4-Clothing, respectively. △ Less

Submitted 17 January, 2021; originally announced January 2021.

arXiv:2008.04073 [pdf]

AI Failures: A Review of Underlying Issues

Authors: Debarag Narayan Banerjee, Sasanka Sekhar Chanda

Abstract: Instances of Artificial Intelligence (AI) systems failing to deliver consistent, satisfactory performance are legion. We investigate why AI failures occur. We address only a narrow subset of the broader field of AI Safety. We focus on AI failures on account of flaws in conceptualization, design and deployment. Other AI Safety issues like trade-offs between privacy and security or convenience, bad… ▽ More Instances of Artificial Intelligence (AI) systems failing to deliver consistent, satisfactory performance are legion. We investigate why AI failures occur. We address only a narrow subset of the broader field of AI Safety. We focus on AI failures on account of flaws in conceptualization, design and deployment. Other AI Safety issues like trade-offs between privacy and security or convenience, bad actors hacking into AI systems to create mayhem or bad actors deploying AI for purposes harmful to humanity and are out of scope of our discussion. We find that AI systems fail on account of omission and commission errors in the design of the AI system, as well as upon failure to develop an appropriate interpretation of input information. Moreover, even when there is no significant flaw in the AI software, an AI system may fail because the hardware is incapable of robust performance across environments. Finally an AI system is quite likely to fail in situations where, in effect, it is called upon to deliver moral judgments -- a capability AI does not possess. We observe certain trade-offs in measures to mitigate a subset of AI failures and provide some recommendations. △ Less

Submitted 18 July, 2020; originally announced August 2020.

Comments: 8 pages

arXiv:2006.08333 [pdf]

An Algorithm to find Superior Fitness on NK Landscapes under High Complexity: Muddling Through

Authors: Sasanka Sekhar Chanda, Sai Yayavaram

Abstract: Under high complexity - given by pervasive interdependence between constituent elements of a decision in an NK landscape - our algorithm obtains fitness superior to that reported in extant research. We distribute the decision elements comprising a decision into clusters. When a change in value of a decision element is considered, a forward move is made if the aggregate fitness of the cluster membe… ▽ More Under high complexity - given by pervasive interdependence between constituent elements of a decision in an NK landscape - our algorithm obtains fitness superior to that reported in extant research. We distribute the decision elements comprising a decision into clusters. When a change in value of a decision element is considered, a forward move is made if the aggregate fitness of the cluster members residing alongside the decision element is higher. The decision configuration with the highest fitness in the path is selected. Increasing the number of clusters obtains even higher fitness. Further, implementing moves comprising of up to two changes in a cluster also obtains higher fitness. Our algorithm obtains superior outcomes by enabling more extensive search, allowing inspection of more distant configurations. We name this algorithm the muddling through algorithm, in memory of Charles Lindblom who spotted the efficacy of the process long before sophisticated computer simulations came into being. △ Less

Submitted 7 September, 2020; v1 submitted 6 June, 2020; originally announced June 2020.

Comments: 6 pages and 5 figures

MSC Class: Keywords. algorithm; complexity; fitness; interdependence; muddling through; NK model; policy making; public administration

arXiv:1907.02244 [pdf, other]

Searching for Apparel Products from Images in the Wild

Authors: Son Tran, Ming Du, Sampath Chanda, R. Manmatha, Cj Taylor

Abstract: In this age of social media, people often look at what others are wearing. In particular, Instagram and Twitter influencers often provide images of themselves wearing different outfits and their followers are often inspired to buy similar clothes.We propose a system to automatically find the closest visually similar clothes in the online Catalog (street-to-shop searching). The problem is challengi… ▽ More In this age of social media, people often look at what others are wearing. In particular, Instagram and Twitter influencers often provide images of themselves wearing different outfits and their followers are often inspired to buy similar clothes.We propose a system to automatically find the closest visually similar clothes in the online Catalog (street-to-shop searching). The problem is challenging since the original images are taken under different pose and lighting conditions. The system initially localizes high-level descriptive regions (top, bottom, wristwear. . . ) using multiple CNN detectors such as YOLO and SSD that are trained specifically for apparel domain. It then classifies these regions into more specific regions such as t-shirts, tunic or dresses. Finally, a feature embedding learned using a multi-task function is recovered for every item and then compared with corresponding items in the online Catalog database and ranked according to distance. We validate our approach component-wise using benchmark datasets and end-to-end using human evaluation. △ Less

Submitted 7 April, 2022; v1 submitted 4 July, 2019; originally announced July 2019.

Comments: KDD2019, AI for Fashion Workshop

Showing 1–24 of 24 results for author: Chanda, S