-
Improving AI-generated music with user-guided training
Authors:
Vishwa Mohan Singh,
Sai Anirudh Aryasomayajula,
Ahan Chatterjee,
Beste Aydemir,
Rifat Mehreen Amin
Abstract:
AI music generation has advanced rapidly, with models like diffusion and autoregressive algorithms enabling high-fidelity outputs. These tools can alter styles, mix instruments, or isolate them. Since sound can be visualized as spectrograms, image-generation algorithms can be applied to generate novel music. However, these algorithms are typically trained on fixed datasets, which makes it challeng…
▽ More
AI music generation has advanced rapidly, with models like diffusion and autoregressive algorithms enabling high-fidelity outputs. These tools can alter styles, mix instruments, or isolate them. Since sound can be visualized as spectrograms, image-generation algorithms can be applied to generate novel music. However, these algorithms are typically trained on fixed datasets, which makes it challenging for them to interpret and respond to user input accurately. This is especially problematic because music is highly subjective and requires a level of personalization that image generation does not provide. In this work, we propose a human-computation approach to gradually improve the performance of these algorithms based on user interactions. The human-computation element involves aggregating and selecting user ratings to use as the loss function for fine-tuning the model. We employ a genetic algorithm that incorporates user feedback to enhance the baseline performance of a model initially trained on a fixed dataset. The effectiveness of this approach is measured by the average increase in user ratings with each iteration. In the pilot test, the first iteration showed an average rating increase of 0.2 compared to the baseline. The second iteration further improved upon this, achieving an additional increase of 0.39 over the first iteration.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Leveraging AM and FM Rhythm Spectrograms for Dementia Classification and Assessment
Authors:
Parismita Gogoi,
Vishwanath Pratap Singh,
Seema Khadirnaikar,
Soma Siddhartha,
Sishir Kalita,
Jagabandhu Mishra,
Md Sahidullah,
Priyankoo Sarmah,
S. R. M. Prasanna
Abstract:
This study explores the potential of Rhythm Formant Analysis (RFA) to capture long-term temporal modulations in dementia speech. Specifically, we introduce RFA-derived rhythm spectrograms as novel features for dementia classification and regression tasks. We propose two methodologies: (1) handcrafted features derived from rhythm spectrograms, and (2) a data-driven fusion approach, integrating prop…
▽ More
This study explores the potential of Rhythm Formant Analysis (RFA) to capture long-term temporal modulations in dementia speech. Specifically, we introduce RFA-derived rhythm spectrograms as novel features for dementia classification and regression tasks. We propose two methodologies: (1) handcrafted features derived from rhythm spectrograms, and (2) a data-driven fusion approach, integrating proposed RFA-derived rhythm spectrograms with vision transformer (ViT) for acoustic representations along with BERT-based linguistic embeddings. We compare these with existing features. Notably, our handcrafted features outperform eGeMAPs with a relative improvement of $14.2\%$ in classification accuracy and comparable performance in the regression task. The fusion approach also shows improvement, with RFA spectrograms surpassing Mel spectrograms in classification by around a relative improvement of $13.1\%$ and a comparable regression score with the baselines.
△ Less
Submitted 14 June, 2025; v1 submitted 1 June, 2025;
originally announced June 2025.
-
Causal Structure Discovery for Error Diagnostics of Children's ASR
Authors:
Vishwanath Pratap Singh,
Md. Sahidullah,
Tomi Kinnunen
Abstract:
Children's automatic speech recognition (ASR) often underperforms compared to that of adults due to a confluence of interdependent factors: physiological (e.g., smaller vocal tracts), cognitive (e.g., underdeveloped pronunciation), and extrinsic (e.g., vocabulary limitations, background noise). Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies…
▽ More
Children's automatic speech recognition (ASR) often underperforms compared to that of adults due to a confluence of interdependent factors: physiological (e.g., smaller vocal tracts), cognitive (e.g., underdeveloped pronunciation), and extrinsic (e.g., vocabulary limitations, background noise). Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies-such as age affecting ASR accuracy both directly and indirectly via pronunciation skills. In this paper, we introduce a causal structure discovery to unravel these interdependent relationships among physiology, cognition, extrinsic factors, and ASR errors. Then, we employ causal quantification to measure each factor's impact on children's ASR. We extend the analysis to fine-tuned models to identify which factors are mitigated by fine-tuning and which remain largely unaffected. Experiments on Whisper and Wav2Vec2.0 demonstrate the generalizability of our findings across different ASR systems.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Convex Approximations of Random Constrained Markov Decision Processes
Authors:
V Varagapriya,
Vikas Vikram Singh,
Abdel Lisser
Abstract:
Constrained Markov decision processes (CMDPs) are used as a decision-making framework to study the long-run performance of a stochastic system. It is well-known that a stationary optimal policy of a CMDP problem under discounted cost criterion can be obtained by solving a linear programming problem when running costs and transition probabilities are exactly known. In this paper, we consider a disc…
▽ More
Constrained Markov decision processes (CMDPs) are used as a decision-making framework to study the long-run performance of a stochastic system. It is well-known that a stationary optimal policy of a CMDP problem under discounted cost criterion can be obtained by solving a linear programming problem when running costs and transition probabilities are exactly known. In this paper, we consider a discounted cost CMDP problem where the running costs and transition probabilities are defined using random variables. Consequently, both the objective function and constraints become random. We use chance constraints to model these uncertainties and formulate the uncertain CMDP problem as a joint chance-constrained Markov decision process (JCCMDP). Under random running costs, we assume that the dependency among random constraint vectors is driven by a Gumbel-Hougaard copula. Using standard probability inequalities, we construct convex upper bound approximations of the JCCMDP problem under certain conditions on random running costs. In addition, we propose a linear programming problem whose optimal value gives a lower bound to the optimal value of the JCCMDP problem. When both running costs and transition probabilities are random, we define the latter variables as a sum of their means and random perturbations. Under mild conditions on the random perturbations and random running costs, we construct convex upper and lower bound approximations of the JCCMDP problem. We analyse the quality of the derived bounds through numerical experiments on a queueing control problem for random running costs. For the case when both running costs and transition probabilities are random, we choose randomly generated Markov decision problems called Garnets for numerical experiments.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic Intelligence
Authors:
Edem Ahadzi,
Vishwanath Pratap Singh,
Tomi Kinnunen,
Ville Hautamaki
Abstract:
In this work, we present the first study addressing automatic speech recognition (ASR) for children in an online learning setting. This is particularly important for both child-centric applications and the privacy protection of minors, where training models with sequentially arriving data is critical. The conventional approach of model fine-tuning often suffers from catastrophic forgetting. To tac…
▽ More
In this work, we present the first study addressing automatic speech recognition (ASR) for children in an online learning setting. This is particularly important for both child-centric applications and the privacy protection of minors, where training models with sequentially arriving data is critical. The conventional approach of model fine-tuning often suffers from catastrophic forgetting. To tackle this issue, we explore two established techniques: elastic weight consolidation (EWC) and synaptic intelligence (SI). Using a custom protocol on the MyST corpus, tailored to the online learning setting, we achieve relative word error rate (WER) reductions of 5.21% with EWC and 4.36% with SI, compared to the fine-tuning baseline.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
STOPA: A Database of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution
Authors:
Anton Firc,
Manasi Chibber,
Jagabandhu Mishra,
Vishwanath Pratap Singh,
Tomi Kinnunen,
Kamil Malinka
Abstract:
A key research area in deepfake speech detection is source tracing - determining the origin of synthesised utterances. The approaches may involve identifying the acoustic model (AM), vocoder model (VM), or other generation-specific parameters. However, progress is limited by the lack of a dedicated, systematically curated dataset. To address this, we introduce STOPA, a systematically varied and me…
▽ More
A key research area in deepfake speech detection is source tracing - determining the origin of synthesised utterances. The approaches may involve identifying the acoustic model (AM), vocoder model (VM), or other generation-specific parameters. However, progress is limited by the lack of a dedicated, systematically curated dataset. To address this, we introduce STOPA, a systematically varied and metadata-rich dataset for deepfake speech source tracing, covering 8 AMs, 6 VMs, and diverse parameter settings across 700k samples from 13 distinct synthesisers. Unlike existing datasets, which often feature limited variation or sparse metadata, STOPA provides a systematically controlled framework covering a broader range of generative factors, such as the choice of the vocoder model, acoustic model, or pretrained weights, ensuring higher attribution reliability. This control improves attribution accuracy, aiding forensic analysis, deepfake detection, and generative model transparency.
△ Less
Submitted 5 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Artificial Intelligence implementation of onboard flexible payload and adaptive beamforming using commercial off-the-shelf devices
Authors:
Luis Manuel Garcés-Socarrás,
Amirhosein Nik,
Flor Ortiz,
Juan A. Vásquez-Peralvo,
Jorge Luis González Rios,
Mouhamad Chehailty,
Marcele Kuhfuss,
Eva Lagunas,
Jan Thoemel,
Sumit Kumar,
Vishal Singh,
Juan Carlos Merlano Duncan,
Sahar Malmir,
Swetha Varadajulu,
Jorge Querol,
Symeon Chatzinotas
Abstract:
Very High Throughput satellites typically provide multibeam coverage, however, a common problem is that there can be a mismatch between the capacity of each beam and the traffic demand: some beams may fall short, while others exceed the requirements. This challenge can be addressed by integrating machine learning with flexible payload and adaptive beamforming techniques. These methods allow for dy…
▽ More
Very High Throughput satellites typically provide multibeam coverage, however, a common problem is that there can be a mismatch between the capacity of each beam and the traffic demand: some beams may fall short, while others exceed the requirements. This challenge can be addressed by integrating machine learning with flexible payload and adaptive beamforming techniques. These methods allow for dynamic allocation of payload resources based on real-time capacity needs. As artificial intelligence advances, its ability to automate tasks, enhance efficiency, and increase precision is proving invaluable, especially in satellite communications, where traditional optimization methods are often computationally intensive. AI-driven solutions offer faster, more effective ways to handle complex satellite communication tasks. Artificial intelligence in space has more constraints than other fields, considering the radiation effects, the spaceship power capabilities, mass, and area. Current onboard processing uses legacy space-certified general-purpose processors, costly application-specific integrated circuits, or field-programmable gate arrays subjected to a highly stringent certification process. The increased performance demands of onboard processors to satisfy the accelerated data rates and autonomy requirements have rendered current space-graded processors obsolete. This work is focused on transforming the satellite payload using artificial intelligence and machine learning methodologies over available commercial off-the-shelf chips for onboard processing. The objectives include validating artificial intelligence-driven scenarios, focusing on flexible payload and adaptive beamforming as machine learning models onboard. Results show that machine learning models significantly improve signal quality, spectral efficiency, and throughput compared to conventional payload.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Stability of Polling Systems for a Large Class of Markovian Switching Policies
Authors:
Konstantin Avrachenkov,
Kousik Das,
Veeraruna Kavitha,
Vartika Singh
Abstract:
We consider a polling system with two queues, where a single server is attending the queues in a cyclic order and requires non-zero switching times to switch between the queues. Our aim is to identify a fairly general and comprehensive class of Markovian switching policies that renders the system stable. Potentially a class of policies that can cover the Pareto frontier related to individual-queue…
▽ More
We consider a polling system with two queues, where a single server is attending the queues in a cyclic order and requires non-zero switching times to switch between the queues. Our aim is to identify a fairly general and comprehensive class of Markovian switching policies that renders the system stable. Potentially a class of policies that can cover the Pareto frontier related to individual-queue-centric performance measures like the stationary expected number of waiting customers in each queue; for instance, such a class of policies is identified recently for a polling system near the fluid regime (with large arrival and departure rates), and we aim to include that class. We also aim to include a second class that facilitates switching between the queues at the instance the occupancy in the opposite queue crosses a threshold and when that in the visiting queue is below a threshold (this inclusion facilitates design of `robust' polling systems). Towards this, we consider a class of two-phase switching policies, which includes the above mentioned classes. In the maximum generality, our policies can be represented by eight parameters, while two parameters are sufficient to represent the aforementioned classes. We provide simple conditions to identify the sub-class of switching policies that ensure system stability. By numerically tuning the parameters of the proposed class, we illustrate that the proposed class can cover the Pareto frontier for the stationary expected number of customers in the two queues.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech
Authors:
Xin Wang,
Héctor Delgado,
Hemlata Tak,
Jee-weon Jung,
Hye-jin Shim,
Massimiliano Todisco,
Ivan Kukanov,
Xuechen Liu,
Md Sahidullah,
Tomi Kinnunen,
Nicholas Evans,
Kong Aik Lee,
Junichi Yamagishi,
Myeonghun Jeong,
Ge Zhu,
Yongyi Zang,
You Zhang,
Soumi Maiti,
Florian Lux,
Nicolas Müller,
Wangyou Zhang,
Chengzhe Sun,
Shuwei Hou,
Siwei Lyu,
Sébastien Le Maguer
, et al. (4 additional authors not shown)
Abstract:
ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in a crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from ~2,000 speakers (cf. ~100 earlier…
▽ More
ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in a crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from ~2,000 speakers (cf. ~100 earlier). The database contains attacks generated with 32 different algorithms, also crowdsourced, and optimised to varying degrees using new surrogate detection models. Among them are attacks generated with a mix of legacy and contemporary text-to-speech synthesis and voice conversion models, in addition to adversarial attacks which are incorporated for the first time. ASVspoof 5 protocols comprise seven speaker-disjoint partitions. They include two distinct partitions for the training of different sets of attack models, two more for the development and evaluation of surrogate detection models, and then three additional partitions which comprise the ASVspoof 5 training, development and evaluation sets. An auxiliary set of data collected from an additional 30k speakers can also be used to train speaker encoders for the implementation of attack algorithms. Also described herein is an experimental validation of the new ASVspoof 5 database using a set of automatic speaker verification and spoof/deepfake baseline detectors. With the exception of protocols and tools for the generation of spoofed/deepfake speech, the resources described in this paper, already used by participants of the ASVspoof 5 challenge in 2024, are now all freely available to the community.
△ Less
Submitted 24 April, 2025; v1 submitted 12 February, 2025;
originally announced February 2025.
-
Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors
Authors:
Vishwanath Pratap Singh,
Md. Sahidullah,
Tomi Kinnunen
Abstract:
The increasing use of children's automatic speech recognition (ASR) systems has spurred research efforts to improve the accuracy of models designed for children's speech in recent years. The current approach utilizes either open-source speech foundation models (SFMs) directly or fine-tuning them with children's speech data. These SFMs, whether open-source or fine-tuned for children, often exhibit…
▽ More
The increasing use of children's automatic speech recognition (ASR) systems has spurred research efforts to improve the accuracy of models designed for children's speech in recent years. The current approach utilizes either open-source speech foundation models (SFMs) directly or fine-tuning them with children's speech data. These SFMs, whether open-source or fine-tuned for children, often exhibit higher word error rates (WERs) compared to adult speech. However, there is a lack of systemic analysis of the cause of this degraded performance of SFMs. Understanding and addressing the reasons behind this performance disparity is crucial for improving the accuracy of SFMs for children's speech. Our study addresses this gap by investigating the causes of accuracy degradation and the primary contributors to WER in children's speech. In the first part of the study, we conduct a comprehensive benchmarking study on two self-supervised SFMs (Wav2Vec2.0 and Hubert) and two weakly supervised SFMs (Whisper and MMS) across various age groups on two children speech corpora, establishing the raw data for the causal inference analysis in the second part. In the second part of the study, we analyze the impact of physiological factors (age, gender), cognitive factors (pronunciation ability), and external factors (vocabulary difficulty, background noise, and word count) on SFM accuracy in children's speech using causal inference. The results indicate that physiology (age) and particular external factor (number of words in audio) have the highest impact on accuracy, followed by background noise and pronunciation ability. Fine-tuning SFMs on children's speech reduces sensitivity to physiological and cognitive factors, while sensitivity to the number of words in audio persists.
Keywords: Children's ASR, Speech Foundational Models, Causal Inference, Physiology, Cognition, Pronunciation
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Real-Time Brain Tumor Detection in Intraoperative Ultrasound Using YOLO11: From Model Training to Deployment in the Operating Room
Authors:
Santiago Cepeda,
Olga Esteban-Sinovas,
Roberto Romero,
Vikas Singh,
Prakash Shetty,
Aliasgar Moiyadi,
Ilyess Zemmoura,
Giuseppe Roberto Giammalva,
Massimiliano Del Bene,
Arianna Barbotti,
Francesco DiMeco,
Timothy R. West,
Brian V. Nahed,
Ignacio Arrese,
Roberto Hornero,
Rosario Sarabia
Abstract:
Intraoperative ultrasound (ioUS) is a valuable tool in brain tumor surgery due to its versatility, affordability, and seamless integration into the surgical workflow. However, its adoption remains limited, primarily because of the challenges associated with image interpretation and the steep learning curve required for effective use. This study aimed to enhance the interpretability of ioUS images…
▽ More
Intraoperative ultrasound (ioUS) is a valuable tool in brain tumor surgery due to its versatility, affordability, and seamless integration into the surgical workflow. However, its adoption remains limited, primarily because of the challenges associated with image interpretation and the steep learning curve required for effective use. This study aimed to enhance the interpretability of ioUS images by developing a real-time brain tumor detection system deployable in the operating room. We collected 2D ioUS images from the Brain Tumor Intraoperative Database (BraTioUS) and the public ReMIND dataset, annotated with expert-refined tumor labels. Using the YOLO11 architecture and its variants, we trained object detection models to identify brain tumors. The dataset included 1,732 images from 192 patients, divided into training, validation, and test sets. Data augmentation expanded the training set to 11,570 images. In the test dataset, YOLO11s achieved the best balance of precision and computational efficiency, with a mAP@50 of 0.95, mAP@50-95 of 0.65, and a processing speed of 34.16 frames per second. The proposed solution was prospectively validated in a cohort of 15 consecutively operated patients diagnosed with brain tumors. Neurosurgeons confirmed its seamless integration into the surgical workflow, with real-time predictions accurately delineating tumor regions. These findings highlight the potential of real-time object detection algorithms to enhance ioUS-guided brain tumor surgery, addressing key challenges in interpretation and providing a foundation for future development of computer vision-based tools for neuro-oncological surgery.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Revolutionizing Pharmaceutical Manufacturing: Advances and Challenges of 3D Printing System and Control
Authors:
Rahul Kumar,
Vikram Singh,
Priya Gupta
Abstract:
The advent of 3D printing has transformed the pharmaceutical industry, enabling precision drug manufacturing with controlled release profiles, dosing, and structural complexity. Additive manufacturing (AM) addresses the growing demand for personalized medicine, overcoming limitations of traditional methods. This technology facilitates tailored dosage forms, complex geometries, and real-time qualit…
▽ More
The advent of 3D printing has transformed the pharmaceutical industry, enabling precision drug manufacturing with controlled release profiles, dosing, and structural complexity. Additive manufacturing (AM) addresses the growing demand for personalized medicine, overcoming limitations of traditional methods. This technology facilitates tailored dosage forms, complex geometries, and real-time quality control. Recent advancements in drop-on-demand printing, UV curable inks, material science, and regulatory frameworks are discussed. Despite opportunities for cost reduction, flexibility, and decentralized manufacturing, challenges persist in scalability, reproducibility, and regulatory adaptation. This review provides an in-depth analysis of the current state of AM in pharmaceutical manufacturing, exploring recent developments, challenges, and future directions for mainstream integration.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.
-
Data-driven Modeling of Combined Sewer Systems for Urban Sustainability: An Empirical Evaluation
Authors:
Vipin Singh,
Tianheng Ling,
Teodor Chiaburu,
Felix Biessmann
Abstract:
Climate change poses complex challenges, with extreme weather events becoming increasingly frequent and difficult to model. Examples include the dynamics of Combined Sewer Systems (CSS). Overburdened CSS during heavy rainfall will overflow untreated wastewater into surface water bodies. Classical approaches to modeling the impact of extreme rainfall events rely on physical simulations, which are p…
▽ More
Climate change poses complex challenges, with extreme weather events becoming increasingly frequent and difficult to model. Examples include the dynamics of Combined Sewer Systems (CSS). Overburdened CSS during heavy rainfall will overflow untreated wastewater into surface water bodies. Classical approaches to modeling the impact of extreme rainfall events rely on physical simulations, which are particularly challenging to create for large urban infrastructures. Deep Learning (DL) models offer a cost-effective alternative for modeling the complex dynamics of sewer systems. In this study, we present a comprehensive empirical evaluation of several state-of-the-art DL time series models for predicting sewer system dynamics in a large urban infrastructure, utilizing three years of measurement data. We especially investigate the potential of DL models to maintain predictive precision during network outages by comparing global models, which have access to all variables within the sewer system, and local models, which are limited to data from a restricted set of local sensors. Our findings demonstrate that DL models can accurately predict the dynamics of sewer system load, even under network outage conditions. These results suggest that DL models can effectively aid in balancing the load redistribution in CSS, thereby enhancing the sustainability and resilience of urban infrastructures.
△ Less
Submitted 13 February, 2025; v1 submitted 21 August, 2024;
originally announced August 2024.
-
Optimal Linear Precoding Under Realistic Satellite Communications Scenarios
Authors:
Geoffrey Eappen,
Jorge Luis Gonzalez,
Vibhum Singh,
Rakesh Palisetty,
Alireza Haqiqtnejad,
Liz Martinez Marrero,
Jevgenij Krivochiza,
Jorge Querol,
Nicola Maturo,
Juan Carlos Merlano Duncan,
Eva Lagunas,
Stefano Andrenacci,
Symeon Chatzinotas
Abstract:
In this paper, optimal linear precoding for the multibeam geostationary earth orbit (GEO) satellite with the multi-user (MU) multiple-input-multiple-output (MIMO) downlink scenario is addressed. Multiple-user interference is one of the major issues faced by the satellites serving the multiple users operating at the common time-frequency resource block in the downlink channel. To mitigate this issu…
▽ More
In this paper, optimal linear precoding for the multibeam geostationary earth orbit (GEO) satellite with the multi-user (MU) multiple-input-multiple-output (MIMO) downlink scenario is addressed. Multiple-user interference is one of the major issues faced by the satellites serving the multiple users operating at the common time-frequency resource block in the downlink channel. To mitigate this issue, the optimal linear precoders are implemented at the gateways (GWs). The precoding computation is performed by utilizing the channel state information obtained at user terminals (UTs). The optimal linear precoders are derived considering beamformer update and power control with an iterative per-antenna power optimization algorithm with a limited required number of iterations. The efficacy of the proposed algorithm is validated using the In-Lab experiment for 16X16 precoding with multi-beam satellite for transmitting and receiving the precoded data with digital video broadcasting satellite-second generation extension (DVB- S2X) standard for the GW and the UTs. The software defined radio platforms are employed for emulating the GWs, UTs, and satellite links. The validation is supported by comparing the proposed optimal linear precoder with full frequency reuse (FFR), and minimum mean square error (MMSE) schemes. The experimental results demonstrate that with the optimal linear precoders it is possible to successfully cancel the inter-user interference in the simulated satellite FFR link. Thus, optimal linear precoding brings gains in terms of enhanced signal-to-noise-and-interference ratio, and increased system throughput and spectral efficiency.
△ Less
Submitted 16 August, 2024;
originally announced August 2024.
-
Segmentation of Mental Foramen in Orthopantomographs: A Deep Learning Approach
Authors:
Haider Raza,
Mohsin Ali,
Vishal Krishna Singh,
Agustin Wahjuningrum,
Rachel Sarig,
Akhilanand Chaurasia
Abstract:
Precise identification and detection of the Mental Foramen are crucial in dentistry, impacting procedures such as impacted tooth removal, cyst surgeries, and implants. Accurately identifying this anatomical feature facilitates post-surgery issues and improves patient outcomes. Moreover, this study aims to accelerate dental procedures, elevating patient care and healthcare efficiency in dentistry.…
▽ More
Precise identification and detection of the Mental Foramen are crucial in dentistry, impacting procedures such as impacted tooth removal, cyst surgeries, and implants. Accurately identifying this anatomical feature facilitates post-surgery issues and improves patient outcomes. Moreover, this study aims to accelerate dental procedures, elevating patient care and healthcare efficiency in dentistry. This research used Deep Learning methods to accurately detect and segment the Mental Foramen from panoramic radiograph images. Two mask types, circular and square, were used during model training. Multiple segmentation models were employed to identify and segment the Mental Foramen, and their effectiveness was evaluated using diverse metrics. An in-house dataset comprising 1000 panoramic radiographs was created for this study. Our experiments demonstrated that the Classical UNet model performed exceptionally well on the test data, achieving a Dice Coefficient of 0.79 and an Intersection over Union (IoU) of 0.67. Moreover, ResUNet++ and UNet Attention models showed competitive performance, with Dice scores of 0.675 and 0.676, and IoU values of 0.683 and 0.671, respectively. We also investigated transfer learning models with varied backbone architectures, finding LinkNet to produce the best outcomes. In conclusion, our research highlights the efficacy of the classical Unet model in accurately identifying and outlining the Mental Foramen in panoramic radiographs. While vital, this task is comparatively simpler than segmenting complex medical datasets such as brain tumours or skin cancer, given their diverse sizes and shapes. This research also holds value in optimizing dental practice, benefiting practitioners and patients.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR
Authors:
Vishwanath Pratap Singh,
Federico Malato,
Ville Hautamaki,
Md. Sahidullah,
Tomi Kinnunen
Abstract:
While automatic speech recognition (ASR) greatly benefits from data augmentation, the augmentation recipes themselves tend to be heuristic. In this paper, we address one of the heuristic approach associated with balancing the right amount of augmented data in ASR training by introducing a reinforcement learning (RL) based dynamic adjustment of original-to-augmented data ratio (OAR). Unlike the fix…
▽ More
While automatic speech recognition (ASR) greatly benefits from data augmentation, the augmentation recipes themselves tend to be heuristic. In this paper, we address one of the heuristic approach associated with balancing the right amount of augmented data in ASR training by introducing a reinforcement learning (RL) based dynamic adjustment of original-to-augmented data ratio (OAR). Unlike the fixed OAR approach in conventional data augmentation, our proposed method employs a deep Q-network (DQN) as the RL mechanism to learn the optimal dynamics of OAR throughout the wav2vec2.0 based ASR training. We conduct experiments using the LibriSpeech dataset with varying amounts of training data, specifically, the 10Min, 1H, 10H, and 100H splits to evaluate the efficacy of the proposed method under different data conditions. Our proposed method, on average, achieves a relative improvement of 4.96% over the open-source wav2vec2.0 base model on standard LibriSpeech test sets.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Artificial Intelligence Satellite Telecommunication Testbed using Commercial Off-The-Shelf Chipsets
Authors:
Luis M. Garcés-Socarrás,
Amirhossein Nik,
Flor Ortiz,
Juan A. Vásquez-Peralvo,
Jorge L. González-Rios,
Mouhamad Chehailty,
Marcele Kuhfuss,
Eva Lagunas,
Jan Thoemel,
Sumit Kumar,
Vishal Singh,
Juan C. Merlano Duncan,
Sahar Malmir,
Swetha Varadajulu,
Jorge Querol,
Symeon Chatzinotas
Abstract:
The Artificial Intelligence Satellite Telecommunications Testbed (AISTT), part of the ESA project SPAICE, is focused on the transformation of the satellite payload by using artificial intelligence (AI) and machine learning (ML) methodologies over available commercial off-the-shelf (COTS) AI-capable chips for onboard processing. The objectives include validating artificial intelligence-driven SATCO…
▽ More
The Artificial Intelligence Satellite Telecommunications Testbed (AISTT), part of the ESA project SPAICE, is focused on the transformation of the satellite payload by using artificial intelligence (AI) and machine learning (ML) methodologies over available commercial off-the-shelf (COTS) AI-capable chips for onboard processing. The objectives include validating artificial intelligence-driven SATCOM scenarios such as interference detection, spectrum sharing, radio resource management, decoding, and beamforming. The study highlights hardware selection and payload architecture. Preliminary results show that ML models significantly improve signal quality, spectral efficiency, and throughput compared to conventional payload. Moreover, the testbed aims to evaluate the performance and the use of AI-capable COTS chips in onboard SATCOM contexts.
△ Less
Submitted 29 November, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification
Authors:
Vishwanath Pratap Singh,
Md Sahidullah,
Tomi Kinnunen
Abstract:
The accuracy of modern automatic speaker verification (ASV) systems, when trained exclusively on adult data, drops substantially when applied to children's speech. The scarcity of children's speech corpora hinders fine-tuning ASV systems for children's speech. Hence, there is a timely need to explore more effective ways of reusing adults' speech data. One promising approach is to align vocal-tract…
▽ More
The accuracy of modern automatic speaker verification (ASV) systems, when trained exclusively on adult data, drops substantially when applied to children's speech. The scarcity of children's speech corpora hinders fine-tuning ASV systems for children's speech. Hence, there is a timely need to explore more effective ways of reusing adults' speech data. One promising approach is to align vocal-tract parameters between adults and children through children-specific data augmentation, referred here to as ChildAugment. Specifically, we modify the formant frequencies and formant bandwidths of adult speech to emulate children's speech. The modified spectra are used to train ECAPA-TDNN (emphasized channel attention, propagation, and aggregation in time-delay neural network) recognizer for children. We compare ChildAugment against various state-of-the-art data augmentation techniques for children's ASV. We also extensively compare different scoring methods, including cosine scoring, PLDA (probabilistic linear discriminant analysis), and NPLDA (neural PLDA). We also propose a low-complexity weighted cosine score for extremely low-resource children ASV. Our findings on the CSLU kids corpus indicate that ChildAugment holds promise as a simple, acoustics-motivated approach, for improving state-of-the-art deep learning based ASV for children. We achieve up to 12.45% (boys) and 11.96% (girls) relative improvement over the baseline.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Cardiac ultrasound simulation for autonomous ultrasound navigation
Authors:
Abdoul Aziz Amadou,
Laura Peralta,
Paul Dryburgh,
Paul Klein,
Kaloian Petkov,
Richard James Housden,
Vivek Singh,
Rui Liao,
Young-Ho Kim,
Florin Christian Ghesu,
Tommaso Mansi,
Ronak Rajani,
Alistair Young,
Kawal Rhode
Abstract:
Ultrasound is well-established as an imaging modality for diagnostic and interventional purposes. However, the image quality varies with operator skills as acquiring and interpreting ultrasound images requires extensive training due to the imaging artefacts, the range of acquisition parameters and the variability of patient anatomies. Automating the image acquisition task could improve acquisition…
▽ More
Ultrasound is well-established as an imaging modality for diagnostic and interventional purposes. However, the image quality varies with operator skills as acquiring and interpreting ultrasound images requires extensive training due to the imaging artefacts, the range of acquisition parameters and the variability of patient anatomies. Automating the image acquisition task could improve acquisition reproducibility and quality but training such an algorithm requires large amounts of navigation data, not saved in routine examinations. Thus, we propose a method to generate large amounts of ultrasound images from other modalities and from arbitrary positions, such that this pipeline can later be used by learning algorithms for navigation. We present a novel simulation pipeline which uses segmentations from other modalities, an optimized volumetric data representation and GPU-accelerated Monte Carlo path tracing to generate view-dependent and patient-specific ultrasound images. We extensively validate the correctness of our pipeline with a phantom experiment, where structures' sizes, contrast and speckle noise properties are assessed. Furthermore, we demonstrate its usability to train neural networks for navigation in an echocardiography view classification experiment by generating synthetic images from more than 1000 patients. Networks pre-trained with our simulations achieve significantly superior performance in settings where large real datasets are not available, especially for under-represented classes. The proposed approach allows for fast and accurate patient-specific ultrasound image generation, and its usability for training networks for navigation-related tasks is demonstrated.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
EEG-Based Reaction Time Prediction with Fuzzy Common Spatial Patterns and Phase Cohesion using Deep Autoencoder Based Data Fusion
Authors:
Vivek Singh,
Tharun Kumar Reddy
Abstract:
Drowsiness state of a driver is a topic of extensive discussion due to its significant role in causing traffic accidents. This research presents a novel approach that combines Fuzzy Common Spatial Patterns (CSP) optimised Phase Cohesive Sequence (PCS) representations and fuzzy CSP-optimized signal amplitude representations. The research aims to examine alterations in Electroencephalogram (EEG) syn…
▽ More
Drowsiness state of a driver is a topic of extensive discussion due to its significant role in causing traffic accidents. This research presents a novel approach that combines Fuzzy Common Spatial Patterns (CSP) optimised Phase Cohesive Sequence (PCS) representations and fuzzy CSP-optimized signal amplitude representations. The research aims to examine alterations in Electroencephalogram (EEG) synchronisation between a state of alertness and drowsiness, forecast drivers' reaction times by analysing EEG data, and subsequently identify the presence of drowsiness. The study's findings indicate that this approach successfully distinguishes between alert and drowsy mental states. By employing a Deep Autoencoder-based data fusion technique and a regression model such as Support Vector Regression (SVR) or Least Absolute Shrinkage and Selection Operator (LASSO), the proposed method outperforms using individual feature sets in combination with a regressor model. This superiority is measured by evaluating the Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Correlation Coefficient (CC). In other words, the fusion of autoencoder-based amplitude EEG power features and PCS features, when used in regression, outperforms using either of these features alone in a regressor model. Specifically, the proposed data fusion method achieves a 14.36% reduction in RMSE, a 25.12% reduction in MAPE, and a 10.12% increase in CC compared to the baseline model using only individual amplitude EEG power features and regression.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
Low Complexity High Speed Deep Neural Network Augmented Wireless Channel Estimation
Authors:
Syed Asrar ul haq,
Varun Singh,
Bhanu Teja Tanaji,
Sumit Darak
Abstract:
The channel estimation (CE) in wireless receivers is one of the most critical and computationally complex signal processing operations. Recently, various works have shown that the deep learning (DL) based CE outperforms conventional minimum mean square error (MMSE) based CE, and it is hardware-friendly. However, DL-based CE has higher complexity and latency than popularly used least square (LS) ba…
▽ More
The channel estimation (CE) in wireless receivers is one of the most critical and computationally complex signal processing operations. Recently, various works have shown that the deep learning (DL) based CE outperforms conventional minimum mean square error (MMSE) based CE, and it is hardware-friendly. However, DL-based CE has higher complexity and latency than popularly used least square (LS) based CE. In this work, we propose a novel low complexity high-speed Deep Neural Network-Augmented Least Square (LC-LSDNN) algorithm for IEEE 802.11p wireless physical layer and efficiently implement it on Zynq system on chip (ZSoC). The novelty of the LC-LSDNN is to use different DNNs for real and imaginary values of received complex symbols. This helps reduce the size of DL by 59% and optimize the critical path, allowing it to operate at 60% higher clock frequency. We also explore three different architectures for MMSE-based CE. We show that LC-LSDNN significantly outperforms MMSE and state-of-the-art DL-based CE for a wide range of signal-to-noise ratios (SNR) and different wireless channels. Also, it is computationally efficient, with around 50% lower resources than existing DL-based CE.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Automated CT Lung Cancer Screening Workflow using 3D Camera
Authors:
Brian Teixeira,
Vivek Singh,
Birgi Tamersoy,
Andreas Prokein,
Ankur Kapoor
Abstract:
Despite recent developments in CT planning that enabled automation in patient positioning, time-consuming scout scans are still needed to compute dose profile and ensure the patient is properly positioned. In this paper, we present a novel method which eliminates the need for scout scans in CT lung cancer screening by estimating patient scan range, isocenter, and Water Equivalent Diameter (WED) fr…
▽ More
Despite recent developments in CT planning that enabled automation in patient positioning, time-consuming scout scans are still needed to compute dose profile and ensure the patient is properly positioned. In this paper, we present a novel method which eliminates the need for scout scans in CT lung cancer screening by estimating patient scan range, isocenter, and Water Equivalent Diameter (WED) from 3D camera images. We achieve this task by training an implicit generative model on over 60,000 CT scans and introduce a novel approach for updating the prediction using real-time scan data. We demonstrate the effectiveness of our method on a testing set of 110 pairs of depth data and CT scan, resulting in an average error of 5mm in estimating the isocenter, 13mm in determining the scan range, 10mm and 16mm in estimating the AP and lateral WED respectively. The relative WED error of our method is 4%, which is well within the International Electrotechnical Commission (IEC) acceptance criteria of 10%.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
pyParaOcean: A System for Visual Analysis of Ocean Data
Authors:
Toshit Jain,
Varun Singh,
Vijay Kumar Boda,
Upkar Singh,
Ingrid Hotz,
P. N. Vinayachandran,
Vijay Natarajan
Abstract:
Visual analysis is well adopted within the field of oceanography for the analysis of model simulations, detection of different phenomena and events, and tracking of dynamic processes. With increasing data sizes and the availability of multivariate dynamic data, there is a growing need for scalable and extensible tools for visualization and interactive exploration. We describe pyParaOcean, a visual…
▽ More
Visual analysis is well adopted within the field of oceanography for the analysis of model simulations, detection of different phenomena and events, and tracking of dynamic processes. With increasing data sizes and the availability of multivariate dynamic data, there is a growing need for scalable and extensible tools for visualization and interactive exploration. We describe pyParaOcean, a visualization system that supports several tasks routinely used in the visual analysis of ocean data. The system is available as a plugin to Paraview and is hence able to leverage its distributed computing capabilities and its rich set of generic analysis and visualization functionalities. pyParaOcean provides modules to support different visual analysis tasks specific to ocean data, such as eddy identification and salinity movement tracking. These modules are available as Paraview filters and this seamless integration results in a system that is easy to install and use. A case study on the Bay of Bengal illustrates the utility of the system for the study of ocean phenomena and processes.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Optimal Closed Loop Control of G2V/V2G Action Using Model Predictive Controller
Authors:
Satya Vikram Pratap Singh,
Siddharth Kamila,
Prashanth Agnihotri
Abstract:
This paper has developed a closed-loop control algorithm to operate the G2V/V2G action, tested under varying battery voltage conditions and load and source power differences. Under V2G action, to maintain total harmonic distortion under minimum level and grid frequency under the standard limit, a Model predictive controller (MPC) has been used to control the gate driver circuit of the inverter. Th…
▽ More
This paper has developed a closed-loop control algorithm to operate the G2V/V2G action, tested under varying battery voltage conditions and load and source power differences. Under V2G action, to maintain total harmonic distortion under minimum level and grid frequency under the standard limit, a Model predictive controller (MPC) has been used to control the gate driver circuit of the inverter. The state space model of the plant has been created using the system identification toolbox, and the MPC Controller block has been designed using the Model Predictive Control Toolbox of MATLAB. The proposed methodology is tested using MATLAB/Simulink and OPAL-RT (OP4510) in a real-time environment. This methodology reduces %THD to less than 0.5%, improves waveform quality of grid voltage, inverter output voltage, grid current, and inverter output current to nearly 99%, and maintains the grid frequency in standard limit while in G2V/V2G action.
△ Less
Submitted 11 October, 2023; v1 submitted 17 August, 2023;
originally announced August 2023.
-
Fault Detection and Classification using Wavelet and ANN in DFIG and TCSC Connected Transmission Line
Authors:
Satya Vikram Pratap Singh,
Tanu Prasad,
Siddharth Kamila,
Prashant Agnihotri
Abstract:
This paper presents fault detection and classification using Wavelet and ANN based methods in a DFIG-based series compensated system. The state-of-the art methods include Wavelet transform, Fourier transform, and Wavelet-neuro fuzzy methods-based system for fault detection and classification. However, the accuracy of these state-of-the-art methods diminishes during variable conditions such as chan…
▽ More
This paper presents fault detection and classification using Wavelet and ANN based methods in a DFIG-based series compensated system. The state-of-the art methods include Wavelet transform, Fourier transform, and Wavelet-neuro fuzzy methods-based system for fault detection and classification. However, the accuracy of these state-of-the-art methods diminishes during variable conditions such as changes in wind speed, high impedance faults, and the changes in the series compensation level. Specifically, in Wavelet transform based methods, the threshold values need to be adapted based on the variable field conditions. To solve this problem, this paper has proposed a Wavelet-ANN based fault detection method where Wavelet is used as an identifier and ANN is used as a classifier for detecting various fault cases. This methodology is also effective under SSR condition. The proposed methodology is evaluated on various fault and non-fault cases generated on an IEEE first benchmark model under varying compensation levels from 20% to 55%, impedance faults, and wind velocity from 6m/sec to 10m/sec using MATLAB/Simulink, OPALRT(OP4510) manufactured real-time digital simulator environment, Arduino board I/O ports communicating with external PC in which ANN model dumped, using Arduino support package of MATLAB. The preliminary results are compared with the state-of-the-art fault detection method, where the proposed method shows robust performance under varying field conditions.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech
Authors:
Vishwanath Pratap Singh,
Md Sahidullah,
Tomi Kinnunen
Abstract:
In this paper, we study the impact of the ageing on modern deep speaker embedding based automatic speaker verification (ASV) systems. We have selected two different datasets to examine ageing on the state-of-the-art ECAPA-TDNN system. The first dataset, used for addressing short-term ageing (up to 10 years time difference between enrollment and test) under uncontrolled conditions, is VoxCeleb. The…
▽ More
In this paper, we study the impact of the ageing on modern deep speaker embedding based automatic speaker verification (ASV) systems. We have selected two different datasets to examine ageing on the state-of-the-art ECAPA-TDNN system. The first dataset, used for addressing short-term ageing (up to 10 years time difference between enrollment and test) under uncontrolled conditions, is VoxCeleb. The second dataset, used for addressing long-term ageing effect (up to 40 years difference) of Finnish speakers under a more controlled setup, is Longitudinal Corpus of Finnish Spoken in Helsinki (LCFSH). Our study provides new insights into the impact of speaker ageing on modern ASV systems. Specifically, we establish a quantitative measure between ageing and ASV scores. Further, our research indicates that ageing affects female English speakers to a greater degree than male English speakers, while in the case of Finnish, it has a greater impact on male speakers than female speakers.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
Breast Cancer Immunohistochemical Image Generation: a Benchmark Dataset and Challenge Review
Authors:
Chuang Zhu,
Shengjie Liu,
Zekuan Yu,
Feng Xu,
Arpit Aggarwal,
Germán Corredor,
Anant Madabhushi,
Qixun Qu,
Hongwei Fan,
Fangda Li,
Yueheng Li,
Xianchao Guan,
Yongbing Zhang,
Vivek Kumar Singh,
Farhan Akram,
Md. Mostafa Kamal Sarker,
Zhongyue Shi,
Mulan Jin
Abstract:
For invasive breast cancer, immunohistochemical (IHC) techniques are often used to detect the expression level of human epidermal growth factor receptor-2 (HER2) in breast tissue to formulate a precise treatment plan. From the perspective of saving manpower, material and time costs, directly generating IHC-stained images from Hematoxylin and Eosin (H&E) stained images is a valuable research direct…
▽ More
For invasive breast cancer, immunohistochemical (IHC) techniques are often used to detect the expression level of human epidermal growth factor receptor-2 (HER2) in breast tissue to formulate a precise treatment plan. From the perspective of saving manpower, material and time costs, directly generating IHC-stained images from Hematoxylin and Eosin (H&E) stained images is a valuable research direction. Therefore, we held the breast cancer immunohistochemical image generation challenge, aiming to explore novel ideas of deep learning technology in pathological image generation and promote research in this field. The challenge provided registered H&E and IHC-stained image pairs, and participants were required to use these images to train a model that can directly generate IHC-stained images from corresponding H&E-stained images. We selected and reviewed the five highest-ranking methods based on their PSNR and SSIM metrics, while also providing overviews of the corresponding pipelines and implementations. In this paper, we further analyze the current limitations in the field of breast cancer immunohistochemical image generation and forecast the future development of this field. We hope that the released dataset and the challenge will inspire more scholars to jointly study higher-quality IHC-stained image generation.
△ Less
Submitted 22 September, 2023; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Exploring Deep Learning Methods for Classification of SAR Images: Towards NextGen Convolutions via Transformers
Authors:
Aakash Singh,
Vivek Kumar Singh
Abstract:
Images generated by high-resolution SAR have vast areas of application as they can work better in adverse light and weather conditions. One such area of application is in the military systems. This study is an attempt to explore the suitability of current state-of-the-art models introduced in the domain of computer vision for SAR target classification (MSTAR). Since the application of any solution…
▽ More
Images generated by high-resolution SAR have vast areas of application as they can work better in adverse light and weather conditions. One such area of application is in the military systems. This study is an attempt to explore the suitability of current state-of-the-art models introduced in the domain of computer vision for SAR target classification (MSTAR). Since the application of any solution produced for military systems would be strategic and real-time, accuracy is often not the only criterion to measure its performance. Other important parameters like prediction time and input resiliency are equally important. The paper deals with these issues in the context of SAR images. Experimental results show that deep learning models can be suitably applied in the domain of SAR image classification with the desired performance levels.
△ Less
Submitted 28 March, 2023;
originally announced March 2023.
-
Estimating a scalar log-concave random variable, using a silence set based probabilistic sampling
Authors:
Maben Rabi,
Junfeng Wu,
Vyoma Singh,
Karl Henrik Johansson
Abstract:
We study the probabilistic sampling of a random variable, in which the variable is sampled only if it falls outside a given set, which is called the silence set. This helps us to understand optimal event-based sampling for the special case of IID random processes, and also to understand the design of a sub-optimal scheme for other cases. We consider the design of this probabilistic sampling for a…
▽ More
We study the probabilistic sampling of a random variable, in which the variable is sampled only if it falls outside a given set, which is called the silence set. This helps us to understand optimal event-based sampling for the special case of IID random processes, and also to understand the design of a sub-optimal scheme for other cases. We consider the design of this probabilistic sampling for a scalar, log-concave random variable, to minimize either the mean square estimation error, or the mean absolute estimation error. We show that the optimal silence interval: (i) is essentially unique, and (ii) is the limit of an iterative procedure of centering. Further we show through numerical experiments that super-level intervals seem to be remarkably near-optimal for mean square estimation. Finally we use the Gauss inequality for scalar unimodal densities, to show that probabilistic sampling gives a mean square distortion that is less than a third of the distortion incurred by periodic sampling, if the average sampling rate is between 0.3 and 0.9 samples per tick.
△ Less
Submitted 16 March, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
Modeling and Analysis of Multiple Electrostatic Actuators on the Response of Vibrotactile Haptic Device
Authors:
Santosh Mohan Rajkumar,
Kumar Vikram Singh,
Jeong-Hoi Koo
Abstract:
In this research, modeling and analysis of a beam-type touchscreen interface with multiple actuators is considered. As thin beams, a mechanical model of a touch screen system is developed with embedded electrostatic actuators at different spatial locations. This discrete finite element-based model is developed to compute the analytical and numerical vibrotactile response due to multiple actuators…
▽ More
In this research, modeling and analysis of a beam-type touchscreen interface with multiple actuators is considered. As thin beams, a mechanical model of a touch screen system is developed with embedded electrostatic actuators at different spatial locations. This discrete finite element-based model is developed to compute the analytical and numerical vibrotactile response due to multiple actuators excited with varying frequency and amplitude. The model is tested with spring-damper boundary conditions incorporating sinusoidal excitations in the human haptic range. An analytical solution is proposed to obtain the vibrotactile response of the touch surface for different frequencies of excitations, the number of actuators, actuator stiffness, and actuator positions. The effect of the mechanical properties of the touch surface on vibrotactile feedback provided to the user feedback is explored. Investigation of optimal location and number of actuators for a desired localized response, such as the magnitude of acceleration and variation in acceleration response for a desired zone on the interface, is carried out. It has been shown that a wide variety of localizable vibrotactile feedback can be generated on the touch surface using different frequencies of excitations, different actuator stiffness, number of actuators, and actuator positions. Having a mechanical model will facilitate simulation studies capable of incorporating more testing scenarios that may not be feasible to physically test.
△ Less
Submitted 14 February, 2023;
originally announced March 2023.
-
An analysis of the Internet of Things in wireless sensor network technologies
Authors:
Harshit Poddar,
Vansh Singh
Abstract:
Information may be accessed from a distance thanks to computer networks. Wireless or wired networks are also possible. Due to recent developments in wireless infrastructure, wireless sensor networks (WSNs) were developed. Activities or events occurring in the environment are monitored, recorded, and managed by WSN. Through a variety of routing techniques, data relaying is done in these systems. Th…
▽ More
Information may be accessed from a distance thanks to computer networks. Wireless or wired networks are also possible. Due to recent developments in wireless infrastructure, wireless sensor networks (WSNs) were developed. Activities or events occurring in the environment are monitored, recorded, and managed by WSN. Through a variety of routing techniques, data relaying is done in these systems. The fourth industrial revolution, or Industry 4.0, is defined as the integration of complex physical automation systems made up of machinery and devices connected by sensors and managed by software. This is done to boost the efficiency and reliability of operations. Industry 4.0 is viewed as a possibility because of industrial IoT, the concept of leveraging IoT technology in manufacturing. delivering, in an industrial setting, a means of connecting engines, power grids, and sensors to the cloud. In this essay, we'll try to comprehend how the Internet of Things (IoT) works in wireless sensor networks and how it might be used in various situations.
△ Less
Submitted 23 September, 2022;
originally announced September 2022.
-
A Trio-Method for Retinal Vessel Segmentation using Image Processing
Authors:
Mahendra Kumar Gourisaria,
Vinayak Singh,
Manoj Sahni
Abstract:
Inner Retinal neurons are a most essential part of the retina and they are supplied with blood via retinal vessels. This paper primarily focuses on the segmentation of retinal vessels using a triple preprocessing approach. DRIVE database was taken into consideration and preprocessed by Gabor Filtering, Gaussian Blur, and Edge Detection by Sobel and Pruning. Segmentation was driven out by 2 propose…
▽ More
Inner Retinal neurons are a most essential part of the retina and they are supplied with blood via retinal vessels. This paper primarily focuses on the segmentation of retinal vessels using a triple preprocessing approach. DRIVE database was taken into consideration and preprocessed by Gabor Filtering, Gaussian Blur, and Edge Detection by Sobel and Pruning. Segmentation was driven out by 2 proposed U-Net architectures. Both the architectures were compared in terms of all the standard performance metrics. Preprocessing generated varied interesting results which impacted the results shown by the UNet architectures for segmentation. This real-time deployment can help in the efficient pre-processing of images with better segmentation and detection.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
Multi Resolution Analysis (MRA) for Approximate Self-Attention
Authors:
Zhanpeng Zeng,
Sourav Pal,
Jeffery Kline,
Glenn M Fung,
Vikas Singh
Abstract:
Transformers have emerged as a preferred model for many tasks in natural langugage processing and vision. Recent efforts on training and deploying Transformers more efficiently have identified many strategies to approximate the self-attention matrix, a key module in a Transformer architecture. Effective ideas include various prespecified sparsity patterns, low-rank basis expansions and combination…
▽ More
Transformers have emerged as a preferred model for many tasks in natural langugage processing and vision. Recent efforts on training and deploying Transformers more efficiently have identified many strategies to approximate the self-attention matrix, a key module in a Transformer architecture. Effective ideas include various prespecified sparsity patterns, low-rank basis expansions and combinations thereof. In this paper, we revisit classical Multiresolution Analysis (MRA) concepts such as Wavelets, whose potential value in this setting remains underexplored thus far. We show that simple approximations based on empirical feedback and design choices informed by modern hardware and implementation challenges, eventually yield a MRA-based approach for self-attention with an excellent performance profile across most criteria of interest. We undertake an extensive set of experiments and demonstrate that this multi-resolution scheme outperforms most efficient self-attention proposals and is favorable for both short and long sequences. Code is available at \url{https://github.com/mlpen/mra-attention}.
△ Less
Submitted 20 July, 2022;
originally announced July 2022.
-
ICOS Protein Expression Segmentation: Can Transformer Networks Give Better Results?
Authors:
Vivek Kumar Singh,
Paul O Reilly,
Jacqueline James,
Manuel Salto Tellez,
Perry Maxwell
Abstract:
Biomarkers identify a patients response to treatment. With the recent advances in artificial intelligence based on the Transformer networks, there is only limited research has been done to measure the performance on challenging histopathology images. In this paper, we investigate the efficacy of the numerous state-of-the-art Transformer networks for immune-checkpoint biomarker, Inducible Tcell COS…
▽ More
Biomarkers identify a patients response to treatment. With the recent advances in artificial intelligence based on the Transformer networks, there is only limited research has been done to measure the performance on challenging histopathology images. In this paper, we investigate the efficacy of the numerous state-of-the-art Transformer networks for immune-checkpoint biomarker, Inducible Tcell COStimulator (ICOS) protein cell segmentation in colon cancer from immunohistochemistry (IHC) slides. Extensive and comprehensive experimental results confirm that MiSSFormer achieved the highest Dice score of 74.85% than the rest evaluated Transformer and Efficient U-Net methods.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.
-
Adaptive Traffic Signal Control for Developing Countries Using Fused Parameters Derived from Crowd-Source Data
Authors:
Sumit Mishra,
Vishal Singh,
Ankit Gupta,
Devanjan Bhattacharya,
Abhisek Mudgal
Abstract:
Advancement of mobile technologies has enabled economical collection, storage, processing, and sharing of traffic data. These data are made accessible to intended users through various application program interfaces (API) and can be used to recognize and mitigate congestion in real time. In this paper, quantitative (time of arrival) and qualitative (color-coded congestion levels) data were acquire…
▽ More
Advancement of mobile technologies has enabled economical collection, storage, processing, and sharing of traffic data. These data are made accessible to intended users through various application program interfaces (API) and can be used to recognize and mitigate congestion in real time. In this paper, quantitative (time of arrival) and qualitative (color-coded congestion levels) data were acquired from the Google traffic APIs. New parameters that reflect heterogeneous traffic conditions were defined and utilized for real-time control of traffic signals while maintaining the green-to-red time ratio. The proposed method utilizes a congestion-avoiding principle commonly used in computer networking. Adaptive congestion levels were observed on three different intersections of Delhi (India), in peak hours. It showed good variation, hence sensitive for the control algorithm to act efficiently. Also, simulation study establishes that proposed control algorithm decreases waiting time and congestion. The proposed method provides an inexpensive alternative for traffic sensing and tracking technologies.
△ Less
Submitted 11 March, 2022;
originally announced May 2022.
-
Spectral Modification Based Data Augmentation For Improving End-to-End ASR For Children's Speech
Authors:
Vishwanath Pratap Singh,
Hardik Sailor,
Supratik Bhattacharya,
Abhishek Pandey
Abstract:
Training a robust Automatic Speech Recognition (ASR) system for children's speech recognition is a challenging task due to inherent differences in acoustic attributes of adult and child speech and scarcity of publicly available children's speech dataset. In this paper, a novel segmental spectrum warping and perturbations in formant energy are introduced, to generate a children-like speech spectrum…
▽ More
Training a robust Automatic Speech Recognition (ASR) system for children's speech recognition is a challenging task due to inherent differences in acoustic attributes of adult and child speech and scarcity of publicly available children's speech dataset. In this paper, a novel segmental spectrum warping and perturbations in formant energy are introduced, to generate a children-like speech spectrum from that of an adult's speech spectrum. Then, this modified adult spectrum is used as augmented data to improve end-to-end ASR systems for children's speech recognition. The proposed data augmentation methods give 6.5% and 6.1% relative reduction in WER on children dev and test sets respectively, compared to the vocal tract length perturbation (VTLP) baseline system trained on Librispeech 100 hours adult speech dataset. When children's speech data is added in training with Librispeech set, it gives a 3.7 % and 5.1% relative reduction in WER, compared to the VTLP baseline system.
△ Less
Submitted 13 March, 2022;
originally announced March 2022.
-
A Mixture of Expert Based Deep Neural Network for Improved ASR
Authors:
Vishwanath Pratap Singh,
Shakti P. Rath,
Abhishek Pandey
Abstract:
This paper presents a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides the conventional layers, such as fully connected layers in DNN-HMM and memory cells in LSTM-HMM, the model uses two additional layers based on Mixture of Experts (MoE). The first MoE layer operating at the input is based on pre-defined broad phon…
▽ More
This paper presents a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides the conventional layers, such as fully connected layers in DNN-HMM and memory cells in LSTM-HMM, the model uses two additional layers based on Mixture of Experts (MoE). The first MoE layer operating at the input is based on pre-defined broad phonetic classes and the second layer operating at the penultimate layer is based on automatically learned acoustic classes. In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification. The ASR accuracy is expected to improve if the conventional architecture of acoustic model is modified to make them more suitable to account for such overlaps. MixNet is developed keeping this in mind. Analysis conducted by means of scatter diagram verifies that MoE indeed improves the separation between classes that translates to better ASR accuracy. Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates compared to the conventional models, namely, DNN and LSTM respectively, trained using sMBR criteria. In comparison to an existing method developed for phone-classification (by Eigen et al), our proposed method yields a significant improvement.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.
-
A higher order Minkowski loss for improved prediction ability of acoustic model in ASR
Authors:
Vishwanath Pratap Singh,
Shakti P. Rath,
Abhishek Pandey
Abstract:
Conventional automatic speech recognition (ASR) system uses second-order minkowski loss during inference time which is suboptimal as it incorporates only first order statistics in posterior estimation [2]. In this paper we have proposed higher order minkowski loss (4th Order and 6th Order) during inference time, without any changes during training time. The main contribution of the paper is to sho…
▽ More
Conventional automatic speech recognition (ASR) system uses second-order minkowski loss during inference time which is suboptimal as it incorporates only first order statistics in posterior estimation [2]. In this paper we have proposed higher order minkowski loss (4th Order and 6th Order) during inference time, without any changes during training time. The main contribution of the paper is to show that higher order loss uses higher order statistics in posterior estimation, which improves the prediction ability of acoustic model in ASR system. We have shown mathematically that posterior probability obtained due to higher order loss is function of second order posterior and thus the method can be incorporated in standard ASR system in an easy manner. It is to be noted that all changes are proposed during test(inference) time, we do not make any change in any training pipeline. Multiple baseline systems namely, TDNN1, TDNN2, DNN and LSTM are developed to verify the improvement incurred due to higher order minkowski loss. All experiments are conducted on LibriSpeech dataset and performance metrics are word error rate (WER) on "dev-clean", "test-clean", "dev-other" and "test-other" datasets.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.
-
SRIB Submission to Interspeech 2021 DiCOVA Challenge
Authors:
Vishwanath Pratap Singh,
Shashi Kumar,
Ravi Shekhar Jha,
Abhishek Pandey
Abstract:
The COVID-19 pandemic has resulted in more than 125 million infections and more than 2.7 million casualties. In this paper, we attempt to classify covid vs non-covid cough sounds using signal processing and deep learning methods. Air turbulence, the vibration of tissues, movement of fluid through airways, opening, and closure of glottis are some of the causes for the production of the acoustic sou…
▽ More
The COVID-19 pandemic has resulted in more than 125 million infections and more than 2.7 million casualties. In this paper, we attempt to classify covid vs non-covid cough sounds using signal processing and deep learning methods. Air turbulence, the vibration of tissues, movement of fluid through airways, opening, and closure of glottis are some of the causes for the production of the acoustic sound signals during cough. Does the COVID-19 alter the acoustic characteristics of breath, cough, and speech sounds produced through the respiratory system? This is an open question waiting for answers. In this paper, we incorporated novel data augmentation methods for cough sound augmentation and multiple deep neural network architectures and methods along with handcrafted features. Our proposed system gives 14% absolute improvement in area under the curve (AUC). The proposed system is developed as part of Interspeech 2021 special sessions and challenges viz. diagnosing of COVID-19 using acoustics (DiCOVA). Our proposed method secured the 5th position on the leaderboard among 29 participants.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Tchebichef Transform Domain-based Deep Learning Architecture for Image Super-resolution
Authors:
Ahlad Kumar,
Harsh Vardhan Singh
Abstract:
The recent outbreak of COVID-19 has motivated researchers to contribute in the area of medical imaging using artificial intelligence and deep learning. Super-resolution (SR), in the past few years, has produced remarkable results using deep learning methods. The ability of deep learning methods to learn the non-linear mapping from low-resolution (LR) images to their corresponding high-resolution (…
▽ More
The recent outbreak of COVID-19 has motivated researchers to contribute in the area of medical imaging using artificial intelligence and deep learning. Super-resolution (SR), in the past few years, has produced remarkable results using deep learning methods. The ability of deep learning methods to learn the non-linear mapping from low-resolution (LR) images to their corresponding high-resolution (HR) images leads to compelling results for SR in diverse areas of research. In this paper, we propose a deep learning based image super-resolution architecture in Tchebichef transform domain. This is achieved by integrating a transform layer into the proposed architecture through a customized Tchebichef convolutional layer ($TCL$). The role of TCL is to convert the LR image from the spatial domain to the orthogonal transform domain using Tchebichef basis functions. The inversion of the aforementioned transformation is achieved using another layer known as the Inverse Tchebichef convolutional Layer (ITCL), which converts back the LR images from the transform domain to the spatial domain. It has been observed that using the Tchebichef transform domain for the task of SR takes the advantage of high and low-frequency representation of images that makes the task of super-resolution simplified. We, further, introduce transfer learning approach to enhance the quality of Covid based medical images. It is shown that our architecture enhances the quality of X-ray and CT images of COVID-19, providing a better image quality that helps in clinical diagnosis. Experimental results obtained using the proposed Tchebichef transform domain super-resolution (TTDSR) architecture provides competitive results when compared with most of the deep learning methods employed using a fewer number of trainable parameters.
△ Less
Submitted 22 February, 2021; v1 submitted 21 February, 2021;
originally announced February 2021.
-
WDN: A Wide and Deep Network to Divide-and-Conquer Image Super-resolution
Authors:
Vikram Singh,
Anurag Mittal
Abstract:
Divide and conquer is an established algorithm design paradigm that has proven itself to solve a variety of problems efficiently. However, it is yet to be fully explored in solving problems with a neural network, particularly the problem of image super-resolution. In this work, we propose an approach to divide the problem of image super-resolution into multiple sub-problems and then solve/conquer…
▽ More
Divide and conquer is an established algorithm design paradigm that has proven itself to solve a variety of problems efficiently. However, it is yet to be fully explored in solving problems with a neural network, particularly the problem of image super-resolution. In this work, we propose an approach to divide the problem of image super-resolution into multiple sub-problems and then solve/conquer them with the help of a neural network. Unlike a typical deep neural network, we design an alternate network architecture that is much wider (along with being deeper) than existing networks and is specially designed to implement the divide-and-conquer design paradigm with a neural network. Additionally, a technique to calibrate the intensities of feature map pixels is being introduced. Extensive experimentation on five datasets reveals that our approach towards the problem and the proposed architecture generate better and sharper results than current state-of-the-art methods.
△ Less
Submitted 7 October, 2020;
originally announced October 2020.
-
Online Graph Completion: Multivariate Signal Recovery in Computer Vision
Authors:
Won Hwa Kim,
Mona Jalal,
Seongjae Hwang,
Sterling C. Johnson,
Vikas Singh
Abstract:
The adoption of "human-in-the-loop" paradigms in computer vision and machine learning is leading to various applications where the actual data acquisition (e.g., human supervision) and the underlying inference algorithms are closely interwined. While classical work in active learning provides effective solutions when the learning module involves classification and regression tasks, many practical…
▽ More
The adoption of "human-in-the-loop" paradigms in computer vision and machine learning is leading to various applications where the actual data acquisition (e.g., human supervision) and the underlying inference algorithms are closely interwined. While classical work in active learning provides effective solutions when the learning module involves classification and regression tasks, many practical issues such as partially observed measurements, financial constraints and even additional distributional or structural aspects of the data typically fall outside the scope of this treatment. For instance, with sequential acquisition of partial measurements of data that manifest as a matrix (or tensor), novel strategies for completion (or collaborative filtering) of the remaining entries have only been studied recently. Motivated by vision problems where we seek to annotate a large dataset of images via a crowdsourced platform or alternatively, complement results from a state-of-the-art object detector using human feedback, we study the "completion" problem defined on graphs, where requests for additional measurements must be made sequentially. We design the optimization model in the Fourier domain of the graph describing how ideas based on adaptive submodularity provide algorithms that work well in practice. On a large set of images collected from Imgur, we see promising results on images that are otherwise difficult to categorize. We also show applications to an experimental design problem in neuroimaging.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
Modeling and Uncertainty Analysis of Groundwater Level Using Six Evolutionary Optimization Algorithms Hybridized with ANFIS, SVM, and ANN
Authors:
Akram Seifi,
Mohammad Ehteram,
Vijay P. Singh,
Amir Mosavi
Abstract:
In the present study, six meta-heuristic schemes are hybridized with artificial neural network (ANN), adaptive neuro-fuzzy interface system (ANFIS), and support vector machine (SVM), to predict monthly groundwater level (GWL), evaluate uncertainty analysis of predictions and spatial variation analysis. The six schemes, including grasshopper optimization algorithm (GOA), cat swarm optimization (CSO…
▽ More
In the present study, six meta-heuristic schemes are hybridized with artificial neural network (ANN), adaptive neuro-fuzzy interface system (ANFIS), and support vector machine (SVM), to predict monthly groundwater level (GWL), evaluate uncertainty analysis of predictions and spatial variation analysis. The six schemes, including grasshopper optimization algorithm (GOA), cat swarm optimization (CSO), weed algorithm (WA), genetic algorithm (GA), krill algorithm (KA), and particle swarm optimization (PSO), were used to hybridize for improving the performance of ANN, SVM, and ANFIS models. Groundwater level (GWL) data of Ardebil plain (Iran) for a period of 144 months were selected to evaluate the hybrid models. The pre-processing technique of principal component analysis (PCA) was applied to reduce input combinations from monthly time series up to 12-month prediction intervals. The results showed that the ANFIS-GOA was superior to the other hybrid models for predicting GWL in the first piezometer and third piezometer in the testing stage. The performance of hybrid models with optimization algorithms was far better than that of classical ANN, ANFIS, and SVM models without hybridization. The percent of improvements in the ANFIS-GOA versus standalone ANFIS in piezometer 10 were 14.4%, 3%, 17.8%, and 181% for RMSE, MAE, NSE, and PBIAS in the training stage and 40.7%, 55%, 25%, and 132% in testing stage, respectively. The improvements for piezometer 6 in train step were 15%, 4%, 13%, and 208% and in the test step were 33%, 44.6%, 16.3%, and 173%, respectively, that clearly confirm the superiority of developed hybridization schemes in GWL modeling. Uncertainty analysis showed that ANFIS-GOA and SVM had, respectively, the best and worst performances among other models. In general, GOA enhanced the accuracy of the ANFIS, ANN, and SVM models.
△ Less
Submitted 28 June, 2020;
originally announced June 2020.
-
View Invariant Human Body Detection and Pose Estimation from Multiple Depth Sensors
Authors:
Walid Bekhtaoui,
Ruhan Sa,
Brian Teixeira,
Vivek Singh,
Klaus Kirchberg,
Yao-jen Chang,
Ankur Kapoor
Abstract:
Point cloud based methods have produced promising results in areas such as 3D object detection in autonomous driving. However, most of the recent point cloud work focuses on single depth sensor data, whereas less work has been done on indoor monitoring applications, such as operation room monitoring in hospitals or indoor surveillance. In these scenarios multiple cameras are often used to tackle o…
▽ More
Point cloud based methods have produced promising results in areas such as 3D object detection in autonomous driving. However, most of the recent point cloud work focuses on single depth sensor data, whereas less work has been done on indoor monitoring applications, such as operation room monitoring in hospitals or indoor surveillance. In these scenarios multiple cameras are often used to tackle occlusion problems. We propose an end-to-end multi-person 3D pose estimation network, Point R-CNN, using multiple point cloud sources. We conduct extensive experiments to simulate challenging real world cases, such as individual camera failures, various target appearances, and complex cluttered scenes with the CMU panoptic dataset and the MVOR operation room dataset. Unlike most of the previous methods that attempt to use multiple sensor information by building complex fusion models, which often lead to poor generalization, we take advantage of the efficiency of concatenating point clouds to fuse the information at the input level. In the meantime, we show our end-to-end network greatly outperforms cascaded state-of-the-art models.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Automated Segmentation of Vertebrae on Lateral Chest Radiography Using Deep Learning
Authors:
Sanket Badhe,
Varun Singh,
Joy Li,
Paras Lakhani
Abstract:
The purpose of this study is to develop an automated algorithm for thoracic vertebral segmentation on chest radiography using deep learning. 124 de-identified lateral chest radiographs on unique patients were obtained. Segmentations of visible vertebrae were manually performed by a medical student and verified by a board-certified radiologist. 74 images were used for training, 10 for validation, a…
▽ More
The purpose of this study is to develop an automated algorithm for thoracic vertebral segmentation on chest radiography using deep learning. 124 de-identified lateral chest radiographs on unique patients were obtained. Segmentations of visible vertebrae were manually performed by a medical student and verified by a board-certified radiologist. 74 images were used for training, 10 for validation, and 40 were held out for testing. A U-Net deep convolutional neural network was employed for segmentation, using the sum of dice coefficient and binary cross-entropy as the loss function. On the test set, the algorithm demonstrated an average dice coefficient value of 90.5 and an average intersection-over-union (IoU) of 81.75. Deep learning demonstrates promise in the segmentation of vertebrae on lateral chest radiography.
△ Less
Submitted 5 January, 2020;
originally announced January 2020.
-
Attention Guided Anomaly Localization in Images
Authors:
Shashanka Venkataramanan,
Kuan-Chuan Peng,
Rajat Vikram Singh,
Abhijit Mahalanobis
Abstract:
Anomaly localization is an important problem in computer vision which involves localizing anomalous regions within images with applications in industrial inspection, surveillance, and medical imaging. This task is challenging due to the small sample size and pixel coverage of the anomaly in real-world scenarios. Most prior works need to use anomalous training images to compute a class-specific thr…
▽ More
Anomaly localization is an important problem in computer vision which involves localizing anomalous regions within images with applications in industrial inspection, surveillance, and medical imaging. This task is challenging due to the small sample size and pixel coverage of the anomaly in real-world scenarios. Most prior works need to use anomalous training images to compute a class-specific threshold to localize anomalies. Without the need of anomalous training images, we propose Convolutional Adversarial Variational autoencoder with Guided Attention (CAVGA), which localizes the anomaly with a convolutional latent variable to preserve the spatial information. In the unsupervised setting, we propose an attention expansion loss where we encourage CAVGA to focus on all normal regions in the image. Furthermore, in the weakly-supervised setting we propose a complementary guided attention loss, where we encourage the attention map to focus on all normal regions while minimizing the attention map corresponding to anomalous regions in the image. CAVGA outperforms the state-of-the-art (SOTA) anomaly localization methods on MVTec Anomaly Detection (MVTAD), modified ShanghaiTech Campus (mSTC) and Large-scale Attention based Glaucoma (LAG) datasets in the unsupervised setting and when using only 2% anomalous images in the weakly-supervised setting. CAVGA also outperforms SOTA anomaly detection methods on the MNIST, CIFAR-10, Fashion-MNIST, MVTAD, mSTC and LAG datasets.
△ Less
Submitted 16 July, 2020; v1 submitted 19 November, 2019;
originally announced November 2019.
-
DUAL-GLOW: Conditional Flow-Based Generative Model for Modality Transfer
Authors:
Haoliang Sun,
Ronak Mehta,
Hao H. Zhou,
Zhichun Huang,
Sterling C. Johnson,
Vivek Prabhakaran,
Vikas Singh
Abstract:
Positron emission tomography (PET) imaging is an imaging modality for diagnosing a number of neurological diseases. In contrast to Magnetic Resonance Imaging (MRI), PET is costly and involves injecting a radioactive substance into the patient. Motivated by developments in modality transfer in vision, we study the generation of certain types of PET images from MRI data. We derive new flow-based gen…
▽ More
Positron emission tomography (PET) imaging is an imaging modality for diagnosing a number of neurological diseases. In contrast to Magnetic Resonance Imaging (MRI), PET is costly and involves injecting a radioactive substance into the patient. Motivated by developments in modality transfer in vision, we study the generation of certain types of PET images from MRI data. We derive new flow-based generative models which we show perform well in this small sample size regime (much smaller than dataset sizes available in standard vision tasks). Our formulation, DUAL-GLOW, is based on two invertible networks and a relation network that maps the latent spaces to each other. We discuss how given the prior distribution, learning the conditional distribution of PET given the MRI image reduces to obtaining the conditional distribution between the two latent codes w.r.t. the two image types. We also extend our framework to leverage 'side' information (or attributes) when available. By controlling the PET generation through 'conditioning' on age, our model is also able to capture brain FDG-PET (hypometabolism) changes, as a function of age. We present experiments on the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset with 826 subjects, and obtain good performance in PET image synthesis, qualitatively and quantitatively better than recent works.
△ Less
Submitted 21 August, 2019;
originally announced August 2019.
-
Adversarial Learning with Multiscale Features and Kernel Factorization for Retinal Blood Vessel Segmentation
Authors:
Farhan Akram,
Vivek Kumar Singh,
Hatem A. Rashwan,
Mohamed Abdel-Nasser,
Md. Mostafa Kamal Sarker,
Nidhi Pandey,
Domenec Puig
Abstract:
In this paper, we propose an efficient blood vessel segmentation method for the eye fundus images using adversarial learning with multiscale features and kernel factorization. In the generator network of the adversarial framework, spatial pyramid pooling, kernel factorization and squeeze excitation block are employed to enhance the feature representation in spatial domain on different scales with…
▽ More
In this paper, we propose an efficient blood vessel segmentation method for the eye fundus images using adversarial learning with multiscale features and kernel factorization. In the generator network of the adversarial framework, spatial pyramid pooling, kernel factorization and squeeze excitation block are employed to enhance the feature representation in spatial domain on different scales with reduced computational complexity. In turn, the discriminator network of the adversarial framework is formulated by combining convolutional layers with an additional squeeze excitation block to differentiate the generated segmentation mask from its respective ground truth. Before feeding the images to the network, we pre-processed them by using edge sharpening and Gaussian regularization to reach an optimized solution for vessel segmentation. The output of the trained model is post-processed using morphological operations to remove the small speckles of noise. The proposed method qualitatively and quantitatively outperforms state-of-the-art vessel segmentation methods using DRIVE and STARE datasets.
△ Less
Submitted 5 July, 2019;
originally announced July 2019.
-
An Efficient Solution for Breast Tumor Segmentation and Classification in Ultrasound Images Using Deep Adversarial Learning
Authors:
Vivek Kumar Singh,
Hatem A. Rashwan,
Mohamed Abdel-Nasser,
Md. Mostafa Kamal Sarker,
Farhan Akram,
Nidhi Pandey,
Santiago Romani,
Domenec Puig
Abstract:
This paper proposes an efficient solution for tumor segmentation and classification in breast ultrasound (BUS) images. We propose to add an atrous convolution layer to the conditional generative adversarial network (cGAN) segmentation model to learn tumor features at different resolutions of BUS images. To automatically re-balance the relative impact of each of the highest level encoded features,…
▽ More
This paper proposes an efficient solution for tumor segmentation and classification in breast ultrasound (BUS) images. We propose to add an atrous convolution layer to the conditional generative adversarial network (cGAN) segmentation model to learn tumor features at different resolutions of BUS images. To automatically re-balance the relative impact of each of the highest level encoded features, we also propose to add a channel-wise weighting block in the network. In addition, the SSIM and L1-norm loss with the typical adversarial loss are used as a loss function to train the model. Our model outperforms the state-of-the-art segmentation models in terms of the Dice and IoU metrics, achieving top scores of 93.76% and 88.82%, respectively. In the classification stage, we show that few statistics features extracted from the shape of the boundaries of the predicted masks can properly discriminate between benign and malignant tumors with an accuracy of 85%$
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
SLSNet: Skin lesion segmentation using a lightweight generative adversarial network
Authors:
Md. Mostafa Kamal Sarker,
Hatem A. Rashwan,
Farhan Akram,
Vivek Kumar Singh,
Syeda Furruka Banu,
Forhad U H Chowdhury,
Kabir Ahmed Choudhury,
Sylvie Chambon,
Petia Radeva,
Domenec Puig,
Mohamed Abdel-Nasser
Abstract:
The determination of precise skin lesion boundaries in dermoscopic images using automated methods faces many challenges, most importantly, the presence of hair, inconspicuous lesion edges and low contrast in dermoscopic images, and variability in the color, texture and shapes of skin lesions. Existing deep learning-based skin lesion segmentation algorithms are expensive in terms of computational t…
▽ More
The determination of precise skin lesion boundaries in dermoscopic images using automated methods faces many challenges, most importantly, the presence of hair, inconspicuous lesion edges and low contrast in dermoscopic images, and variability in the color, texture and shapes of skin lesions. Existing deep learning-based skin lesion segmentation algorithms are expensive in terms of computational time and memory. Consequently, running such segmentation algorithms requires a powerful GPU and high bandwidth memory, which are not available in dermoscopy devices. Thus, this article aims to achieve precise skin lesion segmentation with minimum resources: a lightweight, efficient generative adversarial network (GAN) model called SLSNet, which combines 1-D kernel factorized networks, position and channel attention, and multiscale aggregation mechanisms with a GAN model. The 1-D kernel factorized network reduces the computational cost of 2D filtering. The position and channel attention modules enhance the discriminative ability between the lesion and non-lesion feature representations in spatial and channel dimensions, respectively. A multiscale block is also used to aggregate the coarse-to-fine features of input skin images and reduce the effect of the artifacts. SLSNet is evaluated on two publicly available datasets: ISBI 2017 and the ISIC 2018. Although SLSNet has only 2.35 million parameters, the experimental results demonstrate that it achieves segmentation results on a par with the state-of-the-art skin lesion segmentation methods with an accuracy of 97.61%, and Dice and Jaccard similarity coefficients of 90.63% and 81.98%, respectively. SLSNet can run at more than 110 frames per second (FPS) in a single GTX1080Ti GPU, which is faster than well-known deep learning-based image segmentation models, such as FCN. Therefore, SLSNet can be used for practical dermoscopic applications.
△ Less
Submitted 17 June, 2021; v1 submitted 1 July, 2019;
originally announced July 2019.