Search | arXiv e-print repository

arXiv:2506.20496 [pdf, ps, other]

Critical Anatomy-Preserving & Terrain-Augmenting Navigation (CAPTAiN): Application to Laminectomy Surgical Education

Authors: Jonathan Wang, Hisashi Ishida, David Usevitch, Kesavan Venkatesh, Yi Wang, Mehran Armand, Rachel Bronheim, Amit Jain, Adnan Munawar

Abstract: Surgical training remains a crucial milestone in modern medicine, with procedures such as laminectomy exemplifying the high risks involved. Laminectomy drilling requires precise manual control to mill bony tissue while preserving spinal segment integrity and avoiding breaches in the dura: the protective membrane surrounding the spinal cord. Despite unintended tears occurring in up to 11.3% of case… ▽ More Surgical training remains a crucial milestone in modern medicine, with procedures such as laminectomy exemplifying the high risks involved. Laminectomy drilling requires precise manual control to mill bony tissue while preserving spinal segment integrity and avoiding breaches in the dura: the protective membrane surrounding the spinal cord. Despite unintended tears occurring in up to 11.3% of cases, no assistive tools are currently utilized to reduce this risk. Variability in patient anatomy further complicates learning for novice surgeons. This study introduces CAPTAiN, a critical anatomy-preserving and terrain-augmenting navigation system that provides layered, color-coded voxel guidance to enhance anatomical awareness during spinal drilling. CAPTAiN was evaluated against a standard non-navigated approach through 110 virtual laminectomies performed by 11 orthopedic residents and medical students. CAPTAiN significantly improved surgical completion rates of target anatomy (87.99% vs. 74.42%) and reduced cognitive load across multiple NASA-TLX domains. It also minimized performance gaps across experience levels, enabling novices to perform on par with advanced trainees. These findings highlight CAPTAiN's potential to optimize surgical execution and support skill development across experience levels. Beyond laminectomy, it demonstrates potential for broader applications across various surgical and drilling procedures, including those in neurosurgery, otolaryngology, and other medical fields. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2505.09858 [pdf, other]

Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models

Authors: Danush Kumar Venkatesh, Isabel Funke, Micha Pfeiffer, Fiona Kolbinger, Hanna Maria Schmeiser, Juergen Weitz, Marius Distler, Stefanie Speidel

Abstract: Computer-assisted interventions can improve intra-operative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a… ▽ More Computer-assisted interventions can improve intra-operative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a unique two-stage, text-conditioned diffusion-based method to generate high-fidelity surgical videos for under-represented classes. Our approach conditions the generation process on text prompts and decouples spatial and temporal modeling by utilizing a 2D latent diffusion model to capture spatial content and then integrating temporal attention layers to ensure temporal consistency. Furthermore, we introduce a rejection sampling strategy to select the most suitable synthetic samples, effectively augmenting existing datasets to address class imbalance. We evaluate our method on two downstream tasks-surgical action recognition and intra-operative event prediction-demonstrating that incorporating synthetic videos from our approach substantially enhances model performance. We open-source our implementation at https://gitlab.com/nct_tso_public/surgvgen. △ Less

Submitted 14 May, 2025; originally announced May 2025.

Comments: Early accept at MICCAI 2025

arXiv:2504.05306 [pdf, other]

CREA: A Collaborative Multi-Agent Framework for Creative Content Generation with Diffusion Models

Authors: Kavana Venkatesh, Connor Dunlop, Pinar Yanardag

Abstract: Creativity in AI imagery remains a fundamental challenge, requiring not only the generation of visually compelling content but also the capacity to add novel, expressive, and artistically rich transformations to images. Unlike conventional editing tasks that rely on direct prompt-based modifications, creative image editing demands an autonomous, iterative approach that balances originality, cohere… ▽ More Creativity in AI imagery remains a fundamental challenge, requiring not only the generation of visually compelling content but also the capacity to add novel, expressive, and artistically rich transformations to images. Unlike conventional editing tasks that rely on direct prompt-based modifications, creative image editing demands an autonomous, iterative approach that balances originality, coherence, and artistic intent. To address this, we introduce CREA, a novel multi-agent collaborative framework that mimics the human creative process. Our framework leverages a team of specialized AI agents who dynamically collaborate to conceptualize, generate, critique, and enhance images. Through extensive qualitative and quantitative evaluations, we demonstrate that CREA significantly outperforms state-of-the-art methods in diversity, semantic alignment, and creative transformation. By structuring creativity as a dynamic, agentic process, CREA redefines the intersection of AI and art, paving the way for autonomous AI-driven artistic exploration, generative design, and human-AI co-creation. To the best of our knowledge, this is the first work to introduce the task of creative editing. △ Less

Submitted 7 April, 2025; originally announced April 2025.

Comments: Project URL: https://crea-diffusion.github.io

arXiv:2504.00022 [pdf, other]

Autonomous AI for Multi-Pathology Detection in Chest X-Rays: A Multi-Site Study in the Indian Healthcare System

Authors: Bargava Subramanian, Shajeev Jaikumar, Praveen Shastry, Naveen Kumarasami, Kalyan Sivasailam, Anandakumar D, Keerthana R, Mounigasri M, Kishore Prasath Venkatesh

Abstract: Study Design: The study outlines the development of an autonomous AI system for chest X-ray (CXR) interpretation, trained on a vast dataset of over 5 million X rays sourced from healthcare systems across India. This AI system integrates advanced architectures including Vision Transformers, Faster R-CNN, and various U Net models (such as Attention U-Net, U-Net++, and Dense U-Net) to enable comprehe… ▽ More Study Design: The study outlines the development of an autonomous AI system for chest X-ray (CXR) interpretation, trained on a vast dataset of over 5 million X rays sourced from healthcare systems across India. This AI system integrates advanced architectures including Vision Transformers, Faster R-CNN, and various U Net models (such as Attention U-Net, U-Net++, and Dense U-Net) to enable comprehensive classification, detection, and segmentation of 75 distinct pathologies. To ensure robustness, the study design includes subgroup analyses across age, gender, and equipment type, validating the model's adaptability and performance across diverse patient demographics and imaging environments. Performance: The AI system achieved up to 98% precision and over 95% recall for multi pathology classification, with stable performance across demographic and equipment subgroups. For normal vs. abnormal classification, it reached 99.8% precision, 99.6% recall, and 99.9% negative predictive value (NPV). It was deployed in 17 major healthcare systems in India including diagnostic centers, large hospitals, and government hospitals. Over the deployment period, the system processed over 150,000 scans, averaging 2,000 chest X rays daily, resulting in reduced reporting times and improved diagnostic accuracy. Conclusion: The high precision and recall validate the AI's capability as a reliable tool for autonomous normal abnormal classification, pathology localization, and segmentation. This scalable AI model addresses diagnostic gaps in underserved areas, optimizing radiology workflows and enhancing patient care across diverse healthcare settings in India. △ Less

Submitted 2 April, 2025; v1 submitted 28 March, 2025; originally announced April 2025.

Comments: 27 pages , 8 figures

MSC Class: 68T07

arXiv:2503.22176 [pdf, other]

A Multi-Site Study on AI-Driven Pathology Detection and Osteoarthritis Grading from Knee X-Ray

Authors: Bargava Subramanian, Naveen Kumarasami, Praveen Shastry, Kalyan Sivasailam, Anandakumar D, Keerthana R, Mounigasri M, Abilaasha G, Kishore Prasath Venkatesh

Abstract: Introduction: Bone health disorders like osteoarthritis and osteoporosis pose major global health challenges, often leading to delayed diagnoses due to limited diagnostic tools. This study presents an AI-powered system that analyzes knee X-rays to detect key pathologies, including joint space narrowing, sclerosis, osteophytes, tibial spikes, alignment issues, and soft tissue anomalies. It also gra… ▽ More Introduction: Bone health disorders like osteoarthritis and osteoporosis pose major global health challenges, often leading to delayed diagnoses due to limited diagnostic tools. This study presents an AI-powered system that analyzes knee X-rays to detect key pathologies, including joint space narrowing, sclerosis, osteophytes, tibial spikes, alignment issues, and soft tissue anomalies. It also grades osteoarthritis severity, enabling timely, personalized treatment. Study Design: The research used 1.3 million knee X-rays from a multi-site Indian clinical trial across government, private, and SME hospitals. The dataset ensured diversity in demographics, imaging equipment, and clinical settings. Rigorous annotation and preprocessing yielded high-quality training datasets for pathology-specific models like ResNet15 for joint space narrowing and DenseNet for osteoarthritis grading. Performance: The AI system achieved strong diagnostic accuracy across diverse imaging environments. Pathology-specific models excelled in precision, recall, and NPV, validated using Mean Squared Error (MSE), Intersection over Union (IoU), and Dice coefficient. Subgroup analyses across age, gender, and manufacturer variations confirmed generalizability for real-world applications. Conclusion: This scalable, cost-effective solution for bone health diagnostics demonstrated robust performance in a multi-site trial. It holds promise for widespread adoption, especially in resource-limited healthcare settings, transforming bone health management and enabling proactive patient care. △ Less

Submitted 28 March, 2025; originally announced March 2025.

Comments: 15 pages, 2 figures

MSC Class: 68T07

arXiv:2503.20316 [pdf, other]

AI-Driven MRI Spine Pathology Detection: A Comprehensive Deep Learning Approach for Automated Diagnosis in Diverse Clinical Settings

Authors: Bargava Subramanian, Naveen Kumarasami, Praveen Shastry, Raghotham Sripadraj, Kalyan Sivasailam, Anandakumar D, Abinaya Ramachandran, Sudhir MP, Gunakutti G, Kishore Prasath Venkatesh

Abstract: Study Design: This study presents the development of an autonomous AI system for MRI spine pathology detection, trained on a dataset of 2 million MRI spine scans sourced from diverse healthcare facilities across India. The AI system integrates advanced architectures, including Vision Transformers, U-Net with cross-attention, MedSAM, and Cascade R-CNN, enabling comprehensive classification, segment… ▽ More Study Design: This study presents the development of an autonomous AI system for MRI spine pathology detection, trained on a dataset of 2 million MRI spine scans sourced from diverse healthcare facilities across India. The AI system integrates advanced architectures, including Vision Transformers, U-Net with cross-attention, MedSAM, and Cascade R-CNN, enabling comprehensive classification, segmentation, and detection of 43 distinct spinal pathologies. The dataset is balanced across age groups, genders, and scanner manufacturers to ensure robustness and adaptability. Subgroup analyses were conducted to validate the model's performance across different patient demographics, imaging conditions, and equipment types. Performance: The AI system achieved up to 97.9 percent multi-pathology detection, demonstrating consistent performance across age, gender, and manufacturer subgroups. The normal vs. abnormal classification achieved 98.0 percent accuracy, and the system was deployed across 13 major healthcare enterprises in India, encompassing diagnostic centers, large hospitals, and government facilities. During deployment, it processed approximately 100,000 plus MRI spine scans, leading to reduced reporting times and increased diagnostic efficiency by automating the identification of common spinal conditions. Conclusion: The AI system's high precision and recall validate its capability as a reliable tool for autonomous normal/abnormal classification, pathology segmentation, and detection. Its scalability and adaptability address critical diagnostic gaps, optimize radiology workflows, and improve patient care across varied healthcare environments in India. △ Less

Submitted 28 March, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

Comments: 20 pages , 3 figurea

MSC Class: 68T07

arXiv:2503.20306 [pdf, other]

3D Convolutional Neural Networks for Improved Detection of Intracranial bleeding in CT Imaging

Authors: Bargava Subramanian, Naveen Kumarasami, Praveen Shastry, Kalyan Sivasailam, Anandakumar D, Elakkiya R, Harsha KG, Rithanya V, Harini T, Afshin Hussain, Kishore Prasath Venkatesh

Abstract: Background: Intracranial bleeding (IB) is a life-threatening condition caused by traumatic brain injuries, including epidural, subdural, subarachnoid, and intraparenchymal hemorrhages. Rapid and accurate detection is crucial to prevent severe complications. Traditional imaging can be slow and prone to variability, especially in high-pressure scenarios. Artificial Intelligence (AI) provides a solut… ▽ More Background: Intracranial bleeding (IB) is a life-threatening condition caused by traumatic brain injuries, including epidural, subdural, subarachnoid, and intraparenchymal hemorrhages. Rapid and accurate detection is crucial to prevent severe complications. Traditional imaging can be slow and prone to variability, especially in high-pressure scenarios. Artificial Intelligence (AI) provides a solution by quickly analyzing medical images, identifying subtle hemorrhages, and flagging urgent cases. By enhancing diagnostic speed and accuracy, AI improves workflows and patient care. This article explores AI's role in transforming IB detection in emergency settings. Methods: A U-shaped 3D Convolutional Neural Network (CNN) automates IB detection and classification in volumetric CT scans. Advanced preprocessing, including CLAHE and intensity normalization, enhances image quality. The architecture preserves spatial and contextual details for precise segmentation. A dataset of 2,912 annotated CT scans was used for training and evaluation. Results: The model achieved high performance across major bleed types, with precision, recall, and accuracy exceeding 90 percent in most cases 96 percent precision for epidural hemorrhages and 94 percent accuracy for subarachnoid hemorrhages. Its ability to classify and localize hemorrhages highlights its clinical reliability. Conclusion: This U-shaped 3D CNN offers a scalable solution for automating IB detection, reducing diagnostic delays, and improving emergency care outcomes. Future work will expand dataset diversity, optimize real-time processing, and integrate multimodal data for enhanced clinical applicability. △ Less

Submitted 26 March, 2025; originally announced March 2025.

Comments: 12 pages,4 figures

MSC Class: 68T07

arXiv:2503.14538 [pdf, other]

Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data

Authors: Ananya Ganapthy, Praveen Shastry, Naveen Kumarasami, Anandakumar D, Keerthana R, Mounigasri M, Varshinipriya M, Kishore Prasath Venkatesh, Bargava Subramanian, Kalyan Sivasailam

Abstract: Background: This study introduces a Vision-Language Model (VLM) leveraging SIGLIP and Gemma-3b architectures for automated acute tuberculosis (TB) screening. By integrating chest X-ray images and clinical notes, the model aims to enhance diagnostic accuracy and efficiency, particularly in resource-limited settings. Methods: The VLM combines visual data from chest X-rays with clinical context to… ▽ More Background: This study introduces a Vision-Language Model (VLM) leveraging SIGLIP and Gemma-3b architectures for automated acute tuberculosis (TB) screening. By integrating chest X-ray images and clinical notes, the model aims to enhance diagnostic accuracy and efficiency, particularly in resource-limited settings. Methods: The VLM combines visual data from chest X-rays with clinical context to generate detailed, context-aware diagnostic reports. The architecture employs SIGLIP for visual encoding and Gemma-3b for decoding, ensuring effective representation of acute TB-specific pathologies and clinical insights. Results: Key acute TB pathologies, including consolidation, cavities, and nodules, were detected with high precision (97percent) and recall (96percent). The model demonstrated strong spatial localization capabilities and robustness in distinguishing TB-positive cases, making it a reliable tool for acute TB diagnosis. Conclusion: The multimodal capability of the VLM reduces reliance on radiologists, providing a scalable solution for acute TB screening. Future work will focus on improving the detection of subtle pathologies and addressing dataset biases to enhance its generalizability and application in diverse global healthcare settings. △ Less

Submitted 1 April, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

Comments: 11 pages, 3 figures

MSC Class: 68T07; 68T45; 92C55; 92C50; 68U10

arXiv:2503.14536 [pdf, other]

Advancing Chronic Tuberculosis Diagnostics Using Vision-Language Models: A Multi modal Framework for Precision Analysis

Authors: Praveen Shastry, Sowmya Chowdary Muthulur, Naveen Kumarasami, Anandakumar D, Mounigasri M, Keerthana R, Kishore Prasath Venkatesh, Bargava Subramanian, Kalyan Sivasailam, Revathi Ezhumalai, Abitha Marimuthu

Abstract: Background: This study proposes a Vision-Language Model (VLM) leveraging the SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic tuberculosis (TB) screening. By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation, improving diagnostic consistency and accessibility, particularly in resource-constrained settings. M… ▽ More Background: This study proposes a Vision-Language Model (VLM) leveraging the SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic tuberculosis (TB) screening. By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation, improving diagnostic consistency and accessibility, particularly in resource-constrained settings. Methods: The VLM architecture combines a Vision Transformer (ViT) for visual encoding and a transformer-based text encoder to process clinical context, such as patient histories and treatment records. Cross-modal attention mechanisms align radiographic features with textual information, while the Gemma-3b decoder generates comprehensive diagnostic reports. The model was pre-trained on 5 million paired medical images and texts and fine-tuned using 100,000 chronic TB-specific chest X-rays. Results: The model demonstrated high precision (94 percent) and recall (94 percent) for detecting key chronic TB pathologies, including fibrosis, calcified granulomas, and bronchiectasis. Area Under the Curve (AUC) scores exceeded 0.93, and Intersection over Union (IoU) values were above 0.91, validating its effectiveness in detecting and localizing TB-related abnormalities. Conclusion: The VLM offers a robust and scalable solution for automated chronic TB diagnosis, integrating radiographic and clinical data to deliver actionable and context-aware insights. Future work will address subtle pathologies and dataset biases to enhance the model's generalizability, ensuring equitable performance across diverse populations and healthcare settings. △ Less

Submitted 28 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

Comments: 10 pages , 3 figures

MSC Class: 68T07; 92C55; 68U10; 92C50; 60G35

arXiv:2503.11281 [pdf, other]

AI and Deep Learning for Automated Segmentation and Quantitative Measurement of Spinal Structures in MRI

Authors: Praveen Shastry, Bhawana Sonawane, Kavya Mohan, Naveen Kumarasami, Raghotham Sripadraj, Anandakumar D, Keerthana R, Mounigasri M, Kaviya SP, Kishore Prasath Venkatesh, Bargava Subramanian, Kalyan Sivasailam

Abstract: Background: Accurate spinal structure measurement is crucial for assessing spine health and diagnosing conditions like spondylosis, disc herniation, and stenosis. Manual methods for measuring intervertebral disc height and spinal canal diameter are subjective and time-consuming. Automated solutions are needed to improve accuracy, efficiency, and reproducibility in clinical practice. Purpose: Thi… ▽ More Background: Accurate spinal structure measurement is crucial for assessing spine health and diagnosing conditions like spondylosis, disc herniation, and stenosis. Manual methods for measuring intervertebral disc height and spinal canal diameter are subjective and time-consuming. Automated solutions are needed to improve accuracy, efficiency, and reproducibility in clinical practice. Purpose: This study develops an autonomous AI system for segmenting and measuring key spinal structures in MRI scans, focusing on intervertebral disc height and spinal canal anteroposterior (AP) diameter in the cervical, lumbar, and thoracic regions. The goal is to reduce clinician workload, enhance diagnostic consistency, and improve assessments. Methods: The AI model leverages deep learning architectures, including UNet, nnU-Net, and CNNs. Trained on a large proprietary MRI dataset, it was validated against expert annotations. Performance was evaluated using Dice coefficients and segmentation accuracy. Results: The AI model achieved Dice coefficients of 0.94 for lumbar, 0.91 for cervical, and 0.90 for dorsal spine segmentation (D1-D12). It precisely measured spinal parameters like disc height and canal diameter, demonstrating robustness and clinical applicability. Conclusion: The AI system effectively automates MRI-based spinal measurements, improving accuracy and reducing clinician workload. Its consistent performance across spinal regions supports clinical decision-making, particularly in high-demand settings, enhancing spinal assessments and patient outcomes. △ Less

Submitted 19 March, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

Comments: 16 pages, 2 figures

MSC Class: 92C55; 68T07; 68U10; 62P10; 65D18

arXiv:2503.10717 [pdf, other]

Deep Learning-Based Automated Workflow for Accurate Segmentation and Measurement of Abdominal Organs in CT Scans

Authors: Praveen Shastry, Ashok Sharma, Kavya Mohan, Naveen Kumarasami, Anandakumar D, Mounigasri M, Keerthana R, Kishore Prasath Venkatesh, Bargava Subramanian, Kalyan Sivasailam

Abstract: Background: Automated analysis of CT scans for abdominal organ measurement is crucial for improving diagnostic efficiency and reducing inter-observer variability. Manual segmentation and measurement of organs such as the kidneys, liver, spleen, and prostate are time-consuming and subject to inconsistency, underscoring the need for automated approaches. Purpose: The purpose of this study is to de… ▽ More Background: Automated analysis of CT scans for abdominal organ measurement is crucial for improving diagnostic efficiency and reducing inter-observer variability. Manual segmentation and measurement of organs such as the kidneys, liver, spleen, and prostate are time-consuming and subject to inconsistency, underscoring the need for automated approaches. Purpose: The purpose of this study is to develop and validate an automated workflow for the segmentation and measurement of abdominal organs in CT scans using advanced deep learning models, in order to improve accuracy, reliability, and efficiency in clinical evaluations. Methods: The proposed workflow combines nnU-Net, U-Net++ for organ segmentation, followed by a 3D RCNN model for measuring organ volumes and dimensions. The models were trained and evaluated on CT datasets with metrics such as precision, recall, and Mean Squared Error (MSE) to assess performance. Segmentation quality was verified for its adaptability to variations in patient anatomy and scanner settings. Results: The developed workflow achieved high precision and recall values, exceeding 95 for all targeted organs. The Mean Squared Error (MSE) values were low, indicating a high level of consistency between predicted and ground truth measurements. The segmentation and measurement pipeline demonstrated robust performance, providing accurate delineation and quantification of the kidneys, liver, spleen, and prostate. Conclusion: The proposed approach offers an automated, efficient, and reliable solution for abdominal organ measurement in CT scans. By significantly reducing manual intervention, this workflow enhances measurement accuracy and consistency, with potential for widespread clinical implementation. Future work will focus on expanding the approach to other organs and addressing complex pathological cases. △ Less

Submitted 13 March, 2025; originally announced March 2025.

Comments: 13 pages , 3 figures

MSC Class: 68T99

arXiv:2412.09614 [pdf, other]

Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

Authors: Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, Pinar Yanardag

Abstract: We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. This capability significantly improves upon the limitations of existing T2I models, which often s… ▽ More We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. This capability significantly improves upon the limitations of existing T2I models, which often struggle with the accurate depiction of complex or culturally specific subjects due to dataset constraints. Furthermore, we propose a novel self-correcting mechanism for text-to-image models to ensure consistency and fidelity in visual outputs, leveraging the rich context from the graph to guide corrections. Our qualitative and quantitative experiments demonstrate that Context Canvas significantly enhances the capabilities of popular models such as Flux, Stable Diffusion, and DALL-E, and improves the functionality of ControlNet for fine-grained image editing tasks. To our knowledge, Context Canvas represents the first application of graph-based RAG in enhancing T2I models, representing a significant advancement for producing high-fidelity, context-aware multi-faceted images. △ Less

Submitted 12 December, 2024; originally announced December 2024.

Comments: Project Page: https://context-canvas.github.io/

arXiv:2412.09611 [pdf, other]

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Authors: Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag

Abstract: Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated as… ▽ More Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities. △ Less

Submitted 12 December, 2024; originally announced December 2024.

Comments: Project Page: https://fluxspace.github.io

arXiv:2410.07753 [pdf, other]

Data Augmentation for Surgical Scene Segmentation with Anatomy-Aware Diffusion Models

Authors: Danush Kumar Venkatesh, Dominik Rivoir, Micha Pfeiffer, Fiona Kolbinger, Stefanie Speidel

Abstract: In computer-assisted surgery, automatically recognizing anatomical organs is crucial for understanding the surgical scene and providing intraoperative assistance. While machine learning models can identify such structures, their deployment is hindered by the need for labeled, diverse surgical datasets with anatomical annotations. Labeling multiple classes (i.e., organs) in a surgical scene is time… ▽ More In computer-assisted surgery, automatically recognizing anatomical organs is crucial for understanding the surgical scene and providing intraoperative assistance. While machine learning models can identify such structures, their deployment is hindered by the need for labeled, diverse surgical datasets with anatomical annotations. Labeling multiple classes (i.e., organs) in a surgical scene is time-intensive, requiring medical experts. Although synthetically generated images can enhance segmentation performance, maintaining both organ structure and texture during generation is challenging. We introduce a multi-stage approach using diffusion models to generate multi-class surgical datasets with annotations. Our framework improves anatomy awareness by training organ specific models with an inpainting objective guided by binary segmentation masks. The organs are generated with an inference pipeline using pre-trained ControlNet to maintain the organ structure. The synthetic multi-class datasets are constructed through an image composition step, ensuring structural and textural consistency. This versatile approach allows the generation of multi-class datasets from real binary datasets and simulated surgical masks. We thoroughly evaluate the generated datasets on image quality and downstream segmentation, achieving a $15\%$ improvement in segmentation scores when combined with real images. The code is available at https://gitlab.com/nct_tso_public/muli-class-image-synthesis △ Less

Submitted 21 November, 2024; v1 submitted 10 October, 2024; originally announced October 2024.

Comments: Accepted at WACV 2025

arXiv:2409.09944 [pdf]

doi 10.1109/ICEECCOT43722.2018.9001543

Fault Analysis And Predictive Maintenance Of Induction Motor Using Machine Learning

Authors: Kavana Venkatesh, Neethi M

Abstract: Induction motors are one of the most crucial electrical equipment and are extensively used in industries in a wide range of applications. This paper presents a machine learning model for the fault detection and classification of induction motor faults by using three phase voltages and currents as inputs. The aim of this work is to protect vital electrical components and to prevent abnormal event p… ▽ More Induction motors are one of the most crucial electrical equipment and are extensively used in industries in a wide range of applications. This paper presents a machine learning model for the fault detection and classification of induction motor faults by using three phase voltages and currents as inputs. The aim of this work is to protect vital electrical components and to prevent abnormal event progression through early detection and diagnosis. This work presents a fast forward artificial neural network model to detect some of the commonly occurring electrical faults like overvoltage, under voltage, single phasing, unbalanced voltage, overload, ground fault. A separate model free monitoring system wherein the motor itself acts like a sensor is presented and the only monitored signals are the input given to the motor. Limits for current and voltage values are set for the faulty and healthy conditions, which is done by a classifier. Real time data from a 0.33 HP induction motor is used to train and test the neural network. The model so developed analyses the voltage and current values given at a particular instant and classifies the data into no fault or the specific fault. The model is then interfaced with a real motor to accurately detect and classify the faults so that further necessary action can be taken. △ Less

Submitted 15 September, 2024; originally announced September 2024.

Comments: Presented at ICEECCOT-2018, Published in IEEE Xplore, 6 pages, 3 figures

Journal ref: ICEECCOT-2018, Mysuru, India, 2018, pp. 1-6

arXiv:2408.09822 [pdf, other]

SurgicaL-CD: Generating Surgical Images via Unpaired Image Translation with Latent Consistency Diffusion Models

Authors: Danush Kumar Venkatesh, Dominik Rivoir, Micha Pfeiffer, Stefanie Speidel

Abstract: Computer-assisted surgery (CAS) systems are designed to assist surgeons during procedures, thereby reducing complications and enhancing patient care. Training machine learning models for these systems requires a large corpus of annotated datasets, which is challenging to obtain in the surgical domain due to patient privacy concerns and the significant labeling effort required from doctors. Previou… ▽ More Computer-assisted surgery (CAS) systems are designed to assist surgeons during procedures, thereby reducing complications and enhancing patient care. Training machine learning models for these systems requires a large corpus of annotated datasets, which is challenging to obtain in the surgical domain due to patient privacy concerns and the significant labeling effort required from doctors. Previous methods have explored unpaired image translation using generative models to create realistic surgical images from simulations. However, these approaches have struggled to produce high-quality, diverse surgical images. In this work, we introduce \emph{SurgicaL-CD}, a consistency-distilled diffusion method to generate realistic surgical images with only a few sampling steps without paired data. We evaluate our approach on three datasets, assessing the generated images in terms of quality and utility as downstream training datasets. Our results demonstrate that our method outperforms GANs and diffusion-based approaches. Our code is available at https://gitlab.com/nct_tso_public/gan2diffusion. △ Less

Submitted 11 October, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

Comments: Accepted at ECCV workshop on Synthetic Data for ComputerVision

arXiv:2402.08088 [pdf, other]

Out-of-Distribution Detection and Data Drift Monitoring using Statistical Process Control

Authors: Ghada Zamzmi, Kesavan Venkatesh, Brandon Nelson, Smriti Prathapan, Paul H. Yi, Berkman Sahiner, Jana G. Delfino

Abstract: Background: Machine learning (ML) methods often fail with data that deviates from their training distribution. This is a significant concern for ML-enabled devices in clinical settings, where data drift may cause unexpected performance that jeopardizes patient safety. Method: We propose a ML-enabled Statistical Process Control (SPC) framework for out-of-distribution (OOD) detection and drift mon… ▽ More Background: Machine learning (ML) methods often fail with data that deviates from their training distribution. This is a significant concern for ML-enabled devices in clinical settings, where data drift may cause unexpected performance that jeopardizes patient safety. Method: We propose a ML-enabled Statistical Process Control (SPC) framework for out-of-distribution (OOD) detection and drift monitoring. SPC is advantageous as it visually and statistically highlights deviations from the expected distribution. To demonstrate the utility of the proposed framework for monitoring data drift in radiological images, we investigated different design choices, including methods for extracting feature representations, drift quantification, and SPC parameter selection. Results: We demonstrate the effectiveness of our framework for two tasks: 1) differentiating axial vs. non-axial computed tomography (CT) images and 2) separating chest x-ray (CXR) from other modalities. For both tasks, we achieved high accuracy in detecting OOD inputs, with 0.913 in CT and 0.995 in CXR, and sensitivity of 0.980 in CT and 0.984 in CXR. Our framework was also adept at monitoring data streams and identifying the time a drift occurred. In a simulation with 100 daily CXR cases, we detected a drift in OOD input percentage from 0-1% to 3-5% within two days, maintaining a low false-positive rate. Through additional experimental results, we demonstrate the framework's data-agnostic nature and independence from the underlying model's structure. Conclusion: We propose a framework for OOD detection and drift monitoring that is agnostic to data, modality, and model. The framework is customizable and can be adapted for specific applications. △ Less

Submitted 12 February, 2024; originally announced February 2024.

arXiv:2312.08621 [pdf, other]

Quadrupedal Locomotion Control On Inclined Surfaces Using Collocation Method

Authors: Adarsh Salagame, Maria Gianello, Chenghao Wang, Kaushik Venkatesh, Shreyansh Pitroda, Rohit Rajput, Eric Sihite, Miriam Leeser, Alireza Ramezani

Abstract: Inspired by Chukars wing-assisted incline running (WAIR), in this work, we employ a high-fidelity model of our Husky Carbon quadrupedal-legged robot to walk over steep slopes of up to 45 degrees. Chukars use the aerodynamic forces generated by their flapping wings to manipulate ground contact forces and traverse steep slopes and even overhangs. By exploiting the thrusters on Husky, we employed a c… ▽ More Inspired by Chukars wing-assisted incline running (WAIR), in this work, we employ a high-fidelity model of our Husky Carbon quadrupedal-legged robot to walk over steep slopes of up to 45 degrees. Chukars use the aerodynamic forces generated by their flapping wings to manipulate ground contact forces and traverse steep slopes and even overhangs. By exploiting the thrusters on Husky, we employed a collocation approach to rapidly resolving the joint and thruster actions. Our approach uses a polynomial approximation of the reduced-order dynamics of Husky, called HROM, to quickly and efficiently find optimal control actions that permit high-slope walking without violating friction cone conditions. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2306.00179

arXiv:2311.14878 [pdf, other]

How Strong a Kick Should be to Topple Northeastern's Tumbling Robot?

Authors: Adarsh Salagame, Neha Bhattachan, Andre Caetano, Ian McCarthy, Henry Noyes, Brandon Petersen, Alexander Qiu, Matthew Schroeter, Nolan Smithwick, Konrad Sroka, Jason Widjaja, Yash Bohra, Kaushik Venkatesh, Kruthika Gangaraju, Paul Ghanem, Ioannis Mandralis, Eric Sihite, Arash Kalantari, Alireza Ramezani

Abstract: Rough terrain locomotion has remained one of the most challenging mobility questions. In 2022, NASA's Innovative Advanced Concepts (NIAC) Program invited US academic institutions to participate NASA's Breakthrough, Innovative \& Game-changing (BIG) Idea competition by proposing novel mobility systems that can negotiate extremely rough terrain, lunar bumpy craters. In this competition, Northeastern… ▽ More Rough terrain locomotion has remained one of the most challenging mobility questions. In 2022, NASA's Innovative Advanced Concepts (NIAC) Program invited US academic institutions to participate NASA's Breakthrough, Innovative \& Game-changing (BIG) Idea competition by proposing novel mobility systems that can negotiate extremely rough terrain, lunar bumpy craters. In this competition, Northeastern University won NASA's top Artemis Award award by proposing an articulated robot tumbler called COBRA (Crater Observing Bio-inspired Rolling Articulator). This report briefly explains the underlying principles that made COBRA successful in competing with other concepts ranging from cable-driven to multi-legged designs from six other participating US institutions. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2309.03048 [pdf, other]

Exploring Semantic Consistency in Unpaired Image Translation to Generate Data for Surgical Applications

Authors: Danush Kumar Venkatesh, Dominik Rivoir, Micha Pfeiffer, Fiona Kolbinger, Marius Distler, Jürgen Weitz, Stefanie Speidel

Abstract: In surgical computer vision applications, obtaining labeled training data is challenging due to data-privacy concerns and the need for expert annotation. Unpaired image-to-image translation techniques have been explored to automatically generate large annotated datasets by translating synthetic images to the realistic domain. However, preserving the structure and semantic consistency between the i… ▽ More In surgical computer vision applications, obtaining labeled training data is challenging due to data-privacy concerns and the need for expert annotation. Unpaired image-to-image translation techniques have been explored to automatically generate large annotated datasets by translating synthetic images to the realistic domain. However, preserving the structure and semantic consistency between the input and translated images presents significant challenges, mainly when there is a distributional mismatch in the semantic characteristics of the domains. This study empirically investigates unpaired image translation methods for generating suitable data in surgical applications, explicitly focusing on semantic consistency. We extensively evaluate various state-of-the-art image translation models on two challenging surgical datasets and downstream semantic segmentation tasks. We find that a simple combination of structural-similarity loss and contrastive learning yields the most promising results. Quantitatively, we show that the data generated with this approach yields higher semantic consistency and can be used more effectively as training data.The code is available at https://gitlab.com/nct_tso_public/constructs. △ Less

Submitted 21 February, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: Accepted at IPCAI 2024

arXiv:2308.00183 [pdf, other]

Hovering Control of Flapping Wings in Tandem with Multi-Rotors

Authors: Aniket Dhole, Bibek Gupta, Adarsh Salagame, Xuejian Niu, Yizhe Xu, Kaushik Venkatesh, Paul Ghanem, Ioannis Mandralis, Eric Sihite, Alireza Ramezani

Abstract: This work briefly covers our efforts to stabilize the flight dynamics of Northeastern's tailless bat-inspired micro aerial vehicle, Aerobat. Flapping robots are not new. A plethora of examples is mainly dominated by insect-style design paradigms that are passively stable. However, Aerobat, in addition for being tailless, possesses morphing wings that add to the inherent complexity of flight contro… ▽ More This work briefly covers our efforts to stabilize the flight dynamics of Northeastern's tailless bat-inspired micro aerial vehicle, Aerobat. Flapping robots are not new. A plethora of examples is mainly dominated by insect-style design paradigms that are passively stable. However, Aerobat, in addition for being tailless, possesses morphing wings that add to the inherent complexity of flight control. The robot can dynamically adjust its wing platform configurations during gait cycles, increasing its efficiency and agility. We employ a guard design with manifold small thrusters to stabilize Aerobat's position and orientation in hovering, a flapping system in tandem with a multi-rotor. For flight control purposes, we take an approach based on assuming the guard cannot observe Aerobat's states. Then, we propose an observer to estimate the unknown states of the guard which are then used for closed-loop hovering control of the Guard-Aerobat platform. △ Less

Submitted 31 July, 2023; originally announced August 2023.

arXiv:2206.08738 [pdf, other]

Detecting Adversarial Examples in Batches -- a geometrical approach

Authors: Danush Kumar Venkatesh, Peter Steinbach

Abstract: Many deep learning methods have successfully solved complex tasks in computer vision and speech recognition applications. Nonetheless, the robustness of these models has been found to be vulnerable to perturbed inputs or adversarial examples, which are imperceptible to the human eye, but lead the model to erroneous output decisions. In this study, we adapt and introduce two geometric metrics, dens… ▽ More Many deep learning methods have successfully solved complex tasks in computer vision and speech recognition applications. Nonetheless, the robustness of these models has been found to be vulnerable to perturbed inputs or adversarial examples, which are imperceptible to the human eye, but lead the model to erroneous output decisions. In this study, we adapt and introduce two geometric metrics, density and coverage, and evaluate their use in detecting adversarial samples in batches of unseen data. We empirically study these metrics using MNIST and two real-world biomedical datasets from MedMNIST, subjected to two different adversarial attacks. Our experiments show promising results for both metrics to detect adversarial examples. We believe that his work can lay the ground for further study on these metrics' use in deployed machine learning systems to monitor for possible attacks by adversarial examples or related pathologies such as dataset shift. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: Submitted to AdvML workshop at ICML2022

arXiv:2204.05591 [pdf]

Automatic detection of glaucoma via fundus imaging and artificial intelligence: A review

Authors: Lauren Coan, Bryan Williams, Krishna Adithya Venkatesh, Swati Upadhyaya, Silvester Czanner, Rengaraj Venkatesh, Colin E. Willoughby, Srinivasan Kavitha, Gabriela Czanner

Abstract: Glaucoma is a leading cause of irreversible vision impairment globally and cases are continuously rising worldwide. Early detection is crucial, allowing timely intervention which can prevent further visual field loss. To detect glaucoma, examination of the optic nerve head via fundus imaging can be performed, at the centre of which is the assessment of the optic cup and disc boundaries. Fundus ima… ▽ More Glaucoma is a leading cause of irreversible vision impairment globally and cases are continuously rising worldwide. Early detection is crucial, allowing timely intervention which can prevent further visual field loss. To detect glaucoma, examination of the optic nerve head via fundus imaging can be performed, at the centre of which is the assessment of the optic cup and disc boundaries. Fundus imaging is non-invasive and low-cost; however, the image examination relies on subjective, time-consuming, and costly expert assessments. A timely question to ask is can artificial intelligence mimic glaucoma assessments made by experts. Namely, can artificial intelligence automatically find the boundaries of the optic cup and disc (providing a so-called segmented fundus image) and then use the segmented image to identify glaucoma with high accuracy. We conducted a comprehensive review on artificial intelligence-enabled glaucoma detection frameworks that produce and use segmented fundus images. We found 28 papers and identified two main approaches: 1) logical rule-based frameworks, based on a set of simplistic decision rules; and 2) machine learning/statistical modelling based frameworks. We summarise the state-of-art of the two approaches and highlight the key hurdles to overcome for artificial intelligence-enabled glaucoma detection frameworks to be translated into clinical practice. △ Less

Submitted 12 April, 2022; originally announced April 2022.

arXiv:2112.14382 [pdf, other]

Self-Supervised Robustifying Guidance for Monocular 3D Face Reconstruction

Authors: Hitika Tiwari, Min-Hung Chen, Yi-Min Tsai, Hsien-Kai Kuo, Hung-Jen Chen, Kevin Jou, K. S. Venkatesh, Yong-Sheng Chen

Abstract: Despite the recent developments in 3D Face Reconstruction from occluded and noisy face images, the performance is still unsatisfactory. Moreover, most existing methods rely on additional dependencies, posing numerous constraints over the training procedure. Therefore, we propose a Self-Supervised RObustifying GUidancE (ROGUE) framework to obtain robustness against occlusions and noise in the face… ▽ More Despite the recent developments in 3D Face Reconstruction from occluded and noisy face images, the performance is still unsatisfactory. Moreover, most existing methods rely on additional dependencies, posing numerous constraints over the training procedure. Therefore, we propose a Self-Supervised RObustifying GUidancE (ROGUE) framework to obtain robustness against occlusions and noise in the face images. The proposed network contains 1) the Guidance Pipeline to obtain the 3D face coefficients for the clean faces and 2) the Robustification Pipeline to acquire the consistency between the estimated coefficients for occluded or noisy images and the clean counterpart. The proposed image- and feature-level loss functions aid the ROGUE learning process without posing additional dependencies. To facilitate model evaluation, we propose two challenging occlusion face datasets, ReaChOcc and SynChOcc, containing real-world and synthetic occlusion-based face images for robustness evaluation. Also, a noisy variant of the test dataset of CelebA is produced for evaluation. Our method outperforms the current state-of-the-art method by large margins (e.g., for the perceptual errors, a reduction of 23.8% for real-world occlusions, 26.4% for synthetic occlusions, and 22.7% for noisy images), demonstrating the effectiveness of the proposed approach. The occlusion datasets and the corresponding evaluation code are released publicly at https://github.com/ArcTrinity9/Datasets-ReaChOcc-and-SynChOcc. △ Less

Submitted 21 October, 2022; v1 submitted 28 December, 2021; originally announced December 2021.

Comments: Accepted by The 33rd British Machine Vision Conference (BMVC) 2022. Evaluation code and datasets: https://github.com/ArcTrinity9/Datasets-ReaChOcc-and-SynChOcc

arXiv:2111.08275 [pdf, other]

Deep Distilling: automated code generation using explainable deep learning

Authors: Paul J. Blazek, Kesavan Venkatesh, Milo M. Lin

Abstract: Human reasoning can distill principles from observed patterns and generalize them to explain and solve novel problems. The most powerful artificial intelligence systems lack explainability and symbolic reasoning ability, and have therefore not achieved supremacy in domains requiring human understanding, such as science or common sense reasoning. Here we introduce deep distilling, a machine learnin… ▽ More Human reasoning can distill principles from observed patterns and generalize them to explain and solve novel problems. The most powerful artificial intelligence systems lack explainability and symbolic reasoning ability, and have therefore not achieved supremacy in domains requiring human understanding, such as science or common sense reasoning. Here we introduce deep distilling, a machine learning method that learns patterns from data using explainable deep learning and then condenses it into concise, executable computer code. The code, which can contain loops, nested logical statements, and useful intermediate variables, is equivalent to the neural network but is generally orders of magnitude more compact and human-comprehensible. On a diverse set of problems involving arithmetic, computer vision, and optimization, we show that deep distilling generates concise code that generalizes out-of-distribution to solve problems orders-of-magnitude larger and more complex than the training data. For problems with a known ground-truth rule set, deep distilling discovers the rule set exactly with scalable guarantees. For problems that are ambiguous or computationally intractable, the distilled rules are similar to existing human-derived algorithms and perform at par or better. Our approach demonstrates that unassisted machine intelligence can build generalizable and intuitive rules explaining patterns in large datasets that would otherwise overwhelm human reasoning. △ Less

Submitted 16 November, 2021; originally announced November 2021.

MSC Class: 68T05 (Primary); 68T07; 68T20; 68T37 (Secondary) ACM Class: I.2.2; I.2.6

arXiv:2108.05287 [pdf, other]

Semantic Mobile Base Station Placement

Authors: Kritik Soman, K. S. Venkatesh

Abstract: Location of Base Stations (BS) in mobile networks plays an important role in coverage and received signal strength. As Internet ofThings (IoT), autonomous vehicles and smart cities evolve, wireless net-work coverage will have an important role in ensuring seamless connectivity. Due to use of higher carrier frequencies, blockages cause communication to primarily be Line of Sight (LoS), increasing t… ▽ More Location of Base Stations (BS) in mobile networks plays an important role in coverage and received signal strength. As Internet ofThings (IoT), autonomous vehicles and smart cities evolve, wireless net-work coverage will have an important role in ensuring seamless connectivity. Due to use of higher carrier frequencies, blockages cause communication to primarily be Line of Sight (LoS), increasing the importance of base station placement. In this paper, we propose a novel placement pipeline in which we perform semantic segmentation of aerial drone imagery using DeepLabv3+ and create its 2.5D model with the help ofDigital Surface Model (DSM). This is used along with Vienna simulator for finding the best location for deploying base stations by formulating the problem as a multi-objective function and solving it using Non-Dominated Sorting Genetic Algorithm II (NSGA-II). The case with and without prior deployed base station is considered. We evaluate the basestation deployment based on Signal to Interference Noise Ratio (SINR)coverage probability and user down-link throughput. This is followed by comparison with other base station placement methods and the bene-fits offered by our approach. Our work is novel as it considers scenarios where there is high ground elevation and building density variation, and shows that irregular BS placement improves coverage. △ Less

Submitted 11 August, 2021; originally announced August 2021.

Comments: 12 pages

MSC Class: 68T01

arXiv:2104.02656 [pdf, other]

Collaborative Learning to Generate Audio-Video Jointly

Authors: Vinod K Kurmi, Vipul Bajaj, Badri N Patro, K S Venkatesh, Vinay P Namboodiri, Preethi Jyothi

Abstract: There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are… ▽ More There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities. The proposed method uses multiple discriminators to ensure that the audio, video, and the joint output are also indistinguishable from real-world samples. We present a dataset for this task and show that we are able to generate realistic samples. This method is validated using various standard metrics such as Inception Score, Frechet Inception Distance (FID) and through human evaluation. △ Less

Submitted 31 March, 2021; originally announced April 2021.

Comments: ICASSP 2021 (Accepted)

arXiv:1705.07080 [pdf, other]

Bitwise Operations of Cellular Automaton on Gray-scale Images

Authors: Karttikeya Mangalam, K S Venkatesh

Abstract: Cellular Automata (CA) theory is a discrete model that represents the state of each of its cells from a finite set of possible values which evolve in time according to a pre-defined set of transition rules. CA have been applied to a number of image processing tasks such as Convex Hull Detection, Image Denoising etc. but mostly under the limitation of restricting the input to binary images. In gene… ▽ More Cellular Automata (CA) theory is a discrete model that represents the state of each of its cells from a finite set of possible values which evolve in time according to a pre-defined set of transition rules. CA have been applied to a number of image processing tasks such as Convex Hull Detection, Image Denoising etc. but mostly under the limitation of restricting the input to binary images. In general, a gray-scale image may be converted to a number of different binary images which are finally recombined after CA operations on each of them individually. We have developed a multinomial regression based weighed summation method to recombine binary images for better performance of CA based Image Processing algorithms. The recombination algorithm is tested for the specific case of denoising Salt and Pepper Noise to test against standard benchmark algorithms such as the Median Filter for various images and noise levels. The results indicate several interesting invariances in the application of the CA, such as the particular noise realization and the choice of sub-sampling of pixels to determine recombination weights. Additionally, it appears that simpler algorithms for weight optimization which seek local minima work as effectively as those that seek global minima such as Simulated Annealing. △ Less

Submitted 19 May, 2017; originally announced May 2017.

Comments: 5 Pages. The code is available at : https://github.com/karttikeya/Bitwise-CA-Opeartions/

arXiv:1703.02340 [pdf, ps, other]

Design and Development of an automated Robotic Pick & Stow System for an e-Commerce Warehouse

Authors: Swagat Kumar, Anima Majumder, Samrat Dutta, Rekha Raja, Sharath Jotawar, Ashish Kumar, Manish Soni, Venkat Raju, Olyvia Kundu, Ehtesham Hassan Laxmidhar Behera, K. S. Venkatesh, Rajesh Sinha

Abstract: In this paper, we provide details of a robotic system that can automate the task of picking and stowing objects from and to a rack in an e-commerce fulfillment warehouse. The system primarily comprises of four main modules: (1) Perception module responsible for recognizing query objects and localizing them in the 3-dimensional robot workspace; (2) Planning module generates necessary paths that the… ▽ More In this paper, we provide details of a robotic system that can automate the task of picking and stowing objects from and to a rack in an e-commerce fulfillment warehouse. The system primarily comprises of four main modules: (1) Perception module responsible for recognizing query objects and localizing them in the 3-dimensional robot workspace; (2) Planning module generates necessary paths that the robot end- effector has to take for reaching the objects in the rack or in the tote; (3) Calibration module that defines the physical workspace for the robot visible through the on-board vision system; and (4) Gripping and suction system for picking and stowing different kinds of objects. The perception module uses a faster region-based Convolutional Neural Network (R-CNN) to recognize objects. We designed a novel two finger gripper that incorporates pneumatic valve based suction effect to enhance its ability to pick different kinds of objects. The system was developed by IITK-TCS team for participation in the Amazon Picking Challenge 2016 event. The team secured a fifth place in the stowing task in the event. The purpose of this article is to share our experiences with students and practicing engineers and enable them to build similar systems. The overall efficacy of the system is demonstrated through several simulation as well as real-world experiments with actual robots. △ Less

Submitted 7 March, 2017; originally announced March 2017.

Comments: 15 Pages, 25 Figures, 4 Tables, Journal Paper

arXiv:1507.08445 [pdf, other]

People Counting in High Density Crowds from Still Images

Authors: Ankan Bansal, K. S. Venkatesh

Abstract: We present a method of estimating the number of people in high density crowds from still images. The method estimates counts by fusing information from multiple sources. Most of the existing work on crowd counting deals with very small crowds (tens of individuals) and use temporal information from videos. Our method uses only still images to estimate the counts in high density images (hundreds to… ▽ More We present a method of estimating the number of people in high density crowds from still images. The method estimates counts by fusing information from multiple sources. Most of the existing work on crowd counting deals with very small crowds (tens of individuals) and use temporal information from videos. Our method uses only still images to estimate the counts in high density images (hundreds to thousands of individuals). At this scale, we cannot rely on only one set of features for count estimation. We, therefore, use multiple sources, viz. interest points (SIFT), Fourier analysis, wavelet decomposition, GLCM features and low confidence head detections, to estimate the counts. Each of these sources gives a separate estimate of the count along with confidences and other statistical measures which are then combined to obtain the final estimate. We test our method on an existing dataset of fifty images containing over 64000 individuals. Further, we added another fifty annotated images of crowds and tested on the complete dataset of hundred images containing over 87000 individuals. The counts per image range from 81 to 4633. We report the performance in terms of mean absolute error, which is a measure of accuracy of the method, and mean normalised absolute error, which is a measure of the robustness. △ Less

Submitted 30 July, 2015; originally announced July 2015.

Showing 1–30 of 30 results for author: Venkatesh, K