-
UAVs Meet Agentic AI: A Multidomain Survey of Autonomous Aerial Intelligence and Agentic UAVs
Authors:
Ranjan Sapkota,
Konstantinos I. Roumeliotis,
Manoj Karkee
Abstract:
Agentic UAVs represent a new frontier in autonomous aerial intelligence, integrating perception, decision-making, memory, and collaborative planning to operate adaptively in complex, real-world environments. Driven by recent advances in Agentic AI, these systems surpass traditional UAVs by exhibiting goal-driven behavior, contextual reasoning, and interactive autonomy. We provide a comprehensive f…
▽ More
Agentic UAVs represent a new frontier in autonomous aerial intelligence, integrating perception, decision-making, memory, and collaborative planning to operate adaptively in complex, real-world environments. Driven by recent advances in Agentic AI, these systems surpass traditional UAVs by exhibiting goal-driven behavior, contextual reasoning, and interactive autonomy. We provide a comprehensive foundation for understanding the architectural components and enabling technologies that distinguish Agentic UAVs from traditional autonomous UAVs. Furthermore, a detailed comparative analysis highlights advancements in autonomy with AI agents, learning, and mission flexibility. This study explores seven high-impact application domains precision agriculture, construction & mining, disaster response, environmental monitoring, infrastructure inspection, logistics, security, and wildlife conservation, illustrating the broad societal value of agentic aerial intelligence. Furthermore, we identify key challenges in technical constraints, regulatory limitations, and data-model reliability, and we present emerging solutions across hardware innovation, learning architectures, and human-AI interaction. Finally, a future roadmap is proposed, outlining pathways toward self-evolving aerial ecosystems, system-level collaboration, and sustainable, equitable deployments. This survey establishes a foundational framework for the future development, deployment, and governance of agentic aerial systems (Agentic UAVs) across diverse societal and industrial domains.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Lithography defined semiconductor moires with anomalous in-gap quantum Hall states
Authors:
Wei Pan,
D. Bruce Burckel,
Catalin D. Spataru,
Keshab R. Sapkota,
Aaron J. Muhowski,
Samuel D. Hawkins,
John F. Klem,
Layla S. Smith,
Doyle A. Temple,
Zachery A. Enderson,
Zhigang Jiang,
Komalavalli Thirunavukkuarasu,
Li Xiang,
Mykhaylo Ozerov,
Dmitry Smirnov,
Chang Niu,
Peide D. Ye,
Praveen Pai,
Fan Zhang
Abstract:
Quantum materials and phenomena have attracted great interest for their potential applications in next-generation microelectronics and quantum-information technologies. In one especially interesting class of quantum materials, moire superlattices (MSL) formed by twisted bilayers of 2D materials, a wide range of novel phenomena are observed. However, there exist daunting challenges such as reproduc…
▽ More
Quantum materials and phenomena have attracted great interest for their potential applications in next-generation microelectronics and quantum-information technologies. In one especially interesting class of quantum materials, moire superlattices (MSL) formed by twisted bilayers of 2D materials, a wide range of novel phenomena are observed. However, there exist daunting challenges such as reproducibility and scalability of utilizing 2D MSLs for microelectronics and quantum technologies due to their exfoliate-tear-stack method. Here, we propose lithography defined semiconductor moires superlattices, in which three fundamental parameters, electron-electron interaction, spin-orbit coupling, and band topology, are designable. We experimentally investigate quantum transport properties in a moire specimen made in an InAs quantum well. Strong anomalous in-gap states are observed within the same integer quantum Hall state. Our work opens up new horizons for studying 2D quantum-materials phenomena in semiconductors featuring superior industry-level quality and state-of-the-art technologies, and they may potentially enable new quantum information and microelectronics technologies.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems
Authors:
Shaina Raza,
Ranjan Sapkota,
Manoj Karkee,
Christos Emmanouilidis
Abstract:
Agentic AI systems, built on large language models (LLMs) and deployed in multi-agent configurations, are redefining intelligent autonomy, collaboration and decision-making across enterprise and societal domains. This review presents a structured analysis of Trust, Risk, and Security Management (TRiSM) in the context of LLM-based agentic multi-agent systems (AMAS). We begin by examining the concep…
▽ More
Agentic AI systems, built on large language models (LLMs) and deployed in multi-agent configurations, are redefining intelligent autonomy, collaboration and decision-making across enterprise and societal domains. This review presents a structured analysis of Trust, Risk, and Security Management (TRiSM) in the context of LLM-based agentic multi-agent systems (AMAS). We begin by examining the conceptual foundations of agentic AI, its architectural differences from traditional AI agents, and the emerging system designs that enable scalable, tool-using autonomy. The TRiSM in the agentic AI framework is then detailed through four pillars governance, explainability, ModelOps, and privacy/security each contextualized for agentic LLMs. We identify unique threat vectors and introduce a comprehensive risk taxonomy for the agentic AI applications, supported by case studies illustrating real-world vulnerabilities. Furthermore, the paper also surveys trust-building mechanisms, transparency and oversight techniques, and state-of-the-art explainability strategies in distributed LLM agent systems. Additionally, metrics for evaluating trust, interpretability, and human-centered performance are reviewed alongside open benchmarking challenges. Security and privacy are addressed through encryption, adversarial defense, and compliance with evolving AI regulations. The paper concludes with a roadmap for responsible agentic AI, proposing research directions to align emerging multi-agent systems with robust TRiSM principles for safe, accountable, and transparent deployment.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Robust and Symmetric Magnetic Field Dependency of Superconducting Diode Effect in Asymmetric Dirac Semimetal SQUIDs
Authors:
H. C. Travaglini,
J. J. Cuozzo,
K. R. Sapkota,
I. A. Leahy,
A. D. Rice,
K. Alberi,
W. Pan
Abstract:
The recent demonstration of the superconducting diode effect (SDE) has generated renewed interests in superconducting electronics in which devices such as compact superconducting diodes that can perform signal rectification where low-energy operations are needed. In this article, we present our results of robust and symmetric-in-magnetic-field SDE in asymmetric superconducting quantum interference…
▽ More
The recent demonstration of the superconducting diode effect (SDE) has generated renewed interests in superconducting electronics in which devices such as compact superconducting diodes that can perform signal rectification where low-energy operations are needed. In this article, we present our results of robust and symmetric-in-magnetic-field SDE in asymmetric superconducting quantum interference devices (SQUIDs) realized in high-quality Dirac semimetal Cd3As2 thin film grown by the molecular beam epitaxy (MBE) technique. Consistent with previous work, a zero magnetic field SDE is observed. Furthermore, the difference in switching current is independent of the strength and polarity of an out-plane magnetic field in the range of -10 mT and 10 mT. We speculate that this robust symmetric-in-field SDE in our Dirac semimetal SQUIDs is due to the formation of helical spin texture, theoretically predicted in Dirac semimetals.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
Authors:
Ranjan Sapkota,
Konstantinos I. Roumeliotis,
Manoj Karkee
Abstract:
This review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding. While both leverage large language models (LLMs), they differ fundamentally in autonomy, architectural design, and the role of the developer. Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational workflows that s…
▽ More
This review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding. While both leverage large language models (LLMs), they differ fundamentally in autonomy, architectural design, and the role of the developer. Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational workflows that support ideation, experimentation, and creative exploration. In contrast, agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention. We propose a detailed taxonomy spanning conceptual foundations, execution models, feedback loops, safety mechanisms, debugging strategies, and real-world tool ecosystems. Through comparative workflow analysis and 20 detailed use cases, we illustrate how vibe systems thrive in early-stage prototyping and education, while agentic systems excel in enterprise-grade automation, codebase refactoring, and CI/CD integration. We further examine emerging trends in hybrid architectures, where natural language interfaces are coupled with autonomous execution pipelines. Finally, we articulate a future roadmap for agentic AI, outlining the infrastructure needed for trustworthy, explainable, and collaborative systems. Our findings suggest that successful AI software engineering will rely not on choosing one paradigm, but on harmonizing their strengths within a unified, human-centered development lifecycle.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
Authors:
Ranjan Sapkota,
Konstantinos I. Roumeliotis,
Manoj Karkee
Abstract:
This study critically distinguishes between AI Agents and Agentic AI, offering a structured conceptual taxonomy, application mapping, and challenge analysis to clarify their divergent design philosophies and capabilities. We begin by outlining the search strategy and foundational definitions, characterizing AI Agents as modular systems driven by Large Language Models (LLMs) and Large Image Models…
▽ More
This study critically distinguishes between AI Agents and Agentic AI, offering a structured conceptual taxonomy, application mapping, and challenge analysis to clarify their divergent design philosophies and capabilities. We begin by outlining the search strategy and foundational definitions, characterizing AI Agents as modular systems driven by Large Language Models (LLMs) and Large Image Models (LIMs) for narrow, task-specific automation. Generative AI is positioned as a precursor, with AI Agents advancing through tool integration, prompt engineering, and reasoning enhancements. In contrast, Agentic AI systems represent a paradigmatic shift marked by multi-agent collaboration, dynamic task decomposition, persistent memory, and orchestrated autonomy. Through a sequential evaluation of architectural evolution, operational mechanisms, interaction styles, and autonomy levels, we present a comparative analysis across both paradigms. Application domains such as customer support, scheduling, and data summarization are contrasted with Agentic AI deployments in research automation, robotic coordination, and medical decision support. We further examine unique challenges in each paradigm including hallucination, brittleness, emergent behavior, and coordination failure and propose targeted solutions such as ReAct loops, RAG, orchestration layers, and causal modeling. This work aims to provide a definitive roadmap for developing robust, scalable, and explainable AI agent and Agentic AI-driven systems. >AI Agents, Agent-driven, Vision-Language-Models, Agentic AI Decision Support System, Agentic-AI Applications
△ Less
Submitted 27 May, 2025; v1 submitted 15 May, 2025;
originally announced May 2025.
-
Vision-Language-Action Models: Concepts, Progress, Applications and Challenges
Authors:
Ranjan Sapkota,
Yang Cao,
Konstantinos I. Roumeliotis,
Manoj Karkee
Abstract:
Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that struc…
▽ More
Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that structure the landscape of this rapidly evolving field. We begin by establishing the conceptual foundations of VLA systems, tracing their evolution from cross-modal learning architectures to generalist agents that tightly integrate vision-language models (VLMs), action planners, and hierarchical controllers. Our methodology adopts a rigorous literature review framework, covering over 80 VLA models published in the past three years. Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations. We explore diverse application domains such as humanoid robotics, autonomous vehicles, medical and industrial robotics, precision agriculture, and augmented reality navigation. The review further addresses major challenges across real-time control, multimodal action representation, system scalability, generalization to unseen tasks, and ethical deployment risks. Drawing from the state-of-the-art, we propose targeted solutions including agentic AI adaptation, cross-embodiment generalization, and unified neuro-symbolic planning. In our forward-looking discussion, we outline a future roadmap where VLA models, VLMs, and agentic AI converge to power socially aligned, adaptive, and general-purpose embodied agents. This work serves as a foundational reference for advancing intelligent, real-world robotics and artificial general intelligence. >Vision-language-action, Agentic AI, AI Agents, Vision-language Models
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks
Authors:
Konstantinos I. Roumeliotis,
Ranjan Sapkota,
Manoj Karkee,
Nikolaos D. Tselikas,
Dimitrios K. Nasiopoulos
Abstract:
Automation in agriculture plays a vital role in addressing challenges related to crop monitoring and disease management, particularly through early detection systems. This study investigates the effectiveness of combining multimodal Large Language Models (LLMs), specifically GPT-4o, with Convolutional Neural Networks (CNNs) for automated plant disease classification using leaf imagery. Leveraging…
▽ More
Automation in agriculture plays a vital role in addressing challenges related to crop monitoring and disease management, particularly through early detection systems. This study investigates the effectiveness of combining multimodal Large Language Models (LLMs), specifically GPT-4o, with Convolutional Neural Networks (CNNs) for automated plant disease classification using leaf imagery. Leveraging the PlantVillage dataset, we systematically evaluate model performance across zero-shot, few-shot, and progressive fine-tuning scenarios. A comparative analysis between GPT-4o and the widely used ResNet-50 model was conducted across three resolutions (100, 150, and 256 pixels) and two plant species (apple and corn). Results indicate that fine-tuned GPT-4o models achieved slightly better performance compared to the performance of ResNet-50, achieving up to 98.12% classification accuracy on apple leaf images, compared to 96.88% achieved by ResNet-50, with improved generalization and near-zero training loss. However, zero-shot performance of GPT-4o was significantly lower, underscoring the need for minimal training. Additional evaluations on cross-resolution and cross-plant generalization revealed the models' adaptability and limitations when applied to new domains. The findings highlight the promise of integrating multimodal LLMs into automated disease detection pipelines, enhancing the scalability and intelligence of precision agriculture systems while reducing the dependence on large, labeled datasets and high-resolution sensor infrastructure. Large Language Models, Vision Language Models, LLMs and CNNs, Disease Detection with Vision Language Models, VLMs
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
A Review of 3D Object Detection with Vision-Language Models
Authors:
Ranjan Sapkota,
Konstantinos I Roumeliotis,
Rahul Harsha Cheppally,
Marco Flores Calero,
Manoj Karkee
Abstract:
This review provides a systematic analysis of comprehensive survey of 3D object detection with vision-language models(VLMs) , a rapidly advancing area at the intersection of 3D vision and multimodal AI. By examining over 100 research papers, we provide the first systematic analysis dedicated to 3D object detection with vision-language models. We begin by outlining the unique challenges of 3D objec…
▽ More
This review provides a systematic analysis of comprehensive survey of 3D object detection with vision-language models(VLMs) , a rapidly advancing area at the intersection of 3D vision and multimodal AI. By examining over 100 research papers, we provide the first systematic analysis dedicated to 3D object detection with vision-language models. We begin by outlining the unique challenges of 3D object detection with vision-language models, emphasizing differences from 2D detection in spatial reasoning and data complexity. Traditional approaches using point clouds and voxel grids are compared to modern vision-language frameworks like CLIP and 3D LLMs, which enable open-vocabulary detection and zero-shot generalization. We review key architectures, pretraining strategies, and prompt engineering methods that align textual and 3D features for effective 3D object detection with vision-language models. Visualization examples and evaluation benchmarks are discussed to illustrate performance and behavior. Finally, we highlight current challenges, such as limited 3D-language datasets and computational demands, and propose future research directions to advance 3D object detection with vision-language models. >Object Detection, Vision-Language Models, Agents, VLMs, LLMs, AI
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity
Authors:
Ranjan Sapkota,
Rahul Harsha Cheppally,
Ajay Sharda,
Manoj Karkee
Abstract:
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to ass…
▽ More
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10
Authors:
Ranjan Sapkota,
Manoj Karkee
Abstract:
This study evaluated the performance of the YOLOv12 object detection model, and compared against the performances YOLOv11 and YOLOv10 for apple detection in commercial orchards based on the model training completed entirely on synthetic images generated by Large Language Models (LLMs). The YOLOv12n configuration achieved the highest precision at 0.916, the highest recall at 0.969, and the highest…
▽ More
This study evaluated the performance of the YOLOv12 object detection model, and compared against the performances YOLOv11 and YOLOv10 for apple detection in commercial orchards based on the model training completed entirely on synthetic images generated by Large Language Models (LLMs). The YOLOv12n configuration achieved the highest precision at 0.916, the highest recall at 0.969, and the highest mean Average Precision (mAP@50) at 0.978. In comparison, the YOLOv11 series was led by YOLO11x, which achieved the highest precision at 0.857, recall at 0.85, and mAP@50 at 0.91. For the YOLOv10 series, YOLOv10b and YOLOv10l both achieved the highest precision at 0.85, with YOLOv10n achieving the highest recall at 0.8 and mAP@50 at 0.89. These findings demonstrated that YOLOv12, when trained on realistic LLM-generated datasets surpassed its predecessors in key performance metrics. The technique also offered a cost-effective solution by reducing the need for extensive manual data collection in the agricultural field. In addition, this study compared the computational efficiency of all versions of YOLOv12, v11 and v10, where YOLOv11n reported the lowest inference time at 4.7 ms, compared to YOLOv12n's 5.6 ms and YOLOv10n's 5.9 ms. Although YOLOv12 is new and more accurate than YOLOv11, and YOLOv10, YOLO11n still stays the fastest YOLO model among YOLOv10, YOLOv11 and YOLOv12 series of models. (Index: YOLOv12, YOLOv11, YOLOv10, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO Object detection)
△ Less
Submitted 19 March, 2025; v1 submitted 26 February, 2025;
originally announced March 2025.
-
Comprehensive Analysis of Transparency and Accessibility of ChatGPT, DeepSeek, And other SoTA Large Language Models
Authors:
Ranjan Sapkota,
Shaina Raza,
Manoj Karkee
Abstract:
Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs). The Open Source Initiative (OSI) has recently released its first formal definition of open-source software. This definition, when combined with standard dictionary definitions and the sparse p…
▽ More
Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs). The Open Source Initiative (OSI) has recently released its first formal definition of open-source software. This definition, when combined with standard dictionary definitions and the sparse published literature, provide an initial framework to support broader accessibility to AI models such as LLMs, but more work is essential to capture the unique dynamics of openness in AI. In addition, concerns about open-washing, where models claim openness but lack full transparency, has been raised, which limits the reproducibility, bias mitigation, and domain adaptation of these models. In this context, our study critically analyzes SoTA LLMs from the last five years, including ChatGPT, DeepSeek, LLaMA, and others, to assess their adherence to transparency standards and the implications of partial openness. Specifically, we examine transparency and accessibility from two perspectives: open-source vs. open-weight models. Our findings reveal that while some models are labeled as open-source, this does not necessarily mean they are fully open-sourced. Even in the best cases, open-source models often do not report model training data, and code as well as key metrics, such as weight accessibility, and carbon emissions. To the best of our knowledge, this is the first study that systematically examines the transparency and accessibility of over 100 different SoTA LLMs through the dual lens of open-source and open-weight models. The findings open avenues for further research and call for responsible and sustainable AI practices to ensure greater transparency, accountability, and ethical deployment of these models.(DeepSeek transparency, ChatGPT accessibility, open source, DeepSeek open source)
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Who is Responsible? The Data, Models, Users or Regulations? A Comprehensive Survey on Responsible Generative AI for a Sustainable Future
Authors:
Shaina Raza,
Rizwan Qureshi,
Anam Zahid,
Joseph Fioresi,
Ferhat Sadak,
Muhammad Saeed,
Ranjan Sapkota,
Aditya Jain,
Anas Zafar,
Muneeb Ul Hassan,
Aizan Zafar,
Hasan Maqbool,
Ashmal Vayani,
Jia Wu,
Maged Shoman
Abstract:
Responsible Artificial Intelligence (RAI) has emerged as a crucial framework for addressing ethical concerns in the development and deployment of Artificial Intelligence (AI) systems. A significant body of literature exists, primarily focusing on either RAI guidelines and principles or the technical aspects of RAI, largely within the realm of traditional AI. However, a notable gap persists in brid…
▽ More
Responsible Artificial Intelligence (RAI) has emerged as a crucial framework for addressing ethical concerns in the development and deployment of Artificial Intelligence (AI) systems. A significant body of literature exists, primarily focusing on either RAI guidelines and principles or the technical aspects of RAI, largely within the realm of traditional AI. However, a notable gap persists in bridging theoretical frameworks with practical implementations in real-world settings, as well as transitioning from RAI to Responsible Generative AI (Gen AI). To bridge this gap, we present this article, which examines the challenges and opportunities in implementing ethical, transparent, and accountable AI systems in the post-ChatGPT era, an era significantly shaped by Gen AI. Our analysis includes governance and technical frameworks, the exploration of explainable AI as the backbone to achieve RAI, key performance indicators in RAI, alignment of Gen AI benchmarks with governance frameworks, reviews of AI-ready test beds, and RAI applications across multiple sectors. Additionally, we discuss challenges in RAI implementation and provide a philosophical perspective on the future of RAI. This comprehensive article aims to offer an overview of RAI, providing valuable insights for researchers, policymakers, users, and industry practitioners to develop and deploy AI systems that benefit individuals and society while minimizing potential risks and societal impacts. A curated list of resources and datasets covered in this survey is available on GitHub {https://github.com/anas-zafar/Responsible-AI}.
△ Less
Submitted 28 April, 2025; v1 submitted 15 January, 2025;
originally announced February 2025.
-
Multimodal Large Language Models for Image, Text, and Speech Data Augmentation: A Survey
Authors:
Ranjan Sapkota,
Shaina Raza,
Maged Shoman,
Achyut Paudel,
Manoj Karkee
Abstract:
In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and combat overfitting in training deep convolutional neural networks. However, while existing surveys predominantly focus on ML and DL techniques or limited modal…
▽ More
In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and combat overfitting in training deep convolutional neural networks. However, while existing surveys predominantly focus on ML and DL techniques or limited modalities (text or images), a gap remains in addressing the latest advancements and multi-modal applications of LLM-based methods. This survey fills that gap by exploring recent literature utilizing multimodal LLMs to augment image, text, and audio data, offering a comprehensive understanding of these processes. We outlined various methods employed in the LLM-based image, text and speech augmentation, and discussed the limitations identified in current approaches. Additionally, we identified potential solutions to these limitations from the literature to enhance the efficacy of data augmentation practices using multimodal LLMs. This survey serves as a foundation for future research, aiming to refine and expand the use of multimodal LLMs in enhancing dataset quality and diversity for deep learning applications. (Surveyed Paper GitHub Repo: https://github.com/WSUAgRobotics/data-aug-multi-modal-llm. Keywords: LLM data augmentation, Grok text data augmentation, DeepSeek image data augmentation, Grok speech data augmentation, GPT audio augmentation, voice augmentation, DeepSeek for data augmentation, DeepSeek R1 text data augmentation, DeepSeek R1 image augmentation, Image Augmentation using LLM, Text Augmentation using LLM, LLM data augmentation for deep learning applications)
△ Less
Submitted 21 March, 2025; v1 submitted 29 January, 2025;
originally announced January 2025.
-
Self-Clustering Graph Transformer Approach to Model Resting-State Functional Brain Activity
Authors:
Bishal Thapaliya,
Esra Akbas,
Ram Sapkota,
Bhaskar Ray,
Vince Calhoun,
Jingyu Liu
Abstract:
Resting-state functional magnetic resonance imaging (rs-fMRI) offers valuable insights into the human brain's functional organization and is a powerful tool for investigating the relationship between brain function and cognitive processes, as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this study, we introduce a novel atte…
▽ More
Resting-state functional magnetic resonance imaging (rs-fMRI) offers valuable insights into the human brain's functional organization and is a powerful tool for investigating the relationship between brain function and cognitive processes, as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this study, we introduce a novel attention mechanism for graphs with subnetworks, named Self-Clustering Graph Transformer (SCGT), designed to handle the issue of uniform node updates in graph transformers. By using static functional connectivity (FC) correlation features as input to the transformer model, SCGT effectively captures the sub-network structure of the brain by performing cluster-specific updates to the nodes, unlike uniform node updates in vanilla graph transformers, further allowing us to learn and interpret the subclusters. We validate our approach on the Adolescent Brain Cognitive Development (ABCD) dataset, comprising 7,957 participants, for the prediction of total cognitive score and gender classification. Our results demonstrate that SCGT outperforms the vanilla graph transformer method and other recent models, offering a promising tool for modeling brain functional connectivity and interpreting the underlying subnetwork structures.
△ Less
Submitted 7 February, 2025; v1 submitted 17 January, 2025;
originally announced January 2025.
-
Integrating YOLO11 and Convolution Block Attention Module for Multi-Season Segmentation of Tree Trunks and Branches in Commercial Apple Orchards
Authors:
Ranjan Sapkota,
Manoj Karkee
Abstract:
In this study, we developed a customized instance segmentation model by integrating the Convolutional Block Attention Module (CBAM) with the YOLO11 architecture. This model, trained on a mixed dataset of dormant and canopy season apple orchard images, aimed to enhance the segmentation of tree trunks and branches under varying seasonal conditions throughout the year. The model was individually vali…
▽ More
In this study, we developed a customized instance segmentation model by integrating the Convolutional Block Attention Module (CBAM) with the YOLO11 architecture. This model, trained on a mixed dataset of dormant and canopy season apple orchard images, aimed to enhance the segmentation of tree trunks and branches under varying seasonal conditions throughout the year. The model was individually validated across dormant and canopy season images after training the YOLO11-CBAM on the mixed dataset collected over the two seasons. Additional testing of the model during pre-bloom, flower bloom, fruit thinning, and harvest season was performed. The highest recall and precision metrics were observed in the YOLO11x-seg-CBAM and YOLO11m-seg-CBAM respectively. Particularly, YOLO11m-seg with CBAM showed the highest precision of 0.83 as performed for the Trunk class in training, while without the CBAM, YOLO11m-seg achieved 0.80 precision score for the Trunk class. Likewise, for branch class, YOLO11m-seg with CBAM achieved the highest precision score value of 0.75 while without the CBAM, the YOLO11m-seg achieved a precision of 0.73. For dormant season validation, YOLO11x-seg exhibited the highest precision at 0.91. Canopy season validation highlighted YOLO11s-seg with superior precision across all classes, achieving 0.516 for Branch, and 0.64 for Trunk. The modeling approach, trained on two season datasets as dormant and canopy season images, demonstrated the potential of the YOLO11-CBAM integration to effectively detect and segment tree trunks and branches year-round across all seasonal variations. Keywords: YOLOv11, YOLOv11 Tree Detection, YOLOv11 Branch Detection and Segmentation, Machine Vision, Deep Learning, Machine Learning
△ Less
Submitted 7 December, 2024;
originally announced December 2024.
-
Zero-Shot Automatic Annotation and Instance Segmentation using LLM-Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development
Authors:
Ranjan Sapkota,
Achyut Paudel,
Manoj Karkee
Abstract:
Currently, deep learning-based instance segmentation for various applications (e.g., Agriculture) is predominantly performed using a labor-intensive process involving extensive field data collection using sophisticated sensors, followed by careful manual annotation of images, presenting significant logistical and financial challenges to researchers and organizations. The process also slows down th…
▽ More
Currently, deep learning-based instance segmentation for various applications (e.g., Agriculture) is predominantly performed using a labor-intensive process involving extensive field data collection using sophisticated sensors, followed by careful manual annotation of images, presenting significant logistical and financial challenges to researchers and organizations. The process also slows down the model development and training process. In this study, we presented a novel method for deep learning-based instance segmentation of apples in commercial orchards that eliminates the need for labor-intensive field data collection and manual annotation. Utilizing a Large Language Model (LLM), we synthetically generated orchard images and automatically annotated them using the Segment Anything Model (SAM) integrated with a YOLO11 base model. This method significantly reduces reliance on physical sensors and manual data processing, presenting a major advancement in "Agricultural AI". The synthetic, auto-annotated dataset was used to train the YOLO11 model for Apple instance segmentation, which was then validated on real orchard images. The results showed that the automatically generated annotations achieved a Dice Coefficient of 0.9513 and an IoU of 0.9303, validating the accuracy and overlap of the mask annotations. All YOLO11 configurations, trained solely on these synthetic datasets with automated annotations, accurately recognized and delineated apples, highlighting the method's efficacy. Specifically, the YOLO11m-seg configuration achieved a mask precision of 0.902 and a mask mAP@50 of 0.833 on test images collected from a commercial orchard. Additionally, the YOLO11l-seg configuration outperformed other models in validation on 40 LLM-generated images, achieving the highest mask precision and mAP@50 metrics.
Keywords: YOLO, SAM, SAMv2, YOLO11, YOLOv11, Segment Anything, YOLO-SAM
△ Less
Submitted 27 February, 2025; v1 submitted 18 November, 2024;
originally announced November 2024.
-
Comparing YOLOv11 and YOLOv8 for instance segmentation of occluded and non-occluded immature green fruits in complex orchard environment
Authors:
Ranjan Sapkota,
Manoj Karkee
Abstract:
This study conducted a comprehensive performance evaluation on YOLO11 (or YOLOv11) and YOLOv8, the latest in the "You Only Look Once" (YOLO) series, focusing on their instance segmentation capabilities for immature green apples in orchard environments. YOLO11n-seg achieved the highest mask precision across all categories with a notable score of 0.831, highlighting its effectiveness in fruit detect…
▽ More
This study conducted a comprehensive performance evaluation on YOLO11 (or YOLOv11) and YOLOv8, the latest in the "You Only Look Once" (YOLO) series, focusing on their instance segmentation capabilities for immature green apples in orchard environments. YOLO11n-seg achieved the highest mask precision across all categories with a notable score of 0.831, highlighting its effectiveness in fruit detection. YOLO11m-seg and YOLO11l-seg excelled in non-occluded and occluded fruitlet segmentation with scores of 0.851 and 0.829, respectively. Additionally, YOLOv11x-seg led in mask recall for all categories, achieving a score of 0.815, with YOLO11m-seg performing best for non-occluded immature green fruitlets at 0.858 and YOLOv8x-seg leading the occluded category with 0.800. In terms of mean average precision at a 50\% intersection over union (mAP@50), YOLOv11m-seg consistently outperformed, registering the highest scores for both box and mask segmentation, at 0.876 and 0.860 for the "All" class and 0.908 and 0.909 for non-occluded immature fruitlets, respectively. YOLO11l-seg and YOLOv8l-seg shared the top box mAP@50 for occluded immature fruitlets at 0.847, while YOLO11m-seg achieved the highest mask mAP@50 of 0.810. Despite the advancements in YOLO11, YOLOv8n surpassed its counterparts in image processing speed, with an impressive inference speed of 3.3 milliseconds, compared to the fastest YOLO11 series model at 4.8 milliseconds, underscoring its suitability for real-time agricultural applications related to complex green fruit environments. (YOLOv11 segmentation)
△ Less
Submitted 26 January, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
YOLO11 and Vision Transformers based 3D Pose Estimation of Immature Green Fruits in Commercial Apple Orchards for Robotic Thinning
Authors:
Ranjan Sapkota,
Manoj Karkee
Abstract:
In this study, a robust method for 3D pose estimation of immature green apples (fruitlets) in commercial orchards was developed, utilizing the YOLO11(or YOLOv11) object detection and pose estimation algorithm alongside Vision Transformers (ViT) for depth estimation (Dense Prediction Transformer (DPT) and Depth Anything V2). For object detection and pose estimation, performance comparisons of YOLO1…
▽ More
In this study, a robust method for 3D pose estimation of immature green apples (fruitlets) in commercial orchards was developed, utilizing the YOLO11(or YOLOv11) object detection and pose estimation algorithm alongside Vision Transformers (ViT) for depth estimation (Dense Prediction Transformer (DPT) and Depth Anything V2). For object detection and pose estimation, performance comparisons of YOLO11 (YOLO11n, YOLO11s, YOLO11m, YOLO11l and YOLO11x) and YOLOv8 (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l and YOLOv8x) were made under identical hyperparameter settings among the all configurations. It was observed that YOLO11n surpassed all configurations of YOLO11 and YOLOv8 in terms of box precision and pose precision, achieving scores of 0.91 and 0.915, respectively. Conversely, YOLOv8n exhibited the highest box and pose recall scores of 0.905 and 0.925, respectively. Regarding the mean average precision at 50\% intersection over union (mAP@50), YOLO11s led all configurations with a box mAP@50 score of 0.94, while YOLOv8n achieved the highest pose mAP@50 score of 0.96. In terms of image processing speed, YOLO11n outperformed all configurations with an impressive inference speed of 2.7 ms, significantly faster than the quickest YOLOv8 configuration, YOLOv8n, which processed images in 7.8 ms. Subsequent integration of ViTs for the green fruit's pose depth estimation revealed that Depth Anything V2 outperformed Dense Prediction Transformer in 3D pose length validation, achieving the lowest Root Mean Square Error (RMSE) of 1.52 and Mean Absolute Error (MAE) of 1.28, demonstrating exceptional precision in estimating immature green fruit lengths. Integration of YOLO11 and Depth Anything Model provides a promising solution to 3D pose estimation of immature green fruits for robotic thinning applications. (YOLOv11 pose detection, YOLOv11 Pose, YOLOv11 Keypoints detection, YOLOv11 pose estimation)
△ Less
Submitted 30 March, 2025; v1 submitted 21 October, 2024;
originally announced October 2024.
-
Epitaxial aluminum layer on antimonide heterostructures for exploring Josephson junction effects
Authors:
W. Pan,
K. R. Sapkota,
P. Lu,
A. J. Muhowski,
W. M. Martinez,
C. L. H. Sovinec,
R. Reyna,
J. P. Mendez,
D. Mamaluy,
S. D. Hawkins,
J. F. Klem,
L. S. L. Smith,
D. A. Temple,
Z. Enderson,
Z. Jiang,
E. Rossi
Abstract:
In this article, we present results of our recent work of epitaxially-grown aluminum (epi-Al) on antimonide heterostructures, where the epi-Al thin film is grown at either room temperature or below zero $^o$C. A sharp superconducting transition at $T \sim 1.3$ K is observed in these epi-Al films. We further show that supercurrent states are realized in Josephson junctions fabricated in the epi-Al/…
▽ More
In this article, we present results of our recent work of epitaxially-grown aluminum (epi-Al) on antimonide heterostructures, where the epi-Al thin film is grown at either room temperature or below zero $^o$C. A sharp superconducting transition at $T \sim 1.3$ K is observed in these epi-Al films. We further show that supercurrent states are realized in Josephson junctions fabricated in the epi-Al/antimonide heterostructures with mobility $μ\sim 1.0 \times 10^6$ cm$^2$/Vs. These results clearly demonstrate we have achieved growing high-quality epi-Al/antimonide heterostructures, a promising platform for the exploration of Josephson junction effects for quantum information science and microelectronics applications.
△ Less
Submitted 17 April, 2025; v1 submitted 8 October, 2024;
originally announced October 2024.
-
A vision-based robotic system for precision pollination of apples
Authors:
Uddhav Bhattarai,
Ranjan Sapkota,
Safal Kshetri,
Changki Mo,
Matthew D. Whiting,
Qin Zhang,
Manoj Karkee
Abstract:
Global food production depends upon successful pollination, a process that relies on natural and managed pollinators. However, natural pollinators are declining due to factors such as climate change, habitat loss, and pesticide use. This paper presents an integrated robotic system for precision pollination in apples. The system consisted of a machine vision system to identify target flower cluster…
▽ More
Global food production depends upon successful pollination, a process that relies on natural and managed pollinators. However, natural pollinators are declining due to factors such as climate change, habitat loss, and pesticide use. This paper presents an integrated robotic system for precision pollination in apples. The system consisted of a machine vision system to identify target flower clusters and estimate their positions and orientations, and a manipulator motion planning and actuation system to guide the sprayer to apply charged pollen suspension to the target flower clusters. The system was tested in the lab, followed by field evaluation in Honeycrisp and Fuji orchards. In the Honeycrisp variety, the robotic pollination system achieved a fruit set of 34.8% of sprayed flowers with 87.5% of flower clusters having at least one fruit when a 2 gm/l pollen suspension was used. In comparison, the natural pollination technique achieved a fruit set of 43.1% with 94.9% of clusters with at least one fruit. In Fuji apples, the robotic system achieved lower pollination success, with 7.2% of sprayed flowers setting fruit and 20.6% of clusters having at least one fruit, compared to 33.1% and 80.6%, respectively, with natural pollination. Fruit quality analysis showed that robotically pollinated fruits were comparable to naturally pollinated fruits in terms of color, weight, diameter, firmness, soluble solids, and starch content. Additionally, the system cycle time was 6.5 seconds per cluster. The results showed a promise for robotic pollination in apple orchards. However, further research and development is needed to improve the system and assess its suitability across diverse orchard environments and apple cultivars.
△ Less
Submitted 9 March, 2025; v1 submitted 29 September, 2024;
originally announced September 2024.
-
Non-equilibrium States and Interactions in the Topological Insulator and Topological Crystalline Insulator Phases of NaCd4As3
Authors:
Tika R Kafle,
Yingchao Zhang,
Yi-yan Wang,
Xun Shi,
Na Li,
Richa Sapkota,
Jeremy Thurston,
Wenjing You,
Shunye Gao,
Qingxin Dong,
Kai Rossnagel,
Gen-Fu Chen,
James K Freericks,
Henry C Kapteyn,
Margaret M Murnane
Abstract:
Topological materials are of great interest because they can support metallic edge or surface states that are robust against perturbations, with the potential for technological applications. Here we experimentally explore the light-induced non-equilibrium properties of two distinct topological phases in NaCd4As3: a topological crystalline insulator (TCI) phase and a topological insulator (TI) phas…
▽ More
Topological materials are of great interest because they can support metallic edge or surface states that are robust against perturbations, with the potential for technological applications. Here we experimentally explore the light-induced non-equilibrium properties of two distinct topological phases in NaCd4As3: a topological crystalline insulator (TCI) phase and a topological insulator (TI) phase. This material has surface states that are protected by mirror symmetry in the TCI phase at room temperature, while it undergoes a structural phase transition to a TI phase below 200 K. After exciting the TI phase by an ultrafast laser pulse, we observe a leading band edge shift of >150 meV, that slowly builds up and reaches a maximum after ~0.6 ps, and that persists for ~8 ps. The slow rise time of the excited electron population and electron temperature suggests that the electronic and structural orders are strongly coupled in this TI phase. It also suggests that the directly excited electronic states and the probed electronic states are weakly coupled. Both couplings are likely due to a partial relaxation of the lattice distortion, which is known to be associated with the TI phase. In contrast, no distinct excited state is observed in the TCI phase immediately or after photoexcitation, which we attribute to the low density of states and phase space available near the Fermi level. Our results show how ultrafast laser excitation can reveal the distinct excited states and interactions in phase-rich topological materials.
△ Less
Submitted 20 August, 2024; v1 submitted 28 July, 2024;
originally announced July 2024.
-
Uncovering the Timescales of Spin Reorientation in $TbMn_{6}Sn_{6}$
Authors:
Sinéad A. Ryan,
Anya Grafov,
Na Li,
Hans T. Nembach,
Justin M. Shaw,
Hari Bhandari,
Tika Kafle,
Richa Sapkota,
Henry C. Kapteyn,
Nirmal J. Ghimire,
Margaret M. Murnane
Abstract:
$TbMn_{6}Sn_{6}…
▽ More
$TbMn_{6}Sn_{6}$ is a ferrimagnetic material which exhibits a highly unusual phase transition near room temperature where spins remain collinear while the total magnetic moment rotates from out-of-plane to in-plane. The mechanisms underlying this phenomenon have been studied in the quasi-static limit and the reorientation has been attributed to the competing anisotropies of Tb and Mn, whose magnetic moments have very different temperature dependencies. In this work, we present the first measurement of the spin-reorientation transition in $TbMn_{6}Sn_{6}$. By probing very small signals with the transverse magneto-optical Kerr effect (TMOKE) at the Mn M-edge, we show that the re-orientation timescale spans from 12 ps to 24 ps, depending on the laser excitation fluence. We then verify these data with a simple model of spin precession with a temperature-dependent magnetocrystalline anisotropy field to show that the spin reorientation timescale is consistent with the reorientation being driven by very large anisotropies energies on approximately $\approx$ meV scales. Promisingly, the model predicts a possibility of 180o reorientation of the out-of-plane moment over a range of excitation fluences. This could facilitate optically controlled magnetization switching between very stable ground states, which could have useful applications in spintronics or data storage.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Comprehensive Performance Evaluation of YOLOv12, YOLO11, YOLOv10, YOLOv9 and YOLOv8 on Detecting and Counting Fruitlet in Complex Orchard Environments
Authors:
Ranjan Sapkota,
Zhichao Meng,
Martin Churuvija,
Xiaoqiang Du,
Zenghong Ma,
Manoj Karkee
Abstract:
This study systematically performed an extensive real-world evaluation of the performances of all configurations of YOLOv8, YOLOv9, YOLOv10, YOLO11( or YOLOv11), and YOLOv12 object detection algorithms in terms of precision, recall, mean Average Precision at 50\% Intersection over Union (mAP@50), and computational speeds including pre-processing, inference, and post-processing times immature green…
▽ More
This study systematically performed an extensive real-world evaluation of the performances of all configurations of YOLOv8, YOLOv9, YOLOv10, YOLO11( or YOLOv11), and YOLOv12 object detection algorithms in terms of precision, recall, mean Average Precision at 50\% Intersection over Union (mAP@50), and computational speeds including pre-processing, inference, and post-processing times immature green apple (or fruitlet) detection in commercial orchards. Additionally, this research performed and validated in-field counting of the fruitlets using an iPhone and machine vision sensors. Among the configurations, YOLOv12l recorded the highest recall rate at 0.90, compared to all other configurations of YOLO models. Likewise, YOLOv10x achieved the highest precision score of 0.908, while YOLOv9 Gelan-c attained a precision of 0.903. Analysis of [email protected] revealed that YOLOv9 Gelan-base and YOLOv9 Gelan-e reached peak scores of 0.935, with YOLO11s and YOLOv12l following closely at 0.933 and 0.931, respectively. For counting validation using images captured with an iPhone 14 Pro, the YOLO11n configuration demonstrated outstanding accuracy, recording RMSE values of 4.51 for Honeycrisp, 4.59 for Cosmic Crisp, 4.83 for Scilate, and 4.96 for Scifresh; corresponding MAE values were 4.07, 3.98, 7.73, and 3.85. Similar performance trends were observed with RGB-D sensor data. Moreover, sensor-specific training on Intel Realsense data significantly enhanced model performance. YOLOv11n achieved highest inference speed of 2.4 ms, outperforming YOLOv8n (4.1 ms), YOLOv9 Gelan-s (11.5 ms), YOLOv10n (5.5 ms), and YOLOv12n (4.6 ms), underscoring its suitability for real-time object detection applications. (YOLOv12 architecture, YOLOv11 Architecture, YOLOv12 object detection, YOLOv11 object detecion, YOLOv12 segmentation)
△ Less
Submitted 25 February, 2025; v1 submitted 1 July, 2024;
originally announced July 2024.
-
YOLO advances to its genesis: a decadal and comprehensive review of the You Only Look Once (YOLO) series
Authors:
Ranjan Sapkota,
Marco Flores Calero,
Rizwan Qureshi,
Chetan Badgujar,
Upesh Nepal,
Alwin Poulose,
Peter Zeno,
Uday Bhanu Prakash Vaddevolu,
Sheheryar Khan,
Maged Shoman,
Hong Yan,
Manoj Karkee
Abstract:
This review systematically examines the progression of the You Only Look Once (YOLO) object detection algorithms from YOLOv1 to the recently unveiled YOLOv12. Employing a reverse chronological analysis, this study examines the advancements introduced by YOLO algorithms, beginning with YOLOv12 and progressing through YOLO11 (or YOLOv11), YOLOv10, YOLOv9, YOLOv8, and subsequent versions to explore e…
▽ More
This review systematically examines the progression of the You Only Look Once (YOLO) object detection algorithms from YOLOv1 to the recently unveiled YOLOv12. Employing a reverse chronological analysis, this study examines the advancements introduced by YOLO algorithms, beginning with YOLOv12 and progressing through YOLO11 (or YOLOv11), YOLOv10, YOLOv9, YOLOv8, and subsequent versions to explore each version's contributions to enhancing speed, detection accuracy, and computational efficiency in real-time object detection. Additionally, this study reviews the alternative versions derived from YOLO architectural advancements of YOLO-NAS, YOLO-X, YOLO-R, DAMO-YOLO, and Gold-YOLO. Moreover, the study highlights the transformative impact of YOLO models across five critical application areas: autonomous vehicles and traffic safety, healthcare and medical imaging, industrial manufacturing, surveillance and security, and agriculture. By detailing the incremental technological advancements in subsequent YOLO versions, this review chronicles the evolution of YOLO, and discusses the challenges and limitations in each of the earlier versions. The evolution signifies a path towards integrating YOLO with multimodal, context-aware, and Artificial General Intelligence (AGI) systems for the next YOLO decade, promising significant implications for future developments in AI-driven applications. YOLO Review, YOLO Advances, YOLOv13, YOLOv14, YOLOv15, YOLOv16, YOLOv17, YOLOv18, YOLOv19, YOLOv20, YOLO review, YOLO Object Detection
△ Less
Submitted 30 May, 2025; v1 submitted 12 June, 2024;
originally announced June 2024.
-
DSAM: A Deep Learning Framework for Analyzing Temporal and Spatial Dynamics in Brain Networks
Authors:
Bishal Thapaliya,
Robyn Miller,
Jiayu Chen,
Yu-Ping Wang,
Esra Akbas,
Ram Sapkota,
Bhaskar Ray,
Pranav Suresh,
Santosh Ghimire,
Vince Calhoun,
Jingyu Liu
Abstract:
Resting-state functional magnetic resonance imaging (rs-fMRI) is a noninvasive technique pivotal for understanding human neural mechanisms of intricate cognitive processes. Most rs-fMRI studies compute a single static functional connectivity matrix across brain regions of interest, or dynamic functional connectivity matrices with a sliding window approach. These approaches are at risk of oversimpl…
▽ More
Resting-state functional magnetic resonance imaging (rs-fMRI) is a noninvasive technique pivotal for understanding human neural mechanisms of intricate cognitive processes. Most rs-fMRI studies compute a single static functional connectivity matrix across brain regions of interest, or dynamic functional connectivity matrices with a sliding window approach. These approaches are at risk of oversimplifying brain dynamics and lack proper consideration of the goal at hand. While deep learning has gained substantial popularity for modeling complex relational data, its application to uncovering the spatiotemporal dynamics of the brain is still limited. We propose a novel interpretable deep learning framework that learns goal-specific functional connectivity matrix directly from time series and employs a specialized graph neural network for the final classification. Our model, DSAM, leverages temporal causal convolutional networks to capture the temporal dynamics in both low- and high-level feature representations, a temporal attention unit to identify important time points, a self-attention unit to construct the goal-specific connectivity matrix, and a novel variant of graph neural network to capture the spatial dynamics for downstream classification. To validate our approach, we conducted experiments on the Human Connectome Project dataset with 1075 samples to build and interpret the model for the classification of sex group, and the Adolescent Brain Cognitive Development Dataset with 8520 samples for independent testing. Compared our proposed framework with other state-of-art models, results suggested this novel approach goes beyond the assumption of a fixed connectivity matrix and provides evidence of goal-specific brain connectivity patterns, which opens up the potential to gain deeper insights into how the human brain adapts its functional connectivity specific to the task at hand.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Immature Green Apple Detection and Sizing in Commercial Orchards using YOLOv8 and Shape Fitting Techniques
Authors:
Ranjan Sapkota,
Dawood Ahmed,
Martin Churuvija,
Manoj Karkee
Abstract:
Detecting and estimating size of apples during the early stages of growth is crucial for predicting yield, pest management, and making informed decisions related to crop-load management, harvest and post-harvest logistics, and marketing. Traditional fruit size measurement methods are laborious and timeconsuming. This study employs the state-of-the-art YOLOv8 object detection and instance segmentat…
▽ More
Detecting and estimating size of apples during the early stages of growth is crucial for predicting yield, pest management, and making informed decisions related to crop-load management, harvest and post-harvest logistics, and marketing. Traditional fruit size measurement methods are laborious and timeconsuming. This study employs the state-of-the-art YOLOv8 object detection and instance segmentation algorithm in conjunction with geometric shape fitting techniques on 3D point cloud data to accurately determine the size of immature green apples (or fruitlet) in a commercial orchard environment. The methodology utilized two RGB-D sensors: Intel RealSense D435i and Microsoft Azure Kinect DK. Notably, the YOLOv8 instance segmentation models exhibited proficiency in immature green apple detection, with the YOLOv8m-seg model achieving the highest [email protected] and [email protected] scores of 0.94 and 0.91, respectively. Using the ellipsoid fitting technique on images from the Azure Kinect, we achieved an RMSE of 2.35 mm, MAE of 1.66 mm, MAPE of 6.15 mm, and an R-squared value of 0.9 in estimating the size of apple fruitlets. Challenges such as partial occlusion caused some error in accurately delineating and sizing green apples using the YOLOv8-based segmentation technique, particularly in fruit clusters. In a comparison with 102 outdoor samples, the size estimation technique performed better on the images acquired with Microsoft Azure Kinect than the same with Intel Realsense D435i. This superiority is evident from the metrics: the RMSE values (2.35 mm for Azure Kinect vs. 9.65 mm for Realsense D435i), MAE values (1.66 mm for Azure Kinect vs. 7.8 mm for Realsense D435i), and the R-squared values (0.9 for Azure Kinect vs. 0.77 for Realsense D435i).
△ Less
Submitted 2 April, 2024; v1 submitted 8 December, 2023;
originally announced January 2024.
-
Comparing YOLOv8 and Mask RCNN for object segmentation in complex orchard environments
Authors:
Ranjan Sapkota,
Dawood Ahmed,
Manoj Karkee
Abstract:
Instance segmentation, an important image processing operation for automation in agriculture, is used to precisely delineate individual objects of interest within images, which provides foundational information for various automated or robotic tasks such as selective harvesting and precision pruning. This study compares the one-stage YOLOv8 and the two-stage Mask R-CNN machine learning models for…
▽ More
Instance segmentation, an important image processing operation for automation in agriculture, is used to precisely delineate individual objects of interest within images, which provides foundational information for various automated or robotic tasks such as selective harvesting and precision pruning. This study compares the one-stage YOLOv8 and the two-stage Mask R-CNN machine learning models for instance segmentation under varying orchard conditions across two datasets. Dataset 1, collected in dormant season, includes images of dormant apple trees, which were used to train multi-object segmentation models delineating tree branches and trunks. Dataset 2, collected in the early growing season, includes images of apple tree canopies with green foliage and immature (green) apples (also called fruitlet), which were used to train single-object segmentation models delineating only immature green apples. The results showed that YOLOv8 performed better than Mask R-CNN, achieving good precision and near-perfect recall across both datasets at a confidence threshold of 0.5. Specifically, for Dataset 1, YOLOv8 achieved a precision of 0.90 and a recall of 0.95 for all classes. In comparison, Mask R-CNN demonstrated a precision of 0.81 and a recall of 0.81 for the same dataset. With Dataset 2, YOLOv8 achieved a precision of 0.93 and a recall of 0.97. Mask R-CNN, in this single-class scenario, achieved a precision of 0.85 and a recall of 0.88. Additionally, the inference times for YOLOv8 were 10.9 ms for multi-class segmentation (Dataset 1) and 7.8 ms for single-class segmentation (Dataset 2), compared to 15.6 ms and 12.8 ms achieved by Mask R-CNN's, respectively.
△ Less
Submitted 4 July, 2024; v1 submitted 13 December, 2023;
originally announced December 2023.
-
Robotic Pollination of Apples in Commercial Orchards
Authors:
Ranjan Sapkota,
Dawood Ahmed,
Salik Ram Khanal,
Uddhav Bhattarai,
Changki Mo,
Matthew D. Whiting,
Manoj Karkee
Abstract:
This research presents a novel, robotic pollination system designed for targeted pollination of apple flowers in modern fruiting wall orchards. Developed in response to the challenges of global colony collapse disorder, climate change, and the need for sustainable alternatives to traditional pollinators, the system utilizes a commercial manipulator, a vision system, and a spray nozzle for pollen a…
▽ More
This research presents a novel, robotic pollination system designed for targeted pollination of apple flowers in modern fruiting wall orchards. Developed in response to the challenges of global colony collapse disorder, climate change, and the need for sustainable alternatives to traditional pollinators, the system utilizes a commercial manipulator, a vision system, and a spray nozzle for pollen application. Initial tests in April 2022 pollinated 56% of the target flower clusters with at least one fruit with a cycle time of 6.5 s. Significant improvements were made in 2023, with the system accurately detecting 91% of available flowers and pollinating 84% of target flowers with a reduced cycle time of 4.8 s. This system showed potential for precision artificial pollination that can also minimize the need for labor-intensive field operations such as flower and fruitlet thinning.
△ Less
Submitted 3 February, 2024; v1 submitted 10 November, 2023;
originally announced November 2023.
-
Brain Networks and Intelligence: A Graph Neural Network Based Approach to Resting State fMRI Data
Authors:
Bishal Thapaliya,
Esra Akbas,
Jiayu Chen,
Raam Sapkota,
Bhaskar Ray,
Pranav Suresh,
Vince Calhoun,
Jingyu Liu
Abstract:
Resting-state functional magnetic resonance imaging (rsfMRI) is a powerful tool for investigating the relationship between brain function and cognitive processes as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this paper, we present a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystalli…
▽ More
Resting-state functional magnetic resonance imaging (rsfMRI) is a powerful tool for investigating the relationship between brain function and cognitive processes as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this paper, we present a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystallized, and total intelligence) using graph neural networks on rsfMRI derived static functional network connectivity matrices. Extending from the existing graph convolution networks, our approach incorporates a clustering-based embedding and graph isomorphism network in the graph convolutional layer to reflect the nature of the brain sub-network organization and efficient network expression, in combination with TopK pooling and attention-based readout functions. We evaluated our proposed architecture on a large dataset, specifically the Adolescent Brain Cognitive Development Dataset, and demonstrated its effectiveness in predicting individual differences in intelligence. Our model achieved lower mean squared errors and higher correlation scores than existing relevant graph architectures and other traditional machine learning models for all of the intelligence prediction tasks. The middle frontal gyrus exhibited a significant contribution to both fluid and crystallized intelligence, suggesting their pivotal role in these cognitive processes. Total composite scores identified a diverse set of brain regions to be relevant which underscores the complex nature of total intelligence.
△ Less
Submitted 27 October, 2024; v1 submitted 6 November, 2023;
originally announced November 2023.
-
Generative AI in Agriculture: Creating Image Datasets Using DALL.E's Advanced Large Language Model Capabilities
Authors:
Ranjan Sapkota,
Manoj Karkee
Abstract:
This research investigated the role of artificial intelligence (AI), specifically the DALL.E model by OpenAI, in advancing data generation and visualization techniques in agriculture. DALL.E, an advanced AI image generator, works alongside ChatGPT's language processing to transform text descriptions and image clues into realistic visual representations of the content. The study used both approache…
▽ More
This research investigated the role of artificial intelligence (AI), specifically the DALL.E model by OpenAI, in advancing data generation and visualization techniques in agriculture. DALL.E, an advanced AI image generator, works alongside ChatGPT's language processing to transform text descriptions and image clues into realistic visual representations of the content. The study used both approaches of image generation: text-to-image and image-to-image (variation). Six types of datasets depicting fruit crop environment were generated. These AI-generated images were then compared against ground truth images captured by sensors in real agricultural fields. The comparison was based on Peak Signal-to-Noise Ratio (PSNR) and Feature Similarity Index (FSIM) metrics. The image-to-image generation exhibited a 5.78% increase in average PSNR over text-to-image methods, signifying superior image clarity and quality. However, this method also resulted in a 10.23% decrease in average FSIM, indicating a diminished structural and textural similarity to the original images. Similar to these measures, human evaluation also showed that images generated using image-to-image-based method were more realistic compared to those generated with text-to-image approach. The results highlighted DALL.E's potential in generating realistic agricultural image datasets and thus accelerating the development and adoption of imaging-based precision agricultural solutions. In future, DALL.E along with other alternative LLM based image generation models such as MidJourney, Stable Diffusion, Craiyon, Imagen, Parti, DreamStudio, Make-A-Scene, DeepDream, and VQ-GAN + CLIP could demonstrate further significant potential for enhancing image clarity, quality, and realism in depicting agricultural environments, which could revolutionize precision farming practices.
△ Less
Submitted 15 March, 2025; v1 submitted 17 July, 2023;
originally announced July 2023.
-
Machine Vision-Based Crop-Load Estimation Using YOLOv8
Authors:
Dawood Ahmed,
Ranjan Sapkota,
Martin Churuvija,
Manoj Karkee
Abstract:
Labor shortages in fruit crop production have prompted the development of mechanized and automated machines as alternatives to labor-intensive orchard operations such as harvesting, pruning, and thinning. Agricultural robots capable of identifying tree canopy parts and estimating geometric and topological parameters, such as branch diameter, length, and angles, can optimize crop yields through aut…
▽ More
Labor shortages in fruit crop production have prompted the development of mechanized and automated machines as alternatives to labor-intensive orchard operations such as harvesting, pruning, and thinning. Agricultural robots capable of identifying tree canopy parts and estimating geometric and topological parameters, such as branch diameter, length, and angles, can optimize crop yields through automated pruning and thinning platforms. In this study, we proposed a machine vision system to estimate canopy parameters in apple orchards and determine an optimal number of fruit for individual branches, providing a foundation for robotic pruning, flower thinning, and fruitlet thinning to achieve desired yield and quality.Using color and depth information from an RGB-D sensor (Microsoft Azure Kinect DK), a YOLOv8-based instance segmentation technique was developed to identify trunks and branches of apple trees during the dormant season. Principal Component Analysis was applied to estimate branch diameter (used to calculate limb cross-sectional area, or LCSA) and orientation. The estimated branch diameter was utilized to calculate LCSA, which served as an input for crop-load estimation, with larger LCSA values indicating a higher potential fruit-bearing capacity.RMSE for branch diameter estimation was 2.08 mm, and for crop-load estimation, 3.95. Based on commercial apple orchard management practices, the target crop-load (number of fruit) for each segmented branch was estimated with a mean absolute error (MAE) of 2.99 (ground truth crop-load was 6 apples per LCSA). This study demonstrated a promising workflow with high performance in identifying trunks and branches of apple trees in dynamic commercial orchard environments and integrating farm management practices into automated decision-making.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
Machine Vision System for Early-stage Apple Flowers and Flower Clusters Detection for Precision Thinning and Pollination
Authors:
Salik Ram Khanal,
Ranjan Sapkota,
Dawood Ahmed,
Uddhav Bhattarai,
Manoj Karkee
Abstract:
Early-stage identification of fruit flowers that are in both opened and unopened condition in an orchard environment is significant information to perform crop load management operations such as flower thinning and pollination using automated and robotic platforms. These operations are important in tree-fruit agriculture to enhance fruit quality, manage crop load, and enhance the overall profit. T…
▽ More
Early-stage identification of fruit flowers that are in both opened and unopened condition in an orchard environment is significant information to perform crop load management operations such as flower thinning and pollination using automated and robotic platforms. These operations are important in tree-fruit agriculture to enhance fruit quality, manage crop load, and enhance the overall profit. The recent development in agricultural automation suggests that this can be done using robotics which includes machine vision technology. In this article, we proposed a vision system that detects early-stage flowers in an unstructured orchard environment using YOLOv5 object detection algorithm. For the robotics implementation, the position of a cluster of the flower blossom is important to navigate the robot and the end effector. The centroid of individual flowers (both open and unopen) was identified and associated with flower clusters via K-means clustering. The accuracy of the opened and unopened flower detection is achieved up to mAP of 81.9% in commercial orchard images.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Site-specific weed management in corn using UAS imagery analysis and computer vision techniques
Authors:
Ranjan Sapkota,
John Stenger,
Michael Ostlie,
Paulo Flores
Abstract:
Currently, weed control in commercial corn production is performed without considering weed distribution information in the field. This kind of weed management practice leads to excessive amounts of chemical herbicides being applied in a given field. The objective of this study was to perform site-specific weed control (SSWC) in a corn field by 1) using an unmanned aerial system (UAS) to map the s…
▽ More
Currently, weed control in commercial corn production is performed without considering weed distribution information in the field. This kind of weed management practice leads to excessive amounts of chemical herbicides being applied in a given field. The objective of this study was to perform site-specific weed control (SSWC) in a corn field by 1) using an unmanned aerial system (UAS) to map the spatial distribution information of weeds in the field; 2) creating a prescription map based on the weed distribution map, and 3) spraying the field using the prescription map and a commercial size sprayer. In this study, we are proposing a Crop Row Identification (CRI) algorithm, a computer vision algorithm that identifies corn rows on UAS imagery. After being identified, the corn rows were then removed from the imagery and the remaining vegetation fraction was classified as weeds. Based on that information, a grid-based weed prescription map was created and the weed control application was implemented through a commercial-size sprayer. The decision of spraying herbicides on a particular grid was based on the presence of weeds in that grid cell. All the grids that contained at least one weed were sprayed, while the grids free of weeds were not. Using our SSWC approach, we were able to save 26.23\% of the land (1.97 acres) from being sprayed with chemical herbicides compared to the existing method. This study presents a full workflow from UAS image collection to field weed control implementation using a commercial-size sprayer, and it shows that some level of savings can potentially be obtained even in a situation with high weed infestation, which might provide an opportunity to reduce chemical usage in corn production systems.
△ Less
Submitted 31 December, 2022;
originally announced January 2023.
-
An autonomous robot for pruning modern, planar fruit trees
Authors:
Alexander You,
Nidhi Parayil,
Josyula Gopala Krishna,
Uddhav Bhattarai,
Ranjan Sapkota,
Dawood Ahmed,
Matthew Whiting,
Manoj Karkee,
Cindy M. Grimm,
Joseph R. Davidson
Abstract:
Dormant pruning of fruit trees is an important task for maintaining tree health and ensuring high-quality fruit. Due to decreasing labor availability, pruning is a prime candidate for robotic automation. However, pruning also represents a uniquely difficult problem for robots, requiring robust systems for perception, pruning point determination, and manipulation that must operate under variable li…
▽ More
Dormant pruning of fruit trees is an important task for maintaining tree health and ensuring high-quality fruit. Due to decreasing labor availability, pruning is a prime candidate for robotic automation. However, pruning also represents a uniquely difficult problem for robots, requiring robust systems for perception, pruning point determination, and manipulation that must operate under variable lighting conditions and in complex, highly unstructured environments. In this paper, we introduce a system for pruning sweet cherry trees (in a planar tree architecture called an upright fruiting offshoot configuration) that integrates various subsystems from our previous work on perception and manipulation. The resulting system is capable of operating completely autonomously and requires minimal control of the environment. We validate the performance of our system through field trials in a sweet cherry orchard, ultimately achieving a cutting success rate of 58%. Though not fully robust and requiring improvements in throughput, our system is the first to operate on fruit trees and represents a useful base platform to be improved in the future.
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
Using UAS Imagery and Computer Vision to Support Site-Specific Weed Control in Corn
Authors:
Ranjan Sapkota,
Paulo Flores
Abstract:
Currently, weed control in a corn field is performed by a blanket application of herbicides that do not consider spatial distribution information of weeds and also uses an extensive amount of chemical herbicides. To reduce the amount of chemicals, we used drone-based high-resolution imagery and computer-vision techniques to perform site-specific weed control in corn.
Currently, weed control in a corn field is performed by a blanket application of herbicides that do not consider spatial distribution information of weeds and also uses an extensive amount of chemical herbicides. To reduce the amount of chemicals, we used drone-based high-resolution imagery and computer-vision techniques to perform site-specific weed control in corn.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
UAS Imagery and Computer Vision for Site-Specific Weed Control in Corn
Authors:
Ranjan Sapkota,
Paulo Flores
Abstract:
Currently, weed control in a corn field is performed by a blanket application of herbicides which do not consider spatial distribution information of weeds and also uses an extensive amount of chemical herbicides. In order to reduce the amount of chemicals, we used drone based high-resolution imagery and computer-vision techniwue to perform site-specific weed control in corn.
Currently, weed control in a corn field is performed by a blanket application of herbicides which do not consider spatial distribution information of weeds and also uses an extensive amount of chemical herbicides. In order to reduce the amount of chemicals, we used drone based high-resolution imagery and computer-vision techniwue to perform site-specific weed control in corn.
△ Less
Submitted 28 April, 2022; v1 submitted 26 April, 2022;
originally announced April 2022.
-
Bulk transport properties of Bismuth selenide thin films approaching the two-dimensional limit
Authors:
Yub Raj Sapkota,
Dipanjan Mazumdar
Abstract:
We have investigated the transport properties of topological insulator Bi2Se3 thin films grown using magnetron sputtering with an emphasis on understanding the behavior as a function of thickness. We show that thickness has a strong influence on all aspects of transport as the two-dimensional limit is approached. Bulk resistivity and Hall mobility show disproportionately large changes below 6 quin…
▽ More
We have investigated the transport properties of topological insulator Bi2Se3 thin films grown using magnetron sputtering with an emphasis on understanding the behavior as a function of thickness. We show that thickness has a strong influence on all aspects of transport as the two-dimensional limit is approached. Bulk resistivity and Hall mobility show disproportionately large changes below 6 quintuple layer which we directly correlate to an increase in the bulk band gap of few-layer Bi2Se3, an effect that is concomitant with surface gap opening. A tendency to crossover from a metallic to an insulating behavior in temperature-dependent resistivity measurements in ultra-thin Bi2Se3 is also consistent with an increase in the bulk band gap along with enhanced disorder at the film-substrate interface. Our work highlights that the properties of few-layer Bi2Se3 are tunable that may be attractive for a variety of device applications in areas such as optoelectronics, nanoelectronics and spintronics.
△ Less
Submitted 17 March, 2018;
originally announced March 2018.
-
Optical evidence of blue shift in topological insulator bismuth selenide in the few-layer limit
Authors:
Yub Raj Sapkota,
Asma Alkabsh,
Aaron Walber,
Hassana Samassekou,
Dipanjan Mazumdar
Abstract:
Optical band gap properties of high-quality few-layer topological insulator Bi2Se3 thin films grown with magnetron sputtering are investigated using broadband absorption spectroscopy. We provide direct optical evidence of a rigid blue-shift to up to 0.5 eV in the band gap of Bi2Se3 as it approaches the two-dimensional limit. The onset of this behavior is most significant below six quintuple layers…
▽ More
Optical band gap properties of high-quality few-layer topological insulator Bi2Se3 thin films grown with magnetron sputtering are investigated using broadband absorption spectroscopy. We provide direct optical evidence of a rigid blue-shift to up to 0.5 eV in the band gap of Bi2Se3 as it approaches the two-dimensional limit. The onset of this behavior is most significant below six quintuple layers. The blue shift is very robust and is observed in both protected (capped) and exposed (uncapped) thin films. Our results are consistent with observations that finite-size effects have profound impact on the electronic character of topological insulators, particularly when the top and bottom surface states are coupled. Our result provides new insights, and the need for deeper investigations, into the scaling behavior of topological materials before they can have significant impact on electronic applications.
△ Less
Submitted 2 March, 2017;
originally announced March 2017.
-
Estimation of spin relaxation lengths in spin valves of In and In2O3 nanostructures
Authors:
Keshab R Sapkota,
Parshu Gyawali,
Ian L. Pegg,
John Philip
Abstract:
We report the electrical injection and detection of spin polarized current in lateral ferromagnet-nonmagnet-ferromagnet spin valve devices, ferromagnet being cobalt and nonmagnet being indium (In) or indium oxide (In2O3) nanostructures. The In nanostructures were grown by depositing pure In on lithographically pre-patterned structures. In2O3 nanostructures were obtained by oxidation of In nanostru…
▽ More
We report the electrical injection and detection of spin polarized current in lateral ferromagnet-nonmagnet-ferromagnet spin valve devices, ferromagnet being cobalt and nonmagnet being indium (In) or indium oxide (In2O3) nanostructures. The In nanostructures were grown by depositing pure In on lithographically pre-patterned structures. In2O3 nanostructures were obtained by oxidation of In nanostructures. Spin valve devices were fabricated by depositing micro magnets over the nanostructures with connecting nonmagnetic electrodes via two steps of e-beam lithography. Clear spin switching behavior was observed in the both types of spin valve devices measured at 10 K. From the measured spin signal, the spin relaxation length (λN) of In and In2O3 nanostructures were estimated to be 449.6 nm and 788.6 nm respectively.
△ Less
Submitted 12 September, 2016;
originally announced September 2016.