Skip to main content

Showing 1–50 of 89 results for author: Khan, M H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.15226  [pdf, ps, other

    cs.CV

    Calibration-Aware Prompt Learning for Medical Vision-Language Models

    Authors: Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan

    Abstract: Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable performance across diverse medical imaging tasks by leveraging large-scale image-text pretraining. However, their confidence calibration is largely unexplored, and so remains a significant challenge. As such, miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability.… ▽ More

    Submitted 18 September, 2025; originally announced September 2025.

    Comments: Accepted in BMVC 2025

  2. arXiv:2509.09024  [pdf

    cs.RO physics.app-ph

    Rapid Manufacturing of Lightweight Drone Frames Using Single-Tow Architected Composites

    Authors: Md Habib Ullah Khan, Kaiyue Deng, Ismail Mujtaba Khan, Kelvin Fu

    Abstract: The demand for lightweight and high-strength composite structures is rapidly growing in aerospace and robotics, particularly for optimized drone frames. However, conventional composite manufacturing methods struggle to achieve complex 3D architectures for weight savings and rely on assembling separate components, which introduce weak points at the joints. Additionally, maintaining continuous fiber… ▽ More

    Submitted 10 September, 2025; originally announced September 2025.

    Comments: 23 pages, 5 figures

  3. arXiv:2508.18799  [pdf, ps, other

    cs.CV

    Robust and Label-Efficient Deep Waste Detection

    Authors: Hassan Abid, Khan Muhammad, Muhammad Haris Khan

    Abstract: Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Voc… ▽ More

    Submitted 8 September, 2025; v1 submitted 26 August, 2025; originally announced August 2025.

    Comments: Accepted at BMVC 2025

  4. arXiv:2508.14660  [pdf, ps, other

    cs.CV

    Towards PerSense++: Advancing Training-Free Personalized Instance Segmentation in Dense Images

    Authors: Muhammad Ibraheem Siddiqui, Muhammad Umer Sheikh, Hassan Abid, Kevin Henry, Muhammad Haris Khan

    Abstract: Segmentation in dense visual scenes poses significant challenges due to occlusions, background clutter, and scale variations. To address this, we introduce PerSense, an end-to-end, training-free, and model-agnostic one-shot framework for Personalized instance Segmentation in dense images. PerSense employs a novel Instance Detection Module (IDM) that leverages density maps (DMs) to generate instanc… ▽ More

    Submitted 20 August, 2025; originally announced August 2025.

    Comments: arXiv admin note: text overlap with arXiv:2405.13518

  5. arXiv:2508.01361  [pdf, ps, other

    cs.RO

    VLH: Vision-Language-Haptics Foundation Model

    Authors: Luis Francisco Moreno Fuentes, Muhammad Haris Khan, Miguel Altamirano Cabrera, Valerii Serpiva, Dmitri Iarchuk, Yara Mahmoud, Issatay Tokmurziyev, Dzmitry Tsetserukou

    Abstract: We present VLH, a novel Visual-Language-Haptic Foundation Model that unifies perception, language, and tactile feedback in aerial robotics and virtual reality. Unlike prior work that treats haptics as a secondary, reactive channel, VLH synthesizes mid-air force and vibration cues as a direct consequence of contextual visual understanding and natural language commands. Our platform comprises an 8-i… ▽ More

    Submitted 2 August, 2025; originally announced August 2025.

  6. arXiv:2507.22075  [pdf, ps, other

    cs.LG

    Prototype-Guided Pseudo-Labeling with Neighborhood-Aware Consistency for Unsupervised Adaptation

    Authors: Eman Ali, Chetan Arora, Muhammad Haris Khan

    Abstract: In unsupervised adaptation for vision-language models such as CLIP, pseudo-labels derived from zero-shot predictions often exhibit significant noise, particularly under domain shifts or in visually complex scenarios. Conventional pseudo-label filtering approaches, which rely on fixed confidence thresholds, tend to be unreliable in fully unsupervised settings. In this work, we propose a novel adapt… ▽ More

    Submitted 22 July, 2025; originally announced July 2025.

  7. arXiv:2507.20579  [pdf, ps, other

    cs.CV

    AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations

    Authors: Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muhammad Haris Khan, Usman Tariq, Tom Gedeon, Abhinav Dhall

    Abstract: The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified ma… ▽ More

    Submitted 28 July, 2025; originally announced July 2025.

  8. arXiv:2507.09615  [pdf, ps, other

    cs.CV

    Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score

    Authors: Eman Ali, Sathira Silva, Chetan Arora, Muhammad Haris Khan

    Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that cannot capture evolving, subtle class distinctions or use computationally expensive pseudo-labeling strategies that lim… ▽ More

    Submitted 13 July, 2025; originally announced July 2025.

  9. arXiv:2507.05681  [pdf, ps, other

    cs.AR cs.AI cs.LG

    GATMesh: Clock Mesh Timing Analysis using Graph Neural Networks

    Authors: Muhammad Hadir Khan, Matthew Guthaus

    Abstract: Clock meshes are essential in high-performance VLSI systems for minimizing skew and handling PVT variations, but analyzing them is difficult due to reconvergent paths, multi-source driving, and input mesh buffer skew. SPICE simulations are accurate but slow; yet simplified models miss key effects like slew and input skew. We propose GATMesh, a Graph Neural Network (GNN)-based framework that models… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  10. arXiv:2506.22977  [pdf, ps, other

    cs.CL cs.LG

    On the Generalizability of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"

    Authors: Asen Dotsinski, Udit Thakur, Marko Ivanov, Mohammad Hafeez Khan, Maria Heuss

    Abstract: We present a reproduction study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals" (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: 22 pages, 25 figures. For an interactive dashboard with all figures, see https://comp-mech-generalizability.streamlit.app/ . For the accompanying code, see https://github.com/asendotsinski/comp-mech-generalizability . To be published in proceedings of the 2025 Machine Learning Reproducibility Challenge

    Journal ref: TMLR (2835-8856) 2025

  11. arXiv:2506.15649  [pdf, ps, other

    cs.CV cs.LG

    Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

    Authors: Ankan Deria, Adinath Madhavrao Dukre, Feilong Tang, Sara Atito, Sudipta Roy, Muhammad Awais, Muhammad Haris Khan, Imran Razzak

    Abstract: Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output f… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  12. arXiv:2506.06281  [pdf, other

    cs.CV

    TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

    Authors: Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, Salman Khan

    Abstract: Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable re… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  13. arXiv:2505.23752  [pdf, ps, other

    cs.CV

    ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

    Authors: Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, Salman Khan

    Abstract: Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  14. arXiv:2505.22581  [pdf, other

    cs.CV cs.AI

    Tell me Habibi, is it Real or Fake?

    Authors: Kartik Kuckreja, Parul Gupta, Injy Hamed, Thamar Solorio, Muhammad Haris Khan, Abhinav Dhall

    Abstract: Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, i… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: 9 pages, 2 figures, 12 tables

  15. arXiv:2505.02582  [pdf, ps, other

    cs.HC

    FlyHaptics: Flying Multi-contact Haptic Interface

    Authors: Luis Moreno, Miguel Altamirano Cabrera, Muhammad Haris Khan, Issatay Tokmurziyev, Yara Mahmoud, Valerii Serpiva, Dzmitry Tsetserukou

    Abstract: This work presents FlyHaptics, an aerial haptic interface tracked via a Vicon optical motion capture system and built around six five-bar linkage assemblies enclosed in a lightweight protective cage. We predefined five static tactile patterns - each characterized by distinct combinations of linkage contact points and vibration intensities - and evaluated them in a grounded pilot study, where parti… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  16. arXiv:2505.02569  [pdf, other

    cs.RO cs.HC

    HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction

    Authors: Muhammad Haris Khan, Miguel Altamirano Cabrera, Dmitrii Iarchuk, Yara Mahmoud, Daria Trinitatova, Issatay Tokmurziyev, Dzmitry Tsetserukou

    Abstract: This paper introduces HapticVLM, a novel multimodal system that integrates vision-language reasoning with deep convolutional networks to enable real-time haptic feedback. HapticVLM leverages a ConvNeXt-based material recognition module to generate robust visual embeddings for accurate identification of object materials, while a state-of-the-art Vision-Language Model (Qwen2-VL-2B-Instruct) infers a… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: Submitted to IEEE conf

  17. arXiv:2504.16433  [pdf, other

    cs.CV

    FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing

    Authors: Hariseetharam Gunduboina, Muhammad Haris Khan, Biplab Banerjee

    Abstract: In recent years, large-scale vision-language models (VLMs) like CLIP have gained attention for their zero-shot inference using instructional text prompts. While these models excel in general computer vision, their potential for domain generalization in remote sensing (RS) remains underexplored. Existing approaches enhance prompt learning by generating visual prompt tokens but rely on full-image fe… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

  18. arXiv:2503.16475  [pdf, other

    cs.HC cs.RO

    LLM-Glasses: GenAI-driven Glasses with Haptic Feedback for Navigation of Visually Impaired People

    Authors: Issatay Tokmurziyev, Miguel Altamirano Cabrera, Muhammad Haris Khan, Yara Mahmoud, Luis Moreno, Dzmitry Tsetserukou

    Abstract: We present LLM-Glasses, a wearable navigation system designed to assist visually impaired individuals by combining haptic feedback, YOLO-World object detection, and GPT-4o-driven reasoning. The system delivers real-time tactile guidance via temple-mounted actuators, enabling intuitive and independent navigation. Three user studies were conducted to evaluate its effectiveness: (1) a haptic pattern… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: Submitted to IEEE/RSJ IROS 2025

  19. arXiv:2503.16106  [pdf, other

    cs.CV

    OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP

    Authors: Mohamad Hassan N C, Divyam Gupta, Mainak Singha, Sai Bhargav Rongali, Ankit Jha, Muhammad Haris Khan, Biplab Banerjee

    Abstract: We introduce Low-Shot Open-Set Domain Generalization (LSOSDG), a novel paradigm unifying low-shot learning with open-set domain generalization (ODG). While prompt-based methods using models like CLIP have advanced DG, they falter in low-data regimes (e.g., 1-shot) and lack precision in detecting open-set samples with fine-grained semantics related to training classes. To address these challenges,… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025

  20. arXiv:2503.12206  [pdf, other

    cs.CV

    TLAC: Two-stage LMM Augmented CLIP for Zero-Shot Classification

    Authors: Ans Munir, Faisal Z. Qureshi, Muhammad Haris Khan, Mohsen Ali

    Abstract: Contrastive Language-Image Pretraining (CLIP) has shown impressive zero-shot performance on image classification. However, state-of-the-art methods often rely on fine-tuning techniques like prompt learning and adapter-based tuning to optimize CLIP's performance. The necessity for fine-tuning significantly limits CLIP's adaptability to novel datasets and domains. This requirement mandates substanti… ▽ More

    Submitted 21 April, 2025; v1 submitted 15 March, 2025; originally announced March 2025.

    Comments: Added code link in the abstract

  21. arXiv:2503.12096  [pdf, other

    cs.CV

    O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models

    Authors: Ashshak Sharifdeen, Muhammad Akhtar Munir, Sanoojan Baliah, Salman Khan, Muhammad Haris Khan

    Abstract: Test-time prompt tuning for vision-language models (VLMs) is getting attention because of their ability to learn with unlabeled data without fine-tuning. Although test-time prompt tuning methods for VLMs can boost accuracy, the resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models. Notably, more attention needs to be devote… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

    Comments: Accepted at CVPR 2025

  22. arXiv:2503.02723  [pdf, other

    cs.RO

    ImpedanceGPT: VLM-driven Impedance Control of Swarm of Mini-drones for Intelligent Navigation in Dynamic Environment

    Authors: Faryal Batool, Malaika Zafar, Yasheerah Yaqoot, Roohan Ahmed Khan, Muhammad Haris Khan, Aleksey Fedoseev, Dzmitry Tsetserukou

    Abstract: Swarm robotics plays a crucial role in enabling autonomous operations in dynamic and unpredictable environments. However, a major challenge remains ensuring safe and efficient navigation in environments filled with both dynamic alive (e.g., humans) and dynamic inanimate (e.g., non-living objects) obstacles. In this paper, we propose ImpedanceGPT, a novel system that combines a Vision-Language Mode… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: Submitted to IROS 2025

  23. arXiv:2503.02572  [pdf, other

    cs.RO cs.AI

    RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour

    Authors: Valerii Serpiva, Artem Lykov, Artyom Myshlyaev, Muhammad Haris Khan, Ali Alridha Abdulkarim, Oleg Sautenkov, Dzmitry Tsetserukou

    Abstract: RaceVLA presents an innovative approach for autonomous racing drone navigation by leveraging Visual-Language-Action (VLA) to emulate human-like behavior. This research explores the integration of advanced algorithms that enable drones to adapt their navigation strategies based on real-time environmental feedback, mimicking the decision-making processes of human pilots. The model, fine-tuned on a c… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: 6 pages, 6 figures. Submitted to IROS 2025

  24. arXiv:2503.01378  [pdf, other

    cs.RO

    CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs

    Authors: Artem Lykov, Valerii Serpiva, Muhammad Haris Khan, Oleg Sautenkov, Artyom Myshlyaev, Grik Tadevosyan, Yasheerah Yaqoot, Dzmitry Tsetserukou

    Abstract: This paper introduces CognitiveDrone, a novel Vision-Language-Action (VLA) model tailored for complex Unmanned Aerial Vehicles (UAVs) tasks that demand advanced cognitive abilities. Trained on a dataset comprising over 8,000 simulated flight trajectories across three key categories-Human Recognition, Symbol Understanding, and Reasoning-the model generates real-time 4D action commands based on firs… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Paper submitted to the IEEE conference

  25. arXiv:2502.17034  [pdf, other

    cs.RO cs.NE

    Evolution 6.0: Evolving Robotic Capabilities Through Generative Design

    Authors: Muhammad Haris Khan, Artyom Myshlyaev, Artem Lykov, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

    Abstract: We propose a new concept, Evolution 6.0, which represents the evolution of robotics driven by Generative AI. When a robot lacks the necessary tools to accomplish a task requested by a human, it autonomously designs the required instruments and learns how to use them to achieve the goal. Evolution 6.0 is an autonomous robotic system powered by Vision-Language Models (VLMs), Vision-Language Action (… ▽ More

    Submitted 4 April, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: Submitted to IROS

  26. arXiv:2502.07811  [pdf, other

    cs.CV

    CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

    Authors: Shihab Aaqil Ahamed, Malitha Gunawardhana, Liel David, Michael Sidorov, Daniel Harari, Muhammad Haris Khan

    Abstract: Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook nuanced semantic attributes like specific interactions or sequences that define actions - such as action-specific features that align more closely with human cogniti… ▽ More

    Submitted 8 February, 2025; originally announced February 2025.

  27. arXiv:2501.07255  [pdf, other

    cs.RO

    GazeGrasp: DNN-Driven Robotic Grasping with Wearable Eye-Gaze Interface

    Authors: Issatay Tokmurziyev, Miguel Altamirano Cabrera, Luis Moreno, Muhammad Haris Khan, Dzmitry Tsetserukou

    Abstract: We present GazeGrasp, a gaze-based manipulation system enabling individuals with motor impairments to control collaborative robots using eye-gaze. The system employs an ESP32 CAM for eye tracking, MediaPipe for gaze detection, and YOLOv8 for object localization, integrated with a Universal Robot UR10 for manipulation tasks. After user-specific calibration, the system allows intuitive object select… ▽ More

    Submitted 14 January, 2025; v1 submitted 13 January, 2025; originally announced January 2025.

    Comments: Accepted to: IEEE/ACM International Conference on Human-Robot Interaction (HRI 2025)

  28. arXiv:2501.06919  [pdf, other

    cs.RO

    Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

    Authors: Muhamamd Haris Khan, Selamawit Asfaw, Dmitrii Iarchuk, Miguel Altamirano Cabrera, Luis Moreno, Issatay Tokmurziyev, Dzmitry Tsetserukou

    Abstract: This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT… ▽ More

    Submitted 12 January, 2025; originally announced January 2025.

    Comments: Accepted to IEEE/ACM HRI 2025

  29. arXiv:2411.11917  [pdf, other

    cs.CV

    FCC: Fully Connected Correlation for Few-Shot Segmentation

    Authors: Seonghyeon Moon, Haein Kong, Muhammad Haris Khan, Yuewei Lin

    Abstract: Few-shot segmentation (FSS) aims to segment the target object in a query image using only a small set of support images and masks. Therefore, having strong prior information for the target object using the support set is essential for guiding the initial training of FSS, which leads to the success of few-shot segmentation in challenging cases, such as when the target object shows considerable vari… ▽ More

    Submitted 17 November, 2024; originally announced November 2024.

  30. arXiv:2411.02614  [pdf, other

    eess.IV cs.CV

    Divergent Domains, Convergent Grading: Enhancing Generalization in Diabetic Retinopathy Grading

    Authors: Sharon Chokuwa, Muhammad Haris Khan

    Abstract: Diabetic Retinopathy (DR) constitutes 5% of global blindness cases. While numerous deep learning approaches have sought to enhance traditional DR grading methods, they often falter when confronted with new out-of-distribution data thereby impeding their widespread application. In this study, we introduce a novel deep learning method for achieving domain generalization (DG) in DR grading and make t… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

    Comments: Accepted at WACV 2025

  31. arXiv:2410.20421  [pdf, other

    cs.CV cs.AI

    NT-VOT211: A Large-Scale Benchmark for Night-time Visual Object Tracking

    Authors: Yu Liu, Arif Mahmood, Muhammad Haris Khan

    Abstract: Many current visual object tracking benchmarks such as OTB100, NfS, UAV123, LaSOT, and GOT-10K, predominantly contain day-time scenarios while the challenges posed by the night-time has been less investigated. It is primarily because of the lack of a large-scale, well-annotated night-time benchmark for rigorously evaluating tracking algorithms. To this end, this paper presents NT-VOT211, a new ben… ▽ More

    Submitted 27 October, 2024; originally announced October 2024.

    Comments: Oral Acceptance at the Asian Conference on Computer Vision (ACCV) 2024, Hanoi, Vietnam

  32. arXiv:2410.20395  [pdf, other

    cs.CV eess.IV

    Depth Attention for Robust RGB Tracking

    Authors: Yu Liu, Arif Mahmood, Muhammad Haris Khan

    Abstract: RGB video object tracking is a fundamental task in computer vision. Its effectiveness can be improved using depth information, particularly for handling motion-blurred target. However, depth information is often missing in commonly used tracking benchmarks. In this work, we propose a new framework that leverages monocular depth estimation to counter the challenges of tracking targets that are out… ▽ More

    Submitted 27 October, 2024; originally announced October 2024.

    Comments: Oral Acceptance at the Asian Conference on Computer Vision (ACCV) 2024, Hanoi, Vietnam

  33. arXiv:2410.16202  [pdf, other

    cs.HC

    Musinger: Communication of Music over a Distance with Wearable Haptic Display and Touch Sensitive Surface

    Authors: Miguel Altamirano Cabrera, Muhammad Haris Khan, Ali Alabbas, Luis Moreno, Issatay Tokmurziyev, Dzmitry Tsetserukou

    Abstract: This study explores the integration of auditory and tactile experiences in musical haptics, focusing on enhancing sensory dimensions of music through touch. Addressing the gap in translating auditory signals to meaningful tactile feedback, our research introduces a novel method involving a touch-sensitive recorder and a wearable haptic display that captures musical interactions via force sensors a… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

    Comments: This paper has been accepted for publication at ROBIO 2024 conference

  34. arXiv:2410.16146  [pdf, other

    cs.LG cs.CV

    Towards Combating Frequency Simplicity-biased Learning for Domain Generalization

    Authors: Xilin He, Jingyu Hu, Qinliang Lin, Cheng Luo, Weicheng Xie, Siyang Song, Muhammad Haris Khan, Linlin Shen

    Abstract: Domain generalization methods aim to learn transferable knowledge from source domains that can generalize well to unseen target domains. Recent studies show that neural networks frequently suffer from a simplicity-biased learning behavior which leads to over-reliance on specific frequency sets, namely as frequency shortcuts, instead of semantic information, resulting in poor generalization perform… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

    Comments: Accepted by NeurIPS 2024

  35. arXiv:2410.09865  [pdf, ps, other

    cs.CV

    SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

    Authors: Xilin He, Cheng Luo, Xiaole Xian, Bing Li, Muhammad Haris Khan, Zongyuan Ge, Weicheng Xie, Siyang Song, Linlin Shen, Bernard Ghanem, Xiangyu Yue

    Abstract: Facial expression datasets remain limited in scale due to the subjectivity of annotations and the labor-intensive nature of data collection. This limitation poses a significant challenge for developing modern deep learning-based facial expression analysis models, particularly foundation models, that rely on large-scale data for optimal performance. To tackle the overarching and complex challenge,… ▽ More

    Submitted 12 August, 2025; v1 submitted 13 October, 2024; originally announced October 2024.

    Comments: ICCV 2025

  36. arXiv:2410.05322  [pdf, other

    cs.CV

    Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models

    Authors: Muhammad Haaris Khan, Hadrien Reynaud, Bernhard Kainz

    Abstract: Although powerful for image generation, consistent and controllable video is a longstanding problem for diffusion models. Video models require extensive training and computational resources, leading to high costs and large environmental impacts. Moreover, video models currently offer limited control of the output motion. This paper introduces a novel approach to video generation by augmenting imag… ▽ More

    Submitted 5 October, 2024; originally announced October 2024.

  37. Multi-modal Medical Image Fusion For Non-Small Cell Lung Cancer Classification

    Authors: Salma Hassan, Hamad Al Hammadi, Ibrahim Mohammed, Muhammad Haris Khan

    Abstract: The early detection and nuanced subtype classification of non-small cell lung cancer (NSCLC), a predominant cause of cancer mortality worldwide, is a critical and complex issue. In this paper, we introduce an innovative integration of multi-modal data, synthesizing fused medical imaging (CT and PET scans) with clinical health records and genomic data. This unique fusion methodology leverages advan… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

  38. arXiv:2409.10106  [pdf, other

    cs.RO cs.AI

    Industry 6.0: New Generation of Industry driven by Generative AI and Swarm of Heterogeneous Robots

    Authors: Artem Lykov, Miguel Altamirano Cabrera, Mikhail Konenkov, Valerii Serpiva, Koffivi Fid`ele Gbagbe, Ali Alabbas, Aleksey Fedoseev, Luis Moreno, Muhammad Haris Khan, Ziang Guo, Dzmitry Tsetserukou

    Abstract: This paper presents the concept of Industry 6.0, introducing the world's first fully automated production system that autonomously handles the entire product design and manufacturing process based on user-provided natural language descriptions. By leveraging generative AI, the system automates critical aspects of production, including product blueprint design, component manufacturing, logistics, a… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

    Comments: submitted to IEEE conf

  39. arXiv:2409.07269  [pdf, other

    cs.CV

    Realistic and Efficient Face Swapping: A Unified Approach with Diffusion Models

    Authors: Sanoojan Baliah, Qinliang Lin, Shengcai Liao, Xiaodan Liang, Muhammad Haris Khan

    Abstract: Despite promising progress in face swapping task, realistic swapped images remain elusive, often marred by artifacts, particularly in scenarios involving high pose variation, color differences, and occlusion. To address these issues, we propose a novel approach that better harnesses diffusion models for face-swapping by making following core contributions. (a) We propose to re-frame the face-swapp… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

    Comments: Accepted as a conference paper at WACV 2025

  40. arXiv:2409.03509  [pdf, other

    cs.CV

    Domain-Guided Weight Modulation for Semi-Supervised Domain Generalization

    Authors: Chamuditha Jayanaga Galappaththige, Zachary Izzo, Xilin He, Honglu Zhou, Muhammad Haris Khan

    Abstract: Unarguably, deep learning models capable of generalizing to unseen domain data while leveraging a few labels are of great practical significance due to low developmental costs. In search of this endeavor, we study the challenging problem of semi-supervised domain generalization (SSDG), where the goal is to learn a domain-generalizable model while using only a small fraction of labeled data and a r… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted at WACV25

  41. arXiv:2409.01387  [pdf, other

    cs.AR cs.AI cs.DC cs.LG

    VLSI Hypergraph Partitioning with Deep Learning

    Authors: Muhammad Hadir Khan, Bugra Onal, Eren Dogan, Matthew R. Guthaus

    Abstract: Partitioning is a known problem in computer science and is critical in chip design workflows, as advancements in this area can significantly influence design quality and efficiency. Deep Learning (DL) techniques, particularly those involving Graph Neural Networks (GNNs), have demonstrated strong performance in various node, edge, and graph prediction tasks using both inductive and transductive lea… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

  42. arXiv:2408.08855  [pdf, other

    cs.CV

    DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models

    Authors: Eman Ali, Sathira Silva, Muhammad Haris Khan

    Abstract: Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labeled data is unavailable. Recent research has proposed pseudo-labeling approaches to adapt CLIP in an unsupervised manner using unlabeled target data. Nonetheless, these methods… ▽ More

    Submitted 1 December, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

    Comments: Accepted at WACV 2025

  43. arXiv:2408.07445  [pdf, other

    cs.CV

    Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach

    Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Zaigham Zaheer, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf, Hassan Sajjad, Tom De Schepper, Markus Schedl

    Abstract: Multimodal networks have demonstrated remarkable performance improvements over their unimodal counterparts. Existing multimodal networks are designed in a multi-branch fashion that, due to the reliance on fusion strategies, exhibit deteriorated performance if one or more modalities are missing. In this work, we propose a modality invariant multimodal learning method, which is less susceptible to t… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  44. arXiv:2408.00498  [pdf, other

    cs.CV

    How Effective are Self-Supervised Models for Contact Identification in Videos

    Authors: Malitha Gunawardhana, Limalka Sadith, Liel David, Daniel Harari, Muhammad Haris Khan

    Abstract: The exploration of video content via Self-Supervised Learning (SSL) models has unveiled a dynamic field of study, emphasizing both the complex challenges and unique opportunities inherent in this area. Despite the growing body of research, the ability of SSL models to detect physical contacts in videos remains largely unexplored, particularly the effectiveness of methods such as downstream supervi… ▽ More

    Submitted 25 September, 2024; v1 submitted 1 August, 2024; originally announced August 2024.

    Comments: 15 pages, 6 figures

  45. arXiv:2407.13715  [pdf, other

    cs.CV cs.LG

    Attention Based Simple Primitives for Open World Compositional Zero-Shot Learning

    Authors: Ans Munir, Faisal Z. Qureshi, Muhammad Haris Khan, Mohsen Ali

    Abstract: Compositional Zero-Shot Learning (CZSL) aims to predict unknown compositions made up of attribute and object pairs. Predicting compositions unseen during training is a challenging task. We are exploring Open World Compositional Zero-Shot Learning (OW-CZSL) in this study, where our test space encompasses all potential combinations of attributes and objects. Our approach involves utilizing the self-… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 10 pages, 6 figures

  46. arXiv:2407.04519  [pdf, ps, other

    cs.CV

    Judging from Support-set: A New Way to Utilize Few-Shot Segmentation for Segmentation Refinement Process

    Authors: Seonghyeon Moon, Qingze, Liu, Haein Kong, Muhammad Haris Khan

    Abstract: Segmentation refinement aims to enhance the initial coarse masks generated by segmentation algorithms. The refined masks are expected to capture more details and better contours of the target objects. Research on segmentation refinement has developed as a response to the need for high-quality image segmentations. However, to our knowledge, no method has been developed that can determine the succes… ▽ More

    Submitted 9 July, 2025; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: ICIP 2025

  47. arXiv:2407.01440  [pdf, other

    cs.LG

    GAT-Steiner: Rectilinear Steiner Minimal Tree Prediction Using GNNs

    Authors: Bugra Onal, Eren Dogan, Muhammad Hadir Khan, Matthew R. Guthaus

    Abstract: The Rectilinear Steiner Minimum Tree (RSMT) problem is a fundamental problem in VLSI placement and routing and is known to be NP-hard. Traditional RSMT algorithms spend a significant amount of time on finding Steiner points to reduce the total wire length or use heuristics to approximate producing sub-optimal results. We show that Graph Neural Networks (GNNs) can be used to predict optimal Steiner… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Preprint for The 2024 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2024)

  48. arXiv:2405.14497  [pdf, other

    cs.CV

    Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment

    Authors: Muhammad Sohail Danish, Muhammad Haris Khan, Muhammad Akhtar Munir, M. Saquib Sarfraz, Mohsen Ali

    Abstract: In this work, we tackle the problem of domain generalization for object detection, specifically focusing on the scenario where only a single source domain is available. We propose an effective approach that involves two key steps: diversifying the source domain and aligning detections based on class prediction confidence and localization. Firstly, we demonstrate that by carefully selecting a set o… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  49. arXiv:2405.13518  [pdf, ps, other

    cs.CV

    PerSense: Training-Free Personalized Instance Segmentation in Dense Images

    Authors: Muhammad Ibraheem Siddiqui, Muhammad Umer Sheikh, Hassan Abid, Muhammad Haris Khan

    Abstract: The emergence of foundational models has significantly advanced segmentation approaches. However, challenges still remain in dense scenarios, where occlusions, scale variations, and clutter impede precise instance delineation. To address this, we propose PerSense, an end-to-end, training-free, and model-agnostic one-shot framework for Personalized instance Segmentation in dense images. We start wi… ▽ More

    Submitted 7 August, 2025; v1 submitted 22 May, 2024; originally announced May 2024.

    Comments: Technical report of PerSense

  50. arXiv:2404.09342  [pdf, other

    cs.CV cs.SD eess.AS

    Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

    Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf

    Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2… ▽ More

    Submitted 22 July, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: ACM Multimedia Conference - Grand Challenge