Search | arXiv e-print repository

Understanding Adversarial Training with Energy-based Models

Authors: Mujtaba Hussain Mirza, Maria Rosaria Briglia, Filippo Bartolucci, Senad Beadini, Giuseppe Lisanti, Iacopo Masi

Abstract: We aim at using Energy-based Model (EBM) framework to better understand adversarial training (AT) in classifiers, and additionally to analyze the intrinsic generative capabilities of robust classifiers. By viewing standard classifiers through an energy lens, we begin by analyzing how the energies of adversarial examples, generated by various attacks, differ from those of the natural samples. The c… ▽ More We aim at using Energy-based Model (EBM) framework to better understand adversarial training (AT) in classifiers, and additionally to analyze the intrinsic generative capabilities of robust classifiers. By viewing standard classifiers through an energy lens, we begin by analyzing how the energies of adversarial examples, generated by various attacks, differ from those of the natural samples. The central focus of our work is to understand the critical phenomena of Catastrophic Overfitting (CO) and Robust Overfitting (RO) in AT from an energy perspective. We analyze the impact of existing AT approaches on the energy of samples during training and observe that the behavior of the ``delta energy' -- change in energy between original sample and its adversarial counterpart -- diverges significantly when CO or RO occurs. After a thorough analysis of these energy dynamics and their relationship with overfitting, we propose a novel regularizer, the Delta Energy Regularizer (DER), designed to smoothen the energy landscape during training. We demonstrate that DER is effective in mitigating both CO and RO across multiple benchmarks. We further show that robust classifiers, when being used as generative models, have limits in handling trade-off between image quality and variability. We propose an improved technique based on a local class-wise principal component analysis (PCA) and energy-based guidance for better class-specific initialization and adaptive stopping, enhancing sample diversity and generation quality. Considering that we do not explicitly train for generative modeling, we achieve a competitive Inception Score (IS) and Fréchet inception distance (FID) compared to hybrid discriminative-generative models. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Under review for TPAMI

arXiv:2505.21742 [pdf, ps, other]

What is Adversarial Training for Diffusion Models?

Authors: Briglia Maria Rosaria, Mujtaba Hussain Mirza, Giuseppe Lisanti, Iacopo Masi

Abstract: We answer the question in the title, showing that adversarial training (AT) for diffusion models (DMs) fundamentally differs from classifiers: while AT in classifiers enforces output invariance, AT in DMs requires equivariance to keep the diffusion process aligned with the data distribution. AT is a way to enforce smoothness in the diffusion flow, improving robustness to outliers and corrupted dat… ▽ More We answer the question in the title, showing that adversarial training (AT) for diffusion models (DMs) fundamentally differs from classifiers: while AT in classifiers enforces output invariance, AT in DMs requires equivariance to keep the diffusion process aligned with the data distribution. AT is a way to enforce smoothness in the diffusion flow, improving robustness to outliers and corrupted data. Unlike prior art, our method makes no assumptions about the noise model and integrates seamlessly into diffusion training by adding random noise, similar to randomized smoothing, or adversarial noise, akin to AT. This enables intrinsic capabilities such as handling noisy data, dealing with extreme variability such as outliers, preventing memorization, and improving robustness. We rigorously evaluate our approach with proof-of-concept datasets with known distributions in low- and high-dimensional space, thereby taking a perfect measure of errors; we further evaluate on standard benchmarks such as CIFAR-10, CelebA and LSUN Bedroom, showing strong performance under severe noise, data corruption, and iterative adversarial attacks. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 40 pages

arXiv:2505.18115 [pdf, ps, other]

Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

Authors: Jacob Hansen, Wei Lin, Junmo Kang, Muhammad Jehanzeb Mirza, Hongyin Luo, Rogerio Feris, Alan Ritter, James Glass, Leonid Karlinsky

Abstract: Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They… ▽ More Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf{\method}, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage \method features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or enhance the data quality of available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ~3\% on average and up to 12\% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. Additionally, our approach enables effective performance scaling - both in quantity and quality - by enhancing the resulting LMM performance across a wide range of benchmarks. We also analyze the impact of various factors, including conversation format, base model selection, and resampling strategies. Our code, which supports the reproduction of equal or higher-quality VisIT datasets and facilities future metadata-to-VisIT data conversion for niche domains, is released at https://github.com/jacob-hansen/Instructify. △ Less

Submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.07793 [pdf, ps, other]

Overflow Prevention Enhances Long-Context Recurrent LLMs

Authors: Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, James Glass, Leonid Karlinsky, Raja Giryes

Abstract: A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we de… ▽ More A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations. △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2504.14739 [pdf, ps, other]

doi 10.1177/02783649251339680

A Modularized Design Approach for GelSight Family of Vision-based Tactile Sensors

Authors: Arpit Agarwal, Mohammad Amin Mirzaee, Xiping Sun, Wenzhen Yuan

Abstract: GelSight family of vision-based tactile sensors has proven to be effective for multiple robot perception and manipulation tasks. These sensors are based on an internal optical system and an embedded camera to capture the deformation of the soft sensor surface, inferring the high-resolution geometry of the objects in contact. However, customizing the sensors for different robot hands requires a ted… ▽ More GelSight family of vision-based tactile sensors has proven to be effective for multiple robot perception and manipulation tasks. These sensors are based on an internal optical system and an embedded camera to capture the deformation of the soft sensor surface, inferring the high-resolution geometry of the objects in contact. However, customizing the sensors for different robot hands requires a tedious trial-and-error process to re-design the optical system. In this paper, we formulate the GelSight sensor design process as a systematic and objective-driven design problem and perform the design optimization with a physically accurate optical simulation. The method is based on modularizing and parameterizing the sensor's optical components and designing four generalizable objective functions to evaluate the sensor. We implement the method with an interactive and easy-to-use toolbox called OptiSense Studio. With the toolbox, non-sensor experts can quickly optimize their sensor design in both forward and inverse ways following our predefined modules and steps. We demonstrate our system with four different GelSight sensors by quickly optimizing their initial design in simulation and transferring it to the real sensors. △ Less

Submitted 20 April, 2025; originally announced April 2025.

Comments: The paper is accepted to International Journal of Robotics Research with DOI 10.1177/02783649251339680

arXiv:2504.14231 [pdf, other]

Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

Authors: Johannes Spoecklberger, Wei Lin, Pedro Hermosilla, Sivan Doveh, Horst Possegger, M. Jehanzeb Mirza

Abstract: Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled… ▽ More Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state-of-the-art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state-of-the-art. △ Less

Submitted 19 April, 2025; originally announced April 2025.

arXiv:2504.00220 [pdf, other]

Can Diffusion Models Disentangle? A Theoretical Perspective

Authors: Liming Wang, Muhammad Jehanzeb Mirza, Yishu Gong, Yuan Gong, Jiaqi Zhang, Brian H. Tracey, Katerina Placek, Marco Vilela, James R. Glass

Abstract: This paper presents a novel theoretical framework for understanding how diffusion models can learn disentangled representations. Within this framework, we establish identifiability conditions for general disentangled latent variable models, analyze training dynamics, and derive sample complexity bounds for disentangled latent subspace models. To validate our theory, we conduct disentanglement expe… ▽ More This paper presents a novel theoretical framework for understanding how diffusion models can learn disentangled representations. Within this framework, we establish identifiability conditions for general disentangled latent variable models, analyze training dynamics, and derive sample complexity bounds for disentangled latent subspace models. To validate our theory, we conduct disentanglement experiments across diverse tasks and modalities, including subspace recovery in latent subspace Gaussian mixture models, image colorization, image denoising, and voice conversion for speech classification. Additionally, our experiments show that training strategies inspired by our theory, such as style guidance regularization, consistently enhance disentanglement performance. △ Less

Submitted 31 March, 2025; originally announced April 2025.

arXiv:2501.06263 [pdf, other]

doi 10.1109/LRA.2025.3527306

GelBelt: A Vision-based Tactile Sensor for Continuous Sensing of Large Surfaces

Authors: Mohammad Amin Mirzaee, Hung-Jui Huang, Wenzhen Yuan

Abstract: Scanning large-scale surfaces is widely demanded in surface reconstruction applications and detecting defects in industries' quality control and maintenance stages. Traditional vision-based tactile sensors have shown promising performance in high-resolution shape reconstruction while suffering limitations such as small sensing areas or susceptibility to damage when slid across surfaces, making the… ▽ More Scanning large-scale surfaces is widely demanded in surface reconstruction applications and detecting defects in industries' quality control and maintenance stages. Traditional vision-based tactile sensors have shown promising performance in high-resolution shape reconstruction while suffering limitations such as small sensing areas or susceptibility to damage when slid across surfaces, making them unsuitable for continuous sensing on large surfaces. To address these shortcomings, we introduce a novel vision-based tactile sensor designed for continuous surface sensing applications. Our design uses an elastomeric belt and two wheels to continuously scan the target surface. The proposed sensor showed promising results in both shape reconstruction and surface fusion, indicating its applicability. The dot product of the estimated and reference surface normal map is reported over the sensing area and for different scanning speeds. Results indicate that the proposed sensor can rapidly scan large-scale surfaces with high accuracy at speeds up to 45 mm/s. △ Less

Submitted 9 January, 2025; originally announced January 2025.

Comments: Accepted to IEEE RA-L. 8 pages, 7 figures, webpage: https://aminmirz.github.io/GelBelt/

arXiv:2411.13317 [pdf, other]

Teaching VLMs to Localize Specific Objects from In-context Examples

Authors: Sivan Doveh, Nimrod Shabtay, Wei Lin, Eli Schwartz, Hilde Kuehne, Raja Giryes, Rogerio Feris, Leonid Karlinsky, James Glass, Assaf Arbelle, Shimon Ullman, M. Jehanzeb Mirza

Abstract: Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by… ▽ More Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances the few-shot localization performance of recent VLMs ranging from 7B to 72B in size, without sacrificing generalization, as demonstrated on several benchmarks tailored towards evaluating personalized localization abilities. This work is the first to explore and benchmark personalized few-shot localization for VLMs -- exposing critical weaknesses in present-day VLMs, and laying a foundation for future research in context-driven vision-language applications. △ Less

Submitted 12 March, 2025; v1 submitted 20 November, 2024; originally announced November 2024.

arXiv:2410.10783 [pdf, other]

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Authors: Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb Mirza, Leshem Chosen, Mikhail Yurochkin, Yuekai Sun, Assaf Arbelle, Leonid Karlinsky, Raja Giryes

Abstract: The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against… ▽ More The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). Our dataset is available online on HuggingFace, and our code will be available here. △ Less

Submitted 22 April, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

arXiv:2410.06154 [pdf, other]

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Authors: M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass

Abstract: In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to their fitness for the downstream vision task.… ▽ More In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to their fitness for the downstream vision task. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of prompts preferred by the downstream VLM. Furthermore, we explicitly guide the LLM's generation at each optimization step by adding an offset vector -- calculated from the embedding differences between previous positive and negative solutions -- to the intermediate layer of the network for the next generation. This offset vector biases the LLM generation toward the type of language the downstream VLM prefers, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on two tasks: object recognition and the critical task of enhancing VLM safety. Our GLOV shows performance improvement by up to 15.0% and 57.5% for dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LlaVA) models for object recognition and reduces the attack success rate (ASR) on state-of-the-art VLMs by up to $60.7\%$. △ Less

Submitted 5 February, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

Comments: Code: https://github.com/jmiemirza/GLOV

arXiv:2410.00700 [pdf, other]

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

Authors: Saurav Jha, Shiqi Yang, Masato Ishii, Mengjie Zhao, Christian Simon, Muhammad Jehanzeb Mirza, Dong Gong, Lina Yao, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learnin… ▽ More Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learning (CL) setup, most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones -- a challenge that continual personalization (CP) aims to solve. Inspired by the successful CL methods that rely on class-specific information for regularization, we resort to the inherent class-conditioned density estimates, also known as diffusion classifier (DC) scores, for continual personalization of text-to-image diffusion models. Namely, we propose using DC scores for regularizing the parameter-space and function-space of text-to-image diffusion models, to achieve continual personalization. Using several diverse evaluation setups, datasets, and metrics, we show that our proposed regularization-based CP methods outperform the state-of-the-art C-LoRA, and other baselines. Finally, by operating in the replay-free CL setup and on low-rank adapters, our method incurs zero storage and parameter overhead, respectively, over the state-of-the-art. Our project page: https://srvcodes.github.io/continual_personalization/ △ Less

Submitted 9 February, 2025; v1 submitted 1 October, 2024; originally announced October 2024.

Comments: Accepted to ICLR 2025

arXiv:2407.18309 [pdf]

Adaptive Terminal Sliding Mode Control Using Deep Reinforcement Learning for Zero-Force Control of Exoskeleton Robot Systems

Authors: Morteza Mirzaee, Reza Kazemi

Abstract: This paper introduces a novel zero-force control method for upper-limb exoskeleton robots, which are used in a variety of applications including rehabilitation, assistance, and human physical capability enhancement. The proposed control method employs an Adaptive Integral Terminal Sliding Mode (AITSM) controller, combined with an exponential reaching law and Proximal Policy Optimization (PPO), a t… ▽ More This paper introduces a novel zero-force control method for upper-limb exoskeleton robots, which are used in a variety of applications including rehabilitation, assistance, and human physical capability enhancement. The proposed control method employs an Adaptive Integral Terminal Sliding Mode (AITSM) controller, combined with an exponential reaching law and Proximal Policy Optimization (PPO), a type of Deep Reinforcement Learning (DRL). The PPO system incorporates an attention mechanism and Long Short-Term Memory (LSTM) neural networks, enabling the controller to selectively focus on relevant system states, adapt to changing behavior, and capture long-term dependencies. This controller is designed to manage a 5-DOF upper-limb exoskeleton robot with zero force, even amidst system uncertainties. The controller uses an integral terminal sliding surface to ensure finite-time convergence to the desired state, a crucial feature for applications requiring quick responses. It also includes an exponential switching control term to reduce chattering and improve system accuracy. The controller's adaptability, facilitated by the PPO system, allows real-time parameter adjustments based on system feedback, making the controller robust and capable of dealing with uncertainties and disturbances that could affect the performance of the exoskeleton. The proposed control method's effectiveness and superiority are confirmed through numerical simulations and comparisons with existing control methods. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.06315 [pdf, other]

Shedding More Light on Robust Classifiers under the lens of Energy-based Models

Authors: Mujtaba Hussain Mirza, Maria Rosaria Briglia, Senad Beadini, Iacopo Masi

Abstract: By reinterpreting a robust discriminative classifier as Energy-based Model (EBM), we offer a new take on the dynamics of adversarial training (AT). Our analysis of the energy landscape during AT reveals that untargeted attacks generate adversarial images much more in-distribution (lower energy) than the original data from the point of view of the model. Conversely, we observe the opposite for targ… ▽ More By reinterpreting a robust discriminative classifier as Energy-based Model (EBM), we offer a new take on the dynamics of adversarial training (AT). Our analysis of the energy landscape during AT reveals that untargeted attacks generate adversarial images much more in-distribution (lower energy) than the original data from the point of view of the model. Conversely, we observe the opposite for targeted attacks. On the ground of our thorough analysis, we present new theoretical and practical results that show how interpreting AT energy dynamics unlocks a better understanding: (1) AT dynamic is governed by three phases and robust overfitting occurs in the third phase with a drastic divergence between natural and adversarial energies (2) by rewriting the loss of TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization (TRADES) in terms of energies, we show that TRADES implicitly alleviates overfitting by means of aligning the natural energy with the adversarial one (3) we empirically show that all recent state-of-the-art robust classifiers are smoothing the energy landscape and we reconcile a variety of studies about understanding AT and weighting the loss function under the umbrella of EBMs. Motivated by rigorous evidence, we propose Weighted Energy Adversarial Training (WEAT), a novel sample weighting scheme that yields robust accuracy matching the state-of-the-art on multiple benchmarks such as CIFAR-10 and SVHN and going beyond in CIFAR-100 and Tiny-ImageNet. We further show that robust classifiers vary in the intensity and quality of their generative capabilities, and offer a simple method to push this capability, reaching a remarkable Inception Score (IS) and FID using a robust classifier without training for generative modeling. The code to reproduce our results is available at http://github.com/OmnAI-Lab/Robust-Classifiers-under-the-lens-of-EBM/ . △ Less

Submitted 10 September, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

Comments: Accepted at European Conference on Computer Vision (ECCV) 2024

arXiv:2407.00346 [pdf, other]

Photon routing in disordered chiral waveguide QED ladders: Interplay between photonic localization and collective atomic effects

Authors: Nishan Amgain, Imran M. Mirza

Abstract: In recent years, photon routing has garnered considerable research activity due to its key applications in quantum networking and optical communications. This paper studies the single photon routing scheme in many-emitter disordered chiral waveguide quantum electrodynamics (wQED) ladders. The wQED ladder consists of two one-dimensional lossless waveguides simultaneously and chirally coupled with a… ▽ More In recent years, photon routing has garnered considerable research activity due to its key applications in quantum networking and optical communications. This paper studies the single photon routing scheme in many-emitter disordered chiral waveguide quantum electrodynamics (wQED) ladders. The wQED ladder consists of two one-dimensional lossless waveguides simultaneously and chirally coupled with a chain of dipole-dipole interacting two-level quantum emitters (QEs) or atoms. In particular, we analyze how a departure from the periodic placement of the QEs due to temperature-induced position disorder can impact the routing probability. This involves analyzing how the interplay between the collective atomic effects originating from the dipole-dipole interaction and disorder in the atomic location leading to single-photon localization can change the routing probabilities. As for some key results, we find that the routing probability exhibits a considerable improvement (more than $90\%$ value) for periodic and disordered wQED ladders when considering lattices consisting of twenty QEs. This robustness of collective effects against spontaneous emission loss and weak disorders is further confirmed by examining the routing efficiency and localization length for up to twenty QE chains. These results may find applications in quantum networking and distributed quantum computing under the realistic conditions of imperfect emitter trappings. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: 10 pages, 6 figures

arXiv:2406.09240 [pdf, other]

Comparison Visual Instruction Tuning

Authors: Wei Lin, Muhammad Jehanzeb Mirza, Sivan Doveh, Rogerio Feris, Raja Giryes, Sepp Hochreiter, Leonid Karlinsky

Abstract: Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attent… ▽ More Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attention has been given to these fundamental concepts in the best current mimic of human visual intelligence - Large Multimodal Models (LMMs). We develop and contribute a new two-phase approach CaD-VI for collecting synthetic visual instructions, together with an instruction-following dataset CaD-Inst containing 349K image pairs with CaD instructions collected using CaD-VI. Our approach significantly improves the CaD spotting capabilities in LMMs, advancing the SOTA on a diverse set of related tasks by up to 17.5%. It is also complementary to existing difference-only instruction datasets, allowing automatic targeted refinement of those resources increasing their effectiveness for CaD tuning by up to 10%. Additionally, we propose an evaluation benchmark with 7.5K open-ended QAs to assess the CaD understanding abilities of LMMs. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Project page: https://wlin-at.github.io/cad_vi ; Huggingface dataset repo: https://huggingface.co/datasets/wlin21at/CaD-Inst

arXiv:2406.08164 [pdf, other]

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Authors: Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuehne, Trevor Darrell, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky

Abstract: Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmark… ▽ More Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs. △ Less

Submitted 12 November, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: NeurIPS 2024 Camera Ready

arXiv:2406.06638 [pdf, other]

Particle Multi-Axis Transformer for Jet Tagging

Authors: Muhammad Usman, M Husnain Shahid, Maheen Ejaz, Ummay Hani, Nayab Fatima, Abdul Rehman Khan, Asifullah Khan, Nasir Majid Mirza

Abstract: Jet tagging is an essential categorization problem in high energy physics. In recent times, Deep Learning has not only risen to the challenge of jet tagging but also significantly improved its performance. In this article, we proposed an idea of a new architecture, Particle Multi-Axis transformer (ParMAT) which is a modified version of Particle transformer (ParT). ParMAT contains local and global… ▽ More Jet tagging is an essential categorization problem in high energy physics. In recent times, Deep Learning has not only risen to the challenge of jet tagging but also significantly improved its performance. In this article, we proposed an idea of a new architecture, Particle Multi-Axis transformer (ParMAT) which is a modified version of Particle transformer (ParT). ParMAT contains local and global spatial interactions within a single unit which improves its ability to handle various input lengths. We trained our model on JETCLASS, a publicly available large dataset that contains 100M jets of 10 different classes of particles. By integrating a parallel attention mechanism and pairwise interactions of particles in the attention mechanism, ParMAT achieves robustness and higher accuracy over the ParT and ParticleNet. The scalability of the model to huge datasets and its ability to automatically extract essential features demonstrate its potential for enhancing jet tagging. △ Less

Submitted 16 July, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

arXiv:2404.10534 [pdf, other]

Into the Fog: Evaluating Robustness of Multiple Object Tracking

Authors: Nadezda Kirillova, M. Jehanzeb Mirza, Horst Bischof, Horst Possegger

Abstract: State-of-the-art Multiple Object Tracking (MOT) approaches have shown remarkable performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear weather scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of trackers against these challenging conditions remains underexplored. To addr… ▽ More State-of-the-art Multiple Object Tracking (MOT) approaches have shown remarkable performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear weather scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of trackers against these challenging conditions remains underexplored. To address this gap, we introduce physics-based volumetric fog simulation method for arbitrary MOT datasets, utilizing frame-by-frame monocular depth estimation and a fog formation optical model. We enhance our simulation by rendering both homogeneous and heterogeneous fog and propose to use the dark channel prior method to estimate atmospheric light, showing promising results even in night and indoor scenes. We present the leading benchmark MOTChallenge (third release) augmented with fog (smoke for indoor scenes) of various intensities and conduct a comprehensive evaluation of MOT methods, revealing their limitations under fog and fog-like challenges. △ Less

Submitted 13 November, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

Journal ref: BMVC 2024

arXiv:2403.12736 [pdf, other]

Towards Multimodal In-Context Learning for Vision & Language Models

Authors: Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky

Abstract: State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis… ▽ More State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis has been put on transferring one of the core LLM capability of In-Context Learning (ICL). ICL is the ability of a model to reason about a downstream task with a few examples demonstrations embedded in the prompt. In this work, through extensive evaluations, we find that the state-of-the-art VLMs somewhat lack the ability to follow ICL instructions. In particular, we discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot demonstrations (in an ICL way), likely due to their lack of direct ICL instruction tuning. To enhance the ICL abilities of the present VLM, we propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. Furthermore, we also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art. △ Less

Submitted 17 July, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.11755 [pdf, other]

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuehne, Horst Possegger

Abstract: Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing th… ▽ More Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively △ Less

Submitted 7 August, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: ECCV Camera Ready. Code & Data: https://jmiemirza.github.io/Meta-Prompting/

arXiv:2403.11691 [pdf, other]

TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models

Authors: Lisa Weijler, Muhammad Jehanzeb Mirza, Leon Sick, Can Ekkazan, Pedro Hermosilla

Abstract: Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D)… ▽ More Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D) data, we first optimize a 3D segmentation backbone for the main task of semantic segmentation using the pointclouds and the task of 2D $\to$ 3D KD by using an off-the-shelf 2D pre-trained foundation model. At test-time, our TTT-KD updates the 3D segmentation backbone for each test sample, by using the self-supervised task of knowledge distillation, before performing the final prediction. Extensive evaluations on multiple indoor and outdoor 3D segmentation benchmarks show the utility of TTT-KD, as it improves performance for both in-distribution (ID) and out-of-distribution (ODO) test datasets. We achieve a gain of up to 13% mIoU (7% on average) when the train and test distributions are similar and up to 45% (20% on average) when adapting to OOD test samples. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.09193 [pdf, other]

Can We Talk Models Into Seeing the World Differently?

Authors: Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, M. Jehanzeb Mirza, Margret Keuper, Janis Keuper

Abstract: Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the vision encoder come with their own set of biases, cue preferences, and shortcuts, which have been rigorously studied in uni-modal models. A timely question is ho… ▽ More Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the vision encoder come with their own set of biases, cue preferences, and shortcuts, which have been rigorously studied in uni-modal models. A timely question is how such (potentially misaligned) biases and cue preferences behave under multi-modal fusion in VLMs. As a first step towards a better understanding, we investigate a particularly well-studied vision-only bias - the texture vs. shape bias and the dominance of local over global information. As expected, we find that VLMs inherit this bias to some extent from their vision encoders. Surprisingly, the multi-modality alone proves to have important effects on the model behavior, i.e., the joint training and the language querying change the way visual cues are processed. While this direct impact of language-informed training on a model's visual perception is intriguing, it raises further questions on our ability to actively steer a model's output so that its prediction is based on particular visual cues of the user's choice. Interestingly, VLMs have an inherent tendency to recognize objects based on shape information, which is different from what a plain vision encoder would do. Further active steering towards shape-based classifications through language prompts is however limited. In contrast, active VLM steering towards texture-based decisions through simple natural language prompts is often more successful. URL: https://github.com/paulgavrikov/vlm_shapebias △ Less

Submitted 5 March, 2025; v1 submitted 14 March, 2024; originally announced March 2024.

Comments: Accepted at ICLR 2025

arXiv:2401.15231 [pdf, other]

doi 10.1364/JOSAB.520000

Band Gap Engineering and Controlling Transport Properties of Single Photons in Periodic and Disordered Jaynes-Cummings Arrays

Authors: Tiberius Berndsen, Nishan Amgain, Imran M. Mirza

Abstract: We theoretically study the single photon transport properties in periodic and position-disordered Jaynes-Cummings (or JC) arrays of waveguide-coupled microtoroidal ring resonators, each interacting with a single two-level quantum emitter. Employing the real-space formalism of quantum optics, we focus on various parameter regimes of cavity quantum electrodynamics (cQED) to gain better control of si… ▽ More We theoretically study the single photon transport properties in periodic and position-disordered Jaynes-Cummings (or JC) arrays of waveguide-coupled microtoroidal ring resonators, each interacting with a single two-level quantum emitter. Employing the real-space formalism of quantum optics, we focus on various parameter regimes of cavity quantum electrodynamics (cQED) to gain better control of single photon propagation in such a many-body quantum optical setting. As for some of the key findings, we observe that the periodic setting leads to the formation of the band structure in the photon transmission spectra, which is most evident in the strong coupling regime of cQCD. However, under the resonant conditions with no losses, the application of Bloch's theorem indicates that the width of forbidden gaps can be altered by tuning the emitter-cavity coupling to small values. Moreover, in the disordered case, we find that the single photon transmission curves show the disappearance of band formation. However, spectral features originating from cQED interactions observed for single atom-cavity problem remain robust against weak-disordered conditions. The results of this work may find application in the study of quantum many-body effects in the optical domain as well as in different areas of quantum computation and quantum networking. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: 12 pages, 5 figures

Journal ref: J. Opt. Soc. Amer. B; Vol. 41, Issue 8, pp. C9-C19 (2024)

arXiv:2309.06809 [pdf, other]

TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Possegger, Rogerio Feris, Horst Bischof

Abstract: Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been… ▽ More Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been shown that it is possible to effectively tune VLMs without any paired data, and in particular to effectively improve VLMs visual recognition performance using text-only training data generated by Large Language Models (LLMs). In this paper, we dive deeper into this exciting text-only VLM training approach and explore ways it can be significantly further improved taking the specifics of the downstream task into account when sampling text data from LLMs. In particular, compared to the SOTA text-only VLM training approach, we demonstrate up to 8.4% performance improvement in (cross) domain-specific adaptation, up to 8.7% improvement in fine-grained recognition, and 3.1% overall average improvement in zero-shot classification compared to strong baselines. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: Code is available at: https://github.com/jmiemirza/TAP

arXiv:2308.12441 [pdf, other]

doi 10.1038/s41598-024-61043-0

Chirality-assisted enhancement of tripartite entanglement in waveguide QED

Authors: Logan Patrick, Umar Arshad, Dingyu Guo, Imran M. Mirza

Abstract: We study the generation and control of genuine tripartite entanglement among quantum emitters (QEs) that are side coupled to one-dimensional spin-momentum locked (or chiral) waveguides. By applying the machinery of Fock state master equations along with the recently proposed concurrence fill measure of tripartite entanglement [S. Xie and J. H. Eberly, Phys. Rev. Lett. 127, 040403 (2021)], we analy… ▽ More We study the generation and control of genuine tripartite entanglement among quantum emitters (QEs) that are side coupled to one-dimensional spin-momentum locked (or chiral) waveguides. By applying the machinery of Fock state master equations along with the recently proposed concurrence fill measure of tripartite entanglement [S. Xie and J. H. Eberly, Phys. Rev. Lett. 127, 040403 (2021)], we analyze how three-photon Gaussian wavepackets can distribute entanglement among two and three QEs. We show that with a five times larger waveguide decay rate in the right direction as compared to the left direction, the maximum value of tripartite entanglement can be elevated by 35% as compared to the symmetric scenario where both left and right direction decay rates are equal. Additionally, chirality can maintain the tripartite entanglement for longer times in comparison to the corresponding symmetric decay rate situation. Finally, we study the influence of detunings and spontaneous emission on the resulting entanglement. We envision quantum networking and long-distance quantum communication as two main areas of applications of this work. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Comments: 14 pages, 6 figures

Journal ref: Sci Rep 14, 11175 (2024)

arXiv:2308.01096 [pdf, other]

Learning Fourier-Constrained Diffusion Bridges for MRI Reconstruction

Authors: Muhammad U. Mirza, Onat Dalmaz, Hasan A. Bedel, Gokberk Elmas, Yilmaz Korkmaz, Alper Gungor, Salman UH Dar, Tolga Çukur

Abstract: Deep generative models have gained recent traction in accelerated MRI reconstruction. Diffusion priors are particularly promising given their representational fidelity. Instead of the target transformation from undersampled to fully-sampled data required for MRI reconstruction, common diffusion priors are trained to learn a task-agnostic transformation from an asymptotic start-point of Gaussian no… ▽ More Deep generative models have gained recent traction in accelerated MRI reconstruction. Diffusion priors are particularly promising given their representational fidelity. Instead of the target transformation from undersampled to fully-sampled data required for MRI reconstruction, common diffusion priors are trained to learn a task-agnostic transformation from an asymptotic start-point of Gaussian noise onto the finite end-point of fully-sampled data. During inference, data-consistency projections are injected in between reverse diffusion steps to reach a compromise solution within the span of both the trained diffusion prior and the imaging operator for an accelerated MRI acquisition. Unfortunately, performance losses can occur due to the discrepancy between target and learned transformations given the asymptotic normality assumption in diffusion priors. To address this discrepancy, here we introduce a novel Fourier-constrained diffusion bridge (FDB) for MRI reconstruction that transforms between a finite start-point of moderately undersampled data and an end-point of fully-sampled data. We derive the theoretical formulation of FDB as a generalized diffusion process based on a stochastic degradation operator that performs random spatial-frequency removal. We propose an enhanced sampling algorithm with a learned correction term for soft dealiasing across reverse diffusion steps. Demonstrations on brain MRI indicate that FDB outperforms state-of-the-art methods including non-diffusion and diffusion priors. △ Less

Submitted 16 December, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

arXiv:2307.03836 [pdf, other]

doi 10.1103/PhysRevA.108.063702

Electromagnetically induced transparency in many-emitter waveguide quantum electrodynamics: linear versus nonlinear waveguide dispersions

Authors: Tiberius Berndsen, Imran M. Mirza

Abstract: We study single-photon induced electromagnetically induced transparency (EIT) in many-emitter waveguide quantum electrodynamics (wQED) with linear and nonlinear waveguide dispersion relations. In the single-emitter problem, in addition to the robustness of the EIT spectral features in the over-coupled regime of wQED, we find that the nonlinear dispersion results in the appearance of a side peak fo… ▽ More We study single-photon induced electromagnetically induced transparency (EIT) in many-emitter waveguide quantum electrodynamics (wQED) with linear and nonlinear waveguide dispersion relations. In the single-emitter problem, in addition to the robustness of the EIT spectral features in the over-coupled regime of wQED, we find that the nonlinear dispersion results in the appearance of a side peak for frequencies smaller than the resonant EIT frequency which turns into a pronounced plateau as the nonlinearity is enhanced. Consequently, for many-emitter scenarios, our results indicate the formation of band structure which for higher values of nonlinearity leads to narrow band gaps as compared to the corresponding linear dispersion case. Long-distance quantum networking aided with quantum memories can serve as one of the targeted applications of this work. △ Less

Submitted 11 July, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

Comments: 7 pages, 11 figures

Journal ref: Phys. Rev. A 108, 063702 (2023)

arXiv:2305.18953 [pdf, other]

Sit Back and Relax: Learning to Drive Incrementally in All Weather Conditions

Authors: Stefan Leitner, M. Jehanzeb Mirza, Wei Lin, Jakub Micorek, Marc Masana, Mateusz Kozinski, Horst Possegger, Horst Bischof

Abstract: In autonomous driving scenarios, current object detection models show strong performance when tested in clear weather. However, their performance deteriorates significantly when tested in degrading weather conditions. In addition, even when adapted to perform robustly in a sequence of different weather conditions, they are often unable to perform well in all of them and suffer from catastrophic fo… ▽ More In autonomous driving scenarios, current object detection models show strong performance when tested in clear weather. However, their performance deteriorates significantly when tested in degrading weather conditions. In addition, even when adapted to perform robustly in a sequence of different weather conditions, they are often unable to perform well in all of them and suffer from catastrophic forgetting. To efficiently mitigate forgetting, we propose Domain-Incremental Learning through Activation Matching (DILAM), which employs unsupervised feature alignment to adapt only the affine parameters of a clear weather pre-trained network to different weather conditions. We propose to store these affine parameters as a memory bank for each weather condition and plug-in their weather-specific parameters during driving (i.e. test time) when the respective weather conditions are encountered. Our memory bank is extremely lightweight, since affine parameters account for less than 2% of a typical object detector. Furthermore, contrary to previous domain-incremental learning approaches, we do not require the weather label when testing and propose to automatically infer the weather condition by a majority voting linear classifier. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Intelligent Vehicle Conference (oral presentation)

arXiv:2305.18287 [pdf, other]

LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Mateusz Kozinski, Horst Possegger, Rogerio Feris, Horst Bischof

Abstract: Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed categ… ▽ More Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to 11.7% (3.8% on average) in the label-free setting. Moreover, despite our approach being label-free, we observe 1.3% average gains over leading few-shot prompting baselines that do use 5-shot supervision. △ Less

Submitted 23 October, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023 (Camera Ready) - Project Page: https://jmiemirza.github.io/LaFTer/

arXiv:2305.12654 [pdf, other]

doi 10.1155/2023/8897375

Theoretical and Experimental Challenges in the Measurement of Neutrino Mass

Authors: Jyotsna Singh, M. Ibrahim Mirza

Abstract: Neutrino masses are yet unknown. We discuss the present state of effective electron anti-neutrino mass from $β$ decay experiments; effective Majorana neutrino mass from neutrinoless double-beta decay experiments; neutrino mass squared differences from neutrino oscillation: solar, atmospheric, reactor and accelerator based experiments; sum of neutrino masses from cosmological observations. Current… ▽ More Neutrino masses are yet unknown. We discuss the present state of effective electron anti-neutrino mass from $β$ decay experiments; effective Majorana neutrino mass from neutrinoless double-beta decay experiments; neutrino mass squared differences from neutrino oscillation: solar, atmospheric, reactor and accelerator based experiments; sum of neutrino masses from cosmological observations. Current experimental challenges in the determination of neutrino masses are briefly discussed. The main focus is devoted to contemporary experiments. △ Less

Submitted 28 September, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: 14 pages, 6 figures

Journal ref: Advances in High Energy Physics, vol. 2023, Article ID 8897375, 14 pages, 2023

arXiv:2303.14131 [pdf, ps, other]

Phase Space Analysis of Fluorine-Oxygen-Nitrogen Network and Energy Generation in Flourine-Oxygen Reaction

Authors: Babur M. Mirza

Abstract: Reaction network of fluorine-18, oxygen-15 and nitrogen-15 is considered for its temperature dependent energy output. The main reactions for generation and annihilation of oxygen and fluorine are coupled in the reaction equations while nitrogen is produced as a decay product. We find that the governing set of equations for F18(p, alpha)O15 process in the phase diagram exhibit a predominance of the… ▽ More Reaction network of fluorine-18, oxygen-15 and nitrogen-15 is considered for its temperature dependent energy output. The main reactions for generation and annihilation of oxygen and fluorine are coupled in the reaction equations while nitrogen is produced as a decay product. We find that the governing set of equations for F18(p, alpha)O15 process in the phase diagram exhibit a predominance of the direct reaction rather than the reverse reaction consuming O15. The time-scale determined by the exact solution of this system yields relatively short time-scale conversion of fluorine into oxygen indicating an energy generation of 2.8MeV per reaction. The temperature dependence shows that the effective reaction occurs at temperature about 0.04GK or above. △ Less

Submitted 11 March, 2023; originally announced March 2023.

arXiv:2302.13627 [pdf, other]

doi 10.1103/PhysRevA.107.033507

Nonreciprocal slow or fast light in anti-$\mathcal{PT}$-symmetric optomechanics

Authors: Meiyu Peng, Huilai Zhang, Qian Zhang, Tian-Xiang Lu, Imran M. Mirza, Hui Jing

Abstract: Non-Hermitian systems with anti-parity-time ($\mathcal{APT}$) symmetry have revealed rich physics beyond conventional systems. Here, we study optomechanics in an $\mathcal{APT}$-symmetric spinning resonator and show that, by tuning the rotating speed to approach the exceptional point (EP) or the non-Hermitian spectral degeneracy, nonreciprocal light transmission with a high isolation ratio can be… ▽ More Non-Hermitian systems with anti-parity-time ($\mathcal{APT}$) symmetry have revealed rich physics beyond conventional systems. Here, we study optomechanics in an $\mathcal{APT}$-symmetric spinning resonator and show that, by tuning the rotating speed to approach the exceptional point (EP) or the non-Hermitian spectral degeneracy, nonreciprocal light transmission with a high isolation ratio can be realized. Accompanying this process, nonreciprocal group delay or advance is also identified in the vicinity of EP. Our work sheds new light on manipulating laser propagation with optomechanical EP devices and, in a broader view, can be extended to explore a wide range of $\mathcal{APT}$-symmetric effects, such as $\mathcal{APT}$-symmetric phonon lasers, $\mathcal{APT}$-symmetric topological effects, and $\mathcal{APT}$-symmetric force sensing or accelerator. △ Less

Submitted 27 February, 2023; originally announced February 2023.

Comments: 9 pages, 4 figures. It has been accepted for publication as a Regular Article in Physical Review A

arXiv:2212.09729 [pdf]

Bistable perception, precision and neuromodulation

Authors: Filip Novicky, Thomas Parr, Karl Friston, M. Berk Mirza, Noor Sajid

Abstract: Bistable perception follows from observing a static, ambiguous, (visual) stimulus with two possible interpretations. Here, we present an active (Bayesian) inference account of bistable perception and posit that perceptual transitions between different interpretations (i.e., inferences) of the same stimulus ensue from specific eye movements that shift the focus to a different visual feature. Formal… ▽ More Bistable perception follows from observing a static, ambiguous, (visual) stimulus with two possible interpretations. Here, we present an active (Bayesian) inference account of bistable perception and posit that perceptual transitions between different interpretations (i.e., inferences) of the same stimulus ensue from specific eye movements that shift the focus to a different visual feature. Formally, these inferences are a consequence of precision control that determines how confident beliefs are and change the frequency with which one can perceive - and alternate between - two distinct percepts. We hypothesised that there are multiple, but distinct, ways in which precision modulation can interact to give rise to a similar frequency of bistable perception. We validated this using numerical simulations of the Necker's cube paradigm and demonstrate the multiple routes that underwrite the frequency of perceptual alternation. Our results provide an (enactive) computational account of the intricate precision balance underwriting bistable perception. Importantly, these precision parameters can be considered the computational homologues of particular neurotransmitters - i.e., acetylcholine, noradrenaline, dopamine - that have been previously implicated in controlling bistable perception, providing a computational link between the neurochemistry and perception. △ Less

Submitted 19 December, 2022; originally announced December 2022.

arXiv:2211.15393 [pdf, other]

Video Test-Time Adaptation for Action Recognition

Authors: Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, Horst Bischof

Abstract: Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal mode… ▽ More Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts. Code will be available at \url{https://github.com/wlin-at/ViTTA}. △ Less

Submitted 20 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: Accepted at CVPR 2023

arXiv:2211.12870 [pdf, other]

ActMAD: Activation Matching to Align Distributions for Test-Time-Training

Authors: Muhammad Jehanzeb Mirza, Pol Jané Soneira, Wei Lin, Mateusz Kozinski, Horst Possegger, Horst Bischof

Abstract: Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Matching (ActMAD): We analyze activations of the model and align activation statistics of the OOD test data to those of the training data. In contrast to existing methods, which model the… ▽ More Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Matching (ActMAD): We analyze activations of the model and align activation statistics of the OOD test data to those of the training data. In contrast to existing methods, which model the distribution of entire channels in the ultimate layer of the feature extractor, we model the distribution of each feature in multiple layers across the network. This results in a more fine-grained supervision and makes ActMAD attain state of the art performance on CIFAR-100C and Imagenet-C. ActMAD is also architecture- and task-agnostic, which lets us go beyond image classification, and score 15.4% improvement over previous approaches when evaluating a KITTI-trained object detector on KITTI-Fog. Our experiments highlight that ActMAD can be applied to online adaptation in realistic scenarios, requiring little data to attain its full performance. △ Less

Submitted 23 March, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

Comments: CVPR 2023 - Project Page: https://jmiemirza.github.io/ActMAD/

arXiv:2211.11432 [pdf, other]

MATE: Masked Autoencoders are Online 3D Test-Time Learners

Authors: M. Jehanzeb Mirza, Inkyu Shin, Wei Lin, Andreas Schriebl, Kunyang Sun, Jaesung Choe, Horst Possegger, Mateusz Kozinski, In So Kweon, Kun-Jin Yoon, Horst Bischof

Abstract: Our MATE is the first Test-Time-Training (TTT) method designed for 3D data, which makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data. Like existing TTT methods from the 2D image domain, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: a large portion of each test point cloud is remove… ▽ More Our MATE is the first Test-Time-Training (TTT) method designed for 3D data, which makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data. Like existing TTT methods from the 2D image domain, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: a large portion of each test point cloud is removed before it is fed to the network, tasked with reconstructing the full point cloud. Once the network is updated, it is used to classify the point cloud. We test MATE on several 3D object classification datasets and show that it significantly improves robustness of deep networks to several types of corruptions commonly occurring in 3D point clouds. We show that MATE is very efficient in terms of the fraction of points it needs for the adaptation. It can effectively adapt given as few as 5% of tokens of each test sample, making it extremely lightweight. Our experiments show that MATE also achieves competitive performance by adapting sparsely on the test data, which further reduces its computational overhead, making it ideal for real-time applications. △ Less

Submitted 20 March, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: Code is available at this repository: https://github.com/jmiemirza/MATE

arXiv:2211.05854 [pdf, other]

Test-time adversarial detection and robustness for localizing humans using ultra wide band channel impulse responses

Authors: Abhiram Kolli, Muhammad Jehanzeb Mirza, Horst Possegger, Horst Bischof

Abstract: Keyless entry systems in cars are adopting neural networks for localizing its operators. Using test-time adversarial defences equip such systems with the ability to defend against adversarial attacks without prior training on adversarial samples. We propose a test-time adversarial example detector which detects the input adversarial example through quantifying the localized intermediate responses… ▽ More Keyless entry systems in cars are adopting neural networks for localizing its operators. Using test-time adversarial defences equip such systems with the ability to defend against adversarial attacks without prior training on adversarial samples. We propose a test-time adversarial example detector which detects the input adversarial example through quantifying the localized intermediate responses of a pre-trained neural network and confidence scores of an auxiliary softmax layer. Furthermore, in order to make the network robust, we extenuate the non-relevant features by non-iterative input sample clipping. Using our approach, mean performance over 15 levels of adversarial perturbations is increased by 55.33% for the fast gradient sign method (FGSM) and 6.3% for both the basic iterative method (BIM) and the projected gradient method (PGD). △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: 5 pages, 4 figures, ICASSP Conference

arXiv:2210.07330 [pdf, other]

doi 10.1364/JOSAB.478320

Engineering Optomechanically Induced Transparency by coupling a qubit to a spinning resonator

Authors: Jessica Burns, Owen Root, Hui Jing, Imran M. Mirza

Abstract: We theoretically study the spectral properties of a pump-probe driven hybrid spinning optomechanical ring resonator optically coupled with a two-level quantum emitter (QE or qubit). Recently we have shown [arXiv:1810.03709] that in the absence of the emitter the coupled cavity version of this setup is not only capable of nonreciprocal light propagation but can also exhibit slow & fast light propag… ▽ More We theoretically study the spectral properties of a pump-probe driven hybrid spinning optomechanical ring resonator optically coupled with a two-level quantum emitter (QE or qubit). Recently we have shown [arXiv:1810.03709] that in the absence of the emitter the coupled cavity version of this setup is not only capable of nonreciprocal light propagation but can also exhibit slow & fast light propagation. In this work, we investigate in what ways the presence of a single QE coupled with the optical whispering gallery modes of the spinning optomechanical resonator can alter the probe light nonreciprocity. Under the weak-excitation assumption and mean-field approximation, we find that the interplay between the rotational/spinning Sagnac-effect and the qubit coupling can lead to the enhancement both in the optomechanically induced transparency (OMIT) peak value and in the width of the transparency window due to the opening of qubit-assisted back reflection channel. However, compared to the no-qubit case, we notice that such an enhancement comes at the cost of degrading the group delay in probe light transmission by a factor of 1/2 for clockwise rotary directions. The target applications of these results can be in the areas of quantum circuitry and in non-reciprocal quantum communication protocols where QEs are a key component. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: 9 pages, 6 figures

Journal ref: J. Opt. Soc. Am. Vol. 40, Issue 5, pp. 958-965 (2023)

arXiv:2209.07598 [pdf, other]

doi 10.1063/5.0161061

Mitigation Strategies for ${}^{42}$Ar/${}^{42}$K Background Reduction using Encapsulation with Ultra-Pure Plastic for the LEGEND Experiment

Authors: M. Ibrahim Mirza

Abstract: Neutrinoless double-beta (0$νββ$) decay is the most compelling approach to determine the Majorana nature of neutrino and measure effective Majorana neutrino mass. The LEGEND collaboration is aiming to look for 0$νββ$ decay of ${}^{76}$Ge with unprecedented sensitivity. If underground-sourced argon is not available, the cosmogenically-induced isotope ${}^{42}$Ar and its decay progeny ${}^{42}$K in… ▽ More Neutrinoless double-beta (0$νββ$) decay is the most compelling approach to determine the Majorana nature of neutrino and measure effective Majorana neutrino mass. The LEGEND collaboration is aiming to look for 0$νββ$ decay of ${}^{76}$Ge with unprecedented sensitivity. If underground-sourced argon is not available, the cosmogenically-induced isotope ${}^{42}$Ar and its decay progeny ${}^{42}$K in the liquid argon active veto could create a challenging background for the 0$νββ$ signal. We are studying methodologies to mitigate the ${}^{42}$K background. In order to achieve this, encapsulation of germanium detectors with 3D-printed technologies using low background material are currently under investigation. Simulation results of Poly(ethylene 2,6- naphthalate) (PEN) encapsulation of germanium detectors and plans to study other potential materials are presented. △ Less

Submitted 6 October, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: 4 pages, 3 figures, conference

Journal ref: AIP Conf. Proc. 2908, 100006 (2023)

arXiv:2206.15340 [pdf, ps, other]

Work Extracting From Nonextensive Small System With Feedback and Second Law-Like Inequalities with Quantum Tsallis Entropy

Authors: Saman Amiri, Mahdi Mirzaee, Mohammad Mazhari

Abstract: Gibbs-Boltzmann entropy leads to systems that have a strong dependence on initial conditions. In reality, most materials behave quite independently of initial conditions. Nonextensive entropy or Tsallis entropy leads to nonextensive statistical mechanics. In this paper, we calculate the Tsallis form of Clausius inequality and then determine the upper bound for extracting work from the small system… ▽ More Gibbs-Boltzmann entropy leads to systems that have a strong dependence on initial conditions. In reality, most materials behave quite independently of initial conditions. Nonextensive entropy or Tsallis entropy leads to nonextensive statistical mechanics. In this paper, we calculate the Tsallis form of Clausius inequality and then determine the upper bound for extracting work from the small system in the nonextensive statistical mechanics with mutual information. In the following, we extract mutual information and adjust Maxwell's demon with quantum feedback control. △ Less

Submitted 30 June, 2022; originally announced June 2022.

Comments: 13 pages

arXiv:2204.08817 [pdf, other]

An Efficient Domain-Incremental Learning Approach to Drive in All Weather Conditions

Authors: M. Jehanzeb Mirza, Marc Masana, Horst Possegger, Horst Bischof

Abstract: Although deep neural networks enable impressive visual perception performance for autonomous driving, their robustness to varying weather conditions still requires attention. When adapting these models for changed environments, such as different weather conditions, they are prone to forgetting previously learned information. This catastrophic forgetting is typically addressed via incremental learn… ▽ More Although deep neural networks enable impressive visual perception performance for autonomous driving, their robustness to varying weather conditions still requires attention. When adapting these models for changed environments, such as different weather conditions, they are prone to forgetting previously learned information. This catastrophic forgetting is typically addressed via incremental learning approaches which usually re-train the model by either keeping a memory bank of training samples or keeping a copy of the entire model or model parameters for each scenario. While these approaches show impressive results, they can be prone to scalability issues and their applicability for autonomous driving in all weather conditions has not been shown. In this paper we propose DISC -- Domain Incremental through Statistical Correction -- a simple online zero-forgetting approach which can incrementally learn new tasks (i.e weather conditions) without requiring re-training or expensive memory banks. The only information we store for each task are the statistical parameters as we categorize each domain by the change in first and second order statistics. Thus, as each task arrives, we simply 'plug and play' the statistical vectors for the corresponding task into the model and it immediately starts to perform well on that task. We show the efficacy of our approach by testing it for object detection in a challenging domain-incremental autonomous driving scenario where we encounter different adverse weather conditions, such as heavy rain, fog, and snow. △ Less

Submitted 21 April, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: Accepted to CVPR Workshops - Camera Ready Version

arXiv:2204.05901 [pdf, other]

doi 10.1364/OPTCON.459740

Non-Markovianity in photosynthetic reaction centers: A noise-induced quantum coherence perspective

Authors: Zibo Wang, Antonio V. Lim, Imran M. Mirza

Abstract: The long-standing problem of nearly perfect photosynthetic yield in some types of bacteria and nearly all kinds of plants despite the interaction with a hot and noisy environment has witnessed quantum optical explanations in the last decade or so. Typically in these explanations, photosynthetic reaction centers are modeled as five-level quantum heat engines where the generation of Fano-type interf… ▽ More The long-standing problem of nearly perfect photosynthetic yield in some types of bacteria and nearly all kinds of plants despite the interaction with a hot and noisy environment has witnessed quantum optical explanations in the last decade or so. Typically in these explanations, photosynthetic reaction centers are modeled as five-level quantum heat engines where the generation of Fano-type interference due to the coupling of discrete state transitions with a common Markovian reservoir is held responsible for the enhancement of the photosynthetic efficiency. In this work, we go beyond the Born-Markov approximation used in the earlier works and study the impact of non-Markovian environments with Lorentzian spectral densities on the dynamics of light-harvesting complexes. △ Less

Submitted 3 May, 2022; v1 submitted 12 April, 2022; originally announced April 2022.

Comments: 12 pages, 5 figures

Journal ref: Optics Continuum Vol. 1, Issue 8, pp. 1848-1858 (2022)

arXiv:2202.08417 [pdf, other]

Retrieval-Augmented Reinforcement Learning

Authors: Anirudh Goyal, Abram L. Friesen, Andrea Banino, Theophane Weber, Nan Rosemary Ke, Adria Puigdomenech Badia, Arthur Guez, Mehdi Mirza, Peter C. Humphreys, Ksenia Konyushkova, Laurent Sifre, Michal Valko, Simon Osindero, Timothy Lillicrap, Nicolas Heess, Charles Blundell

Abstract: Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the… ▽ More Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. he proposed method facilitates learning agents that at test-time can condition their behavior on the entire dataset and not only the current state, or current trajectory. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method. △ Less

Submitted 24 May, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

arXiv:2202.00053 [pdf]

doi 10.1088/1402-4896/ac8cbd

Linear and nonlinear analysis of Ion-Temperature-Gradient (ITG) Driven mode in the asymmetric Pair-Ion Magnetoplasma

Authors: Javaria Razzaq, Zahida Ehsan, Arshad M. Mirza

Abstract: We have investigated linear and nonlinear dynamics of ion-temperature-gradient driven drift mode for Maxwellian and non Maxwellian pair-ion plasma embedded in an inhomogeneous magnetic field having gradients in ion's temperature and number density. Linear dispersion relations are derived and analyzed analytically as well as numerically for different cases. It has been found that growth rate of ins… ▽ More We have investigated linear and nonlinear dynamics of ion-temperature-gradient driven drift mode for Maxwellian and non Maxwellian pair-ion plasma embedded in an inhomogeneous magnetic field having gradients in ion's temperature and number density. Linear dispersion relations are derived and analyzed analytically as well as numerically for different cases. It has been found that growth rate of instability increases with increasing eta. By using the transport equations of Braginskii, model, a set of nonlinear equations are derived. In the nonlinear regime, soliton structures are found to exist. Our numerical analysis shows that amplitude of solitary waves increases by increasing ion to electron number density ratio. These solitary structures are also found to be sensitive to non thermal kappa and Cairns distributed electrons. Our present work may contribute a good illustration of the observation of nonlinear solitary waves driven by the ITG mode in magnetically confined pair-ion plasmas and space pair-ion plasmas as the formation of localized structures along drift modes is one of the striking reasons for L-H transition in the region of improved confinements in magnetically confined devices like tokamaks. △ Less

Submitted 31 January, 2022; originally announced February 2022.

Comments: 18

MSC Class: na

arXiv:2112.07769 [pdf, other]

doi 10.1364/JOSAB.441224

On the dissipative dynamics of entangled states in coupled-cavity quantum electrodynamics arrays

Authors: Imran M. Mirza, Adriana S. Cruz

Abstract: We examine the dissipative dynamics of N00N states with an arbitrary photon number N in two architectures of fiber-coupled optical ring resonators (RRs) interacting with two-level quantum emitters. One architecture consists of a two-way cascaded array of emitter-cavity systems, while in the other architecture we consider two fiber-coupled RRs each coupled to multiple dipole-dipole interacting (DDI… ▽ More We examine the dissipative dynamics of N00N states with an arbitrary photon number N in two architectures of fiber-coupled optical ring resonators (RRs) interacting with two-level quantum emitters. One architecture consists of a two-way cascaded array of emitter-cavity systems, while in the other architecture we consider two fiber-coupled RRs each coupled to multiple dipole-dipole interacting (DDI) quantum emitters (QEs). Our focus in this paper is to study how am initially prepared multiple excitation atomic N00N states transfers to the RRs and then how rapidly it decays in these open cavity quantum electrodynamics (CQED) setups while varying the emitter-cavity coupling strengths, emitter-cavity detuning, and backscattering from cavity modes. We present a general theoretical formalism valid for any arbitrary numbers of QEs, RRs, and N number in the N00N state for both schemes. As examples, we discuss the cases of single and two-excitation N00N states and report the comparison of our findings in both schemes. As one of the main results, we conclude that the array scheme tends to store N00N for longer times while the DDI scheme supports higher fidelity values. The results of this study may find applications in designing new multiparty entanglement-based protocols in quantum metrology and quantum lithography. △ Less

Submitted 14 December, 2021; originally announced December 2021.

Comments: \c{opyright} XXXX [2022] Optica Publishing Group. One print or electronic copy may be made for personal use only. Systematic reproduction and distribution, duplication of any material in this paper for a fee or for commercial purposes, or modifications of the content of this paper are prohibited

Journal ref: J. Opt. Soc. Am. B 39 (1), 177-187 (2022)

arXiv:2112.00463 [pdf, other]

The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization

Authors: M. Jehanzeb Mirza, Jakub Micorek, Horst Possegger, Horst Bischof

Abstract: Domain adaptation is crucial to adapt a learned model to new scenarios, such as domain shifts or changing data distributions. Current approaches usually require a large amount of labeled or unlabeled data from the shifted domain. This can be a hurdle in fields which require continuous dynamic adaptation or suffer from scarcity of data, e.g. autonomous driving in challenging weather conditions. To… ▽ More Domain adaptation is crucial to adapt a learned model to new scenarios, such as domain shifts or changing data distributions. Current approaches usually require a large amount of labeled or unlabeled data from the shifted domain. This can be a hurdle in fields which require continuous dynamic adaptation or suffer from scarcity of data, e.g. autonomous driving in challenging weather conditions. To address this problem of continuous adaptation to distribution shifts, we propose Dynamic Unsupervised Adaptation (DUA). By continuously adapting the statistics of the batch normalization layers we modify the feature representations of the model. We show that by sequentially adapting a model with only a fraction of unlabeled data, a strong performance gain can be achieved. With even less than 1% of unlabeled data from the target domain, DUA already achieves competitive results to strong baselines. In addition, the computational overhead is minimal in contrast to previous approaches. Our approach is simple, yet effective and can be applied to any architecture which uses batch normalization as one of its components. We show the utility of DUA by evaluating it on a variety of domain adaptation datasets and tasks including object recognition, digit recognition and object detection. △ Less

Submitted 4 April, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

Comments: Accepted to CVPR 2022 - Camera Ready Version - Code: https://github.com/jmiemirza/DUA

arXiv:2110.03363 [pdf, other]

Evaluating model-based planning and planner amortization for continuous control

Authors: Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, Martin Riedmiller

Abstract: There is a widespread intuition that model-based control methods should be able to surpass the data efficiency of model-free approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC.… ▽ More There is a widespread intuition that model-based control methods should be able to surpass the data efficiency of model-free approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC. We find that well-tuned model-free agents are strong baselines even for high DoF control problems but MPC with learned proposals and models (trained on the fly or transferred from related tasks) can significantly improve performance and data efficiency in hard multi-task/multi-goal settings. Finally, we show that it is possible to distil a model-based planner into a policy that amortizes the planning computation without any loss of performance. Videos of agents performing different tasks can be seen at https://sites.google.com/view/mbrl-amortization/home. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: 9 pages main text, 30 pages with references and appendix including several ablations and additional experiments. Submitted to ICLR 2022

arXiv:2108.05917 [pdf, other]

doi 10.1364/OE.449275

Coherent Perfect Absorption in Tavis-Cummings Models

Authors: Zibo Wang, Pawan Khatiwada, Dan Wang, Imran M. Mirza

Abstract: We theoretically study the conditions under which two laser fields can undergo Coherent Perfect Absorption (CPA) when shined on a single-mode bi-directional optical cavity coupled with two two- level quantum emitters (natural atoms, artificial atoms, quantum dots, qubits, etc.). In addition to being indirectly coupled through the cavity-mediated field, in our Tavis-Cummings model the two quantum e… ▽ More We theoretically study the conditions under which two laser fields can undergo Coherent Perfect Absorption (CPA) when shined on a single-mode bi-directional optical cavity coupled with two two- level quantum emitters (natural atoms, artificial atoms, quantum dots, qubits, etc.). In addition to being indirectly coupled through the cavity-mediated field, in our Tavis-Cummings model the two quantum emitters (QEs) are allowed to interact directly via the dipole-dipole interaction (DDI). Under the mean-field approximation and low-excitation assumption, in this work, we particularly focus on the impact of DDI on the existence of CPA in the presence of decoherence mechanisms (spontaneous emission from the QEs and the leakage of photons from the cavity walls). We also present a dressed-state analysis of the problem to discuss the underlying physics related to the allowed polariton state transitions in the Jaynes-Tavis-Cummings ladder. As a key result, we find that in the strong-coupling regime of cavity quantum electrodynamics, the strong DDI and the emitter-cavity detuning can act together to achieve the CPA at two laser frequencies tunable by the inter-atomic separation which are not possible to attain with a single QE in the presence of detuning. Our CPA results are potentially applicable in building quantum memories that are an essential component in long-distance quantum networking. △ Less

Submitted 12 August, 2021; originally announced August 2021.

Comments: 14 pages, 7 figures

arXiv:2107.11462 [pdf, other]

LEGEND-1000 Preconceptual Design Report

Authors: LEGEND Collaboration, N. Abgrall, I. Abt, M. Agostini, A. Alexander, C. Andreoiu, G. R. Araujo, F. T. Avignone III, W. Bae, A. Bakalyarov, M. Balata, M. Bantel, I. Barabanov, A. S. Barabash, P. S. Barbeau, C. J. Barton, P. J. Barton, L. Baudis, C. Bauer, E. Bernieri, L. Bezrukov, K. H. Bhimani, V. Biancacci, E. Blalock, A. Bolozdynya , et al. (239 additional authors not shown)

Abstract: We propose the construction of LEGEND-1000, the ton-scale Large Enriched Germanium Experiment for Neutrinoless $ββ$ Decay. This international experiment is designed to answer one of the highest priority questions in fundamental physics. It consists of 1000 kg of Ge detectors enriched to more than 90% in the $^{76}$Ge isotope operated in a liquid argon active shield at a deep underground laboratory… ▽ More We propose the construction of LEGEND-1000, the ton-scale Large Enriched Germanium Experiment for Neutrinoless $ββ$ Decay. This international experiment is designed to answer one of the highest priority questions in fundamental physics. It consists of 1000 kg of Ge detectors enriched to more than 90% in the $^{76}$Ge isotope operated in a liquid argon active shield at a deep underground laboratory. By combining the lowest background levels with the best energy resolution in the field, LEGEND-1000 will perform a quasi-background-free search and can make an unambiguous discovery of neutrinoless double-beta decay with just a handful of counts at the decay $Q$ value. The experiment is designed to probe this decay with a 99.7%-CL discovery sensitivity in the $^{76}$Ge half-life of $1.3\times10^{28}$ years, corresponding to an effective Majorana mass upper limit in the range of 9-21 meV, to cover the inverted-ordering neutrino mass scale with 10 yr of live time. △ Less

Submitted 23 July, 2021; originally announced July 2021.

Showing 1–50 of 124 results for author: Mirza, M