-
MedGemma Technical Report
Authors:
Andrew Sellergren,
Sahar Kazemzadeh,
Tiam Jaroensri,
Atilla Kiraly,
Madeleine Traverse,
Timo Kohlberger,
Shawn Xu,
Fayaz Jamil,
Cían Hughes,
Charles Lau,
Justin Chen,
Fereshteh Mahvar,
Liron Yatziv,
Tiffany Chen,
Bram Sterling,
Stefanie Anna Baby,
Susanna Maria Baby,
Jeremy Lai,
Samuel Schmidgall,
Lu Yang,
Kejia Chen,
Per Bjornsson,
Shashir Reddy,
Ryan Brush,
Kenneth Philbrick
, et al. (54 additional authors not shown)
Abstract:
Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce Me…
▽ More
Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.
△ Less
Submitted 8 July, 2025; v1 submitted 7 July, 2025;
originally announced July 2025.
-
Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison
Authors:
Aiswarya Baby,
Tintu Thankom Koshy
Abstract:
Visual Question Answering (VQA) has emerged as a pivotal task in the intersection of computer vision and natural language processing, requiring models to understand and reason about visual content in response to natural language questions. Analyzing VQA datasets is essential for developing robust models that can handle the complexities of multimodal reasoning. Several approaches have been develope…
▽ More
Visual Question Answering (VQA) has emerged as a pivotal task in the intersection of computer vision and natural language processing, requiring models to understand and reason about visual content in response to natural language questions. Analyzing VQA datasets is essential for developing robust models that can handle the complexities of multimodal reasoning. Several approaches have been developed to examine these datasets, each offering distinct perspectives on question diversity, answer distribution, and visual-textual correlations. Despite significant progress, existing VQA models face challenges related to dataset bias, limited model complexity, commonsense reasoning gaps, rigid evaluation methods, and generalization to real world scenarios. This paper offers a detailed study of the original VQA dataset, baseline models and methods along with a comparative study of five advanced VQA models, ABC-CNN, KICNLE, Masked Vision and Language Modeling, BLIP-2, and OFA, each employing distinct methods to address these ongoing challenges.
△ Less
Submitted 4 March, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
A New Variant of Benes Network: Its Topological Characterisation and Comparative Analysis
Authors:
Parvez Ali,
Annmaria Baby,
D. Antony Xavier,
Eddith Sarah Varghese,
Theertha Nair A.,
Haidar Ali
Abstract:
The modern era always looks into advancements in technology. Design and topology of interconnection networks play a mutual role in development of technology. Analysing the topological properties and characteristics of an interconnection network is not an easy task. Graph theory helps in solving this task analytically and efficiently through the use of numerical parameters known as distance based t…
▽ More
The modern era always looks into advancements in technology. Design and topology of interconnection networks play a mutual role in development of technology. Analysing the topological properties and characteristics of an interconnection network is not an easy task. Graph theory helps in solving this task analytically and efficiently through the use of numerical parameters known as distance based topological descriptors. These descriptors have considerable applications in various fields of computer science, chemistry, biology, etc. This paper deals with the evaluation of topological descriptors for an n-dimensional multistage interconnection network, the benes network,BB(n). Also, a new variant of interconnection network is derived from the benes network, named as augmented benes network and denoted as BB^* (n). The topological descriptors for the benes derived network are also determined in this work. Further, the benes network and augmented benes network undergoes a comparative analysis based on few network parameters, which helps to understand the efficiency of newly derived benes network. A broadcasting algorithm for the augmented benes network is also provided.
△ Less
Submitted 25 October, 2024;
originally announced November 2024.
-
Study on (r,s)- Generalised Transformation Graphs, A Novel Perspective Based on Transformation Graphs
Authors:
Parvez Ali,
Annmaria Baby,
D. Antony Xavier,
Theertha Nair A,
Haidar Ali,
Syed Ajaz K. Kirmani
Abstract:
For a graph $\mathbb{Q}=(\mathbb{V},\mathbb{E})$, the transformation graphs are defined as graphs with vertex set being $\mathbb{V(Q)} \cup \mathbb{E(Q)}$ and edge set is described following certain conditions. In comparison to the structure descriptor of the original graph $\mathbb{Q}$, the topological descriptor of its transformation graphs displays distinct characteristics related to structure.…
▽ More
For a graph $\mathbb{Q}=(\mathbb{V},\mathbb{E})$, the transformation graphs are defined as graphs with vertex set being $\mathbb{V(Q)} \cup \mathbb{E(Q)}$ and edge set is described following certain conditions. In comparison to the structure descriptor of the original graph $\mathbb{Q}$, the topological descriptor of its transformation graphs displays distinct characteristics related to structure. Thus, a compound's transformation graphs descriptors can be used to model a variety of structural features of the underlying chemical. In this work, the concept of transformation graphs are extended giving rise to novel class of graphs, the $(r,s)$- generalised transformation graphs, whose vertex set is union of $r$ copies of $\mathbb{V(Q)}$ and $s$ copies of $\mathbb{E(Q)}$, where, $r, s \in N$ and the edge set are defined under certain conditions. Further, these class of graphs are analysed with the help of first Zagreb index. Mainly, there are eight transformation graphs based on the criteria for edge set, but under the concept of $(r,s)$- generalised transformation graphs, infinite number of graphs can be described and analysed.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
Context-based out-of-vocabulary word recovery for ASR systems in Indian languages
Authors:
Arun Baby,
Saranya Vinnaitherthan,
Akhil Kerhalkar,
Pranav Jawale,
Sharath Adavanne,
Nagaraj Adiga
Abstract:
Detecting and recovering out-of-vocabulary (OOV) words is always challenging for Automatic Speech Recognition (ASR) systems. Many existing methods focus on modeling OOV words by modifying acoustic and language models and integrating context words cleverly into models. To train such complex models, we need a large amount of data with context words, additional training time, and increased model size…
▽ More
Detecting and recovering out-of-vocabulary (OOV) words is always challenging for Automatic Speech Recognition (ASR) systems. Many existing methods focus on modeling OOV words by modifying acoustic and language models and integrating context words cleverly into models. To train such complex models, we need a large amount of data with context words, additional training time, and increased model size. However, after getting the ASR transcription to recover context-based OOV words, the post-processing method has not been explored much. In this work, we propose a post-processing technique to improve the performance of context-based OOV recovery. We created an acoustically boosted language model with a sub-graph made at phone level with an OOV words list. We proposed two methods to determine a suitable cost function to retrieve the OOV words based on the context. The cost function is defined based on phonetic and acoustic knowledge for matching and recovering the correct context words in the decode. The effectiveness of the proposed cost function is evaluated at both word-level and sentence-level. The evaluation results show that this approach can recover an average of 50% context-based OOV words across multiple categories.
△ Less
Submitted 9 June, 2022;
originally announced June 2022.
-
Non-native English lexicon creation for bilingual speech synthesis
Authors:
Arun Baby,
Pranav Jawale,
Saranya Vinnaitherthan,
Sumukh Badam,
Nagaraj Adiga,
Sharath Adavanne
Abstract:
Bilingual English speakers speak English as one of their languages. Their English is of a non-native kind, and their conversations are of a code-mixed fashion. The intelligibility of a bilingual text-to-speech (TTS) system for such non-native English speakers depends on a lexicon that captures the phoneme sequence used by non-native speakers. However, due to the lack of non-native English lexicon,…
▽ More
Bilingual English speakers speak English as one of their languages. Their English is of a non-native kind, and their conversations are of a code-mixed fashion. The intelligibility of a bilingual text-to-speech (TTS) system for such non-native English speakers depends on a lexicon that captures the phoneme sequence used by non-native speakers. However, due to the lack of non-native English lexicon, existing bilingual TTS systems employ native English lexicons that are widely available, in addition to their native language lexicon. Due to the inconsistency between the non-native English pronunciation in the audio and native English lexicon in the text, the intelligibility of synthesized speech in such TTS systems is significantly reduced.
This paper is motivated by the knowledge that the native language of the speaker highly influences non-native English pronunciation. We propose a generic approach to obtain rules based on letter to phoneme alignment to map native English lexicon to their non-native version. The effectiveness of such mapping is studied by comparing bilingual (Indian English and Hindi) TTS systems trained with and without the proposed rules. The subjective evaluation shows that the bilingual TTS system trained with the proposed non-native English lexicon rules obtains a 6% absolute improvement in preference.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.
-
An ASR Guided Speech Intelligibility Measure for TTS Model Selection
Authors:
Arun Baby,
Saranya Vinnaitherthan,
Nagaraj Adiga,
Pranav Jawale,
Sumukh Badam,
Sharath Adavanne,
Srikanth Konjeti
Abstract:
The perceptual quality of neural text-to-speech (TTS) is highly dependent on the choice of the model during training. Selecting the model using a training-objective metric such as the least mean squared error does not always correlate with human perception. In this paper, we propose an objective metric based on the phone error rate (PER) to select the TTS model with the best speech intelligibility…
▽ More
The perceptual quality of neural text-to-speech (TTS) is highly dependent on the choice of the model during training. Selecting the model using a training-objective metric such as the least mean squared error does not always correlate with human perception. In this paper, we propose an objective metric based on the phone error rate (PER) to select the TTS model with the best speech intelligibility. The PER is computed between the input text to the TTS model, and the text decoded from the synthesized speech using an automatic speech recognition (ASR) model, which is trained on the same data as the TTS model. With the help of subjective studies, we show that the TTS model chosen with the least PER on validation split has significantly higher speech intelligibility compared to the model with the least training-objective metric loss. Finally, using the proposed PER and subjective evaluation, we show that the choice of best TTS model depends on the genre of the target domain text. All our experiments are conducted on a Hindi language dataset. However, the proposed model selection method is language independent.
△ Less
Submitted 2 June, 2020;
originally announced June 2020.
-
Dynamic Vision Sensors for Human Activity Recognition
Authors:
Stefanie Anna Baby,
Bimal Vinod,
Chaitanya Chinni,
Kaushik Mitra
Abstract:
Unlike conventional cameras which capture video at a fixed frame rate, Dynamic Vision Sensors (DVS) record only changes in pixel intensity values. The output of DVS is simply a stream of discrete ON/OFF events based on the polarity of change in its pixel values. DVS has many attractive features such as low power consumption, high temporal resolution, high dynamic range and fewer storage requiremen…
▽ More
Unlike conventional cameras which capture video at a fixed frame rate, Dynamic Vision Sensors (DVS) record only changes in pixel intensity values. The output of DVS is simply a stream of discrete ON/OFF events based on the polarity of change in its pixel values. DVS has many attractive features such as low power consumption, high temporal resolution, high dynamic range and fewer storage requirements. All these make DVS a very promising camera for potential applications in wearable platforms where power consumption is a major concern.
In this paper, we explore the feasibility of using DVS for Human Activity Recognition (HAR). We propose to use the various slices (such as $x-y$, $x-t$, and $y-t$) of the DVS video as a feature map for HAR and denote them as Motion Maps. We show that fusing motion maps with Motion Boundary Histogram (MBH) give good performance on the benchmark DVS dataset as well as on a real DVS gesture dataset collected by us. Interestingly, the performance of DVS is comparable to that of conventional videos although DVS captures only sparse motion information.
△ Less
Submitted 13 March, 2018;
originally announced March 2018.