-
Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models
Authors:
Seyed Amir Ahmad Safavi-Naini,
Shuhaib Ali,
Omer Shahab,
Zahra Shahhoseini,
Thomas Savage,
Sara Rafiee,
Jamil S Samaan,
Reem Al Shabeeb,
Farah Ladak,
Jamie O Yang,
Juan Echavarria,
Sumbal Babar,
Aasma Shaukat,
Samuel Margolis,
Nicholas P Tatonetti,
Girish Nadkarni,
Bara El Kurdi,
Ali Soroush
Abstract:
Background and Aims: This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology.
Methods: We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images to systematically assess the impact of model configurations and parameters and prompt engineering strategies utilizing GPT-3.…
▽ More
Background and Aims: This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology.
Methods: We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images to systematically assess the impact of model configurations and parameters and prompt engineering strategies utilizing GPT-3.5. Next, we assessed the performance of proprietary and open-source LLMs (versions), including GPT (3.5, 4, 4o, 4omini), Claude (3, 3.5), Gemini (1.0), Mistral, Llama (2, 3, 3.1), Mixtral, and Phi (3), across different interfaces (web and API), computing environments (cloud and local), and model precisions (with and without quantization). Finally, we assessed accuracy using a semiautomated pipeline.
Results: Among the proprietary models, GPT-4o (73.7%) and Claude3.5-Sonnet (74.0%) achieved the highest accuracy, outperforming the top open-source models: Llama3.1-405b (64%), Llama3.1-70b (58.3%), and Mixtral-8x7b (54.3%). Among the quantized open-source models, the 6-bit quantized Phi3-14b (48.7%) performed best. The scores of the quantized models were comparable to those of the full-precision models Llama2-7b, Llama2--13b, and Gemma2-9b. Notably, VLM performance on image-containing questions did not improve when the images were provided and worsened when LLM-generated captions were provided. In contrast, a 10% increase in accuracy was observed when images were accompanied by human-crafted image descriptions.
Conclusion: In conclusion, while LLMs exhibit robust zero-shot performance in medical reasoning, the integration of visual data remains a challenge for VLMs. Effective deployment involves carefully determining optimal model configurations, encouraging users to consider either the high performance of proprietary models or the flexible adaptability of open-source models.
△ Less
Submitted 4 September, 2024; v1 submitted 25 August, 2024;
originally announced September 2024.
-
Training dynamic models using early exits for automatic speech recognition on resource-constrained devices
Authors:
George August Wright,
Umberto Cappellazzo,
Salah Zaiem,
Desh Raj,
Lucas Ondel Yang,
Daniele Falavigna,
Mohamed Nabih Ali,
Alessio Brutti
Abstract:
The ability to dynamically adjust the computational load of neural models during inference is crucial for on-device processing scenarios characterised by limited and time-varying computational resources. A promising solution is presented by early-exit architectures, in which additional exit branches are appended to intermediate layers of the encoder. In self-attention models for automatic speech r…
▽ More
The ability to dynamically adjust the computational load of neural models during inference is crucial for on-device processing scenarios characterised by limited and time-varying computational resources. A promising solution is presented by early-exit architectures, in which additional exit branches are appended to intermediate layers of the encoder. In self-attention models for automatic speech recognition (ASR), early-exit architectures enable the development of dynamic models capable of adapting their size and architecture to varying levels of computational resources and ASR performance demands. Previous research on early-exiting ASR models has relied on pre-trained self-supervised models, fine-tuned with an early-exit loss. In this paper, we undertake an experimental comparison between fine-tuning pre-trained backbones and training models from scratch with the early-exiting objective. Experiments conducted on public datasets reveal that early-exit models trained from scratch not only preserve performance when using fewer encoder layers but also exhibit enhanced task accuracy compared to single-exit or pre-trained models. Furthermore, we explore an exit selection strategy grounded in posterior probabilities as an alternative to the conventional frame-based entropy approach. Results provide insights into the training dynamics of early-exit architectures for ASR models, particularly the efficacy of training strategies and exit selection methods.
△ Less
Submitted 22 February, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition
Authors:
Kyoung Ok Yang,
Junho Koh,
Jun Won Choi
Abstract:
Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR perform…
▽ More
Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR performance. Based on the embedding features extracted from each modality, UCFFormer employs the Unified Transformer to capture the inter-dependency among embeddings in both time and modality domains. We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer. UCFFormer also incorporates contrastive learning to reduce the discrepancy in feature distributions across various modalities, thus generating semantically aligned features for information fusion. Performance evaluation conducted on two popular datasets, UTD-MHAD and NTU RGB+D, demonstrates that UCFFormer achieves state-of-the-art performance, outperforming competing methods by considerable margins.
△ Less
Submitted 10 September, 2023;
originally announced September 2023.
-
Bias Mitigation Framework for Intersectional Subgroups in Neural Networks
Authors:
Narine Kokhlikyan,
Bilal Alsallakh,
Fulton Wang,
Vivek Miglani,
Oliver Aobo Yang,
David Adkins
Abstract:
We propose a fairness-aware learning framework that mitigates intersectional subgroup bias associated with protected attributes. Prior research has primarily focused on mitigating one kind of bias by incorporating complex fairness-driven constraints into optimization objectives or designing additional layers that focus on specific protected attributes. We introduce a simple and generic bias mitiga…
▽ More
We propose a fairness-aware learning framework that mitigates intersectional subgroup bias associated with protected attributes. Prior research has primarily focused on mitigating one kind of bias by incorporating complex fairness-driven constraints into optimization objectives or designing additional layers that focus on specific protected attributes. We introduce a simple and generic bias mitigation approach that prevents models from learning relationships between protected attributes and output variable by reducing mutual information between them. We demonstrate that our approach is effective in reducing bias with little or no drop in accuracy. We also show that the models trained with our learning framework become causally fair and insensitive to the values of protected attributes. Finally, we validate our approach by studying feature interactions between protected and non-protected attributes. We demonstrate that these interactions are significantly reduced when applying our bias mitigation.
△ Less
Submitted 25 December, 2022;
originally announced December 2022.
-
Algebraic Learning: Towards Interpretable Information Modeling
Authors:
Tong Owen Yang
Abstract:
Along with the proliferation of digital data collected using sensor technologies and a boost of computing power, Deep Learning (DL) based approaches have drawn enormous attention in the past decade due to their impressive performance in extracting complex relations from raw data and representing valuable information. Meanwhile, though, rooted in its notorious black-box nature, the appreciation of…
▽ More
Along with the proliferation of digital data collected using sensor technologies and a boost of computing power, Deep Learning (DL) based approaches have drawn enormous attention in the past decade due to their impressive performance in extracting complex relations from raw data and representing valuable information. Meanwhile, though, rooted in its notorious black-box nature, the appreciation of DL has been highly debated due to the lack of interpretability. On the one hand, DL only utilizes statistical features contained in raw data while ignoring human knowledge of the underlying system, which results in both data inefficiency and trust issues; on the other hand, a trained DL model does not provide to researchers any extra insight about the underlying system beyond its output, which, however, is the essence of most fields of science, e.g. physics and economics.
This thesis addresses the issue of interpretability in general information modeling and endeavors to ease the problem from two scopes. Firstly, a problem-oriented perspective is applied to incorporate knowledge into modeling practice, where interesting mathematical properties emerge naturally which cast constraints on modeling. Secondly, given a trained model, various methods could be applied to extract further insights about the underlying system. These two pathways are termed as guided model design and secondary measurements. Remarkably, a novel scheme emerges for the modeling practice in statistical learning: Algebraic Learning (AgLr). Instead of being restricted to the discussion of any specific model, AgLr starts from idiosyncrasies of a learning task itself and studies the structure of a legitimate model class. This novel scheme demonstrates the noteworthy value of abstract algebra for general AI, which has been overlooked in recent progress, and could shed further light on interpretable information modeling.
△ Less
Submitted 13 March, 2022;
originally announced March 2022.
-
Robust and Efficient Multilevel-ILU Preconditioning of Hybrid Newton-GMRES for Incompressible Navier-Stokes Equations
Authors:
Qiao Chen,
Xiangmin Jiao,
Oliver Yang
Abstract:
We introduce a robust and efficient preconditioner for a hybrid Newton-GMRES method for solving the nonlinear systems arising from incompressible Navier-Stokes equations. When the Reynolds number is relatively high, these systems often involve millions of degrees of freedom (DOFs), and the nonlinear systems are difficult to converge, partially due to the strong asymmetry of the system and the sadd…
▽ More
We introduce a robust and efficient preconditioner for a hybrid Newton-GMRES method for solving the nonlinear systems arising from incompressible Navier-Stokes equations. When the Reynolds number is relatively high, these systems often involve millions of degrees of freedom (DOFs), and the nonlinear systems are difficult to converge, partially due to the strong asymmetry of the system and the saddle-point structure. In this work, we propose to alleviate these issues by leveraging a multilevel ILU preconditioner called HILUCSI, which is particularly effective for saddle-point problems and can enable robust and rapid convergence of the inner iterations in Newton-GMRES. We further use Picard iterations with the Oseen systems to hot-start Newton-GMRES to achieve global convergence, also preconditioned using HILUCSI. To further improve efficiency and robustness, we use the Oseen operators as physics-based sparsifiers when building preconditioners for Newton iterations and introduce adaptive refactorization and iterative refinement in HILUCSI. We refer to the resulting preconditioned hybrid Newton-GMRES as HILUNG. We demonstrate the effectiveness of HILUNG by solving the standard 2D driven-cavity problem with Re 5000 and a 3D flow-over-cylinder problem with low viscosity. We compare HILUNG with some state-of-the-art customized preconditioners for INS, including two variants of augmented Lagrangian preconditioners and two physics-based preconditioners, as well as some general-purpose approximate-factorization techniques. Our comparison shows that HILUNG is much more robust for solving high-Re problems and it is also more efficient in both memory and runtime for moderate-Re problems.
△ Less
Submitted 9 August, 2021; v1 submitted 14 November, 2020;
originally announced November 2020.