-
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology
Authors:
Fatemeh Ghezloo,
Mehmet Saygin Seyfioglu,
Rustin Soraki,
Wisdom O. Ikezogwo,
Beibin Li,
Tejoram Vivekanandan,
Joann G. Elmore,
Ranjay Krishna,
Linda Shapiro
Abstract:
Diagnosing diseases through histopathology whole slide images (WSIs) is fundamental in modern pathology but is challenged by the gigapixel scale and complexity of WSIs. Trained histopathologists overcome this challenge by navigating the WSI, looking for relevant patches, taking notes, and compiling them to produce a final holistic diagnostic. Traditional AI approaches, such as multiple instance le…
▽ More
Diagnosing diseases through histopathology whole slide images (WSIs) is fundamental in modern pathology but is challenged by the gigapixel scale and complexity of WSIs. Trained histopathologists overcome this challenge by navigating the WSI, looking for relevant patches, taking notes, and compiling them to produce a final holistic diagnostic. Traditional AI approaches, such as multiple instance learning and transformer-based models, fail short of such a holistic, iterative, multi-scale diagnostic procedure, limiting their adoption in the real-world. We introduce PathFinder, a multi-modal, multi-agent framework that emulates the decision-making process of expert pathologists. PathFinder integrates four AI agents, the Triage Agent, Navigation Agent, Description Agent, and Diagnosis Agent, that collaboratively navigate WSIs, gather evidence, and provide comprehensive diagnoses with natural language explanations. The Triage Agent classifies the WSI as benign or risky; if risky, the Navigation and Description Agents iteratively focus on significant regions, generating importance maps and descriptive insights of sampled patches. Finally, the Diagnosis Agent synthesizes the findings to determine the patient's diagnostic classification. Our Experiments show that PathFinder outperforms state-of-the-art methods in skin melanoma diagnosis by 8% while offering inherent explainability through natural language descriptions of diagnostically relevant patches. Qualitative analysis by pathologists shows that the Description Agent's outputs are of high quality and comparable to GPT-4o. PathFinder is also the first AI-based system to surpass the average performance of pathologists in this challenging melanoma classification task by 9%, setting a new record for efficient, accurate, and interpretable AI-assisted diagnostics in pathology. Data, code and models available at https://pathfinder-dx.github.io/
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives
Authors:
Wisdom O. Ikezogwo,
Kevin Zhang,
Mehmet Saygin Seyfioglu,
Fatemeh Ghezloo,
Linda Shapiro,
Ranjay Krishna
Abstract:
We propose MedicalNarratives, a dataset curated from medical pedagogical videos similar in nature to data collected in Think-Aloud studies and inspired by Localized Narratives, which collects grounded image-text data by curating instructors' speech and mouse cursor movements synchronized in time. MedicalNarratives enables pretraining of both semantic and dense objectives, alleviating the need to t…
▽ More
We propose MedicalNarratives, a dataset curated from medical pedagogical videos similar in nature to data collected in Think-Aloud studies and inspired by Localized Narratives, which collects grounded image-text data by curating instructors' speech and mouse cursor movements synchronized in time. MedicalNarratives enables pretraining of both semantic and dense objectives, alleviating the need to train medical semantic and dense tasks disparately due to the lack of reasonably sized datasets. Our dataset contains 4.7M image-text pairs from videos and articles, with 1M samples containing dense annotations in the form of traces and bounding boxes. To evaluate the utility of MedicalNarratives, we train GenMedClip based on the CLIP architecture using our dataset spanning 12 medical domains and demonstrate that it outperforms previous state-of-the-art models on a newly constructed medical imaging benchmark that comprehensively evaluates performance across all modalities. Data, demo, code and models available at https://medical-narratives.github.io
△ Less
Submitted 12 January, 2025; v1 submitted 7 January, 2025;
originally announced January 2025.
-
Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
Authors:
Mehmet Saygin Seyfioglu,
Karim Bouyarmane,
Suren Kumar,
Amir Tavanaei,
Ismail B. Tutar
Abstract:
As online shopping is growing, the ability for buyers to virtually visualize products in their settings-a phenomenon we define as "Virtual Try-All"-has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of…
▽ More
As online shopping is growing, the ability for buyers to virtually visualize products in their settings-a phenomenon we define as "Virtual Try-All"-has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of products. In contrast, personalization-driven models such as DreamPaint are good at preserving the item's details but they are not optimized for real-time applications. We present "Diffuse to Choose," a novel diffusion-based image-conditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details. We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-shot diffusion inpainting methods as well as few-shot diffusion personalization algorithms like DreamPaint.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
Authors:
Mehmet Saygin Seyfioglu,
Wisdom O. Ikezogwo,
Fatemeh Ghezloo,
Ranjay Krishna,
Linda Shapiro
Abstract:
Diagnosis in histopathology requires a global whole slide images (WSIs) analysis, requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets, which currently contain information for individual image patches, without a…
▽ More
Diagnosis in histopathology requires a global whole slide images (WSIs) analysis, requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets, which currently contain information for individual image patches, without a spatial grounding of the concepts within each patch and without a wider view of the WSI. Therefore, they lack sufficient diagnostic capacity for histopathology. To bridge this gap, we introduce Quilt-Instruct, a large-scale dataset of 107,131 histopathology-specific instruction question/answer pairs, grounded within diagnostically relevant image patches that make up the WSI. Our dataset is collected by leveraging educational histopathology videos from YouTube, which provides spatial localization of narrations by automatically extracting the narrators' cursor positions. Quilt-Instruct supports contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch, enabling diagnostic reasoning across patches. To evaluate Quilt-LLaVA, we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using public histopathology datasets, where Quilt-LLaVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA. Our code, data, and model are publicly accessible at quilt-llava.github.io.
△ Less
Submitted 13 January, 2025; v1 submitted 7 December, 2023;
originally announced December 2023.
-
Quilt-1M: One Million Image-Text Pairs for Histopathology
Authors:
Wisdom Oluchi Ikezogwo,
Mehmet Saygin Seyfioglu,
Fatemeh Ghezloo,
Dylan Stefan Chan Geva,
Fatwir Sheikh Mohammed,
Pavan Kumar Anand,
Ranjay Krishna,
Linda Shapiro
Abstract:
Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of analogous data in the medical field, specifically in histopathology, has slowed comparable progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering $1,087$ hours of va…
▽ More
Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of analogous data in the medical field, specifically in histopathology, has slowed comparable progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering $1,087$ hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate QUILT: a large-scale vision-language dataset consisting of $802, 144$ image and text pairs. QUILT was automatically curated using a mixture of models, including large language models, handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around $200$K samples. We combine QUILT with datasets from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: QUILT-1M, with $1$M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of QUILT-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new histopathology images across $13$ diverse patch-level datasets of $8$ different sub-pathologies and cross-modal retrieval tasks.
△ Less
Submitted 13 January, 2025; v1 submitted 19 June, 2023;
originally announced June 2023.
-
DreamPaint: Few-Shot Inpainting of E-Commerce Items for Virtual Try-On without 3D Modeling
Authors:
Mehmet Saygin Seyfioglu,
Karim Bouyarmane,
Suren Kumar,
Amir Tavanaei,
Ismail B. Tutar
Abstract:
We introduce DreamPaint, a framework to intelligently inpaint any e-commerce product on any user-provided context image. The context image can be, for example, the user's own image for virtual try-on of clothes from the e-commerce catalog on themselves, the user's room image for virtual try-on of a piece of furniture from the e-commerce catalog in their room, etc. As opposed to previous augmented-…
▽ More
We introduce DreamPaint, a framework to intelligently inpaint any e-commerce product on any user-provided context image. The context image can be, for example, the user's own image for virtual try-on of clothes from the e-commerce catalog on themselves, the user's room image for virtual try-on of a piece of furniture from the e-commerce catalog in their room, etc. As opposed to previous augmented-reality (AR)-based virtual try-on methods, DreamPaint does not use, nor does it require, 3D modeling of neither the e-commerce product nor the user context. Instead, it directly uses 2D images of the product as available in product catalog database, and a 2D picture of the context, for example taken from the user's phone camera. The method relies on few-shot fine tuning a pre-trained diffusion model with the masked latents (e.g., Masked DreamBooth) of the catalog images per item, whose weights are then loaded on a pre-trained inpainting module that is capable of preserving the characteristics of the context image. DreamPaint allows to preserve both the product image and the context (environment/user) image without requiring text guidance to describe the missing part (product/context). DreamPaint also allows to intelligently infer the best 3D angle of the product to place at the desired location on the user context, even if that angle was previously unseen in the product's reference 2D images. We compare our results against both text-guided and image-guided inpainting modules and show that DreamPaint yields superior performance in both subjective human study and quantitative metrics.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Multi-modal Masked Autoencoders Learn Compositional Histopathological Representations
Authors:
Wisdom Oluchi Ikezogwo,
Mehmet Saygin Seyfioglu,
Linda Shapiro
Abstract:
Self-supervised learning (SSL) enables learning useful inductive biases through utilizing pretext tasks that require no labels. The unlabeled nature of SSL makes it especially important for whole slide histopathological images (WSIs), where patch-level human annotation is difficult. Masked Autoencoders (MAE) is a recent SSL method suitable for digital pathology as it does not require negative samp…
▽ More
Self-supervised learning (SSL) enables learning useful inductive biases through utilizing pretext tasks that require no labels. The unlabeled nature of SSL makes it especially important for whole slide histopathological images (WSIs), where patch-level human annotation is difficult. Masked Autoencoders (MAE) is a recent SSL method suitable for digital pathology as it does not require negative sampling and requires little to no data augmentations. However, the domain shift between natural images and digital pathology images requires further research in designing MAE for patch-level WSIs. In this paper, we investigate several design choices for MAE in histopathology. Furthermore, we introduce a multi-modal MAE (MMAE) that leverages the specific compositionality of Hematoxylin & Eosin (H&E) stained WSIs. We performed our experiments on the public patch-level dataset NCT-CRC-HE-100K. The results show that the MMAE architecture outperforms supervised baselines and other state-of-the-art SSL techniques for an eight-class tissue phenotyping task, utilizing only 100 labeled samples for fine-tuning. Our code is available at https://github.com/wisdomikezogwo/MMAE_Pathology
△ Less
Submitted 14 November, 2022; v1 submitted 4 September, 2022;
originally announced September 2022.
-
Brain-Aware Replacements for Supervised Contrastive Learning in Detection of Alzheimer's Disease
Authors:
Mehmet Saygın Seyfioğlu,
Zixuan Liu,
Pranav Kamath,
Sadjyot Gangolli,
Sheng Wang,
Thomas Grabowski,
Linda Shapiro
Abstract:
We propose a novel framework for Alzheimer's disease (AD) detection using brain MRIs. The framework starts with a data augmentation method called Brain-Aware Replacements (BAR), which leverages a standard brain parcellation to replace medically-relevant 3D brain regions in an anchor MRI from a randomly picked MRI to create synthetic samples. Ground truth "hard" labels are also linearly mixed depen…
▽ More
We propose a novel framework for Alzheimer's disease (AD) detection using brain MRIs. The framework starts with a data augmentation method called Brain-Aware Replacements (BAR), which leverages a standard brain parcellation to replace medically-relevant 3D brain regions in an anchor MRI from a randomly picked MRI to create synthetic samples. Ground truth "hard" labels are also linearly mixed depending on the replacement ratio in order to create "soft" labels. BAR produces a great variety of realistic-looking synthetic MRIs with higher local variability compared to other mix-based methods, such as CutMix. On top of BAR, we propose using a soft-label-capable supervised contrastive loss, aiming to learn the relative similarity of representations that reflect how mixed are the synthetic MRIs using our soft labels. This way, we do not fully exhaust the entropic capacity of our hard labels, since we only use them to create soft labels and synthetic MRIs through BAR. We show that a model pre-trained using our framework can be further fine-tuned with a cross-entropy loss using the hard labels that were used to create the synthetic samples. We validated the performance of our framework in a binary AD detection task against both from-scratch supervised training and state-of-the-art self-supervised training plus fine-tuning approaches. Then we evaluated BAR's individual performance compared to another mix-based method CutMix by integrating it within our framework. We show that our framework yields superior results in both precision and recall for the AD detection task.
△ Less
Submitted 20 July, 2022; v1 submitted 10 July, 2022;
originally announced July 2022.
-
MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling
Authors:
Tarik Arici,
Mehmet Saygin Seyfioglu,
Tal Neiman,
Yi Xu,
Son Train,
Trishul Chilimbi,
Belinda Zeng,
Ismail Tutar
Abstract:
Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in a…
▽ More
Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in addition to the Masked Language Modeling (MLM) loss, alignment-based objectives are used for cross-modality interaction, and RoI feature regression and classification tasks for Masked Image-Region Modeling (MIRM). Both alignment and MIRM objectives mostly do not have ground truth. Alignment-based objectives require pairings of image and text and heuristic objective functions. MIRM relies on object detectors. Masking policies either do not take advantage of multi-modality or are strictly coupled with alignments generated by other models. In this paper, we present Masked Language and Image Modeling (MLIM) for VLP. MLIM uses two loss functions: Masked Language Modeling (MLM) loss and image reconstruction (RECON) loss. We propose Modality Aware Masking (MAM) to boost cross-modality interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality. Using MLM + RECON tasks coupled with MAM, we present a simplified VLP methodology and show that it has better downstream task performance on a proprietary e-commerce multi-modal dataset.
△ Less
Submitted 24 September, 2021;
originally announced September 2021.
-
Autonomous UAV Base Stations for Next Generation Wireless Networks: A Deep Learning Approach
Authors:
Ali Murat Demirtas,
Mehmet Saygin Seyfioglu,
Irem Bor-Yaliniz,
Bulent Tavli,
Halim Yanikomeroglu
Abstract:
To address the ever-growing connectivity demands of wireless communications, the adoption of ingenious solutions, such as Unmanned Aerial Vehicles (UAVs) as mobile Base Stations (BSs), is imperative. In general, the location of a UAV Base Station (UAV-BS) is determined by optimization algorithms, which have high computationally complexities and place heavy demands on UAV resources. In this paper,…
▽ More
To address the ever-growing connectivity demands of wireless communications, the adoption of ingenious solutions, such as Unmanned Aerial Vehicles (UAVs) as mobile Base Stations (BSs), is imperative. In general, the location of a UAV Base Station (UAV-BS) is determined by optimization algorithms, which have high computationally complexities and place heavy demands on UAV resources. In this paper, we show that a Convolutional Neural Network (CNN) model can be trained to infer the location of a UAV-BS in real time. In so doing, we create a framework to determine the UAV locations that considers the deployment of Mobile Users (MUs) to generate labels by using the data obtained from an optimization algorithm. Performance evaluations reveal that once the CNN model is trained with the given labels and locations of MUs, the proposed approach is capable of approximating the results given by the adopted optimization algorithm with high fidelity, outperforming Reinforcement Learning (RL)-based approaches. We also explore future research challenges and highlight key issues.
△ Less
Submitted 11 April, 2022; v1 submitted 29 July, 2021;
originally announced July 2021.
-
Detecting Cybersecurity Events from Noisy Short Text
Authors:
Semih Yagcioglu,
Mehmet Saygin Seyfioglu,
Begum Citamak,
Batuhan Bardak,
Seren Guldamlasioglu,
Azmi Yuksel,
Emin Islam Tatli
Abstract:
It is very critical to analyze messages shared over social networks for cyber threat intelligence and cyber-crime prevention. In this study, we propose a method that leverages both domain-specific word embeddings and task-specific features to detect cyber security events from tweets. Our model employs a convolutional neural network (CNN) and a long short-term memory (LSTM) recurrent neural network…
▽ More
It is very critical to analyze messages shared over social networks for cyber threat intelligence and cyber-crime prevention. In this study, we propose a method that leverages both domain-specific word embeddings and task-specific features to detect cyber security events from tweets. Our model employs a convolutional neural network (CNN) and a long short-term memory (LSTM) recurrent neural network which takes word level meta-embeddings as inputs and incorporates contextual embeddings to classify noisy short text. We collected a new dataset of cyber security related tweets from Twitter and manually annotated a subset of 2K of them. We experimented with this dataset and concluded that the proposed model outperforms both traditional and neural baselines. The results suggest that our method works well for detecting cyber security events from noisy short text.
△ Less
Submitted 2 June, 2019; v1 submitted 10 April, 2019;
originally announced April 2019.