-
Multimodal Survival Modeling in the Age of Foundation Models
Authors:
Steven Song,
Morgan Borjigin-Wang,
Irene Madejski,
Robert L. Grossman
Abstract:
The Cancer Genome Atlas (TCGA) has enabled novel discoveries and served as a large-scale reference through its harmonized genomics, clinical, and image data. Prior studies have trained bespoke cancer survival prediction models from unimodal or multimodal TCGA data. A modern paradigm in biomedical deep learning is the development of foundation models (FMs) to derive meaningful feature embeddings, a…
▽ More
The Cancer Genome Atlas (TCGA) has enabled novel discoveries and served as a large-scale reference through its harmonized genomics, clinical, and image data. Prior studies have trained bespoke cancer survival prediction models from unimodal or multimodal TCGA data. A modern paradigm in biomedical deep learning is the development of foundation models (FMs) to derive meaningful feature embeddings, agnostic to a specific modeling task. Biomedical text especially has seen growing development of FMs. While TCGA contains free-text data as pathology reports, these have been historically underutilized. Here, we investigate the feasibility of training classical, multimodal survival models over zero-shot embeddings extracted by FMs. We show the ease and additive effect of multimodal fusion, outperforming unimodal models. We demonstrate the benefit of including pathology report text and rigorously evaluate the effect of model-based text summarization and hallucination. Overall, we modernize survival modeling by leveraging FMs and information extraction from pathology reports.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
A Proposed End-To-End Principle for Data Commons
Authors:
Robert L. Grossman
Abstract:
A data commons brings together (or co-locates) data with cloud computing infrastructure and commonly used software services, tools and applications for managing, analyzing and sharing data to create an interoperable resource for a research community. We introduce an architectural design principle for data commons called the narrow middle architecture that is broadly based upon the end-to-end argum…
▽ More
A data commons brings together (or co-locates) data with cloud computing infrastructure and commonly used software services, tools and applications for managing, analyzing and sharing data to create an interoperable resource for a research community. We introduce an architectural design principle for data commons called the narrow middle architecture that is broadly based upon the end-to-end argument in systems design. We also discuss important core services for data commons and the role of standards.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation
Authors:
Steven Song,
Anirudh Subramanyam,
Irene Madejski,
Robert L. Grossman
Abstract:
In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages…
▽ More
In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG), where the task is to generate a clinician's report detailing their observations from a set of radiological images, such as X-rays. We argue that simple linear classifiers over extracted image embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image feature encoder models, and without ever directly "showing" the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further present results of our experiments with various components of LaB-RAG to better understand our method. Finally, we critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
Ten Pillars for Data Meshes
Authors:
Robert L. Grossman,
Ceilyn Boyd,
Nhan Do,
Danne C. Elbers,
Michael S. Fitzsimons,
Maryellen L. Giger,
Anthony Juehne,
Brienna Larrick,
Jerry S. H. Lee,
Dawei Lin,
Michael Lukowski,
James D. Myers,
L. Philip Schumm,
Aarti Venkat
Abstract:
Over the past few years, a growing number of data platforms have emerged, including data commons, data repositories, and databases containing biomedical, environmental, social determinants of health and other data relevant to improving health outcomes. With the growing number of data platforms, interoperating multiple data platforms to form data meshes, data fabrics and other types of data ecosyst…
▽ More
Over the past few years, a growing number of data platforms have emerged, including data commons, data repositories, and databases containing biomedical, environmental, social determinants of health and other data relevant to improving health outcomes. With the growing number of data platforms, interoperating multiple data platforms to form data meshes, data fabrics and other types of data ecosystems reduces data silos, expands data use, and increases the potential for new discoveries. In this paper, we introduce ten principles, which we call pillars, for data meshes. The goals of the principles are 1) to make it easier, faster, and more uniform to set up a data mesh from multiple data platforms; and, 2) to make it easier, faster, and more uniform, for a data platform to join one or more data meshes. The hope is that the greater availability of data through data meshes will accelerate research and that the greater uniformity of meshes will lower the cost of developing meshes and connecting a data platform to them.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
An Annotated Glossary for Data Commons, Data Meshes, and Other Data Platforms
Authors:
Robert L. Grossman
Abstract:
Cloud-based data commons, data meshes, data hubs, and other data platforms are important ways to manage, analyze and share data to accelerate research and to support reproducible research. This is an annotated glossary of some of the more common terms used in articles and discussions about these platforms.
Cloud-based data commons, data meshes, data hubs, and other data platforms are important ways to manage, analyze and share data to accelerate research and to support reproducible research. This is an annotated glossary of some of the more common terms used in articles and discussions about these platforms.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Enhancing Instance-Level Image Classification with Set-Level Labels
Authors:
Renyu Zhang,
Aly A. Khan,
Yuxin Chen,
Robert L. Grossman
Abstract:
Instance-level image classification tasks have traditionally relied on single-instance labels to train models, e.g., few-shot learning and transfer learning. However, set-level coarse-grained labels that capture relationships among instances can provide richer information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveragin…
▽ More
Instance-level image classification tasks have traditionally relied on single-instance labels to train models, e.g., few-shot learning and transfer learning. However, set-level coarse-grained labels that capture relationships among instances can provide richer information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveraging set-level labels. We provide a theoretical analysis of the proposed method, including recognition conditions for fast excess risk rate, shedding light on the theoretical foundations of our approach. We conducted experiments on two distinct categories of datasets: natural image datasets and histopathology image datasets. Our experimental results demonstrate the effectiveness of our approach, showcasing improved classification performance compared to traditional single-instance label-based methods. Notably, our algorithm achieves 13% improvement in classification accuracy compared to the strongest baseline on the histopathology image classification benchmarks. Importantly, our experimental findings align with the theoretical analysis, reinforcing the robustness and reliability of our proposed method. This work bridges the gap between instance-level and set-level image classification, offering a promising avenue for advancing the capabilities of image classification models with set-level coarse-grained labels.
△ Less
Submitted 17 November, 2023; v1 submitted 8 November, 2023;
originally announced November 2023.
-
Ten Lessons for Data Sharing With a Data Commons
Authors:
Robert L. Grossman
Abstract:
A data commons is a cloud-based data platform with a governance structure that allows a community to manage, analyze and share its data. Data commons provide a research community with the ability to manage and analyze large datasets using the elastic scalability provided by cloud computing and to share data securely and compliantly, and, in this way, accelerate the pace of research. Over the past…
▽ More
A data commons is a cloud-based data platform with a governance structure that allows a community to manage, analyze and share its data. Data commons provide a research community with the ability to manage and analyze large datasets using the elastic scalability provided by cloud computing and to share data securely and compliantly, and, in this way, accelerate the pace of research. Over the past decade, a number of data commons have been developed and we discuss some of the lessons learned from this effort.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
A Framework for the Interoperability of Cloud Platforms: Towards FAIR Data in SAFE Environments
Authors:
Robert L. Grossman,
Rebecca R. Boyles,
Brandi N. Davis-Dusenbery,
Amanda Haddock,
Allison P. Heath,
Brian D. O'Connor,
Adam C. Resnick,
Deanne M. Taylor,
Stan Ahalt
Abstract:
As the number of cloud platforms supporting scientific research grows, there is an increasing need to support interoperability between two or more cloud platforms, as a growing amount of data is being hosted in cloud-based platforms. A well accepted core concept is to make data in cloud platforms Findable, Accessible, Interoperable and Reusable (FAIR). We introduce a companion concept that applies…
▽ More
As the number of cloud platforms supporting scientific research grows, there is an increasing need to support interoperability between two or more cloud platforms, as a growing amount of data is being hosted in cloud-based platforms. A well accepted core concept is to make data in cloud platforms Findable, Accessible, Interoperable and Reusable (FAIR). We introduce a companion concept that applies to cloud-based computing environments that we call a Secure and Authorized FAIR Environment (SAFE). SAFE environments require data and platform governance structures and are designed to support the interoperability of sensitive or controlled access data, such as biomedical data. A SAFE environment is a cloud platform that has been approved through a defined data and platform governance process as authorized to hold data from another cloud platform and exposes appropriate APIs for the two platforms to interoperate.
△ Less
Submitted 15 February, 2024; v1 submitted 9 March, 2022;
originally announced March 2022.
-
Scalable Batch-Mode Deep Bayesian Active Learning via Equivalence Class Annealing
Authors:
Renyu Zhang,
Aly A. Khan,
Robert L. Grossman,
Yuxin Chen
Abstract:
Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALanCe, a scalable batch-mode active learning algorithm, which combines in…
▽ More
Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALanCe, a scalable batch-mode active learning algorithm, which combines insights from decision-theoretic active learning, combinatorial information measure, and diversity sampling. At its core, Batch-BALanCe relies on a novel decision-theoretic acquisition function that facilitates differentiation among different equivalence classes. Intuitively, each equivalence class consists of hypotheses (e.g., posterior samples of deep neural networks) with similar predictions, and Batch-BALanCe adaptively adjusts the size of the equivalence classes as learning progresses. To scale up the computation of queries to large batches, we further propose an efficient batch-mode acquisition procedure, which aims to maximize a novel information measure defined through the acquisition function. We show that our algorithm can effectively handle realistic multi-class classification tasks, and achieves compelling performance on several benchmark datasets for active learning under both low- and large-batch regimes. Reference code is released at https://github.com/zhangrenyuuchicago/BALanCe.
△ Less
Submitted 20 February, 2023; v1 submitted 27 December, 2021;
originally announced December 2021.
-
Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data
Authors:
Robert L. Grossman
Abstract:
Data commons collate data with cloud computing infrastructure and commonly used software services, tools and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize and share large scale genomics datasets. Data ecosystems can be built by interopera…
▽ More
Data commons collate data with cloud computing infrastructure and commonly used software services, tools and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize and share large scale genomics datasets. Data ecosystems can be built by interoperating multiple data commons. It can be quite labor intensive to curate, import and analyze the data in a data commons. Data lakes provide an alternative to data commons and simply provide access to data, with the data curation and analysis deferred until later and delegated to those that access the data. We review software platforms for managing, analyzing and sharing genomic data, with an emphasis on data commons, but also covering data ecosystems and data lakes.
△ Less
Submitted 24 December, 2018; v1 submitted 5 September, 2018;
originally announced September 2018.
-
A Case for Data Commons: Towards Data Science as a Service
Authors:
Robert L. Grossman,
Allison Heath,
Mark Murphy,
Maria Patterson,
Walt Wells
Abstract:
As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science lifecycle, including long-term data storage, data exploration and discovery services, and compute capabilities to support data analysis and re-analysis, as new data are added and as scienti…
▽ More
As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science lifecycle, including long-term data storage, data exploration and discovery services, and compute capabilities to support data analysis and re-analysis, as new data are added and as scientific pipelines are refined. We describe our experience developing data commons-- interoperable infrastructure that co-locates data, storage, and compute with common analysis tools--and present several cases studies. Across these case studies, several common requirements emerge, including the need for persistent digital identifier and metadata services, APIs, data portability, pay for compute capabilities, and data peering agreements between data commons. Though many challenges, including sustainability and developing appropriate standards remain, interoperable data commons bring us one step closer to effective Data Science as Service for the scientific research community.
△ Less
Submitted 9 April, 2016;
originally announced April 2016.
-
The Design of a Community Science Cloud: The Open Science Data Cloud Perspective
Authors:
Robert L. Grossman,
Matthew Greenway,
Allison P. Heath,
Ray Powell,
Rafael D. Suarez,
Walt Wells,
Kevin White,
Malcolm Atkinson,
Iraklis Klampanos,
Heidi L. Alvarez,
Christine Harvey,
Joe J. Mambretti
Abstract:
In this paper we describe the design, and implementation of the Open Science Data Cloud, or OSDC. The goal of the OSDC is to provide petabyte-scale data cloud infrastructure and related services for scientists working with large quantities of data. Currently, the OSDC consists of more than 2000 cores and 2 PB of storage distributed across four data centers connected by 10G networks. We discuss som…
▽ More
In this paper we describe the design, and implementation of the Open Science Data Cloud, or OSDC. The goal of the OSDC is to provide petabyte-scale data cloud infrastructure and related services for scientists working with large quantities of data. Currently, the OSDC consists of more than 2000 cores and 2 PB of storage distributed across four data centers connected by 10G networks. We discuss some of the lessons learned during the past three years of operation and describe the software stacks used in the OSDC. We also describe some of the research projects in biology, the earth sciences, and social sciences enabled by the OSDC.
△ Less
Submitted 3 January, 2016;
originally announced January 2016.
-
MalStone: Towards A Benchmark for Analytics on Large Data Clouds
Authors:
Collin Bennett,
Robert L. Grossman,
David Locke,
Jonathan Seidman,
Steve Vejcik
Abstract:
Developing data mining algorithms that are suitable for cloud computing platforms is currently an active area of research, as is developing cloud computing platforms appropriate for data mining. Currently, the most common benchmark for cloud computing is the Terasort (and related) benchmarks. Although the Terasort Benchmark is quite useful, it was not designed for data mining per se. In this paper…
▽ More
Developing data mining algorithms that are suitable for cloud computing platforms is currently an active area of research, as is developing cloud computing platforms appropriate for data mining. Currently, the most common benchmark for cloud computing is the Terasort (and related) benchmarks. Although the Terasort Benchmark is quite useful, it was not designed for data mining per se. In this paper, we introduce a benchmark called MalStone that is specifically designed to measure the performance of cloud computing middleware that supports the type of data intensive computing common when building data mining models. We also introduce MalGen, which is a utility for generating data on clouds that can be used with MalStone.
△ Less
Submitted 7 July, 2010;
originally announced July 2010.
-
Sector and Sphere: Towards Simplified Storage and Processing of Large Scale Distributed Data
Authors:
Yunhong Gu,
Robert L Grossman
Abstract:
Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. In contrast to existing storage and compute clouds, Sector can manage data not only within a data center, but also ac…
▽ More
Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. In contrast to existing storage and compute clouds, Sector can manage data not only within a data center, but also across geographically distributed data centers. Similarly, the Sphere compute cloud supports User Defined Functions (UDF) over data both within a data center and across data centers. As a special case, MapReduce style programming can be implemented in Sphere by using a Map UDF followed by a Reduce UDF. We describe some experimental studies comparing Sector/Sphere and Hadoop using the Terasort Benchmark. In these studies, Sector is about twice as fast as Hadoop. Sector/Sphere is open source.
△ Less
Submitted 16 January, 2009; v1 submitted 6 September, 2008;
originally announced September 2008.
-
Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere
Authors:
Robert L Grossman,
Yunhong Gu
Abstract:
We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it p…
▽ More
We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it provides the storage services required by the Sphere compute cloud. We also describe the programming paradigm supported by the Sphere compute cloud. Sector and Sphere are designed for analyzing large data sets using computer clusters connected with wide area high performance networks (for example, 10+ Gb/s). We describe a distributed data mining application that we have developed using Sector and Sphere. Finally, we describe some experimental studies comparing Sector/Sphere to Hadoop.
△ Less
Submitted 21 August, 2008;
originally announced August 2008.
-
Compute and Storage Clouds Using Wide Area High Performance Networks
Authors:
Robert L. Grossman,
Yunhong Gu,
Michael Sabala,
Wanzhi Zhang
Abstract:
We describe a cloud based infrastructure that we have developed that is optimized for wide area, high performance networks and designed to support data mining applications. The infrastructure consists of a storage cloud called Sector and a compute cloud called Sphere. We describe two applications that we have built using the cloud and some experimental studies.
We describe a cloud based infrastructure that we have developed that is optimized for wide area, high performance networks and designed to support data mining applications. The infrastructure consists of a storage cloud called Sector and a compute cloud called Sphere. We describe two applications that we have built using the cloud and some experimental studies.
△ Less
Submitted 13 August, 2008;
originally announced August 2008.