Search | arXiv e-print repository

ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models

Authors: Adrian Mirza, Nawaf Alampara, Martiño Ríos-García, Mohamed Abdelalim, Jack Butler, Bethany Connolly, Tunca Dogan, Marianna Nezhurina, Bünyamin Şen, Santosh Tirunagari, Mark Worrall, Adamo Young, Philippe Schwaller, Michael Pieler, Kevin Maik Jablonka

Abstract: Foundation models have shown remarkable success across scientific domains, yet their impact in chemistry remains limited due to the absence of diverse, large-scale, high-quality datasets that reflect the field's multifaceted nature. We present the ChemPile, an open dataset containing over 75 billion tokens of curated chemical data, specifically built for training and evaluating general-purpose mod… ▽ More Foundation models have shown remarkable success across scientific domains, yet their impact in chemistry remains limited due to the absence of diverse, large-scale, high-quality datasets that reflect the field's multifaceted nature. We present the ChemPile, an open dataset containing over 75 billion tokens of curated chemical data, specifically built for training and evaluating general-purpose models in the chemical sciences. The dataset mirrors the human learning journey through chemistry -- from educational foundations to specialized expertise -- spanning multiple modalities and content types including structured data in diverse chemical representations (SMILES, SELFIES, IUPAC names, InChI, molecular renderings), scientific and educational text, executable code, and chemical images. ChemPile integrates foundational knowledge (textbooks, lecture notes), specialized expertise (scientific articles and language-interfaced data), visual understanding (molecular structures, diagrams), and advanced reasoning (problem-solving traces and code) -- mirroring how human chemists develop expertise through diverse learning materials and experiences. Constructed through hundreds of hours of expert curation, the ChemPile captures both foundational concepts and domain-specific complexity. We provide standardized training, validation, and test splits, enabling robust benchmarking. ChemPile is openly released via HuggingFace with a consistent API, permissive license, and detailed documentation. We hope the ChemPile will serve as a catalyst for chemical AI, enabling the development of the next generation of chemical foundation models. △ Less

Submitted 18 May, 2025; originally announced May 2025.

arXiv:2401.01414 [pdf, other]

VALD-MD: Visual Attribution via Latent Diffusion for Medical Diagnostics

Authors: Ammar A. Siddiqui, Santosh Tirunagari, Tehseen Zia, David Windridge

Abstract: Visual attribution in medical imaging seeks to make evident the diagnostically-relevant components of a medical image, in contrast to the more common detection of diseased tissue deployed in standard machine vision pipelines (which are less straightforwardly interpretable/explainable to clinicians). We here present a novel generative visual attribution technique, one that leverages latent diffusio… ▽ More Visual attribution in medical imaging seeks to make evident the diagnostically-relevant components of a medical image, in contrast to the more common detection of diseased tissue deployed in standard machine vision pipelines (which are less straightforwardly interpretable/explainable to clinicians). We here present a novel generative visual attribution technique, one that leverages latent diffusion models in combination with domain-specific large language models, in order to generate normal counterparts of abnormal images. The discrepancy between the two hence gives rise to a mapping indicating the diagnostically-relevant image components. To achieve this, we deploy image priors in conjunction with appropriate conditioning mechanisms in order to control the image generative process, including natural language text prompts acquired from medical science and applied radiology. We perform experiments and quantitatively evaluate our results on the COVID-19 Radiography Database containing labelled chest X-rays with differing pathologies via the Frechet Inception Distance (FID), Structural Similarity (SSIM) and Multi Scale Structural Similarity Metric (MS-SSIM) metrics obtained between real and generated images. The resulting system also exhibits a range of latent capabilities including zero-shot localized disease induction, which are evaluated with real examples from the cheXpert dataset. △ Less

Submitted 2 January, 2024; originally announced January 2024.

arXiv:1905.11387 [pdf, other]

Automatic Delineation of Kidney Region in DCE-MRI

Authors: Santosh Tirunagari, Norman Poh, Kevin Wells, Miroslaw Bober, Isky Gorden, David Windridge

Abstract: Delineation of the kidney region in dynamic contrast-enhanced magnetic resonance Imaging (DCE-MRI) is required during post-acquisition analysis in order to quantify various aspects of renal function, such as filtration and perfusion or blood flow. However, this can be obfuscated by the Partial Volume Effect (PVE), caused due to the mixing of any single voxel with two or more signal intensities fro… ▽ More Delineation of the kidney region in dynamic contrast-enhanced magnetic resonance Imaging (DCE-MRI) is required during post-acquisition analysis in order to quantify various aspects of renal function, such as filtration and perfusion or blood flow. However, this can be obfuscated by the Partial Volume Effect (PVE), caused due to the mixing of any single voxel with two or more signal intensities from adjacent regions such as liver region and other tissues. To avoid this problem, firstly, a kidney region of interest (ROI) needs to be defined for the analysis. A clinician may choose to select a region avoiding edges where PV mixing is likely to be significant. However, this approach is time-consuming and labour intensive. To address this issue, we present Dynamic Mode Decomposition (DMD) coupled with thresholding and blob analysis as a framework for automatic delineation of the kidney region. This method is first validated on synthetically generated data with ground-truth available and then applied to ten healthy volunteers' kidney DCE-MRI datasets. We found that the result obtained from our proposed framework is comparable to that of a human expert. For example, while our result gives an average Root Mean Square Error (RMSE) of 0.0097, the baseline achieves an average RMSE of 0.1196 across the 10 datasets. As a result, we conclude automatic modelling via DMD framework is a promising approach. △ Less

Submitted 26 May, 2019; originally announced May 2019.

Comments: arXiv admin note: text overlap with arXiv:1905.10218

arXiv:1905.10218 [pdf, other]

Functional Segmentation through Dynamic Mode Decomposition: Automatic Quantification of Kidney Function in DCE-MRI Images

Authors: Santosh Tirunagari, Norman Poh, Kevin Wells, Miroslaw Bober, Isky Gorden, David Windridge

Abstract: Quantification of kidney function in Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE-MRI) requires careful segmentation of the renal region of interest (ROI). Traditionally, human experts are required to manually delineate the kidney ROI across multiple images in the dynamic sequence. This approach is costly, time-consuming and labour intensive, and therefore acts to limit patient throug… ▽ More Quantification of kidney function in Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE-MRI) requires careful segmentation of the renal region of interest (ROI). Traditionally, human experts are required to manually delineate the kidney ROI across multiple images in the dynamic sequence. This approach is costly, time-consuming and labour intensive, and therefore acts to limit patient throughout and acts as one of the factors limiting the wider adoption of DCR-MRI in clinical practice. Therefore, to address this issue, we present the first use of Dynamic Mode Decomposition (DMD) as a basis for automatic segmentation of a dynamic sequence, in this case, kidney ROIs in DCE-MRI. Using DMD coupled combined with thresholding and connected component analysis is first validated on synthetically generated data with known ground-truth, and then applied to ten healthy volunteers' DCE-MRI datasets. We find that the segmentation result obtained from our proposed DMD framework is comparable to that of expert observers and very significantly better than that of an a-priori bounding box segmentation. Our result gives a mean Jaccard coefficient of 0.87, compared to mean scores of 0.85, 0.88 and 0.87 produced from three independent manual annotations. This represents the first use of DMD as a robust automatic data-driven segmentation approach without requiring any human intervention. This is a viable, efficient alternative approach to current manual methods of isolation of kidney function in DCE-MRI. △ Less

Submitted 24 May, 2019; originally announced May 2019.

arXiv:1612.01409 [pdf, other]

Probabilistic Broken-Stick Model: A Regression Algorithm for Irregularly Sampled Data with Application to eGFR

Authors: Norman Poh, Simon Bull, Santosh Tirunagari, Nicholas Cole, Simon de Lusignan

Abstract: In order for clinicians to manage disease progression and make effective decisions about drug dosage, treatment regimens or scheduling follow up appointments, it is necessary to be able to identify both short and long-term trends in repeated biomedical measurements. However, this is complicated by the fact that these measurements are irregularly sampled and influenced by both genuine physiological… ▽ More In order for clinicians to manage disease progression and make effective decisions about drug dosage, treatment regimens or scheduling follow up appointments, it is necessary to be able to identify both short and long-term trends in repeated biomedical measurements. However, this is complicated by the fact that these measurements are irregularly sampled and influenced by both genuine physiological changes and external factors. In their current forms, existing regression algorithms often do not fulfil all of a clinician's requirements for identifying short-term events while still being able to identify long-term trends in disease progression. Therefore, in order to balance both short term interpretability and long term flexibility, an extension to broken-stick regression models is proposed in order to make them more suitable for modelling clinical time series. The proposed probabilistic broken-stick model can robustly estimate both short-term and long-term trends simultaneously, while also accommodating the unequal length and irregularly sampled nature of clinical time series. Moreover, since the model is parametric and completely generative, its first derivative provides a long-term non-linear estimate of the annual rate of change in the measurements more reliably than linear regression. The benefits of the proposed model are illustrated using estimated glomerular filtration rate as a case study for managing patients with chronic kidney disease. △ Less

Submitted 30 November, 2016; originally announced December 2016.

Comments: Preprint submitted to Journal of Biomedical Informatics

arXiv:1609.05716 [pdf, other]

Visualisation of Survey Responses using Self-Organising Maps: A Case Study on Diabetes Self-care Factors

Authors: Santosh Tirunagari, Simon Bull, Samaneh Kouchaki, Deborah Cooke, Norman Poh

Abstract: Due to the chronic nature of diabetes, patient self-care factors play an important role in any treatment plan. In order to understand the behaviour of patients in response to medical advice on self-care, clinicians often conduct cross-sectional surveys. When analysing the survey data, statistical machine learning methods can potentially provide additional insight into the data either through deepe… ▽ More Due to the chronic nature of diabetes, patient self-care factors play an important role in any treatment plan. In order to understand the behaviour of patients in response to medical advice on self-care, clinicians often conduct cross-sectional surveys. When analysing the survey data, statistical machine learning methods can potentially provide additional insight into the data either through deeper understanding of the patterns present or making information available to clinicians in an intuitive manner. In this study, we use self-organising maps (SOMs) to visualise the responses of patients who share similar responses to survey questions, with the goal of helping clinicians understand how patients are managing their treatment and where action should be taken. The principle behavioural patterns revealed through this are that: patients who take the correct dose of insulin also tend to take their injections at the correct time, patients who eat on time also tend to correctly manage their food portions and patients who check their blood glucose with a monitor also tend to adjust their insulin dosage and carry snacks to counter low blood glucose. The identification of these positive behavioural patterns can also help to inform treatment by exploiting their negative corollaries. △ Less

Submitted 30 August, 2016; originally announced September 2016.

arXiv:1609.04214 [pdf, ps, other]

"Flow Size Difference" Can Make a Difference: Detecting Malicious TCP Network Flows Based on Benford's Law

Authors: Aamo Iorliam, Santosh Tirunagari, Anthony T. S. Ho, Shujun Li, Adrian Waller, Norman Poh

Abstract: Statistical characteristics of network traffic have attracted a significant amount of research for automated network intrusion detection, some of which looked at applications of natural statistical laws such as Zipf's law, Benford's law and the Pareto distribution. In this paper, we present the application of Benford's law to a new network flow metric "flow size difference", which have not been st… ▽ More Statistical characteristics of network traffic have attracted a significant amount of research for automated network intrusion detection, some of which looked at applications of natural statistical laws such as Zipf's law, Benford's law and the Pareto distribution. In this paper, we present the application of Benford's law to a new network flow metric "flow size difference", which have not been studied before by other researchers, to build an unsupervised flow-based intrusion detection system (IDS). The method was inspired by our observation on a large number of TCP flow datasets where normal flows tend to follow Benford's law closely but malicious flows tend to deviate significantly from it. The proposed IDS is unsupervised, so it can be easily deployed without any training. It has two simple operational parameters with a clear semantic meaning, allowing the IDS operator to set and adapt their values intuitively to adjust the overall performance of the IDS. We tested the proposed IDS on two (one closed and one public) datasets, and proved its efficiency in terms of AUC (area under the ROC curve). Our work showed the "flow size difference" has a great potential to improve the performance of any flow-based network IDSs. △ Less

Submitted 20 January, 2017; v1 submitted 14 September, 2016; originally announced September 2016.

Comments: 13 pages, 3 figures

ACM Class: C.2; K.6.5

arXiv:1607.06783 [pdf]

Can DMD obtain a Scene Background in Color?

Authors: Santosh Tirunagari, Norman Poh, Miroslaw Bober, David Windridge

Abstract: A background model describes a scene without any foreground objects and has a number of applications, ranging from video surveillance to computational photography. Recent studies have introduced the method of Dynamic Mode Decomposition (DMD) for robustly separating video frames into a background model and foreground components. While the method introduced operates by converting color images to gra… ▽ More A background model describes a scene without any foreground objects and has a number of applications, ranging from video surveillance to computational photography. Recent studies have introduced the method of Dynamic Mode Decomposition (DMD) for robustly separating video frames into a background model and foreground components. While the method introduced operates by converting color images to grayscale, we in this study propose a technique to obtain the background model in the color domain. The effectiveness of our technique is demonstrated using a publicly available Scene Background Initialisation (SBI) dataset. Our results both qualitatively and quantitatively show that DMD can successfully obtain a colored background model. △ Less

Submitted 22 July, 2016; originally announced July 2016.

Comments: International Conference on Image, Vision and Computing (ICIVC 2016), August 3-5, 2016, Portsmouth, UK

arXiv:1605.05142 [pdf, other]

Automatic Classification of Irregularly Sampled Time Series with Unequal Lengths: A Case Study on Estimated Glomerular Filtration Rate

Authors: Santosh Tirunagari, Simon Bull, Norman Poh

Abstract: A patient's estimated glomerular filtration rate (eGFR) can provide important information about disease progression and kidney function. Traditionally, an eGFR time series is interpreted by a human expert labelling it as stable or unstable. While this approach works for individual patients, the time consuming nature of it precludes the quick evaluation of risk in large numbers of patients. However… ▽ More A patient's estimated glomerular filtration rate (eGFR) can provide important information about disease progression and kidney function. Traditionally, an eGFR time series is interpreted by a human expert labelling it as stable or unstable. While this approach works for individual patients, the time consuming nature of it precludes the quick evaluation of risk in large numbers of patients. However, automating this process poses significant challenges as eGFR measurements are usually recorded at irregular intervals and the series of measurements differs in length between patients. Here we present a two-tier system to automatically classify an eGFR trend. First, we model the time series using Gaussian process regression (GPR) to fill in `gaps' by resampling a fixed size vector of fifty time-dependent observations. Second, we classify the resampled eGFR time series using a K-NN/SVM classifier, and evaluate its performance via 5-fold cross validation. Using this approach we achieved an F-score of 0.90, compared to 0.96 for 5 human experts when scored amongst themselves. △ Less

Submitted 17 May, 2016; originally announced May 2016.

Report number: CS-CKD-2016-01

arXiv:1507.02447 [pdf, other]

Data Mining of Causal Relations from Text: Analysing Maritime Accident Investigation Reports

Authors: Santosh Tirunagari

Abstract: Text mining is a process of extracting information of interest from text. Such a method includes techniques from various areas such as Information Retrieval (IR), Natural Language Processing (NLP), and Information Extraction (IE). In this study, text mining methods are applied to extract causal relations from maritime accident investigation reports collected from the Marine Accident Investigation… ▽ More Text mining is a process of extracting information of interest from text. Such a method includes techniques from various areas such as Information Retrieval (IR), Natural Language Processing (NLP), and Information Extraction (IE). In this study, text mining methods are applied to extract causal relations from maritime accident investigation reports collected from the Marine Accident Investigation Branch (MAIB). These causal relations provide information on various mechanisms behind accidents, including human and organizational factors relating to the accident. The objective of this study is to facilitate the analysis of the maritime accident investigation reports, by means of extracting contributory causes with more feasibility. A careful investigation of contributory causes from the reports provide opportunity to improve safety in future. Two methods have been employed in this study to extract the causal relations. They are 1) Pattern classification method and 2) Connectives method. The earlier one uses naive Bayes and Support Vector Machines (SVM) as classifiers. The latter simply searches for the words connecting cause and effect in sentences. The causal patterns extracted using these two methods are compared to the manual (human expert) extraction. The pattern classification method showed a fair and sensible performance with F-measure(average) = 65% when compared to connectives method with F-measure(average) = 58%. This study is an evidence, that text mining methods could be employed in extracting causal relations from marine accident investigation reports. △ Less

Submitted 9 July, 2015; originally announced July 2015.

arXiv:1503.06331 [pdf, other]

Exploratory Data Analysis of The KelvinHelmholtz instability in Jets

Authors: Santosh Tirunagari

Abstract: The KelvinHelmholtz (KH) instability is a fundamental wave instability that is frequently observed in all kinds of shear layer (jets, wakes, atmospheric air currents etc). The study of KH-instability, coherent flow structures has a major impact in understanding the fundamentals of fluid dynamics. Therefore there is a need for methods that can identify and analyse these structures. In this Final as… ▽ More The KelvinHelmholtz (KH) instability is a fundamental wave instability that is frequently observed in all kinds of shear layer (jets, wakes, atmospheric air currents etc). The study of KH-instability, coherent flow structures has a major impact in understanding the fundamentals of fluid dynamics. Therefore there is a need for methods that can identify and analyse these structures. In this Final assignment, we use machine-learning methods such as Proper Orthogonal Decomposition (POD) and Dynamic Mode Decomposition (DMD) to analyse the coherent flow structures. We used a 2D co-axial jet as our data, with Reynolds number corresponding to Re: 10,000. Results for POD modes and DMD modes are discussed and compared. △ Less

Submitted 21 March, 2015; originally announced March 2015.

Report number: DNS-Report-2012

arXiv:1503.06316 [pdf, other]

Identifying Similar Patients Using Self-Organising Maps: A Case Study on Type-1 Diabetes Self-care Survey Responses

Authors: Santosh Tirunagari, Norman Poh, Guosheng Hu, David Windridge

Abstract: Diabetes is considered a lifestyle disease and a well managed self-care plays an important role in the treatment. Clinicians often conduct surveys to understand the self-care behaviors in their patients. In this context, we propose to use Self-Organising Maps (SOM) to explore the survey data for assessing the self-care behaviors in Type-1 diabetic patients. Specifically, SOM is used to visualize h… ▽ More Diabetes is considered a lifestyle disease and a well managed self-care plays an important role in the treatment. Clinicians often conduct surveys to understand the self-care behaviors in their patients. In this context, we propose to use Self-Organising Maps (SOM) to explore the survey data for assessing the self-care behaviors in Type-1 diabetic patients. Specifically, SOM is used to visualize high dimensional similar patient profiles, which is rarely discussed. Experiments demonstrate that our findings through SOM analysis corresponds well to the expectations of the clinicians. In addition, our findings inspire the experts to improve their understanding of the self-care behaviors for their patients. The principle findings in our study show: 1) patients who take correct dose of insulin, inject insulin at the right time, 2) patients who take correct food portions undertake regular physical activity and 3) patients who eat on time take correct food portions. △ Less

Submitted 21 March, 2015; originally announced March 2015.

Comments: 01-05 pages

Report number: TR-DoC-02

arXiv:1503.03680 [pdf, other]

Breast Cancer Data Analytics With Missing Values: A study on Ethnic, Age and Income Groups

Authors: Santosh Tirunagari, Norman Poh, Hajara Abdulrahman, Nawal Nemmour, David Windridge

Abstract: An analysis of breast cancer incidences in women and the relationship between ethnicity and survival rate has been an ongoing study with recorded incidences of missing values in the secondary data. In this paper, we study and report the results of breast cancer survival rate by ethnicity, age and income groups from the dataset collected for 53593 patients in South East England between the years 19… ▽ More An analysis of breast cancer incidences in women and the relationship between ethnicity and survival rate has been an ongoing study with recorded incidences of missing values in the secondary data. In this paper, we study and report the results of breast cancer survival rate by ethnicity, age and income groups from the dataset collected for 53593 patients in South East England between the years 1998 and 2003. In addition to this, we also predict the missing values for the ethnic groups in the dataset. The principle findings in our study suggest that: 1) women of white ethnicity in South East England have a highest percentage of survival rate when compared to the black ethnicity, 2) High income groups have higher survival rates to that of lower income groups and 3) Age groups between 80-95 have lower percentage of survival rate. △ Less

Submitted 12 March, 2015; originally announced March 2015.

Comments: The paper analyzes a breast cancer data with missing values, where the missing values of ethnicity are imputed based on a Naive Bayes classifier. Further, the data was analysed from domain purpose as well such as the effect of ethnicity, age, and income on the survival of the breast cancer

Showing 1–13 of 13 results for author: Tirunagari, S