-
AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists
Authors:
Yifei Li,
Hanane Nour Moussa,
Ziru Chen,
Shijie Chen,
Botao Yu,
Mingyi Xue,
Benjamin Burns,
Tzu-Yao Chiu,
Vishal Dey,
Zitong Lu,
Chen Wei,
Qianheng Zhang,
Tianyu Zhang,
Song Gao,
Xuhui Huang,
Xia Ning,
Nesreen K. Ahmed,
Ali Payani,
Huan Sun
Abstract:
Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and param…
▽ More
Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and parametric knowledge of LLMs to search for diverse sources, select ecologically valid tasks, and synthesize accurate task instructions and code solutions. Using our pipeline, we construct AutoSDT-5K, a dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery. Expert feedback on a subset of 256 tasks shows the effectiveness of AutoSDT: 93% of the collected tasks are ecologically valid, and 92.2% of the synthesized programs are functionally correct. Trained on AutoSDT-5K, the Qwen2.5-Coder-Instruct LLM series, dubbed AutoSDT-Coder, show substantial improvement on two challenging data-driven discovery benchmarks, ScienceAgentBench and DiscoveryBench. Most notably, AutoSDT-Coder-32B reaches the same level of performance as GPT-4o on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model. On DiscoveryBench, it lifts the hypothesis matching score to 8.1, bringing a 17.4% relative improvement and closing the gap between open-weight models and GPT-4o.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Large Language Models for Controllable Multi-property Multi-objective Molecule Optimization
Authors:
Vishal Dey,
Xiao Hu,
Xia Ning
Abstract:
In real-world drug design, molecule optimization requires selectively improving multiple molecular properties up to pharmaceutically relevant levels, while maintaining others that already meet such criteria. However, existing computational approaches and instruction-tuned LLMs fail to capture such nuanced property-specific objectives, limiting their practical applicability. To address this, we int…
▽ More
In real-world drug design, molecule optimization requires selectively improving multiple molecular properties up to pharmaceutically relevant levels, while maintaining others that already meet such criteria. However, existing computational approaches and instruction-tuned LLMs fail to capture such nuanced property-specific objectives, limiting their practical applicability. To address this, we introduce C-MuMOInstruct, the first instruction-tuning dataset focused on multi-property optimization with explicit, property-specific objectives. Leveraging C-MuMOInstruct, we develop GeLLMO-Cs, a series of instruction-tuned LLMs that can perform targeted property-specific optimization. Our experiments across 5 in-distribution and 5 out-of-distribution tasks show that GeLLMO-Cs consistently outperform strong baselines, achieving up to 126% higher success rate. Notably, GeLLMO-Cs exhibit impressive 0-shot generalization to novel optimization tasks and unseen instructions. This offers a step toward a foundational LLM to support realistic, diverse optimizations with property-specific objectives. C-MuMOInstruct and code are accessible through https://github.com/ninglab/GeLLMO-C.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
GeLLMO: Generalizing Large Language Models for Multi-property Molecule Optimization
Authors:
Vishal Dey,
Xiao Hu,
Xia Ning
Abstract:
Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we…
▽ More
Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we introduce MuMOInstruct, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging MuMOInstruct, we develop GeLLMOs, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that GeLLMOs consistently outperform state-of-the-art baselines. GeLLMOs also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of GeLLMOs as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. MuMOInstruct, models, and code are accessible through https://github.com/ninglab/GeLLMO.
△ Less
Submitted 27 May, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
Authors:
Ziru Chen,
Shijie Chen,
Yuting Ning,
Qianheng Zhang,
Boshi Wang,
Botao Yu,
Yifei Li,
Zeyi Liao,
Chen Wei,
Zitong Lu,
Vishal Dey,
Mingyi Xue,
Frazier N. Baker,
Benjamin Burns,
Daniel Adu-Ampratwum,
Xuhui Huang,
Xia Ning,
Song Gao,
Yu Su,
Huan Sun
Abstract:
The advancements of large language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To t…
▽ More
The advancements of large language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using ScienceAgentBench, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands CodeAct, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. In addition, we evaluate OpenAI o1-preview with direct prompting and self-debug, which can boost the performance to 42.2%, demonstrating the effectiveness of increasing inference-time compute but with more than 10 times the cost of other LLMs. Still, our results underscore the limitations of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.
△ Less
Submitted 31 March, 2025; v1 submitted 7 October, 2024;
originally announced October 2024.
-
Network-theory based modeling of avalanche dynamics in percolative tunnelling networks
Authors:
Vivek Dey,
Steffen Kampman,
Rafael Gutierrez,
Gianaurelio Cuniberti,
Pavan Nukala
Abstract:
Brain-like self-assembled networks can infer and analyze information out of unorganized noisy signals with minimal power consumption. These networks are characterized by spatiotemporal avalanches and their crackling behavior, and their physical models are expected to predict and understand their computational capabilities. Here, we use a network theory-based approach to provide a physical model fo…
▽ More
Brain-like self-assembled networks can infer and analyze information out of unorganized noisy signals with minimal power consumption. These networks are characterized by spatiotemporal avalanches and their crackling behavior, and their physical models are expected to predict and understand their computational capabilities. Here, we use a network theory-based approach to provide a physical model for percolative tunnelling networks, found in Ag-hBN system, consisting of nodes (atomic clusters) of Ag intercalated in the hBN van der Waals layers. By modeling a single edge plasticity through constitutive electrochemical filament formation, and annihilation through Joule heating, we identify independent parameters that determine the network connectivity. We construct a phase diagram and show that a small region of the parameter space contains signals which are long-range temporally correlated, and only a subset of them contains crackling avalanche dynamics. Physical systems spontaneously selforganize to this region for possibly maximizing the efficiency of information transfer.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Enhancing Molecular Property Prediction with Auxiliary Learning and Task-Specific Adaptation
Authors:
Vishal Dey,
Xia Ning
Abstract:
Pretrained Graph Neural Networks have been widely adopted for various molecular property prediction tasks. Despite their ability to encode structural and relational features of molecules, traditional fine-tuning of such pretrained GNNs on the target task can lead to poor generalization. To address this, we explore the adaptation of pretrained GNNs to the target task by jointly training them with m…
▽ More
Pretrained Graph Neural Networks have been widely adopted for various molecular property prediction tasks. Despite their ability to encode structural and relational features of molecules, traditional fine-tuning of such pretrained GNNs on the target task can lead to poor generalization. To address this, we explore the adaptation of pretrained GNNs to the target task by jointly training them with multiple auxiliary tasks. This could enable the GNNs to learn both general and task-specific features, which may benefit the target task. However, a major challenge is to determine the relatedness of auxiliary tasks with the target task. To address this, we investigate multiple strategies to measure the relevance of auxiliary tasks and integrate such tasks by adaptively combining task gradients or by learning task weights via bi-level optimization. Additionally, we propose a novel gradient surgery-based approach, Rotation of Conflicting Gradients ($\mathtt{RCGrad}$), that learns to align conflicting auxiliary task gradients through rotation. Our experiments with state-of-the-art pretrained GNNs demonstrate the efficacy of our proposed methods, with improvements of up to 7.7% over fine-tuning. This suggests that incorporating auxiliary tasks along with target task fine-tuning can be an effective way to improve the generalizability of pretrained GNNs for molecular property prediction.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Precision Anti-Cancer Drug Selection via Neural Ranking
Authors:
Vishal Dey,
Xia Ning
Abstract:
Personalized cancer treatment requires a thorough understanding of complex interactions between drugs and cancer cell lines in varying genetic and molecular contexts. To address this, high-throughput screening has been used to generate large-scale drug response data, facilitating data-driven computational models. Such models can capture complex drug-cell line interactions across various contexts i…
▽ More
Personalized cancer treatment requires a thorough understanding of complex interactions between drugs and cancer cell lines in varying genetic and molecular contexts. To address this, high-throughput screening has been used to generate large-scale drug response data, facilitating data-driven computational models. Such models can capture complex drug-cell line interactions across various contexts in a fully data-driven manner. However, accurately prioritizing the most sensitive drugs for each cell line still remains a significant challenge. To address this, we developed neural ranking approaches that leverage large-scale drug response data across multiple cell lines from diverse cancer types. Unlike existing approaches that primarily utilize regression and classification techniques for drug response prediction, we formulated the objective of drug selection and prioritization as a drug ranking problem. In this work, we proposed two neural listwise ranking methods that learn latent representations of drugs and cell lines, and then use those representations to score drugs in each cell line via a learnable scoring function. Specifically, we developed a neural listwise ranking method, List-One, on top of the existing method ListNet. Additionally, we proposed a novel listwise ranking method, List-All, that focuses on all the sensitive drugs instead of the top sensitive drug, unlike List-One. Our results demonstrate that List-All outperforms the best baseline with significant improvements of as much as 8.6% in hit@20 across 50% test cell lines. Furthermore, our analyses suggest that the learned latent spaces from our proposed methods demonstrate informative clustering structures and capture relevant underlying biological features. Moreover, our comprehensive empirical evaluation provides a thorough and objective comparison of the performance of different methods (including our proposed ones).
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
Entity Alignment For Knowledge Graphs: Progress, Challenges, and Empirical Studies
Authors:
Deepak Chaurasiya,
Anil Surisetty,
Nitish Kumar,
Alok Singh,
Vikrant Dey,
Aakarsh Malhotra,
Gaurav Dhama,
Ankur Arora
Abstract:
Entity Alignment (EA) identifies entities across databases that refer to the same entity. Knowledge graph-based embedding methods have recently dominated EA techniques. Such methods map entities to a low-dimension space and align them based on their similarities. With the corpus of EA methodologies growing rapidly, this paper presents a comprehensive analysis of various existing EA methods, elabor…
▽ More
Entity Alignment (EA) identifies entities across databases that refer to the same entity. Knowledge graph-based embedding methods have recently dominated EA techniques. Such methods map entities to a low-dimension space and align them based on their similarities. With the corpus of EA methodologies growing rapidly, this paper presents a comprehensive analysis of various existing EA methods, elaborating their applications and limitations. Further, we distinguish the methods based on their underlying algorithms and the information they incorporate to learn entity representations. Based on challenges in industrial datasets, we bring forward $4$ research questions (RQs). These RQs empirically analyse the algorithms from the perspective of \textit{Hubness, Degree distribution, Non-isomorphic neighbourhood,} and \textit{Name bias}. For Hubness, where one entity turns up as the nearest neighbour of many other entities, we define an $h$-score to quantify its effect on the performance of various algorithms. Additionally, we try to level the playing field for algorithms that rely primarily on name-bias existing in the benchmarking open-source datasets by creating a low name bias dataset. We further create an open-source repository for $14$ embedding-based EA methods and present the analysis for invoking further research motivations in the field of EA.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
Improving Compound Activity Classification via Deep Transfer and Representation Learning
Authors:
Vishal Dey,
Raghu Machiraju,
Xia Ning
Abstract:
Recent advances in molecular machine learning, especially deep neural networks such as Graph Neural Networks (GNNs) for predicting structure activity relationships (SAR) have shown tremendous potential in computer-aided drug discovery. However, the applicability of such deep neural networks are limited by the requirement of large amounts of training data. In order to cope with limited training dat…
▽ More
Recent advances in molecular machine learning, especially deep neural networks such as Graph Neural Networks (GNNs) for predicting structure activity relationships (SAR) have shown tremendous potential in computer-aided drug discovery. However, the applicability of such deep neural networks are limited by the requirement of large amounts of training data. In order to cope with limited training data for a target task, transfer learning for SAR modeling has been recently adopted to leverage information from data of related tasks. In this work, in contrast to the popular parameter-based transfer learning such as pretraining, we develop novel deep transfer learning methods TAc and TAc-fc to leverage source domain data and transfer useful information to the target domain. TAc learns to generate effective molecular features that can generalize well from one domain to another, and increase the classification performance in the target domain. Additionally, TAc-fc extends TAc by incorporating novel components to selectively learn feature-wise and compound-wise transferability. We used the bioassay screening data from PubChem, and identified 120 pairs of bioassays such that the active compounds in each pair are more similar to each other compared to its inactive compounds. Our experiments clearly demonstrate that TAc achieves significant improvement over all baselines across a large number of target tasks. Furthermore, although TAc-fc achieves slightly worse ROC-AUC on average compared to TAc, TAc-fc still achieves the best performance on more tasks in terms of PR-AUC and F1 compared to other methods. In summary, TAc-fc is also found to be a strong model with competitive or even better performance than TAc on a notable number of target tasks.
△ Less
Submitted 8 March, 2022; v1 submitted 14 November, 2021;
originally announced November 2021.
-
A Pipeline to Understand Emerging Illness via Social Media Data Analysis: A Case Study on Breast Implant Illness
Authors:
Vishal Dey,
Peter Krasniak,
Minh Nguyen,
Clara Lee,
Xia Ning
Abstract:
Background: A new illness could first come to the public attention over social media before it is medically defined, formally documented or systematically studied. One example is a phenomenon known as breast implant illness (BII) that has been extensively discussed on social media, though vaguely defined in medical literature. Objectives: The objective of this study is to construct a data analysis…
▽ More
Background: A new illness could first come to the public attention over social media before it is medically defined, formally documented or systematically studied. One example is a phenomenon known as breast implant illness (BII) that has been extensively discussed on social media, though vaguely defined in medical literature. Objectives: The objective of this study is to construct a data analysis pipeline to understand emerging illness using social media data, and to apply the pipeline to understand key attributes of BII. Methods: We conducted a pipeline of social media data analysis using Natural Language Processing (NLP) and topic modeling. We extracted mentions related to signs/symptoms, diseases/disorders and medical procedures using the Clinical Text Analysis and Knowledge Extraction System (cTAKES) from social media data. We mapped the mentions to standard medical concepts. We summarized mapped concepts to topics using Latent Dirichlet Allocation (LDA). Finally, we applied this pipeline to understand BII from several BII-dedicated social media sites. Results: Our pipeline identified topics related to toxicity, cancer and mental health issues that are highly associated with BII. Our pipeline also shows that cancers, autoimmune disorders and mental health problems are emerging concerns associated with breast implants based on social media discussions. The pipeline also identified mentions such as rupture, infection, pain and fatigue as common self-reported issues among the public, as well as toxicity from silicone implants. Conclusions: Our study could inspire future work studying the suggested symptoms and factors of BII. Our study provides the first analysis and derived knowledge of BII from social media using NLP techniques, and demonstrates the potential of using social media information to better understand similar emerging illnesses.
△ Less
Submitted 8 March, 2022; v1 submitted 25 August, 2020;
originally announced August 2020.
-
Adversarial Attacks and Defences: A Survey
Authors:
Anirban Chakraborty,
Manaar Alam,
Vishal Dey,
Anupam Chattopadhyay,
Debdeep Mukhopadhyay
Abstract:
Deep learning has emerged as a strong and efficient framework that can be applied to a broad spectrum of complex learning problems which were difficult to solve using the traditional machine learning techniques in the past. In the last few years, deep learning has advanced radically in such a way that it can surpass human-level performance on a number of tasks. As a consequence, deep learning is b…
▽ More
Deep learning has emerged as a strong and efficient framework that can be applied to a broad spectrum of complex learning problems which were difficult to solve using the traditional machine learning techniques in the past. In the last few years, deep learning has advanced radically in such a way that it can surpass human-level performance on a number of tasks. As a consequence, deep learning is being extensively used in most of the recent day-to-day applications. However, security of deep learning systems are vulnerable to crafted adversarial examples, which may be imperceptible to the human eye, but can lead the model to misclassify the output. In recent times, different types of adversaries based on their threat model leverage these vulnerabilities to compromise a deep learning system where adversaries have high incentives. Hence, it is extremely important to provide robustness to deep learning algorithms against these adversaries. However, there are only a few strong countermeasures which can be used in all types of attack scenarios to design a robust deep learning system. In this paper, we attempt to provide a detailed discussion on different types of adversarial attacks with various threat models and also elaborate the efficiency and challenges of recent countermeasures against them.
△ Less
Submitted 28 September, 2018;
originally announced October 2018.