-
OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery
Authors:
Vignesh Prabhakar,
Md Amirul Islam,
Adam Atanas,
Yao-Ting Wang,
Joah Han,
Aastha Jhunjhunwala,
Rucha Apte,
Robert Clark,
Kang Xu,
Zihan Wang,
Kai Liu
Abstract:
Large Language Models (LLMs) have demonstrated remarkable potential in advancing scientific knowledge and addressing complex challenges. In this work, we introduce OmniScience, a specialized large reasoning model for general science, developed through three key components: (1) domain adaptive pretraining on a carefully curated corpus of scientific literature, (2) instruction tuning on a specialize…
▽ More
Large Language Models (LLMs) have demonstrated remarkable potential in advancing scientific knowledge and addressing complex challenges. In this work, we introduce OmniScience, a specialized large reasoning model for general science, developed through three key components: (1) domain adaptive pretraining on a carefully curated corpus of scientific literature, (2) instruction tuning on a specialized dataset to guide the model in following domain-specific tasks, and (3) reasoning-based knowledge distillation through fine-tuning to significantly enhance its ability to generate contextually relevant and logically sound responses. We demonstrate the versatility of OmniScience by developing a battery agent that efficiently ranks molecules as potential electrolyte solvents or additives. Comprehensive evaluations reveal that OmniScience is competitive with state-of-the-art large reasoning models on the GPQA Diamond and domain-specific battery benchmarks, while outperforming all public reasoning and non-reasoning models with similar parameter counts. We further demonstrate via ablation experiments that domain adaptive pretraining and reasoning-based knowledge distillation are critical to attain our performance levels, across benchmarks.
△ Less
Submitted 22 April, 2025; v1 submitted 21 March, 2025;
originally announced March 2025.
-
Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model
Authors:
Yu-Chen Lin,
Akhilesh Kumar,
Norman Chang,
Wenliang Zhang,
Muhammad Zakir,
Rucha Apte,
Haiyang He,
Chao Wang,
Jyh-Shing Roger Jang
Abstract:
We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm…
▽ More
We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG) method in retrieving more relevant information, ultimately achieving a 73.33% "Percentage of Correct Lines" for code generation problems in MapReduce applications.
△ Less
Submitted 30 January, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Diffusion model based data generation for partial differential equations
Authors:
Rucha Apte,
Sheel Nidhan,
Rishikesh Ranade,
Jay Pathak
Abstract:
In a preliminary attempt to address the problem of data scarcity in physics-based machine learning, we introduce a novel methodology for data generation in physics-based simulations. Our motivation is to overcome the limitations posed by the limited availability of numerical data. To achieve this, we leverage a diffusion model that allows us to generate synthetic data samples and test them for two…
▽ More
In a preliminary attempt to address the problem of data scarcity in physics-based machine learning, we introduce a novel methodology for data generation in physics-based simulations. Our motivation is to overcome the limitations posed by the limited availability of numerical data. To achieve this, we leverage a diffusion model that allows us to generate synthetic data samples and test them for two canonical cases: (a) the steady 2-D Poisson equation, and (b) the forced unsteady 2-D Navier-Stokes (NS) {vorticity-transport} equation in a confined box. By comparing the generated data samples against outputs from classical solvers, we assess their accuracy and examine their adherence to the underlying physics laws. In this way, we emphasize the importance of not only satisfying visual and statistical comparisons with solver data but also ensuring the generated data's conformity to physics laws, thus enabling their effective utilization in downstream tasks.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
Systematic Analysis of Image Generation using GANs
Authors:
Rohan Akut,
Sumukh Marathe,
Rucha Apte,
Ishan Joshi,
Siddhivinayak Kulkarni
Abstract:
Generative Adversarial Networks have been crucial in the developments made in unsupervised learning in recent times. Exemplars of image synthesis from text or other images, these networks have shown remarkable improvements over conventional methods in terms of performance. Trained on the adversarial training philosophy, these networks aim to estimate the potential distribution from the real data a…
▽ More
Generative Adversarial Networks have been crucial in the developments made in unsupervised learning in recent times. Exemplars of image synthesis from text or other images, these networks have shown remarkable improvements over conventional methods in terms of performance. Trained on the adversarial training philosophy, these networks aim to estimate the potential distribution from the real data and then use this as input to generate the synthetic data. Based on this fundamental principle, several frameworks can be generated that are paragon implementations in several real-life applications such as art synthesis, generation of high resolution outputs and synthesis of images from human drawn sketches, to name a few. While theoretically GANs present better results and prove to be an improvement over conventional methods in many factors, the implementation of these frameworks for dedicated applications remains a challenge. This study explores and presents a taxonomy of these frameworks and their use in various image to image synthesis and text to image synthesis applications. The basic GANs, as well as a variety of different niche frameworks, are critically analyzed. The advantages of GANs for image generation over conventional methods as well their disadvantages amongst other frameworks are presented. The future applications of GANs in industries such as healthcare, art and entertainment are also discussed.
△ Less
Submitted 30 August, 2019;
originally announced August 2019.