-
From Words to Code: Harnessing Data for Program Synthesis from Natural Language
Authors:
Anirudh Khatry,
Joyce Cahoon,
Jordan Henkel,
Shaleen Deep,
Venkatesh Emani,
Avrilia Floratou,
Sumit Gulwani,
Vu Le,
Mohammad Raza,
Sherry Shi,
Mukul Singh,
Ashish Tiwari
Abstract:
Creating programs to correctly manipulate data is a difficult task, as the underlying programming languages and APIs can be challenging to learn for many users who are not skilled programmers. Large language models (LLMs) demonstrate remarkable potential for generating code from natural language, but in the data manipulation domain, apart from the natural language (NL) description of the intended…
▽ More
Creating programs to correctly manipulate data is a difficult task, as the underlying programming languages and APIs can be challenging to learn for many users who are not skilled programmers. Large language models (LLMs) demonstrate remarkable potential for generating code from natural language, but in the data manipulation domain, apart from the natural language (NL) description of the intended task, we also have the dataset on which the task is to be performed, or the "data context". Existing approaches have utilized data context in a limited way by simply adding relevant information from the input data into the prompts sent to the LLM.
In this work, we utilize the available input data to execute the candidate programs generated by the LLMs and gather their outputs. We introduce semantic reranking, a technique to rerank the programs generated by LLMs based on three signals coming the program outputs: (a) semantic filtering and well-formedness based score tuning: do programs even generate well-formed outputs, (b) semantic interleaving: how do the outputs from different candidates compare to each other, and (c) output-based score tuning: how do the outputs compare to outputs predicted for the same task. We provide theoretical justification for semantic interleaving. We also introduce temperature mixing, where we combine samples generated by LLMs using both high and low temperatures. We extensively evaluate our approach in three domains, namely databases (SQL), data science (Pandas) and business intelligence (Excel's Power Query M) on a variety of new and existing benchmarks. We observe substantial gains across domains, with improvements of up to 45% in top-1 accuracy and 34% in top-3 accuracy.
△ Less
Submitted 3 May, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
Cobra: A Framework for Cost Based Rewriting of Database Applications
Authors:
K. Venkatesh Emani,
S. Sudarshan
Abstract:
Database applications are typically written using a mixture of imperative languages and declarative frameworks for data processing. Application logic gets distributed across the declarative and imperative parts of a program. Often, there is more than one way to implement the same program, whose efficiency may depend on a number of parameters. In this paper, we propose a framework that automaticall…
▽ More
Database applications are typically written using a mixture of imperative languages and declarative frameworks for data processing. Application logic gets distributed across the declarative and imperative parts of a program. Often, there is more than one way to implement the same program, whose efficiency may depend on a number of parameters. In this paper, we propose a framework that automatically generates all equivalent alternatives of a given program using a given set of program transformations, and chooses the least cost alternative. We use the concept of program regions as an algebraic abstraction of a program and extend the Volcano/Cascades framework for optimization of algebraic expressions, to optimize programs. We illustrate the use of our framework for optimizing database applications. We show through experimental results, that our framework has wide applicability in real world applications and provides significant performance benefits.
△ Less
Submitted 26 February, 2018; v1 submitted 15 January, 2018;
originally announced January 2018.
-
Optimization of Imperative Programs in a Relational Database
Authors:
Karthik Ramachandra,
Kwanghyun Park,
K. Venkatesh Emani,
Alan Halverson,
Cesar Galindo-Legaria,
Conor Cunningham
Abstract:
For decades, RDBMSs have supported declarative SQL as well as imperative functions and procedures as ways for users to express data processing tasks. While the evaluation of declarative SQL has received a lot of attention resulting in highly sophisticated techniques, the evaluation of imperative programs has remained naive and highly inefficient. Imperative programs offer several benefits over SQL…
▽ More
For decades, RDBMSs have supported declarative SQL as well as imperative functions and procedures as ways for users to express data processing tasks. While the evaluation of declarative SQL has received a lot of attention resulting in highly sophisticated techniques, the evaluation of imperative programs has remained naive and highly inefficient. Imperative programs offer several benefits over SQL and hence are often preferred and widely used. But unfortunately, their abysmal performance discourages, and even prohibits their use in many situations. We address this important problem that has hitherto received little attention.
We present Froid, an extensible framework for optimizing imperative programs in relational databases. Froid's novel approach automatically transforms entire User Defined Functions (UDFs) into relational algebraic expressions, and embeds them into the calling SQL query. This form is now amenable to cost-based optimization and results in efficient, set-oriented, parallel plans as opposed to inefficient, iterative, serial execution of UDFs. Froid's approach additionally brings the benefits of many compiler optimizations to UDFs with no additional implementation effort. We describe the design of Froid and present our experimental evaluation that demonstrates performance improvements of up to multiple orders of magnitude on real workloads.
△ Less
Submitted 20 August, 2019; v1 submitted 1 December, 2017;
originally announced December 2017.