Search | arXiv e-print repository

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Authors: Dhruv Gautam, Spandan Garg, Jinu Jang, Neel Sundaresan, Roshanak Zilouchian Moghaddam

Abstract: Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce RefactorBench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. Solving tasks within RefactorBench… ▽ More Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce RefactorBench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. Solving tasks within RefactorBench requires thorough exploration of dependencies across multiple files and strong adherence to relevant instructions. Every task is defined by 3 natural language instructions of varying specificity and is mutually exclusive, allowing for the creation of longer combined tasks on the same repository. Baselines on RefactorBench reveal that current LM agents struggle with simple compositional tasks, solving only 22% of tasks with base instructions, in contrast to a human developer with short time constraints solving 87%. Through trajectory analysis, we identify various unique failure modes of LM agents, and further explore the failure mode of tracking past actions. By adapting a baseline agent to condition on representations of state, we achieve a 43.9% improvement in solving RefactorBench tasks. We further extend our state-aware approach to encompass entire digital environments and outline potential directions for future research. RefactorBench aims to support the study of LM agents by providing a set of real-world, multi-hop tasks within the realm of code. △ Less

Submitted 10 March, 2025; originally announced March 2025.

Comments: ICLR 2025 Camera Ready

ACM Class: I.2.5

arXiv:2502.15034 [pdf, other]

Randomized benchmarking of a high-fidelity remote CNOT gate over a meter-scale microwave interconnect

Authors: Kentaro Heya, Timothy Phung, Moein Malekakhlagh, Rachel Steiner, Marco Turchetti, William Shanks, John Mamin, Wen-Sen Lu, Yadav Prasad Kandel, Neereja Sundaresan, Jason Orcutt

Abstract: In the modular superconducting quantum processor architecture, high-fidelity, meter-scale microwave interconnect between processor modules is a key technology for extending system size beyond constraints imposed by device manufacturing equipment, yield, and signal delivery. While there have been many demonstrations of remote state transfer between modules, these relied on tomographic experiments f… ▽ More In the modular superconducting quantum processor architecture, high-fidelity, meter-scale microwave interconnect between processor modules is a key technology for extending system size beyond constraints imposed by device manufacturing equipment, yield, and signal delivery. While there have been many demonstrations of remote state transfer between modules, these relied on tomographic experiments for benchmarking, but this technique does not reliably separate State Preparation And Measurement (SPAM) error from error per state transfer. Recent developments based on randomized benchmarking provide a compatible theory for separating these two errors. In this work, we present a module-to-module interconnect based on Tunable-Coupling Qubits (TCQs) and benchmark, in a SPAM error tolerant manner, a remote state transfer fidelity of 0.988 across a 60cm long coplanar waveguide (CPW). The state transfer is implemented via superadiabatic transitionless driving method, which suppresses intermediate excitation in internal modes of CPW. We also introduce the frame tracking technique to correct unintended qubit phase rotations before and after the state transfers, which enables the SPAM-error-tolerant benchmarking of the state transfers. We further propose and construct a remote CNOT gate between modules, composed of local CZ gates in each module and remote state transfers, and report a high gate fidelity of 0.933 using randomized benchmarking method. The remote CNOT construction and benchmarking we present is a more complete metric that fully characterizes the module to module link operation going forward as it more closely represents interconnect operation in a circuit. △ Less

Submitted 20 February, 2025; originally announced February 2025.

arXiv:2412.14308

Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation

Authors: Benjamin Steenhoek, Michele Tufano, Neel Sundaresan, Alexey Svyatkovskiy

Abstract: Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we pr… ▽ More Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose Reinforcement Learning from Static Quality Metrics (RLSQM), wherein we utilize Reinforcement Learning to generate high-quality unit tests based on static analysis-based quality metrics. First, we analyzed LLM-generated tests and show that LLMs frequently do generate undesirable test smells -- up to 37% of the time. Then, we implemented lightweight static analysis-based reward model and trained LLMs using this reward model to optimize for five code quality metrics. Our experimental results demonstrate that the RL-optimized Codex model consistently generated higher-quality test cases than the base LLM, improving quality metrics by up to 23%, and generated nearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on all code quality metrics, in spite of training a substantially cheaper Codex model. We provide insights into how reliably utilize RL to improve test generation quality and show that RLSQM is a significant step towards enhancing the overall efficiency and reliability of automated software testing. Our data are available at https://doi.org/10.6084/m9.figshare.25983166. △ Less

Submitted 6 January, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

Comments: This work was intended as a replacement of arXiv:2310.02368 and any subsequent updates will appear there

arXiv:2409.04634 [pdf, other]

Mechanically-intermixed indium superconducting connections for microwave quantum interconnects

Authors: Yves Martin, Neereja Sundaresan, Jae-woong Nah, Rachel Steiner, Marco Turchetti, Kevin Stawiasz, Chi Xiong, Jason S. Orcutt

Abstract: Superconducting coaxial cables represent critical communication channels for interconnecting superconducting quantum processors. Here, we report mechanically-intermixed indium joins to aluminum coaxial cables for low loss quantum interconnects. We describe an ABCD matrix formalism to characterize the total resonator internal quality factor ($Q_i$) and any contact ($R_{cont}$) or shunt resistance (… ▽ More Superconducting coaxial cables represent critical communication channels for interconnecting superconducting quantum processors. Here, we report mechanically-intermixed indium joins to aluminum coaxial cables for low loss quantum interconnects. We describe an ABCD matrix formalism to characterize the total resonator internal quality factor ($Q_i$) and any contact ($R_{cont}$) or shunt resistance ($R_{shunt}$) associated with the mechanically-intermixed indium joins. We present four resonator test systems incorporating three indium join methods over the typical frequency range of interest (3-5.5GHz) at temperatures below $20mK$. We measure high internal quality factor aluminum cables ($Q_i = 1.55 \pm 0.37 x 10^6$) through a push-to-connect indium join of the outer conductor that capacitively couples the inner conductor for reflection measurements. We then characterize the total internal quality factors of modes of a cable resonator with a push-to-connect superconducting cable-splice at the midpoint to find mean $Q_i = 1.40 x 10^6$ and $Q_i = 9.39 x 10^5$ for even and odd-modes respectively and use an ABCD matrix model of the system to extract $R_{cont} = 6x10^{-4} Ω$ for the indium join of the inner conductor. Finally, we demonstrate indium press-mold cable-to-chip connections where the cable-to-chip join is placed at a current node and voltage node through varying on-chip waveguide lengths with mean $Q_i = 1.24 x 10^6$ and $Q_i = 1.07 x 10^6$ respectively to extract $R_{cont} = 8.5x10^{-4} Ω$ and $R_{shunt} = 1.3x10^7 Ω$ for the interface. With these techniques, we demonstrate a set of low-loss methods to join superconducting cables for future quantum △ Less

Submitted 6 September, 2024; originally announced September 2024.

Comments: 6 pages, 5 figures

arXiv:2404.08885 [pdf, other]

Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension

Authors: Mengnan Qi, Yufan Huang, Yongqiang Yao, Maoquan Wang, Bin Gu, Neel Sundaresan

Abstract: Large language models (LLMs) has experienced exponential growth, they demonstrate remarkable performance across various tasks. Notwithstanding, contemporary research primarily centers on enhancing the size and quality of pretraining data, still utilizing the next token prediction task on autoregressive transformer model structure. The efficacy of this task in truly facilitating the model's compreh… ▽ More Large language models (LLMs) has experienced exponential growth, they demonstrate remarkable performance across various tasks. Notwithstanding, contemporary research primarily centers on enhancing the size and quality of pretraining data, still utilizing the next token prediction task on autoregressive transformer model structure. The efficacy of this task in truly facilitating the model's comprehension of code logic remains questionable, we speculate that it still interprets code as mere text, while human emphasizes the underlying logical knowledge. In order to prove it, we introduce a new task, "Logically Equivalent Code Selection," which necessitates the selection of logically equivalent code from a candidate set, given a query code. Our experimental findings indicate that current LLMs underperform in this task, since they understand code by unordered bag of keywords. To ameliorate their performance, we propose an advanced pretraining task, "Next Token Prediction+". This task aims to modify the sentence embedding distribution of the LLM without sacrificing its generative capabilities. Our experimental results reveal that following this pretraining, both Code Llama and StarCoder, the prevalent code domain pretraining models, display significant improvements on our logically equivalent code selection task and the code completion task. △ Less

Submitted 12 April, 2024; originally announced April 2024.

arXiv:2403.08299 [pdf, other]

AutoDev: Automated AI-Driven Development

Authors: Michele Tufano, Anisha Agarwal, Jinu Jang, Roshanak Zilouchian Moghaddam, Neel Sundaresan

Abstract: The landscape of software development has witnessed a paradigm shift with the advent of AI-powered assistants, exemplified by GitHub Copilot. However, existing solutions are not leveraging all the potential capabilities available in an IDE such as building, testing, executing code, git operations, etc. Therefore, they are constrained by their limited capabilities, primarily focusing on suggesting… ▽ More The landscape of software development has witnessed a paradigm shift with the advent of AI-powered assistants, exemplified by GitHub Copilot. However, existing solutions are not leveraging all the potential capabilities available in an IDE such as building, testing, executing code, git operations, etc. Therefore, they are constrained by their limited capabilities, primarily focusing on suggesting code snippets and file manipulation within a chat-based interface. To fill this gap, we present AutoDev, a fully automated AI-driven software development framework, designed for autonomous planning and execution of intricate software engineering tasks. AutoDev enables users to define complex software engineering objectives, which are assigned to AutoDev's autonomous AI Agents to achieve. These AI agents can perform diverse operations on a codebase, including file editing, retrieval, build processes, execution, testing, and git operations. They also have access to files, compiler output, build and testing logs, static analysis tools, and more. This enables the AI Agents to execute tasks in a fully automated manner with a comprehensive understanding of the contextual information required. Furthermore, AutoDev establishes a secure development environment by confining all operations within Docker containers. This framework incorporates guardrails to ensure user privacy and file security, allowing users to define specific permitted or restricted commands and operations within AutoDev. In our evaluation, we tested AutoDev on the HumanEval dataset, obtaining promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2402.14261 [pdf, other]

Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming

Authors: Anisha Agarwal, Aaron Chan, Shubham Chandel, Jinu Jang, Shaun Miller, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Neel Sundaresan, Michele Tufano

Abstract: The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given sce… ▽ More The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for scenarios encompassing a wide range of developer tasks, including code generation from natural language (generate), documentation generation from code (doc), test case generation (test), bug-fixing (fix), and workspace understanding and query resolution (workspace). These success metrics are designed to evaluate the performance of LLMs within a given IDE and its respective parameter space. Our learnings from evaluating three common LLMs using these metrics can inform the development and validation of future scenarios in LLM guided IDEs. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2401.09663 [pdf, other]

Enhanced Quantum State Transfer and Bell State Generation over Long-Range Multimode Interconnects via Superadiabatic Transitionless Driving

Authors: Moein Malekakhlagh, Timothy Phung, Daniel Puzzuoli, Kentaro Heya, Neereja Sundaresan, Jason Orcutt

Abstract: Achieving high-fidelity direct two-qubit gates over meter-scale long quantum interconnects is challenging in part due to the multimode nature of such systems. One alternative scheme is to combine local operations with remote quantum state transfer or remote entanglement. Here, we study quantum state transfer and entanglement generation for two distant qubits, equipped with tunable interactions, ov… ▽ More Achieving high-fidelity direct two-qubit gates over meter-scale long quantum interconnects is challenging in part due to the multimode nature of such systems. One alternative scheme is to combine local operations with remote quantum state transfer or remote entanglement. Here, we study quantum state transfer and entanglement generation for two distant qubits, equipped with tunable interactions, over a common multimode interconnect. We employ the SuperAdiabatic Transitionless Driving (SATD) solutions for adiabatic passage and demonstrate various favorable improvements over the standard protocol. In particular, by suppressing leakage to a select (resonant) interconnect mode, SATD breaks the speed-limit relation imposed by the qubit-interconnect interaction $g$, where instead the operation time is limited by leakage to the adjacent modes, i.e. free spectral range $Δ_c$ of the interconnect, allowing for fast operations even with weak $g$. Furthermore, we identify a multimode error mechanism for Bell state generation using such adiabatic protocols, in which the even/odd modal dependence of qubit-interconnect interaction breaks down the dark state symmetry, leading to detrimental adiabatic overlap with the odd modes growing as $(g/Δ_c)^2$. Therefore, adopting a weak coupling, imposed by a multimode interconnect, SATD provides a significant improvement in terms of operation speed and consequently sensitivity to incoherent error. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: 14 pages, 12 figures, 4 appendices

arXiv:2312.11508 [pdf, other]

Rethinking the Instruction Quality: LIFT is What You Need

Authors: Yang Xu, Yongqiang Yao, Yufan Huang, Mengnan Qi, Maoquan Wang, Bin Gu, Neel Sundaresan

Abstract: Instruction tuning, a specialized technique to enhance large language model (LLM) performance via instruction datasets, relies heavily on the quality of employed data. Existing quality improvement methods alter instruction data through dataset expansion or curation. However, the expansion method risks data redundancy, potentially compromising LLM performance, while the curation approach confines t… ▽ More Instruction tuning, a specialized technique to enhance large language model (LLM) performance via instruction datasets, relies heavily on the quality of employed data. Existing quality improvement methods alter instruction data through dataset expansion or curation. However, the expansion method risks data redundancy, potentially compromising LLM performance, while the curation approach confines the LLM's potential to the original dataset. Our aim is to surpass the original data quality without encountering these shortcomings. To achieve this, we propose LIFT (LLM Instruction Fusion Transfer), a novel and versatile paradigm designed to elevate the instruction quality to new heights. LIFT strategically broadens data distribution to encompass more high-quality subspaces and eliminates redundancy, concentrating on high-quality segments across overall data subspaces. Experimental results demonstrate that, even with a limited quantity of high-quality instruction data selected by our paradigm, LLMs not only consistently uphold robust performance across various tasks but also surpass some state-of-the-art results, highlighting the significant improvement in instruction quality achieved by our paradigm. △ Less

Submitted 27 December, 2023; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2310.14209 [pdf, other]

SUT: Active Defects Probing for Transcompiler Models

Authors: Mengnan Qi, Yufan Huang, Maoquan Wang, Yongqiang Yao, Zihan Liu, Bin Gu, Colin Clement, Neel Sundaresan

Abstract: Automatic Program translation has enormous application value and hence has been attracting significant interest from AI researchers. However, we observe that current program translation models still make elementary syntax errors, particularly, when the target language does not have syntax elements in the source language. Metrics like BLUE, CodeBLUE and computation accuracy may not expose these iss… ▽ More Automatic Program translation has enormous application value and hence has been attracting significant interest from AI researchers. However, we observe that current program translation models still make elementary syntax errors, particularly, when the target language does not have syntax elements in the source language. Metrics like BLUE, CodeBLUE and computation accuracy may not expose these issues. In this paper we introduce a new metrics for programming language translation and these metrics address these basic syntax errors. We develop a novel active defects probing suite called Syntactic Unit Tests (SUT) which includes a highly interpretable evaluation harness for accuracy and test scoring. Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests. Specifically, compared to previous program translation task evaluation dataset, its pass rate on our unit tests has decreased by 26.15%. Further our evaluation harness reveal syntactic element errors in which these models exhibit deficiencies. △ Less

Submitted 22 October, 2023; originally announced October 2023.

arXiv:2310.11476 [pdf, other]

Program Translation via Code Distillation

Authors: Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin Clement, Neel Sundaresan

Abstract: Software version migration and program translation are an important and costly part of the lifecycle of large codebases. Traditional machine translation relies on parallel corpora for supervised translation, which is not feasible for program translation due to a dearth of aligned data. Recent unsupervised neural machine translation techniques have overcome data limitations by included techniques s… ▽ More Software version migration and program translation are an important and costly part of the lifecycle of large codebases. Traditional machine translation relies on parallel corpora for supervised translation, which is not feasible for program translation due to a dearth of aligned data. Recent unsupervised neural machine translation techniques have overcome data limitations by included techniques such as back translation and low level compiler intermediate representations (IR). These methods face significant challenges due to the noise in code snippet alignment and the diversity of IRs respectively. In this paper we propose a novel model called Code Distillation (CoDist) whereby we capture the semantic and structural equivalence of code in a language agnostic intermediate representation. Distilled code serves as a translation pivot for any programming language, leading by construction to parallel corpora which scale to all available source code by simply applying the distillation compiler. We demonstrate that our approach achieves state-of-the-art performance on CodeXGLUE and TransCoder GeeksForGeeks translation benchmarks, with an average absolute increase of 12.7% on the TransCoder GeeksforGeeks translation benchmark compare to TransCoder-ST. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2310.02368 [pdf, other]

Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation

Authors: Benjamin Steenhoek, Michele Tufano, Neel Sundaresan, Alexey Svyatkovskiy

Abstract: Software testing is a crucial aspect of software development, and the creation of high-quality tests that adhere to best practices is essential for effective maintenance. Recently, Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases. However, these LLMs are often trained on vast amounts of publicly available code, which may includ… ▽ More Software testing is a crucial aspect of software development, and the creation of high-quality tests that adhere to best practices is essential for effective maintenance. Recently, Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases. However, these LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM). To begin, we analyze the anti-patterns generated by the LLM and show that LLMs can generate undesirable test smells. Thus, we train specific reward models for each static quality metric, then utilize Proximal Policy Optimization (PPO) to train models for optimizing a single quality metric at a time. Furthermore, we amalgamate these rewards into a unified reward model aimed at capturing different best practices and quality aspects of tests. By comparing RL-trained models with those trained using supervised learning, we provide insights into how reliably utilize RL to improve test generation quality and into the effects of various training strategies. Our experimental results demonstrate that the RL-optimized model consistently generated high-quality test cases compared to the base LLM, improving the model by up to 21%, and successfully generates nearly 100% syntactically correct code. RLSQM also outperformed GPT-4 on four out of seven metrics. This represents a significant step towards enhancing the overall efficiency and reliability of software testing through Reinforcement Learning and static quality metrics. Our data are available at https://figshare.com/s/ded476c8d4c221222849. △ Less

Submitted 6 January, 2025; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: Accepted to DeepTest 2025 (ICSE Workshop). Previously this version appeared as arXiv:2412.14308 which was submitted as a new work by accident

arXiv:2307.13383 [pdf, other]

Predicting Code Coverage without Execution

Authors: Michele Tufano, Shubham Chandel, Anisha Agarwal, Neel Sundaresan, Colin Clement

Abstract: Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learn… ▽ More Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learning to amortize this expensive process could lower the cost of code coverage by requiring only the source code context, and the task of code coverage prediction can be a novel benchmark for judging the ability of models to understand code. We propose a novel benchmark task called Code Coverage Prediction for Large Language Models (LLMs). We formalize this task to evaluate the capability of LLMs in understanding code execution by determining which lines of a method are executed by a given test case and inputs. We curate and release a dataset we call COVERAGEEVAL by executing tests and code from the HumanEval dataset and collecting code coverage information. We report the performance of four state-of-the-art LLMs used for code-related tasks, including OpenAI's GPT-4 and GPT-3.5-Turbo, Google's BARD, and Anthropic's Claude, on the Code Coverage Prediction task. Finally, we argue that code coverage as a metric and pre-training data source are valuable for overall LLM performance on software engineering tasks. △ Less

Submitted 25 July, 2023; originally announced July 2023.

arXiv:2306.17077 [pdf, other]

RAPGen: An Approach for Fixing Code Inefficiencies in Zero-Shot

Authors: Spandan Garg, Roshanak Zilouchian Moghaddam, Neel Sundaresan

Abstract: Performance bugs are non-functional bugs that can even manifest in well-tested commercial products. Fixing these performance bugs is an important yet challenging problem. In this work, we address this challenge and present a new approach called Retrieval-Augmented Prompt Generation (RAPGen). Given a code snippet with a performance issue, RAPGen first retrieves a prompt instruction from a pre-const… ▽ More Performance bugs are non-functional bugs that can even manifest in well-tested commercial products. Fixing these performance bugs is an important yet challenging problem. In this work, we address this challenge and present a new approach called Retrieval-Augmented Prompt Generation (RAPGen). Given a code snippet with a performance issue, RAPGen first retrieves a prompt instruction from a pre-constructed knowledge-base of previous performance bug fixes and then generates a prompt using the retrieved instruction. It then uses this prompt on a Large Language Model (such as Codex) in zero-shot to generate a fix. We compare our approach with the various prompt variations and state of the art methods in the task of performance bug fixing. Our evaluation shows that RAPGen can generate performance improvement suggestions equivalent or better than a developer in ~60% of the cases, getting ~42% of them verbatim, in an expert-verified dataset of past performance changes made by C# developers. △ Less

Submitted 8 January, 2025; v1 submitted 29 June, 2023; originally announced June 2023.

arXiv:2306.01754 [pdf, other]

Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning?

Authors: Aaron Chan, Anant Kharkar, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Alec Helyar, Eslam Kamal, Mohamed Elkamhawy, Neel Sundaresan

Abstract: Software vulnerabilities bear enterprises significant costs. Despite extensive efforts in research and development of software vulnerability detection methods, uncaught vulnerabilities continue to put software owners and users at risk. Many current vulnerability detection methods require that code snippets can compile and build before attempting detection. This, unfortunately, introduces a long la… ▽ More Software vulnerabilities bear enterprises significant costs. Despite extensive efforts in research and development of software vulnerability detection methods, uncaught vulnerabilities continue to put software owners and users at risk. Many current vulnerability detection methods require that code snippets can compile and build before attempting detection. This, unfortunately, introduces a long latency between the time a vulnerability is injected to the time it is removed, which can substantially increases the cost of fixing a vulnerability. We recognize that the current advances in machine learning can be used to detect vulnerable code patterns on syntactically incomplete code snippets as the developer is writing the code at EditTime. In this paper we present a practical system that leverages deep learning on a large-scale data set of vulnerable code patterns to learn complex manifestations of more than 250 vulnerability types and detect vulnerable code patterns at EditTime. We discuss zero-shot, few-shot, and fine-tuning approaches on state of the art pre-trained Large Language Models (LLMs). We show that in comparison with state of the art vulnerability detection models our approach improves the state of the art by 10%. We also evaluate our approach to detect vulnerability in auto-generated code by code LLMs. Evaluation on a benchmark of high-risk code scenarios shows a reduction of up to 90% vulnerability reduction. △ Less

Submitted 22 May, 2023; originally announced June 2023.

arXiv:2305.13581 [pdf, other]

doi 10.1038/s41586-023-06846-3

Encoding a magic state with beyond break-even fidelity

Authors: Riddhi S. Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J. Wood, Seth T. Merkel, Michael B. Healy, Marius Hillenbrand, Tomas Jochym-O'Connor, James R. Wootton, Theodore J. Yoder, Andrew W. Cross, Maika Takita, Benjamin J. Brown

Abstract: To run large-scale algorithms on a quantum computer, error-correcting codes must be able to perform a fundamental set of operations, called logic gates, while isolating the encoded information from noise~\cite{Harper2019,Ryan-Anderson2021,Egan2021fault, Chen2022calibrated, Sundaresan2022matching, ryananderson2022implementing, Postler2022demonstration, GoogleAI2023}. We can complete a universal set… ▽ More To run large-scale algorithms on a quantum computer, error-correcting codes must be able to perform a fundamental set of operations, called logic gates, while isolating the encoded information from noise~\cite{Harper2019,Ryan-Anderson2021,Egan2021fault, Chen2022calibrated, Sundaresan2022matching, ryananderson2022implementing, Postler2022demonstration, GoogleAI2023}. We can complete a universal set of logic gates by producing special resources called magic states~\cite{Bravyi2005universal,Maier2013magic, Chamberland2022building}. It is therefore important to produce high-fidelity magic states to conduct algorithms while introducing a minimal amount of noise to the computation. Here, we propose and implement a scheme to prepare a magic state on a superconducting qubit array using error correction. We find that our scheme produces better magic states than those we can prepare using the individual qubits of the device. This demonstrates a fundamental principle of fault-tolerant quantum computing~\cite{Shor96}, namely, that we can use error correction to improve the quality of logic gates with noisy qubits. Additionally, we show we can increase the yield of magic states using adaptive circuits, where circuit elements are changed depending on the outcome of mid-circuit measurements. This demonstrates an essential capability we will need for many error-correction subroutines. Our prototype will be invaluable in the future as it can reduce the number of physical qubits needed to produce high-fidelity magic states in large-scale quantum-computing architectures. △ Less

Submitted 13 March, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: 19 pages, 13 figures, 3 tables, comments welcome; v2 - Updated draft including new appendices following peer review. Includes a section on injecting the encoded magic state into larger codes (explicitly studying the surface code, the heavy-hex code and the color code) and a numerical section interrogating the fault-tolerant properties of the circuit

Journal ref: Nature 625, 259 (2024)

arXiv:2305.05383 [pdf, other]

Code Execution with Pre-trained Language Models

Authors: Chenxiao Liu, Shuai Lu, Weizhu Chen, Daxin Jiang, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan, Nan Duan

Abstract: Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentati… ▽ More Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: Accepted to the Findings of ACL 2023

arXiv:2303.07263 [pdf, other]

InferFix: End-to-End Program Repair with LLMs

Authors: Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, Alexey Svyatkovskiy

Abstract: Software development life cycle is profoundly influenced by bugs: their introduction, identification, and eventual resolution account for a significant portion of software cost. This has motivated software engineering researchers and practitioners to propose different approaches for automating the identification and repair of software defects. Large language models have been adapted to the program… ▽ More Software development life cycle is profoundly influenced by bugs: their introduction, identification, and eventual resolution account for a significant portion of software cost. This has motivated software engineering researchers and practitioners to propose different approaches for automating the identification and repair of software defects. Large language models have been adapted to the program repair task through few-shot demonstration learning and instruction prompting, treating this as an infilling task. However, these models have only focused on learning general bug-fixing patterns for uncategorized bugs mined from public repositories. In this paper, we propose InferFix: a transformer-based program repair framework paired with a state-of-the-art static analyzer to fix critical security and performance bugs. InferFix combines a Retriever -- transformer encoder model pretrained via contrastive learning objective, which aims at searching for semantically equivalent bugs and corresponding fixes; and a Generator -- a large language model (Codex Cushman) finetuned on supervised bug-fix data with prompts augmented via bug type annotations and semantically similar fixes retrieved from an external non-parametric memory. To train and evaluate our approach, we curated InferredBugs, a novel, metadata-rich dataset of bugs extracted by executing the Infer static analyzer on the change histories of thousands of Java and C# repositories. Our evaluation demonstrates that InferFix outperforms strong LLM baselines, with a top-1 accuracy of 65.6% for generating fixes in C# and 76.8% in Java. We discuss the deployment of InferFix alongside Infer at Microsoft which offers an end-to-end solution for detection, classification, and localization of bugs, as well as fixing and validation of candidate patches, integrated in the continuous integration pipeline to automate the software development workflow. △ Less

Submitted 13 March, 2023; originally announced March 2023.

arXiv:2208.13928 [pdf, other]

doi 10.1145/3540250.3558959

Exploring and Evaluating Personalized Models for Code Generation

Authors: Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, Michele Tufano

Abstract: Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular dow… ▽ More Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain -- for example, question-answering on a given topic -- generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model's parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios. △ Less

Submitted 19 September, 2022; v1 submitted 29 August, 2022; originally announced August 2022.

Comments: Accepted to the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022), Industry Track - Singapore, November 14-18, 2022, to appear 9 pages

arXiv:2206.13619 [pdf, other]

DeepPERF: A Deep Learning-Based Approach For Improving Software Performance

Authors: Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B. Clement, Neel Sundaresan, Chen Wu

Abstract: Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open source data creates a great opportunity to automate the identification and patching of performance problems… ▽ More Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepPERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepPERF on English and Source code corpora and followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in ~53% of the cases, getting ~34% of them verbatim in our expert-verified dataset of performance changes made by C# developers. Additionally, we evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that our model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations. So far we've submitted 19 pull-requests with 28 different performance optimizations and 11 of these PRs have been approved by the project owners. △ Less

Submitted 27 June, 2022; originally announced June 2022.

arXiv:2205.11023 [pdf, other]

AdaptivePaste: Code Adaptation through Learning Semantics-aware Variable Usage Representations

Authors: Xiaoyu Liu, Jinu Jang, Neel Sundaresan, Miltiadis Allamanis, Alexey Svyatkovskiy

Abstract: In software development, it is common for programmers to copy-paste or port code snippets and then adapt them to their use case. This scenario motivates the code adaptation task -- a variant of program repair which aims to adapt variable identifiers in a pasted snippet of code to the surrounding, preexisting source code. However, no existing approach has been shown to effectively address this task… ▽ More In software development, it is common for programmers to copy-paste or port code snippets and then adapt them to their use case. This scenario motivates the code adaptation task -- a variant of program repair which aims to adapt variable identifiers in a pasted snippet of code to the surrounding, preexisting source code. However, no existing approach has been shown to effectively address this task. In this paper, we introduce AdaptivePaste, a learning-based approach to source code adaptation, based on transformers and a dedicated dataflow-aware deobfuscation pre-training task to learn meaningful representations of variable usage patterns. We evaluate AdaptivePaste on a dataset of code snippets in Python. Results suggest that our model can learn to adapt source code with 79.8% accuracy. To evaluate how valuable is AdaptivePaste in practice, we perform a user study with 10 Python developers on a hundred real-world copy-paste instances. The results show that AdaptivePaste reduces the dwell time to nearly half the time it takes for manual code adaptation, and helps to avoid bugs. In addition, we utilize the participant feedback to identify potential avenues for improvement of AdaptivePaste. △ Less

Submitted 6 October, 2023; v1 submitted 22 May, 2022; originally announced May 2022.

arXiv:2204.12648 [pdf, other]

Generating Examples From CLI Usage: Can Transformers Help?

Authors: Roshanak Zilouchian Moghaddam, Spandan Garg, Colin B. Clement, Yevhen Mohylevskyy, Neel Sundaresan

Abstract: Continuous evolution in modern software often causes documentation, tutorials, and examples to be out of sync with changing interfaces and frameworks. Relying on outdated documentation and examples can lead programs to fail or be less efficient or even less secure. In response, programmers need to regularly turn to other resources on the web such as StackOverflow for examples to guide them in writ… ▽ More Continuous evolution in modern software often causes documentation, tutorials, and examples to be out of sync with changing interfaces and frameworks. Relying on outdated documentation and examples can lead programs to fail or be less efficient or even less secure. In response, programmers need to regularly turn to other resources on the web such as StackOverflow for examples to guide them in writing software. We recognize that this inconvenient, error-prone, and expensive process can be improved by using machine learning applied to software usage data. In this paper, we present our practical system which uses machine learning on large-scale telemetry data and documentation corpora, generating appropriate and complex examples that can be used to improve documentation. We discuss both feature-based and transformer-based machine learning approaches and demonstrate that our system achieves 100% coverage for the used functionalities in the product, providing up-to-date examples upon every release and reduces the numbers of PRs submitted by software owners writing and editing documentation by >68%. We also share valuable lessons learnt during the 3 years that our production quality system has been deployed for Azure Cloud Command Line Interface (Azure CLI). △ Less

Submitted 26 April, 2022; originally announced April 2022.

arXiv:2203.12776 [pdf, other]

doi 10.1145/3524842.3528009

Methods2Test: A dataset of focal methods mapped to test cases

Authors: Michele Tufano, Shao Kun Deng, Neel Sundaresan, Alexey Svyatkovskiy

Abstract: Unit testing is an essential part of the software development process, which helps to identify issues with source code in early stages of development and prevent regressions. Machine learning has emerged as viable approach to help software developers generate automated unit tests. However, generating reliable unit test cases that are semantically correct and capable of catching software bugs or un… ▽ More Unit testing is an essential part of the software development process, which helps to identify issues with source code in early stages of development and prevent regressions. Machine learning has emerged as viable approach to help software developers generate automated unit tests. However, generating reliable unit test cases that are semantically correct and capable of catching software bugs or unintended behavior via machine learning requires large, metadata-rich, datasets. In this paper we present Methods2Test: A dataset of focal methods mapped to test cases: a large, supervised dataset of test cases mapped to corresponding methods under test (i.e., focal methods). This dataset contains 780,944 pairs of JUnit tests and focal methods, extracted from a total of 91,385 Java open source projects hosted on GitHub with licenses permitting re-distribution. The main challenge behind the creation of the Methods2Test was to establish a reliable mapping between a test case and the relevant focal method. To this aim, we designed a set of heuristics, based on developers' best practices in software testing, which identify the likely focal method for a given test case. To facilitate further analysis, we store a rich set of metadata for each method-test pair in JSON-formatted files. Additionally, we extract textual corpus from the dataset at different context levels, which we provide both in raw and tokenized forms, in order to enable researchers to train and evaluate machine learning models for Automated Test Generation. Methods2Test is publicly available at: https://github.com/microsoft/methods2test △ Less

Submitted 23 March, 2022; originally announced March 2022.

Comments: Accepted for publication in the proceedings of The 2022 Mining Software Repositories Conference (MSR 2022) - Data and Tool track

arXiv:2203.09907 [pdf, ps, other]

doi 10.1145/3510003.3510153

Learning to Reduce False Positives in Analytic Bug Detectors

Authors: Anant Kharkar, Roshanak Zilouchian Moghaddam, Matthew Jin, Xiaoyu Liu, Xin Shi, Colin Clement, Neel Sundaresan

Abstract: Due to increasingly complex software design and rapid iterative development, code defects and security vulnerabilities are prevalent in modern software. In response, programmers rely on static analysis tools to regularly scan their codebases and find potential bugs. In order to maximize coverage, however, these tools generally tend to report a significant number of false positives, requiring devel… ▽ More Due to increasingly complex software design and rapid iterative development, code defects and security vulnerabilities are prevalent in modern software. In response, programmers rely on static analysis tools to regularly scan their codebases and find potential bugs. In order to maximize coverage, however, these tools generally tend to report a significant number of false positives, requiring developers to manually verify each warning. To address this problem, we propose a Transformer-based learning approach to identify false positive bug warnings. We demonstrate that our models can improve the precision of static analysis by 17.5%. In addition, we validated the generalizability of this approach across two major bug types: null dereference and resource leak. △ Less

Submitted 7 March, 2022; originally announced March 2022.

Comments: Accepted for publication at ICSE 2022

arXiv:2203.09095 [pdf, other]

Automating Code Review Activities by Large-Scale Pre-training

Authors: Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan

Abstract: Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly,… ▽ More Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real-world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review scenario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews. △ Less

Submitted 11 October, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

Comments: ESEC/FSE 2022, camera-ready version

arXiv:2203.07205 [pdf, other]

doi 10.1038/s41467-023-38247-5

Matching and maximum likelihood decoding of a multi-round subsystem quantum error correction experiment

Authors: Neereja Sundaresan, Theodore J. Yoder, Youngseok Kim, Muyuan Li, Edward H. Chen, Grace Harper, Ted Thorbeck, Andrew W. Cross, Antonio D. Córcoles, Maika Takita

Abstract: Quantum error correction offers a promising path for performing quantum computations with low errors. Although a fully fault-tolerant execution of a quantum algorithm remains unrealized, recent experimental developments, along with improvements in control electronics, are enabling increasingly advanced demonstrations of the necessary operations for applying quantum error correction. Here, we perfo… ▽ More Quantum error correction offers a promising path for performing quantum computations with low errors. Although a fully fault-tolerant execution of a quantum algorithm remains unrealized, recent experimental developments, along with improvements in control electronics, are enabling increasingly advanced demonstrations of the necessary operations for applying quantum error correction. Here, we perform quantum error correction on superconducting qubits connected in a heavy-hexagon lattice. The full processor can encode a logical qubit with distance three and perform several rounds of fault-tolerant syndrome measurements that allow the correction of any single fault in the circuitry. Furthermore, by using dynamic circuits and classical computation as part of our syndrome extraction protocols, we can exploit real-time feedback to reduce the impact of energy relaxation error in the syndrome and flag qubits. We show that the logical error varies depending on the use of a perfect matching decoder compared to a maximum likelihood decoder. We observe a logical error per syndrome measurement round as low as $\sim0.04$ for the matching decoder and as low as $\sim0.03$ for the maximum likelihood decoder. Our results suggest that more significant improvements to decoders are likely on the horizon as quantum hardware has reached a new stage of development towards fully fault-tolerant operations. △ Less

Submitted 19 April, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

Comments: 15 pages, 6 figures, 5 tables

Journal ref: Nat Commun 14, 2852 (2023)

arXiv:2201.12901 [pdf, other]

Training and Evaluating a Jupyter Notebook Data Science Assistant

Authors: Shubham Chandel, Colin B. Clement, Guillermo Serrato, Neel Sundaresan

Abstract: We study the feasibility of a Data Science assistant powered by a sequence-to-sequence transformer by training a new model JuPyT5 on all publicly available Jupyter Notebook GitHub repositories and developing a new metric: Data Science Problems (DSP). DSP is a collection of 1119 problems curated from 306 pedagogical notebooks with 92 dataset dependencies, natural language and Markdown problem descr… ▽ More We study the feasibility of a Data Science assistant powered by a sequence-to-sequence transformer by training a new model JuPyT5 on all publicly available Jupyter Notebook GitHub repositories and developing a new metric: Data Science Problems (DSP). DSP is a collection of 1119 problems curated from 306 pedagogical notebooks with 92 dataset dependencies, natural language and Markdown problem descriptions, and assert-based unit tests. These notebooks were designed to test university students' mastery of various Python implementations of Math and Data Science, and we now leverage them to study the ability of JuPyT5 to understand and pass the tests. We analyze the content of DSP, validate its quality, and we find that given 100 sampling attempts JuPyT5 is able to solve 77.5\% of the DSP problems. We further present various ablation and statistical analyses and compare DSP to other recent natural language to code benchmarks. △ Less

Submitted 30 January, 2022; originally announced January 2022.

arXiv:2110.04285 [pdf, other]

doi 10.1103/PhysRevLett.128.110504

Calibrated decoders for experimental quantum error correction

Authors: Edward H. Chen, Theodore J. Yoder, Youngseok Kim, Neereja Sundaresan, Srikanth Srinivasan, Muyuan Li, Antonio D. Córcoles, Andrew W. Cross, Maika Takita

Abstract: Arbitrarily long quantum computations require quantum memories that can be repeatedly measured without being corrupted. Here, we preserve the state of a quantum memory, notably with the additional use of flagged error events. All error events were extracted using fast, mid-circuit measurements and resets of the physical qubits. Among the error decoders we considered, we introduce a perfect matchin… ▽ More Arbitrarily long quantum computations require quantum memories that can be repeatedly measured without being corrupted. Here, we preserve the state of a quantum memory, notably with the additional use of flagged error events. All error events were extracted using fast, mid-circuit measurements and resets of the physical qubits. Among the error decoders we considered, we introduce a perfect matching decoder that was calibrated from measurements containing up to size-4 correlated events. To compare the decoders, we used a partial post-selection scheme shown to retain ten times more data than full post-selection. We observed logical errors per round of $2.2\pm0.1\times10^{-2}$ (decoded without post-selection) and $5.1\pm0.7\times10^{-4}$ (full post-selection), which was less than the physical measurement error of $7\times10^{-3}$ and therefore surpasses a pseudo-threshold for repeated logical measurements. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: 16 pages, 14 figures, 5 tables, for peer-review

MSC Class: 81P73 (Primary) 81P73 (Secondary) ACM Class: J.2

Journal ref: Phys. Rev. Lett. 128, 110504 (2022)

arXiv:2109.08780 [pdf, other]

Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

Authors: Colin B. Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, Nan Duan, Neel Sundaresan, Alexey Svyatkovskiy

Abstract: Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any give… ▽ More Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an architecture-independent approach for leveraging the syntactic hierarchies of source code for incorporating entire file-level context into a fixed-length window. Using concrete syntax trees of each source file we extract syntactic hierarchies and integrate them into context window by selectively removing from view more specific, less relevant scopes for a given task. We evaluate this approach on code generation tasks and joint translation of natural language and source code in Python programming language, achieving a new state-of-the-art in code completion and summarization for Python in the CodeXGLUE benchmark. We also introduce new CodeXGLUE benchmarks for user-experience-motivated tasks: code completion with normalized literals, method body completion/code summarization conditioned on file-level context. △ Less

Submitted 17 September, 2021; originally announced September 2021.

Comments: EMNLP 2021 camera ready

arXiv:2109.00084 [pdf, other]

doi 10.1145/3540250.3549163

Program Merge Conflict Resolution via Neural Transformers

Authors: Alexey Svyatkovskiy, Sarah Fakhoury, Negar Ghorbani, Todd Mytkowicz, Elizabeth Dinella, Christian Bird, Jinu Jang, Neel Sundaresan, Shuvendu Lahiri

Abstract: Collaborative software development is an integral part of the modern software development life cycle, essential to the success of large-scale software projects. When multiple developers make concurrent changes around the same lines of code, a merge conflict may occur. Such conflicts stall pull requests and continuous integration pipelines for hours to several days, seriously hurting developer prod… ▽ More Collaborative software development is an integral part of the modern software development life cycle, essential to the success of large-scale software projects. When multiple developers make concurrent changes around the same lines of code, a merge conflict may occur. Such conflicts stall pull requests and continuous integration pipelines for hours to several days, seriously hurting developer productivity. To address this problem, we introduce MergeBERT, a novel neural program merge framework based on token-level three-way differencing and a transformer encoder model. By exploiting the restricted nature of merge conflict resolutions, we reformulate the task of generating the resolution sequence as a classification task over a set of primitive merge patterns extracted from real-world merge commit data. Our model achieves 63-68% accuracy for merge resolution synthesis, yielding nearly a 3x performance improvement over existing semi-structured, and 2x improvement over neural program merge tools. Finally, we demonstrate that MergeBERT is sufficiently flexible to work with source code files in Java, JavaScript, TypeScript, and C# programming languages. To measure the practical use of MergeBERT, we conduct a user study to evaluate MergeBERT suggestions with 25 developers from large OSS projects on 122 real-world conflicts they encountered. Results suggest that in practice, MergeBERT resolutions would be accepted at a higher rate than estimated by automatic metrics for precision and accuracy. Additionally, we use participant feedback to identify future avenues for improvement of MergeBERT. △ Less

Submitted 29 November, 2022; v1 submitted 31 August, 2021; originally announced September 2021.

Comments: ESEC/FSE '22 camera ready version. 12 pages, 4 figures, online appendix

arXiv:2108.12518 [pdf, other]

doi 10.1103/PRXQuantum.2.040326

Scalable mitigation of measurement errors on quantum computers

Authors: Paul D. Nation, Hwajung Kang, Neereja Sundaresan, Jay M. Gambetta

Abstract: We present a method for mitigating measurement errors on quantum computing platforms that does not form the full assignment matrix, or its inverse, and works in a subspace defined by the noisy input bit-strings. This method accommodates both uncorrelated and correlated errors, and allows for computing accurate error bounds. Additionally, we detail a matrix-free preconditioned iterative solution me… ▽ More We present a method for mitigating measurement errors on quantum computing platforms that does not form the full assignment matrix, or its inverse, and works in a subspace defined by the noisy input bit-strings. This method accommodates both uncorrelated and correlated errors, and allows for computing accurate error bounds. Additionally, we detail a matrix-free preconditioned iterative solution method that converges in $\mathcal{O}(1)$ steps that is performant and uses orders of magnitude less memory than direct factorization. We demonstrate the validity of our method, and mitigate errors in a few seconds on numbers of qubits that would otherwise be intractable. △ Less

Submitted 27 August, 2021; originally announced August 2021.

Comments: 9 pages, 8 figures, 1 table

Journal ref: PRX Quantum 2, 040326 (2021)

arXiv:2108.03322 [pdf, other]

Distilling Transformers for Neural Cross-Domain Search

Authors: Colin B. Clement, Chen Wu, Dawn Drain, Neel Sundaresan

Abstract: Pre-trained transformers have recently clinched top spots in the gamut of natural language tasks and pioneered solutions to software engineering tasks. Even information retrieval has not been immune to the charm of the transformer, though their large size and cost is generally a barrier to deployment. While there has been much work in streamlining, caching, and modifying transformer architectures… ▽ More Pre-trained transformers have recently clinched top spots in the gamut of natural language tasks and pioneered solutions to software engineering tasks. Even information retrieval has not been immune to the charm of the transformer, though their large size and cost is generally a barrier to deployment. While there has been much work in streamlining, caching, and modifying transformer architectures for production, here we explore a new direction: distilling a large pre-trained translation model into a lightweight bi-encoder which can be efficiently cached and queried. We argue from a probabilistic perspective that sequence-to-sequence models are a conceptually ideal---albeit highly impractical---retriever. We derive a new distillation objective, implementing it as a data augmentation scheme. Using natural language source code search as a case study for cross-domain search, we demonstrate the validity of this idea by significantly improving upon the current leader of the CodeSearchNet challenge, a recent natural language code search benchmark. △ Less

Submitted 6 August, 2021; originally announced August 2021.

Comments: 4 pages, 1 figure, emnlp formatting

arXiv:2106.00675 [pdf, other]

doi 10.1103/PhysRevLett.129.060501

Quantum crosstalk cancellation for fast entangling gates and improved multi-qubit performance

Authors: K. X. Wei, E. Magesan, I. Lauer, S. Srinivasan, D. F. Bogorin, S. Carnevale, G. A. Keefe, Y. Kim, D. Klaus, W. Landers, N. Sundaresan, C. Wang, E. J. Zhang, M. Steffen, O. E. Dial, D. C. McKay, A. Kandala

Abstract: Quantum computers built with superconducting artificial atoms already stretch the limits of their classical counterparts. While the lowest energy states of these artificial atoms serve as the qubit basis, the higher levels are responsible for both a host of attractive gate schemes as well as generating undesired interactions. In particular, when coupling these atoms to generate entanglement, the h… ▽ More Quantum computers built with superconducting artificial atoms already stretch the limits of their classical counterparts. While the lowest energy states of these artificial atoms serve as the qubit basis, the higher levels are responsible for both a host of attractive gate schemes as well as generating undesired interactions. In particular, when coupling these atoms to generate entanglement, the higher levels cause shifts in the computational levels that leads to unwanted $ZZ$ quantum crosstalk. Here, we present a novel technique to manipulate the energy levels and mitigate this crosstalk via a simultaneous AC Stark effect on coupled qubits. This breaks a fundamental deadlock between qubit-qubit coupling and crosstalk, leading to a 90ns CNOT with a gate error of (0.19 $\pm$ 0.02) $\%$ and the demonstration of a novel CZ gate with fixed-coupling single-junction transmon qubits. Furthermore, we show a definitive improvement in circuit performance with crosstalk cancellation over seven qubits, demonstrating the scalability of the technique. This work paves the way for superconducting hardware with faster gates and greatly improved multi-qubit circuit fidelities. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: 8 pages, 5 figures plus Supplementary Information (8 pages, 7 figures)

Journal ref: Phys. Rev. Lett. 129, 060501 (2022)

arXiv:2105.09352 [pdf, other]

DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons

Authors: Dawn Drain, Colin B. Clement, Guillermo Serrato, Neel Sundaresan

Abstract: The joint task of bug localization and program repair is an integral part of the software development process. In this work we present DeepDebug, an approach to automated debugging using large, pretrained transformers. We begin by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs. We apply these synthetic bugs toward two ends. First, we directly tra… ▽ More The joint task of bug localization and program repair is an integral part of the software development process. In this work we present DeepDebug, an approach to automated debugging using large, pretrained transformers. We begin by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs. We apply these synthetic bugs toward two ends. First, we directly train a backtranslation model on all functions from 200K repositories. Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions in those repositories that are covered by passing tests. This provides us with rich debugging information such as stack traces and print statements, which we use to finetune our model which was pretrained on raw source code. Finally, we strengthen all our models by expanding the context window beyond the buggy function itself, and adding a skeleton consisting of that function's parent class, imports, signatures, docstrings, and method bodies, in order of priority. On the QuixBugs benchmark, we increase the total number of fixes found by over 50%, while also decreasing the false positive rate from 35% to 5% and decreasing the timeout from six hours to one minute. On our own benchmark of executable tests, our model fixes 68% of all bugs on its first attempt without using traces, and after adding traces it fixes 75% on first attempt. We will open-source our framework and validation set for evaluating on executable tests. △ Less

Submitted 19 May, 2021; originally announced May 2021.

arXiv:2104.07896 [pdf, other]

doi 10.1145/3460945.3464951

Generating Bug-Fixes Using Pretrained Transformers

Authors: Dawn Drain, Chen Wu, Alexey Svyatkovskiy, Neel Sundaresan

Abstract: Detecting and fixing bugs are two of the most important yet frustrating parts of the software development cycle. Existing bug detection tools are based mainly on static analyzers, which rely on mathematical logic and symbolic reasoning about the program execution to detect common types of bugs. Fixing bugs is typically left out to the developer. In this work we introduce DeepDebug: a data-driven p… ▽ More Detecting and fixing bugs are two of the most important yet frustrating parts of the software development cycle. Existing bug detection tools are based mainly on static analyzers, which rely on mathematical logic and symbolic reasoning about the program execution to detect common types of bugs. Fixing bugs is typically left out to the developer. In this work we introduce DeepDebug: a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub repositories. We frame bug-patching as a sequence-to-sequence learning task consisting of two steps: (i) denoising pretraining, and (ii) supervised finetuning on the target translation task. We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch, while domain-adaptive pretraining from natural language to code further improves the accuracy by another 32%. We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art. In contrast to prior work, we attain our best results when generating raw code, as opposed to working with abstracted code that tends to only benefit smaller capacity models. Finally, we observe a subtle improvement from adding syntax embeddings along with the standard positional embeddings, as well as with adding an auxiliary task to predict each token's syntactic class. Despite focusing on Java, our approach is language agnostic, requiring only a general-purpose parser such as tree-sitter. △ Less

Submitted 28 April, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

arXiv:2104.05310 [pdf, other]

Generating Code with the Help of Retrieved Template Functions and Stack Overflow Answers

Authors: Dawn Drain, Changran Hu, Chen Wu, Mikhail Breslav, Neel Sundaresan

Abstract: We approach the important challenge of code autocompletion as an open-domain task, in which a sequence-to-sequence code generator model is enhanced with the ability to attend to reference code snippets supplied by a semantic code search engine. In this work, we present a novel framework to precisely retrieve template functions as well as intent-snippet pairs and effectively train such a retrieval-… ▽ More We approach the important challenge of code autocompletion as an open-domain task, in which a sequence-to-sequence code generator model is enhanced with the ability to attend to reference code snippets supplied by a semantic code search engine. In this work, we present a novel framework to precisely retrieve template functions as well as intent-snippet pairs and effectively train such a retrieval-guided code generator. To demonstrate the effectiveness of our model designs, we perform extensive experiments with CodeSearchNet which contains template functions and CoNaLa which contains Stack Overflow intent-snippet pairs. We also investigate different retrieval models, including Elasticsearch, DPR, and our fusion representation search model, which currently holds the number one spot on the CodeSearchNet leaderboard. We observe improvements by leveraging multiple database elements and further gain from retrieving diverse data points by using Maximal Marginal Relevance. Overall, we see a 4% improvement to cross-entropy loss, a 15% improvement to edit distance, and a 44% improvement to BLEU score when retrieving template functions. We see subtler improvements of 2%, 11%, and 6% respectively when retrieving Stack Overflow intent-snippet pairs. We also create a novel Stack Overflow-Function Alignment dataset, which consists of 150K tuples of functions and Stack Overflow intent-snippet pairs that are of help in writing the associated function, of which 1.7K are manually curated. △ Less

Submitted 12 April, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

Comments: 8 pages

arXiv:2102.04664 [pdf, other]

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Authors: Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu

Abstract: Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems,… ▽ More Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems. △ Less

Submitted 16 March, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

Comments: 14 pages; Revise CodeBLEU scores for all models on text-to-code task

arXiv:2012.08475 [pdf, other]

High-fidelity superconducting quantum processors via laser-annealing of transmon qubits

Authors: Eric J. Zhang, Srikanth Srinivasan, Neereja Sundaresan, Daniela F. Bogorin, Yves Martin, Jared B. Hertzberg, John Timmerwilke, Emily J. Pritchett, Jeng-Bang Yau, Cindy Wang, William Landers, Eric P. Lewandowski, Adinath Narasgond, Sami Rosenblatt, George A. Keefe, Isaac Lauer, Mary Beth Rothwell, Douglas T. McClure, Oliver E. Dial, Jason S. Orcutt, Markus Brink, Jerry M. Chow

Abstract: Scaling the number of qubits while maintaining high-fidelity quantum gates remains a key challenge for quantum computing. Presently, superconducting quantum processors with >50-qubits are actively available. For such systems, fixed-frequency transmons are attractive due to their long coherence and noise immunity. However, scaling fixed-frequency architectures proves challenging due to precise rela… ▽ More Scaling the number of qubits while maintaining high-fidelity quantum gates remains a key challenge for quantum computing. Presently, superconducting quantum processors with >50-qubits are actively available. For such systems, fixed-frequency transmons are attractive due to their long coherence and noise immunity. However, scaling fixed-frequency architectures proves challenging due to precise relative frequency requirements. Here we employ laser annealing to selectively tune transmon qubits into desired frequency patterns. Statistics over hundreds of annealed qubits demonstrate an empirical tuning precision of 18.5 MHz, with no measurable impact on qubit coherence. We quantify gate error statistics on a tuned 65-qubit processor, with median two-qubit gate fidelity of 98.7%. Baseline tuning statistics yield a frequency-equivalent resistance precision of 4.7 MHz, sufficient for high-yield scaling beyond 1000-qubit levels. Moving forward, we anticipate selective laser annealing to play a central role in scaling fixed-frequency architectures. △ Less

Submitted 15 December, 2020; originally announced December 2020.

Comments: 9 pages, 8 figures, Supplementary Information

arXiv:2010.03150 [pdf, other]

PyMT5: multi-mode translation of natural language and Python code with transformers

Authors: Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, Neel Sundaresan

Abstract: Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natu… ▽ More Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation. △ Less

Submitted 7 October, 2020; originally announced October 2020.

Comments: 14 pages, 7 figures, 5 tables, EMNLP 2020 camera ready version

arXiv:2009.10297 [pdf, other]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Authors: Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, Shuai Ma

Abstract: Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic fe… ▽ More Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy. △ Less

Submitted 27 September, 2020; v1 submitted 21 September, 2020; originally announced September 2020.

Comments: 8 pages, 6 figures

arXiv:2009.08366 [pdf, other]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Authors: Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, Ming Zhou

Abstract: Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding pr… ▽ More Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search. △ Less

Submitted 13 September, 2021; v1 submitted 17 September, 2020; originally announced September 2020.

Comments: Accepted by ICLR2021

arXiv:2009.05634 [pdf, other]

doi 10.1145/3524481.3527220

Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers

Authors: Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Neel Sundaresan

Abstract: Unit testing represents the foundational basis of the software testing pyramid, beneath integration and end-to-end testing. Automated software testing researchers have proposed a variety of techniques to assist developers in this time-consuming task. In this paper we present an approach to support developers in writing unit test cases by generating accurate and useful assert statements. Our approa… ▽ More Unit testing represents the foundational basis of the software testing pyramid, beneath integration and end-to-end testing. Automated software testing researchers have proposed a variety of techniques to assist developers in this time-consuming task. In this paper we present an approach to support developers in writing unit test cases by generating accurate and useful assert statements. Our approach is based on a state-of-the-art transformer model initially pretrained on an English textual corpus. This semantically rich model is then trained in a semi-supervised fashion on a large corpus of source code. Finally, we finetune this model on the task of generating assert statements for unit tests. The resulting model is able to generate accurate assert statements for a given method under test. In our empirical evaluation, the model was able to predict the exact assert statements written by developers in 62% of the cases in the first attempt. The results show 80% relative improvement for top-1 accuracy over the previous RNN-based approach in the literature. We also show the substantial impact of the pretraining process on the performances of our model, as well as comparing it with assert auto-completion task. Finally, we demonstrate how our approach can be used to augment EvoSuite test cases, with additional asserts leading to improved test coverage. △ Less

Submitted 11 September, 2020; originally announced September 2020.

arXiv:2009.05617 [pdf, other]

Unit Test Case Generation with Transformers and Focal Context

Authors: Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, Neel Sundaresan

Abstract: Automated unit test case generation tools facilitate test-driven development and support developers by suggesting tests intended to identify flaws in their code. Existing approaches are usually guided by the test coverage criteria, generating synthetic test cases that are often difficult for developers to read or understand. In this paper we propose AthenaTest, an approach that aims to generate un… ▽ More Automated unit test case generation tools facilitate test-driven development and support developers by suggesting tests intended to identify flaws in their code. Existing approaches are usually guided by the test coverage criteria, generating synthetic test cases that are often difficult for developers to read or understand. In this paper we propose AthenaTest, an approach that aims to generate unit test cases by learning from real-world focal methods and developer-written testcases. We formulate unit test case generation as a sequence-to-sequence learning task, adopting a two-step training procedure consisting of denoising pretraining on a large unsupervised Java corpus, and supervised finetuning for a downstream translation task of generating unit tests. We investigate the impact of natural language and source code pretraining, as well as the focal context information surrounding the focal method. Both techniques provide improvements in terms of validation loss, with pretraining yielding 25% relative improvement and focal context providing additional 11.1% improvement. We also introduce Methods2Test, the largest publicly available supervised parallel corpus of unit test case methods and corresponding focal methods in Java, which comprises 780K test cases mined from 91K open-source repositories from GitHub. We evaluate AthenaTest on five defects4j projects, generating 25K passing test cases covering 43.7% of the focal methods with only 30 attempts. We execute the test cases, collect test coverage information, and compare them with test cases generated by EvoSuite and GPT-3, finding that our approach outperforms GPT-3 and has comparable coverage w.r.t. EvoSuite. Finally, we survey professional developers on their preference in terms of readability, understandability, and testing effectiveness of the generated tests, showing overwhelmingly preference towards AthenaTest. △ Less

Submitted 20 May, 2021; v1 submitted 11 September, 2020; originally announced September 2020.

arXiv:2008.08571 [pdf, other]

doi 10.1088/2058-9565/abe519

Demonstration of quantum volume 64 on a superconducting quantum computing system

Authors: Petar Jurcevic, Ali Javadi-Abhari, Lev S. Bishop, Isaac Lauer, Daniela F. Bogorin, Markus Brink, Lauren Capelluto, Oktay Günlük, Toshinari Itoko, Naoki Kanazawa, Abhinav Kandala, George A. Keefe, Kevin Krsulich, William Landers, Eric P. Lewandowski, Douglas T. McClure, Giacomo Nannicini, Adinath Narasgond, Hasan M. Nayfeh, Emily Pritchett, Mary Beth Rothwell, Srikanth Srinivasan, Neereja Sundaresan, Cindy Wang, Ken X. Wei , et al. (6 additional authors not shown)

Abstract: We improve the quality of quantum circuits on superconducting quantum computing systems, as measured by the quantum volume, with a combination of dynamical decoupling, compiler optimizations, shorter two-qubit gates, and excited state promoted readout. This result shows that the path to larger quantum volume systems requires the simultaneous increase of coherence, control gate fidelities, measurem… ▽ More We improve the quality of quantum circuits on superconducting quantum computing systems, as measured by the quantum volume, with a combination of dynamical decoupling, compiler optimizations, shorter two-qubit gates, and excited state promoted readout. This result shows that the path to larger quantum volume systems requires the simultaneous increase of coherence, control gate fidelities, measurement fidelities, and smarter software which takes into account hardware details, thereby demonstrating the need to continue to co-design the software and hardware stack for the foreseeable future. △ Less

Submitted 4 September, 2020; v1 submitted 19 August, 2020; originally announced August 2020.

Comments: Fixed typo in author list. Added references [38], [49] and [52]

Journal ref: Quantum Sci. Technol. 6 025020 (2021)

arXiv:2007.02925 [pdf, other]

doi 10.1103/PRXQuantum.1.020318

Reducing unitary and spectator errors in cross resonance with optimized rotary echoes

Authors: Neereja Sundaresan, Isaac Lauer, Emily Pritchett, Easwar Magesan, Petar Jurcevic, Jay M. Gambetta

Abstract: We present an improvement to the cross resonance gate realized with the addition of resonant, target rotary pulses. These pulses, applied directly to the target qubit, are simultaneous to and in phase with the echoed cross resonance pulses. Using specialized Hamiltonian error amplifying tomography, we confirm a reduction of error terms with target rotary -- directly translating to improved two-qub… ▽ More We present an improvement to the cross resonance gate realized with the addition of resonant, target rotary pulses. These pulses, applied directly to the target qubit, are simultaneous to and in phase with the echoed cross resonance pulses. Using specialized Hamiltonian error amplifying tomography, we confirm a reduction of error terms with target rotary -- directly translating to improved two-qubit gate fidelity. Beyond improvement in the control-target subspace, the target rotary reduces entanglement between target and target spectators caused by residual quantum interactions. We further characterize multi-qubit performance improvement enabled by target rotary pulsing using unitarity benchmarking and quantum volume measurements, achieving a new record quantum volume for a superconducting qubit system. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Journal ref: PRX Quantum 1, 020318 (2020)

arXiv:2005.08025 [pdf, other]

IntelliCode Compose: Code Generation Using Transformer

Authors: Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, Neel Sundaresan

Abstract: In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. In this paper, we introduce IntelliCode Compose $-$ a general-purpose multilingual code completion tool which is capable of predicting sequences… ▽ More In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. In this paper, we introduce IntelliCode Compose $-$ a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, $C\#$, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. Our best model yields an average edit similarity of $86.7\%$ and a perplexity of 1.82 for Python programming language. △ Less

Submitted 29 October, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

Comments: Accepted for publication at ESEC/FSE conference

arXiv:1912.00742 [pdf, other]

doi 10.1145/3292500.3330699

Pythia: AI-assisted Code Completion System

Authors: Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, Neel Sundaresan

Abstract: In this paper, we propose a novel end-to-end approach for AI-assisted code completion called Pythia. It generates ranked lists of method and API recommendations which can be used by software developers at edit time. The system is currently deployed as part of Intellicode extension in Visual Studio Code IDE. Pythia exploits state-of-the-art large-scale deep learning models trained on code contexts… ▽ More In this paper, we propose a novel end-to-end approach for AI-assisted code completion called Pythia. It generates ranked lists of method and API recommendations which can be used by software developers at edit time. The system is currently deployed as part of Intellicode extension in Visual Studio Code IDE. Pythia exploits state-of-the-art large-scale deep learning models trained on code contexts extracted from abstract syntax trees. It is designed to work at a high throughput predicting the best matching code completions on the order of 100 $ms$. We describe the architecture of the system, perform comparisons to frequency-based approach and invocation-based Markov Chain language model, and discuss challenges serving Pythia models on lightweight client devices. The offline evaluation results obtained on 2700 Python open source software GitHub repositories show a top-5 accuracy of 92\%, surpassing the baseline models by 20\% averaged over classes, for both intra and cross-project settings. △ Less

Submitted 28 November, 2019; originally announced December 2019.

Comments: Published in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '19)

arXiv:1905.05720 [pdf, other]

doi 10.1103/PhysRevA.101.032343

Verifying Multipartite Entangled GHZ States via Multiple Quantum Coherences

Authors: Ken X. Wei, Isaac Lauer, Srikanth Srinivasan, Neereja Sundaresan, Douglas T. McClure, David Toyli, David C. McKay, Jay M. Gambetta, Sarah Sheldon

Abstract: The ability to generate and verify multipartite entanglement is an important benchmark for near-term quantum devices devices. We develop a scalable entanglement metric based on multiple quantum coherences, and demonstrate experimentally on a 20-qubit superconducting device - the IBM Q System One. We report a state fidelity of 0.5165$\pm$0.0036 for an 18-qubit GHZ state, indicating multipartite ent… ▽ More The ability to generate and verify multipartite entanglement is an important benchmark for near-term quantum devices devices. We develop a scalable entanglement metric based on multiple quantum coherences, and demonstrate experimentally on a 20-qubit superconducting device - the IBM Q System One. We report a state fidelity of 0.5165$\pm$0.0036 for an 18-qubit GHZ state, indicating multipartite entanglement across all 18 qubits. Our entanglement metric is robust to noise and only requires measuring the population in the ground state; it can be readily applied to other quantum devices to verify multipartite entanglement. △ Less

Submitted 14 May, 2019; originally announced May 2019.

Comments: 7+4 pages, comments welcome

Journal ref: Phys. Rev. A 101, 032343 (2020)

arXiv:1801.10167 [pdf, other]

doi 10.1103/PhysRevX.9.011021

Interacting Qubit-Photon Bound States with Superconducting Circuits

Authors: Neereja M. Sundaresan, Rex Lundgren, Guanyu Zhu, Alexey V. Gorshkov, Andrew A. Houck

Abstract: Qubits strongly coupled to a photonic crystal give rise to many exotic physical scenarios, beginning with single and multi-excitation qubit-photon dressed bound states comprising induced spatially localized photonic modes, centered around the qubits, and the qubits themselves. The localization of these states changes with qubit detuning from the band-edge, offering an avenue of in situ control of… ▽ More Qubits strongly coupled to a photonic crystal give rise to many exotic physical scenarios, beginning with single and multi-excitation qubit-photon dressed bound states comprising induced spatially localized photonic modes, centered around the qubits, and the qubits themselves. The localization of these states changes with qubit detuning from the band-edge, offering an avenue of in situ control of bound state interaction. Here, we present experimental results from a device with two qubits coupled to a superconducting microwave photonic crystal and realize tunable on-site and inter-bound state interactions. We observe a fourth-order two photon virtual process between bound states indicating strong coupling between the photonic crystal and qubits. Due to their localization-dependent interaction, these states offer the ability to create one-dimensional chains of bound states with tunable and potentially long-range interactions that preserve the qubits' spatial organization, a key criterion for realization of certain quantum many-body models. The widely tunable, strong and robust interactions demonstrated with this system are promising benchmarks towards realizing larger, more complex systems of bound states. △ Less

Submitted 30 January, 2018; originally announced January 2018.

Journal ref: Phys. Rev. X 9, 011021 (2019)

arXiv:1607.06895 [pdf, other]

doi 10.1103/PhysRevX.7.011016

Observation of a dissipative phase transition in a one-dimensional circuit QED lattice

Authors: Mattias Fitzpatrick, Neereja M. Sundaresan, Andy C. Y. Li, Jens Koch, A. A. Houck

Abstract: Condensed matter physics has been driven forward by significant experimental and theoretical progress in the study and understanding of equilibrium phase transitions based on symmetry and topology. However, nonequilibrium phase transitions have remained a challenge, in part due to their complexity in theoretical descriptions and the additional experimental difficulties in systematically controllin… ▽ More Condensed matter physics has been driven forward by significant experimental and theoretical progress in the study and understanding of equilibrium phase transitions based on symmetry and topology. However, nonequilibrium phase transitions have remained a challenge, in part due to their complexity in theoretical descriptions and the additional experimental difficulties in systematically controlling systems out of equilibrium. Here, we study a one-dimensional chain of 72 microwave cavities, each coupled to a superconducting qubit, and coherently drive the system into a nonequilibrium steady state. We find experimental evidence for a dissipative phase transition in the system in which the steady state changes dramatically as the mean photon number is increased. Near the boundary between the two observed phases, the system demonstrates bistability, with characteristic switching times as long as 60 ms -- far longer than any of the intrinsic rates known for the system. This experiment demonstrates the power of circuit QED systems for studying nonequilibrium condensed matter physics and paves the way for future experiments exploring nonequilbrium physics with many-body quantum optics. △ Less

Submitted 23 July, 2016; originally announced July 2016.

Journal ref: Phys. Rev. X 7, 011016 (2017)

Showing 1–50 of 54 results for author: Sundaresan, N