Skip to main content

Showing 1–9 of 9 results for author: Kraska, T

Searching in archive stat. Search in all archives.
.
  1. arXiv:2006.05265  [pdf, other

    cs.LG cs.SE stat.ML

    MISIM: A Neural Code Semantics Similarity System Using the Context-Aware Semantics Structure

    Authors: Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marcus, Nesime Tatbul, Jesmin Jahan Tithi, Niranjan Hasabnis, Paul Petersen, Timothy Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich

    Abstract: Code semantics similarity can be used for many tasks such as code recommendation, automated software defect correction, and clone detection. Yet, the accuracy of such systems has not yet reached a level of general purpose reliability. To help address this, we present Machine Inferred Code Similarity (MISIM), a neural code semantics similarity system consisting of two core components: (i)MISIM uses… ▽ More

    Submitted 2 June, 2021; v1 submitted 5 June, 2020; originally announced June 2020.

    Comments: arXiv admin note: text overlap with arXiv:2003.11118

  2. arXiv:2003.09758  [pdf, other

    cs.LG cs.DB stat.ML

    ARDA: Automatic Relational Data Augmentation for Machine Learning

    Authors: Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, David Karger

    Abstract: Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmen… ▽ More

    Submitted 21 March, 2020; originally announced March 2020.

  3. arXiv:1905.10688  [pdf, other

    cs.LG cs.DB cs.IR stat.ML

    Sherlock: A Deep Learning Approach to Semantic Data Type Detection

    Authors: Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Çağatay Demiralp, César Hidalgo

    Abstract: Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number o… ▽ More

    Submitted 25 May, 2019; originally announced May 2019.

    Comments: KDD'19

  4. arXiv:1904.03257  [pdf, ps, other

    cs.LG cs.DB cs.DC cs.SE stat.ML

    MLSys: The New Frontier of Machine Learning Systems

    Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

    Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More

    Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

  5. arXiv:1901.10875  [pdf, other

    cs.CR stat.OT

    STAR: Statistical Tests with Auditable Results

    Authors: Sacha Servan-Schreiber, Olga Ohrimenko, Tim Kraska, Emanuel Zgraggen

    Abstract: We present STAR: a novel system aimed at solving the complex issue of "p-hacking" and false discoveries in scientific studies. STAR provides a concrete way for ensuring the application of false discovery control procedures in hypothesis testing, using mathematically provable guarantees, with the goal of reducing the risk of data dredging. STAR generates an efficiently auditable certificate which a… ▽ More

    Submitted 23 October, 2019; v1 submitted 19 January, 2019; originally announced January 2019.

  6. arXiv:1808.08294  [pdf, other

    cs.LG stat.ML

    Unknown Examples & Machine Learning Model Generalization

    Authors: Yeounoh Chung, Peter J. Haas, Eli Upfal, Tim Kraska

    Abstract: Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, m… ▽ More

    Submitted 11 October, 2019; v1 submitted 24 August, 2018; originally announced August 2018.

  7. arXiv:1806.03723  [pdf, other

    stat.ML cs.LG

    Smallify: Learning Network Size while Training

    Authors: Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, Samuel Madden

    Abstract: As neural networks become widely deployed in different applications and on different hardware, it has become increasingly important to optimize inference time and model size along with model accuracy. Most current techniques optimize model size, model accuracy and inference time in different stages, resulting in suboptimal results and computational inefficiency. In this work, we propose a new tech… ▽ More

    Submitted 10 June, 2018; originally announced June 2018.

    Comments: 11 pages, 3 figures

  8. arXiv:1612.01040  [pdf, other

    cs.DB stat.ME

    Controlling False Discoveries During Interactive Data Exploration

    Authors: Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, Tim Kraska

    Abstract: Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypot… ▽ More

    Submitted 3 December, 2016; originally announced December 2016.

  9. arXiv:1310.5426  [pdf, other

    cs.LG cs.DC stat.ML

    MLI: An API for Distributed Machine Learning

    Authors: Evan R. Sparks, Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph Gonzalez, Michael J. Franklin, Michael I. Jordan, Tim Kraska

    Abstract: MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implement… ▽ More

    Submitted 25 October, 2013; v1 submitted 21 October, 2013; originally announced October 2013.