-
Testing Juntas Optimally with Samples
Authors:
Lorenzo Beretta,
Nathaniel Harms,
Caleb Koch
Abstract:
We prove tight upper and lower bounds of $Θ\left(\tfrac{1}ε\left( \sqrt{2^k \log\binom{n}{k} } + \log\binom{n}{k} \right)\right)$ on the number of samples required for distribution-free $k$-junta testing. This is the first tight bound for testing a natural class of Boolean functions in the distribution-free sample-based model. Our bounds also hold for the feature selection problem, showing that a…
▽ More
We prove tight upper and lower bounds of $Θ\left(\tfrac{1}ε\left( \sqrt{2^k \log\binom{n}{k} } + \log\binom{n}{k} \right)\right)$ on the number of samples required for distribution-free $k$-junta testing. This is the first tight bound for testing a natural class of Boolean functions in the distribution-free sample-based model. Our bounds also hold for the feature selection problem, showing that a junta tester must learn the set of relevant variables. For tolerant junta testing, we prove a sample lower bound of $Ω(2^{(1-o(1)) k} + \log\binom{n}{k})$ showing that, unlike standard testing, there is no large gap between tolerant testing and learning.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Hybrid Reinforcement Learning and Model Predictive Control for Adaptive Control of Hydrogen-Diesel Dual-Fuel Combustion
Authors:
Julian Bedei,
Murray McBain,
Alexander Winkler,
Charles Robert Koch,
Jakob Andert,
David Gordon
Abstract:
Reinforcement Learning (RL) and Machine Learning Integrated Model Predictive Control (ML-MPC) are promising approaches for optimizing hydrogen-diesel dual-fuel engine control, as they can effectively control multiple-input multiple-output systems and nonlinear processes. ML-MPC is advantageous for providing safe and optimal controls, ensuring the engine operates within predefined safety limits. In…
▽ More
Reinforcement Learning (RL) and Machine Learning Integrated Model Predictive Control (ML-MPC) are promising approaches for optimizing hydrogen-diesel dual-fuel engine control, as they can effectively control multiple-input multiple-output systems and nonlinear processes. ML-MPC is advantageous for providing safe and optimal controls, ensuring the engine operates within predefined safety limits. In contrast, RL is distinguished by its adaptability to changing conditions through its learning-based approach. However, the practical implementation of either method alone poses challenges. RL requires high variance in control inputs during early learning phases, which can pose risks to the system by potentially executing unsafe actions, leading to mechanical damage. Conversely, ML-MPC relies on an accurate system model to generate optimal control inputs and has limited adaptability to system drifts, such as injector aging, which naturally occur in engine applications. To address these limitations, this study proposes a hybrid RL and ML-MPC approach that uses an ML-MPC framework while incorporating an RL agent to dynamically adjust the ML-MPC load tracking reference in response to changes in the environment. At the same time, the ML-MPC ensures that actions stay safe throughout the RL agent's exploration. To evaluate the effectiveness of this approach, fuel pressure is deliberately varied to introduce a model-plant mismatch between the ML-MPC and the engine test bench. The result of this mismatch is a root mean square error (RMSE) in indicated mean effective pressure of 0.57 bar when running the ML-MPC. The experimental results demonstrate that RL successfully adapts to changing boundary conditions by altering the tracking reference while ML-MPC ensures safe control inputs. The quantitative improvement in load tracking by implementing RL is an RSME of 0.44 bar.
△ Less
Submitted 6 May, 2025; v1 submitted 23 April, 2025;
originally announced April 2025.
-
The Cambridge Report on Database Research
Authors:
Anastasia Ailamaki,
Samuel Madden,
Daniel Abadi,
Gustavo Alonso,
Sihem Amer-Yahia,
Magdalena Balazinska,
Philip A. Bernstein,
Peter Boncz,
Michael Cafarella,
Surajit Chaudhuri,
Susan Davidson,
David DeWitt,
Yanlei Diao,
Xin Luna Dong,
Michael Franklin,
Juliana Freire,
Johannes Gehrke,
Alon Halevy,
Joseph M. Hellerstein,
Mark D. Hill,
Stratos Idreos,
Yannis Ioannidis,
Christoph Koch,
Donald Kossmann,
Tim Kraska
, et al. (21 additional authors not shown)
Abstract:
On October 19 and 20, 2023, the authors of this report convened in Cambridge, MA, to discuss the state of the database research field, its recent accomplishments and ongoing challenges, and future directions for research and community engagement. This gathering continues a long standing tradition in the database community, dating back to the late 1980s, in which researchers meet roughly every five…
▽ More
On October 19 and 20, 2023, the authors of this report convened in Cambridge, MA, to discuss the state of the database research field, its recent accomplishments and ongoing challenges, and future directions for research and community engagement. This gathering continues a long standing tradition in the database community, dating back to the late 1980s, in which researchers meet roughly every five years to produce a forward looking report.
This report summarizes the key takeaways from our discussions. We begin with a retrospective on the academic, open source, and commercial successes of the community over the past five years. We then turn to future opportunities, with a focus on core data systems, particularly in the context of cloud computing and emerging hardware, as well as on the growing impact of data science, data governance, and generative AI.
This document is not intended as an exhaustive survey of all technical challenges or industry innovations in the field. Rather, it reflects the perspectives of senior community members on the most pressing challenges and promising opportunities ahead.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Using Process Calculus for Optimizing Data and Computation Sharing in Complex Stateful Parallel Computations
Authors:
Zilu Tian,
Dan Olteanu,
Christoph Koch
Abstract:
We propose novel techniques that exploit data and computation sharing to improve the performance of complex stateful parallel computations, like agent-based simulations. Parallel computations are translated into behavioral equations, a novel formalism layered on top of the foundational process calculus $π$-calculus. Behavioral equations blend code and data, allowing a system to easily compose and…
▽ More
We propose novel techniques that exploit data and computation sharing to improve the performance of complex stateful parallel computations, like agent-based simulations. Parallel computations are translated into behavioral equations, a novel formalism layered on top of the foundational process calculus $π$-calculus. Behavioral equations blend code and data, allowing a system to easily compose and transform parallel programs into specialized programs. We show how optimizations like merging programs, synthesizing efficient message data structures, eliminating local messaging, rewriting communication instructions into local computations, and {aggregation pushdown} can be expressed as transformations of behavioral equations. We have also built a system called OptiFusion that implements behavioral equations and the aforementioned optimizations. Our experiments showed that OptiFusion is over 10$\times$ faster than state-of-the-art stateful systems benchmarked via complex stateful workloads. Generating specialized instructions that are impractical to write by hand allows OptiFusion to outperform even the hand-optimized implementations by up to 2$\times$.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
OpenAI o1 System Card
Authors:
OpenAI,
:,
Aaron Jaech,
Adam Kalai,
Adam Lerer,
Adam Richardson,
Ahmed El-Kishky,
Aiden Low,
Alec Helyar,
Aleksander Madry,
Alex Beutel,
Alex Carney,
Alex Iftimie,
Alex Karpenko,
Alex Tachard Passos,
Alexander Neitz,
Alexander Prokofiev,
Alexander Wei,
Allison Tam,
Ally Bennett,
Ananya Kumar,
Andre Saraiva,
Andrea Vallone,
Andrew Duberstein,
Andrew Kondrich
, et al. (238 additional authors not shown)
Abstract:
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-ar…
▽ More
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
Dissociating Artificial Intelligence from Artificial Consciousness
Authors:
Graham Findlay,
William Marshall,
Larissa Albantakis,
Isaac David,
William GP Mayner,
Christof Koch,
Giulio Tononi
Abstract:
Developments in machine learning and computing power suggest that artificial general intelligence is within reach. This raises the question of artificial consciousness: if a computer were to be functionally equivalent to a human, being able to do all we do, would it experience sights, sounds, and thoughts, as we do when we are conscious? Answering this question in a principled manner can only be d…
▽ More
Developments in machine learning and computing power suggest that artificial general intelligence is within reach. This raises the question of artificial consciousness: if a computer were to be functionally equivalent to a human, being able to do all we do, would it experience sights, sounds, and thoughts, as we do when we are conscious? Answering this question in a principled manner can only be done on the basis of a theory of consciousness that is grounded in phenomenology and that states the necessary and sufficient conditions for any system, evolved or engineered, to support subjective experience. Here we employ Integrated Information Theory (IIT), which provides principled tools to determine whether a system is conscious, to what degree, and the content of its experience. We consider pairs of systems constituted of simple Boolean units, one of which -- a basic stored-program computer -- simulates the other with full functional equivalence. By applying the principles of IIT, we demonstrate that (i) two systems can be functionally equivalent without being phenomenally equivalent, and (ii) that this conclusion is not dependent on the simulated system's function. We further demonstrate that, according to IIT, it is possible for a digital computer to simulate our behavior, possibly even by simulating the neurons in our brain, without replicating our experience. This contrasts sharply with computational functionalism, the thesis that performing computations of the right kind is necessary and sufficient for consciousness.
△ Less
Submitted 3 March, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
GPT-4o System Card
Authors:
OpenAI,
:,
Aaron Hurst,
Adam Lerer,
Adam P. Goucher,
Adam Perelman,
Aditya Ramesh,
Aidan Clark,
AJ Ostrow,
Akila Welihinda,
Alan Hayes,
Alec Radford,
Aleksander Mądry,
Alex Baker-Whitcomb,
Alex Beutel,
Alex Borzunov,
Alex Carney,
Alex Chow,
Alex Kirillov,
Alex Nichol,
Alex Paino,
Alex Renzin,
Alex Tachard Passos,
Alexander Kirillov,
Alexi Christakis
, et al. (395 additional authors not shown)
Abstract:
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil…
▽ More
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.
△ Less
Submitted 25 October, 2024;
originally announced October 2024.
-
Fast decision tree learning solves hard coding-theoretic problems
Authors:
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
We connect the problem of properly PAC learning decision trees to the parameterized Nearest Codeword Problem ($k$-NCP). Despite significant effort by the respective communities, algorithmic progress on both problems has been stuck: the fastest known algorithm for the former runs in quasipolynomial time (Ehrenfeucht and Haussler 1989) and the best known approximation ratio for the latter is…
▽ More
We connect the problem of properly PAC learning decision trees to the parameterized Nearest Codeword Problem ($k$-NCP). Despite significant effort by the respective communities, algorithmic progress on both problems has been stuck: the fastest known algorithm for the former runs in quasipolynomial time (Ehrenfeucht and Haussler 1989) and the best known approximation ratio for the latter is $O(n/\log n)$ (Berman and Karpinsky 2002; Alon, Panigrahy, and Yekhanin 2009). Research on both problems has thus far proceeded independently with no known connections.
We show that $\textit{any}$ improvement of Ehrenfeucht and Haussler's algorithm will yield $O(\log n)$-approximation algorithms for $k$-NCP, an exponential improvement of the current state of the art. This can be interpreted either as a new avenue for designing algorithms for $k$-NCP, or as one for establishing the optimality of Ehrenfeucht and Haussler's algorithm. Furthermore, our reduction along with existing inapproximability results for $k$-NCP already rule out polynomial-time algorithms for properly learning decision trees. A notable aspect of our hardness results is that they hold even in the setting of $\textit{weak}$ learning whereas prior ones were limited to the setting of strong learning.
△ Less
Submitted 25 September, 2024; v1 submitted 19 September, 2024;
originally announced September 2024.
-
The Sample Complexity of Smooth Boosting and the Tightness of the Hardcore Theorem
Authors:
Guy Blanc,
Alexandre Hayderi,
Caleb Koch,
Li-Yang Tan
Abstract:
Smooth boosters generate distributions that do not place too much weight on any given example. Originally introduced for their noise-tolerant properties, such boosters have also found applications in differential privacy, reproducibility, and quantum learning theory. We study and settle the sample complexity of smooth boosting: we exhibit a class that can be weak learned to $γ$-advantage over smoo…
▽ More
Smooth boosters generate distributions that do not place too much weight on any given example. Originally introduced for their noise-tolerant properties, such boosters have also found applications in differential privacy, reproducibility, and quantum learning theory. We study and settle the sample complexity of smooth boosting: we exhibit a class that can be weak learned to $γ$-advantage over smooth distributions with $m$ samples, for which strong learning over the uniform distribution requires $\tildeΩ(1/γ^2)\cdot m$ samples. This matches the overhead of existing smooth boosters and provides the first separation from the setting of distribution-independent boosting, for which the corresponding overhead is $O(1/γ)$.
Our work also sheds new light on Impagliazzo's hardcore theorem from complexity theory, all known proofs of which can be cast in the framework of smooth boosting. For a function $f$ that is mildly hard against size-$s$ circuits, the hardcore theorem provides a set of inputs on which $f$ is extremely hard against size-$s'$ circuits. A downside of this important result is the loss in circuit size, i.e. that $s' \ll s$. Answering a question of Trevisan, we show that this size loss is necessary and in fact, the parameters achieved by known proofs are the best possible.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
From Text to Insight: Large Language Models for Materials Science Data Extraction
Authors:
Mara Schilling-Wilhelmi,
Martiño Ríos-García,
Sherjeel Shabih,
María Victoria Gil,
Santiago Miret,
Christoph T. Koch,
José A. Márquez,
Kevin Maik Jablonka
Abstract:
The vast majority of materials science knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling efficient ext…
▽ More
The vast majority of materials science knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling efficient extraction of structured, actionable data from unstructured text by non-experts. While applying LLMs to materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This review provides a comprehensive overview of LLM-based structured data extraction in materials science, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and materials science expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven materials research. The insights presented here could significantly enhance how researchers across disciplines access and utilize scientific information, potentially accelerating the development of novel materials for critical societal needs.
△ Less
Submitted 2 December, 2024; v1 submitted 23 July, 2024;
originally announced July 2024.
-
Superconstant Inapproximability of Decision Tree Learning
Authors:
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
We consider the task of properly PAC learning decision trees with queries. Recent work of Koch, Strassle, and Tan showed that the strictest version of this task, where the hypothesis tree $T$ is required to be optimally small, is NP-hard. Their work leaves open the question of whether the task remains intractable if $T$ is only required to be close to optimal, say within a factor of 2, rather than…
▽ More
We consider the task of properly PAC learning decision trees with queries. Recent work of Koch, Strassle, and Tan showed that the strictest version of this task, where the hypothesis tree $T$ is required to be optimally small, is NP-hard. Their work leaves open the question of whether the task remains intractable if $T$ is only required to be close to optimal, say within a factor of 2, rather than exactly optimal.
We answer this affirmatively and show that the task indeed remains NP-hard even if $T$ is allowed to be within any constant factor of optimal. More generally, our result allows for a smooth tradeoff between the hardness assumption and the inapproximability factor. As Koch et al.'s techniques do not appear to be amenable to such a strengthening, we first recover their result with a new and simpler proof, which we couple with a new XOR lemma for decision trees. While there is a large body of work on XOR lemmas for decision trees, our setting necessitates parameters that are extremely sharp, and are not known to be attainable by existing XOR lemmas. Our work also carries new implications for the related problem of Decision Tree Minimization.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
A Strong Direct Sum Theorem for Distributional Query Complexity
Authors:
Guy Blanc,
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
Consider the expected query complexity of computing the $k$-fold direct product $f^{\otimes k}$ of a function $f$ to error $\varepsilon$ with respect to a distribution $μ^k$. One strategy is to sequentially compute each of the $k$ copies to error $\varepsilon/k$ with respect to $μ$ and apply the union bound. We prove a strong direct sum theorem showing that this naive strategy is essentially optim…
▽ More
Consider the expected query complexity of computing the $k$-fold direct product $f^{\otimes k}$ of a function $f$ to error $\varepsilon$ with respect to a distribution $μ^k$. One strategy is to sequentially compute each of the $k$ copies to error $\varepsilon/k$ with respect to $μ$ and apply the union bound. We prove a strong direct sum theorem showing that this naive strategy is essentially optimal. In particular, computing a direct product necessitates a blowup in both query complexity and error.
Strong direct sum theorems contrast with results that only show a blowup in query complexity or error but not both. There has been a long line of such results for distributional query complexity, dating back to (Impagliazzo, Raz, Wigderson 1994) and (Nisan, Rudich, Saks 1994), but a strong direct sum theorem had been elusive.
A key idea in our work is the first use of the Hardcore Theorem (Impagliazzo 1995) in the context of query complexity. We prove a new "resilience lemma" that accompanies it, showing that the hardcore of $f^{\otimes k}$ is likely to remain dense under arbitrary partitions of the input space.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Transfer of Reinforcement Learning-Based Controllers from Model- to Hardware-in-the-Loop
Authors:
Mario Picerno,
Lucas Koch,
Kevin Badalian,
Marius Wegener,
Joschka Schaub,
Charles Robert Koch,
Jakob Andert
Abstract:
The process of developing control functions for embedded systems is resource-, time-, and data-intensive, often resulting in sub-optimal cost and solutions approaches. Reinforcement Learning (RL) has great potential for autonomously training agents to perform complex control tasks with minimal human intervention. Due to costly data generation and safety constraints, however, its application is mos…
▽ More
The process of developing control functions for embedded systems is resource-, time-, and data-intensive, often resulting in sub-optimal cost and solutions approaches. Reinforcement Learning (RL) has great potential for autonomously training agents to perform complex control tasks with minimal human intervention. Due to costly data generation and safety constraints, however, its application is mostly limited to purely simulated domains. To use RL effectively in embedded system function development, the generated agents must be able to handle real-world applications. In this context, this work focuses on accelerating the training process of RL agents by combining Transfer Learning (TL) and X-in-the-Loop (XiL) simulation. For the use case of transient exhaust gas re-circulation control for an internal combustion engine, use of a computationally cheap Model-in-the-Loop (MiL) simulation is made to select a suitable algorithm, fine-tune hyperparameters, and finally train candidate agents for the transfer. These pre-trained RL agents are then fine-tuned in a Hardware-in-the-Loop (HiL) system via TL. The transfer revealed the need for adjusting the reward parameters when advancing to real hardware. Further, the comparison between a purely HiL-trained and a transferred agent showed a reduction of training time by a factor of 5.9. The results emphasize the necessity to train RL agents with real hardware, and demonstrate that the maturity of the transferred policies affects both training time and performance, highlighting the strong synergies between TL and XiL simulation.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
Introducing a Deep Neural Network-based Model Predictive Control Framework for Rapid Controller Implementation
Authors:
David C. Gordon,
Alexander Winkler,
Julian Bedei,
Patrick Schaber,
Jakob Andert,
Charles R. Koch
Abstract:
Model Predictive Control (MPC) provides an optimal control solution based on a cost function while allowing for the implementation of process constraints. As a model-based optimal control technique, the performance of MPC strongly depends on the model used where a trade-off between model computation time and prediction performance exists. One solution is the integration of MPC with a machine learn…
▽ More
Model Predictive Control (MPC) provides an optimal control solution based on a cost function while allowing for the implementation of process constraints. As a model-based optimal control technique, the performance of MPC strongly depends on the model used where a trade-off between model computation time and prediction performance exists. One solution is the integration of MPC with a machine learning (ML) based process model which are quick to evaluate online. This work presents the experimental implementation of a deep neural network (DNN) based nonlinear MPC for Homogeneous Charge Compression Ignition (HCCI) combustion control. The DNN model consists of a Long Short-Term Memory (LSTM) network surrounded by fully connected layers which was trained using experimental engine data and showed acceptable prediction performance with under 5% error for all outputs. Using this model, the MPC is designed to track the Indicated Mean Effective Pressure (IMEP) and combustion phasing trajectories, while minimizing several parameters. Using the acados software package to enable the real-time implementation of the MPC on an ARM Cortex A72, the optimization calculations are completed within 1.4 ms. The external A72 processor is integrated with the prototyping engine controller using a UDP connection allowing for rapid experimental deployment of the NMPC. The IMEP trajectory following of the developed controller was excellent, with a root-mean-square error of 0.133 bar, in addition to observing process constraints.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Properly Learning Decision Trees with Queries Is NP-Hard
Authors:
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
We prove that it is NP-hard to properly PAC learn decision trees with queries, resolving a longstanding open problem in learning theory (Bshouty 1993; Guijarro-Lavin-Raghavan 1999; Mehta-Raghavan 2002; Feldman 2016). While there has been a long line of work, dating back to (Pitt-Valiant 1988), establishing the hardness of properly learning decision trees from random examples, the more challenging…
▽ More
We prove that it is NP-hard to properly PAC learn decision trees with queries, resolving a longstanding open problem in learning theory (Bshouty 1993; Guijarro-Lavin-Raghavan 1999; Mehta-Raghavan 2002; Feldman 2016). While there has been a long line of work, dating back to (Pitt-Valiant 1988), establishing the hardness of properly learning decision trees from random examples, the more challenging setting of query learners necessitates different techniques and there were no previous lower bounds. En route to our main result, we simplify and strengthen the best known lower bounds for a different problem of Decision Tree Minimization (Zantema-Bodlaender 2000; Sieling 2003).
On a technical level, we introduce the notion of hardness distillation, which we study for decision tree complexity but can be considered for any complexity measure: for a function that requires large decision trees, we give a general method for identifying a small set of inputs that is responsible for its complexity. Our technique even rules out query learners that are allowed constant error. This contrasts with existing lower bounds for the setting of random examples which only hold for inverse-polynomial error.
Our result, taken together with a recent almost-polynomial time query algorithm for properly learning decision trees under the uniform distribution (Blanc-Lange-Qiao-Tan 2022), demonstrates the dramatic impact of distributional assumptions on the problem.
△ Less
Submitted 9 July, 2023;
originally announced July 2023.
-
A Strong Composition Theorem for Junta Complexity and the Boosting of Property Testers
Authors:
Guy Blanc,
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
We prove a strong composition theorem for junta complexity and show how such theorems can be used to generically boost the performance of property testers.
The $\varepsilon$-approximate junta complexity of a function $f$ is the smallest integer $r$ such that $f$ is $\varepsilon$-close to a function that depends only on $r$ variables. A strong composition theorem states that if $f$ has large…
▽ More
We prove a strong composition theorem for junta complexity and show how such theorems can be used to generically boost the performance of property testers.
The $\varepsilon$-approximate junta complexity of a function $f$ is the smallest integer $r$ such that $f$ is $\varepsilon$-close to a function that depends only on $r$ variables. A strong composition theorem states that if $f$ has large $\varepsilon$-approximate junta complexity, then $g \circ f$ has even larger $\varepsilon'$-approximate junta complexity, even for $\varepsilon' \gg \varepsilon$. We develop a fairly complete understanding of this behavior, proving that the junta complexity of $g \circ f$ is characterized by that of $f$ along with the multivariate noise sensitivity of $g$. For the important case of symmetric functions $g$, we relate their multivariate noise sensitivity to the simpler and well-studied case of univariate noise sensitivity.
We then show how strong composition theorems yield boosting algorithms for property testers: with a strong composition theorem for any class of functions, a large-distance tester for that class is immediately upgraded into one for small distances. Combining our contributions yields a booster for junta testers, and with it new implications for junta testing. This is the first boosting-type result in property testing, and we hope that the connection to composition theorems adds compelling motivation to the study of both topics.
△ Less
Submitted 8 July, 2023;
originally announced July 2023.
-
Detecting Robustness against MVRC for Transaction Programs with Predicate Reads
Authors:
Brecht Vandevoort,
Bas Ketsman,
Christoph Koch,
Frank Neven
Abstract:
The transactional robustness problem revolves around deciding whether, for a given workload, a lower isolation level than Serializable is sufficient to guarantee serializability. The paper presents a new characterization for robustness against isolation level (multi-version) Read Committed. It supports transaction programs with control structures (loops and conditionals) and inserts, deletes, and…
▽ More
The transactional robustness problem revolves around deciding whether, for a given workload, a lower isolation level than Serializable is sufficient to guarantee serializability. The paper presents a new characterization for robustness against isolation level (multi-version) Read Committed. It supports transaction programs with control structures (loops and conditionals) and inserts, deletes, and predicate reads -- scenarios that trigger the phantom problem, which is known to be hard to analyze in this context. The characterization is graph-theoretic and not unlike previous decision mechanisms known from the concurrency control literature that database researchers and practicians are comfortable with. We show experimentally that our characterization pushes the frontier in allowing to recognize more and more complex workloads as robust than before.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Certification with an NP Oracle
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Carmen Strassle,
Li-Yang Tan
Abstract:
In the certification problem, the algorithm is given a function $f$ with certificate complexity $k$ and an input $x^\star$, and the goal is to find a certificate of size $\le \text{poly}(k)$ for $f$'s value at $x^\star$. This problem is in $\mathsf{NP}^{\mathsf{NP}}$, and assuming $\mathsf{P} \ne \mathsf{NP}$, is not in $\mathsf{P}$. Prior works, dating back to Valiant in 1984, have therefore soug…
▽ More
In the certification problem, the algorithm is given a function $f$ with certificate complexity $k$ and an input $x^\star$, and the goal is to find a certificate of size $\le \text{poly}(k)$ for $f$'s value at $x^\star$. This problem is in $\mathsf{NP}^{\mathsf{NP}}$, and assuming $\mathsf{P} \ne \mathsf{NP}$, is not in $\mathsf{P}$. Prior works, dating back to Valiant in 1984, have therefore sought to design efficient algorithms by imposing assumptions on $f$ such as monotonicity.
Our first result is a $\mathsf{BPP}^{\mathsf{NP}}$ algorithm for the general problem. The key ingredient is a new notion of the balanced influence of variables, a natural variant of influence that corrects for the bias of the function. Balanced influences can be accurately estimated via uniform generation, and classic $\mathsf{BPP}^{\mathsf{NP}}$ algorithms are known for the latter task.
We then consider certification with stricter instance-wise guarantees: for each $x^\star$, find a certificate whose size scales with that of the smallest certificate for $x^\star$. In sharp contrast with our first result, we show that this problem is $\mathsf{NP}^{\mathsf{NP}}$-hard even to approximate. We obtain an optimal inapproximability ratio, adding to a small handful of problems in the higher levels of the polynomial hierarchy for which optimal inapproximability is known. Our proof involves the novel use of bit-fixing dispersers for gap amplification.
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
Superpolynomial Lower Bounds for Decision Tree Learning and Testing
Authors:
Caleb Koch,
Carmen Strassle,
Li-Yang Tan
Abstract:
We establish new hardness results for decision tree optimization problems, adding to a line of work that dates back to Hyafil and Rivest in 1976. We prove, under randomized ETH, superpolynomial lower bounds for two basic problems: given an explicit representation of a function $f$ and a generator for a distribution $\mathcal{D}$, construct a small decision tree approximator for $f$ under…
▽ More
We establish new hardness results for decision tree optimization problems, adding to a line of work that dates back to Hyafil and Rivest in 1976. We prove, under randomized ETH, superpolynomial lower bounds for two basic problems: given an explicit representation of a function $f$ and a generator for a distribution $\mathcal{D}$, construct a small decision tree approximator for $f$ under $\mathcal{D}$, and decide if there is a small decision tree approximator for $f$ under $\mathcal{D}$.
Our results imply new lower bounds for distribution-free PAC learning and testing of decision trees, settings in which the algorithm only has restricted access to $f$ and $\mathcal{D}$. Specifically, we show: $n$-variable size-$s$ decision trees cannot be properly PAC learned in time $n^{\tilde{O}(\log\log s)}$, and depth-$d$ decision trees cannot be tested in time $\exp(d^{\,O(1)})$. For learning, the previous best lower bound only ruled out $\text{poly}(n)$-time algorithms (Alekhnovich, Braverman, Feldman, Klivans, and Pitassi, 2009). For testing, recent work gives similar though incomparable bounds in the setting where $f$ is random and $\mathcal{D}$ is nonexplicit (Blais, Ferreira Pinto Jr., and Harms, 2021). Assuming a plausible conjecture on the hardness of Set-Cover, we show our lower bound for learning decision trees can be improved to $n^{Ω(\log s)}$, matching the best known upper bound of $n^{O(\log s)}$ due to Ehrenfeucht and Haussler (1989).
We obtain our results within a unified framework that leverages recent progress in two lines of work: the inapproximability of Set-Cover and XOR lemmas for query complexity. Our framework is versatile and yields results for related concept classes such as juntas and DNF formulas.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
A Query-Optimal Algorithm for Finding Counterfactuals
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Li-Yang Tan
Abstract:
We design an algorithm for finding counterfactuals with strong theoretical guarantees on its performance. For any monotone model $f : X^d \to \{0,1\}$ and instance $x^\star$, our algorithm makes \[ {S(f)^{O(Δ_f(x^\star))}\cdot \log d}\] queries to $f$ and returns {an {\sl optimal}} counterfactual for $x^\star$: a nearest instance $x'$ to $x^\star$ for which $f(x')\ne f(x^\star)$. Here $S(f)$ is th…
▽ More
We design an algorithm for finding counterfactuals with strong theoretical guarantees on its performance. For any monotone model $f : X^d \to \{0,1\}$ and instance $x^\star$, our algorithm makes \[ {S(f)^{O(Δ_f(x^\star))}\cdot \log d}\] queries to $f$ and returns {an {\sl optimal}} counterfactual for $x^\star$: a nearest instance $x'$ to $x^\star$ for which $f(x')\ne f(x^\star)$. Here $S(f)$ is the sensitivity of $f$, a discrete analogue of the Lipschitz constant, and $Δ_f(x^\star)$ is the distance from $x^\star$ to its nearest counterfactuals. The previous best known query complexity was $d^{\,O(Δ_f(x^\star))}$, achievable by brute-force local search. We further prove a lower bound of $S(f)^{Ω(Δ_f(x^\star))} + Ω(\log d)$ on the query complexity of any algorithm, thereby showing that the guarantees of our algorithm are essentially optimal.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
Deep Reinforcement Learning for Data-Driven Adaptive Scanning in Ptychography
Authors:
Marcel Schloz,
Johannes Müller,
Thomas C. Pekin,
Wouter Van den Broek,
Christoph T. Koch
Abstract:
We present a method that lowers the dose required for a ptychographic reconstruction by adaptively scanning the specimen, thereby providing the required spatial information redundancy in the regions of highest importance. The proposed method is built upon a deep learning model that is trained by reinforcement learning (RL), using prior knowledge of the specimen structure from training data sets. W…
▽ More
We present a method that lowers the dose required for a ptychographic reconstruction by adaptively scanning the specimen, thereby providing the required spatial information redundancy in the regions of highest importance. The proposed method is built upon a deep learning model that is trained by reinforcement learning (RL), using prior knowledge of the specimen structure from training data sets. We show that equivalent low-dose experiments using adaptive scanning outperform conventional ptychography experiments in terms of reconstruction resolution.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
Which programming languages do hackers use? A survey at the German Chaos Computer Club
Authors:
Christian Koch,
Katharina Müller,
Eldar Sultanow
Abstract:
There are numerous articles about the programming languages most commonly used by hackers. Among them, however, there are hardly any scientific studies. One reason might be that hackers mainly operate anonymously and are difficult to reach. This paper aims to shed light on this interesting and relevant research question. In order to find answers, we conducted a survey among the members of the Germ…
▽ More
There are numerous articles about the programming languages most commonly used by hackers. Among them, however, there are hardly any scientific studies. One reason might be that hackers mainly operate anonymously and are difficult to reach. This paper aims to shed light on this interesting and relevant research question. In order to find answers, we conducted a survey among the members of the German Chaos Computer Club. As one of the world's largest organisations for information security and hacking, the club provides a good basis for our study. We examine the question of which programming languages are used by hackers as well as the importance of the programming language for their work. The paper offers first insights into the topic and can provide a starting point for further research.
△ Less
Submitted 23 March, 2022;
originally announced March 2022.
-
A Matheuristic Approach for Solving a Simultaneous Lot Sizing and Scheduling Problem with Client Prioritization in Tire Industry
Authors:
Cyril Koch,
Taha Arbaoui,
Yassine Ouazene,
Farouk Yalaoui,
Humbert De Brunier,
Nicolas Jaunet,
Antoine De Wulf
Abstract:
This paper introduces an integrated lot sizing and scheduling problem inspired from a real-world application in off-the-road tire industry. This problem considers the assignment of different items on parallel machines with complex eligibility constraints within a finite planning horizon. It also considers a large panel of specific constraints such as: backordering, a limited number of setups, upst…
▽ More
This paper introduces an integrated lot sizing and scheduling problem inspired from a real-world application in off-the-road tire industry. This problem considers the assignment of different items on parallel machines with complex eligibility constraints within a finite planning horizon. It also considers a large panel of specific constraints such as: backordering, a limited number of setups, upstream resources saturation and customers prioritization. A novel mixed integer formulation is proposed with the objective of optimizing different normalized criteria related to the inventory and service level performance. Based on this mathematical formulation, a problem-based matheuristic method that solves the lot sizing and assignment problems separately is proposed to solve the industrial case. A computational study and sensitivity analysis are carried out based on real-world data with up to 170 products, 70 unrelated parallel machines and 42 periods. The obtained results show the effectiveness of the proposed approach on improving the company's solution. Indeed, the two most important KPIs for the management have been optimized of respectively 32% for the backorders and 13% for the overstock. Moreover, the computational time have been reduced significantly.
△ Less
Submitted 21 January, 2022;
originally announced January 2022.
-
The Query Complexity of Certification
Authors:
Guy Blanc,
Caleb Koch,
Jane Lange,
Li-Yang Tan
Abstract:
We study the problem of {\sl certification}: given queries to a function $f : \{0,1\}^n \to \{0,1\}$ with certificate complexity $\le k$ and an input $x^\star$, output a size-$k$ certificate for $f$'s value on $x^\star$. This abstractly models a central problem in explainable machine learning, where we think of $f$ as a blackbox model that we seek to explain the predictions of.
For monotone func…
▽ More
We study the problem of {\sl certification}: given queries to a function $f : \{0,1\}^n \to \{0,1\}$ with certificate complexity $\le k$ and an input $x^\star$, output a size-$k$ certificate for $f$'s value on $x^\star$. This abstractly models a central problem in explainable machine learning, where we think of $f$ as a blackbox model that we seek to explain the predictions of.
For monotone functions, a classic local search algorithm of Angluin accomplishes this task with $n$ queries, which we show is optimal for local search algorithms. Our main result is a new algorithm for certifying monotone functions with $O(k^8 \log n)$ queries, which comes close to matching the information-theoretic lower bound of $Ω(k \log n)$. The design and analysis of our algorithm are based on a new connection to threshold phenomena in monotone functions.
We further prove exponential-in-$k$ lower bounds when $f$ is non-monotone, and when $f$ is monotone but the algorithm is only given random examples of $f$. These lower bounds show that assumptions on the structure of $f$ and query access to it are both necessary for the polynomial dependence on $k$ that we achieve.
△ Less
Submitted 6 April, 2022; v1 submitted 19 January, 2022;
originally announced January 2022.
-
Robustness against Read Committed for Transaction Templates with Functional Constraints
Authors:
Brecht Vandevoort,
Bas Ketsman,
Christoph Koch,
Frank Neven
Abstract:
The popular isolation level Multiversion Read Committed (RC) trades some of the strong guarantees of serializability for increased transaction throughput. Sometimes, transaction workloads can be safely executed under RC obtaining serializability at the lower cost of RC. Such workloads are said to be robust against RC. Previous work has yielded a tractable procedure for deciding robustness against…
▽ More
The popular isolation level Multiversion Read Committed (RC) trades some of the strong guarantees of serializability for increased transaction throughput. Sometimes, transaction workloads can be safely executed under RC obtaining serializability at the lower cost of RC. Such workloads are said to be robust against RC. Previous work has yielded a tractable procedure for deciding robustness against RC for workloads generated by transaction programs modeled as transaction templates. An important insight of that work is that, by more accurately modeling transaction programs, we are able to recognize larger sets of workloads as robust. In this work, we increase the modeling power of transaction templates by extending them with functional constraints, which are useful for capturing data dependencies like foreign keys. We show that the incorporation of functional constraints can identify more workloads as robust that otherwise would not be. Even though we establish that the robustness problem becomes undecidable in its most general form, we show that various restrictions on functional constraints lead to decidable and even tractable fragments that can be used to model and test for robustness against RC for realistic scenarios.
△ Less
Submitted 22 December, 2023; v1 submitted 13 January, 2022;
originally announced January 2022.
-
Detecting Slag Formations with Deep Convolutional Neural Networks
Authors:
Christian von Koch,
William Anzén,
Max Fischer,
Raazesh Sainudiin
Abstract:
We investigate the ability to detect slag formations in images from inside a Grate-Kiln system furnace with two deep convolutional neural networks. The conditions inside the furnace cause occasional obstructions of the camera view. Our approach suggests dealing with this problem by introducing a convLSTM-layer in the deep convolutional neural network. The results show that it is possible to achiev…
▽ More
We investigate the ability to detect slag formations in images from inside a Grate-Kiln system furnace with two deep convolutional neural networks. The conditions inside the furnace cause occasional obstructions of the camera view. Our approach suggests dealing with this problem by introducing a convLSTM-layer in the deep convolutional neural network. The results show that it is possible to achieve sufficient performance to automate the decision of timely countermeasures in the industrial operational setting. Furthermore, the addition of the convLSTM-layer results in fewer outlying predictions and a lower running variance of the fraction of detected slag in the image time series.
△ Less
Submitted 13 October, 2021;
originally announced October 2021.
-
Robustness against Read Committed for Transaction Templates
Authors:
Brecht Vandevoort,
Bas Ketsman,
Christoph Koch,
Frank Neven
Abstract:
The isolation level Multiversion Read Committed (RC), offered by many database systems, is known to trade consistency for increased transaction throughput. Sometimes, transaction workloads can be safely executed under RC obtaining the perfect isolation of serializability at the lower cost of RC. To identify such cases, we introduce an expressive model of transaction programs to better reason about…
▽ More
The isolation level Multiversion Read Committed (RC), offered by many database systems, is known to trade consistency for increased transaction throughput. Sometimes, transaction workloads can be safely executed under RC obtaining the perfect isolation of serializability at the lower cost of RC. To identify such cases, we introduce an expressive model of transaction programs to better reason about the serializability of transactional workloads. We develop tractable algorithms to decide whether any possible schedule of a workload executed under RC is serializable (referred to as the robustness problem). Our approach yields robust subsets that are larger than those identified by previous methods. We provide experimental evidence that workloads that are robust against RC can be evaluated faster under RC compared to stronger isolation levels. We discuss techniques for making workloads robust against RC by promoting selective read operations to updates. Depending on the scenario, the performance improvements can be considerable. Robustness testing and safely executing transactions under the lower isolation level RC can therefore provide a direct way to increase transaction throughput without changing DBMS internals.
△ Less
Submitted 26 July, 2021;
originally announced July 2021.
-
Extraction and Analysis of Highway On-Ramp Merging Scenarios from Naturalistic Trajectory Data
Authors:
Lars Klitzke,
Kay Gimm,
Carsten Koch,
Frank Köster
Abstract:
Connected and Automated Vehicles (CAVs) are envisioned to transform the future industrial and private transportation sectors. However, due to the system's enormous complexity, functional verification and validation of safety aspects are essential before the technology merges into the public domain. Therefore, in recent years, a scenario-driven approach has gained acceptance, emphasizing the requir…
▽ More
Connected and Automated Vehicles (CAVs) are envisioned to transform the future industrial and private transportation sectors. However, due to the system's enormous complexity, functional verification and validation of safety aspects are essential before the technology merges into the public domain. Therefore, in recent years, a scenario-driven approach has gained acceptance, emphasizing the requirement of a solid data basis of scenarios. The large-scale research facility Test Bed Lower Saxony (TFNDS) enables the provision of ample information for a database of scenarios on highways. For that purpose, however, the scenarios of interest must be identified and extracted from the collected Naturalistic Trajectory Data (NTD). This work addresses this problem and proposes a methodology for onramp scenario extraction, enabling scenario categorization and assessment. An Hidden Markov Model (HMM) and Dynamic Time Warping (DTW) is utilized for extraction and a decision tree with the Surrogate Measure of Safety (SMoS) Post Enroachment Time (PET) for categorization and assessment. The efficacy of the approach is shown with a dataset of NTD collected on the TFNDS.
△ Less
Submitted 3 March, 2022; v1 submitted 12 April, 2021;
originally announced April 2021.
-
Increasing the Quality of 360° Video Streaming by Transitioning between Viewport Quality Adaptation Mechanisms
Authors:
Christian Koch,
Arne-Tobias Rak,
Michael Zink,
Ralf Steinmetz,
Amr Rizk
Abstract:
Virtual reality has been gaining popularity in recent years caused by the proliferation of affordable consumer-grade devices such as Oculus Rift, HTC Vive, and Samsung VR. Amongst the various VR applications, 360° video streaming is currently one of the most popular ones. It allows user to change their field-of-view (FoV) based on head movement, which enables them to freely select an area anywhere…
▽ More
Virtual reality has been gaining popularity in recent years caused by the proliferation of affordable consumer-grade devices such as Oculus Rift, HTC Vive, and Samsung VR. Amongst the various VR applications, 360° video streaming is currently one of the most popular ones. It allows user to change their field-of-view (FoV) based on head movement, which enables them to freely select an area anywhere from the sphere the video is (virtually) projected to. While 360° video streaming offers new exciting ways of consuming content for viewers, it poses a series of challenges to the systems that are responsible for the distribution of such content from the origin to the viewer. One challenge is the significantly increased bandwidth requirement for streaming such content in real time. Recent research has shown that only streaming the content that is in the user's FoV in high quality can lead to strong bandwidth savings. This can be achieved by analyzing the viewers head orientation and movement based on sensor information. Alternatively, historic information from users that watched the content in the past can be taken into account to prefetch 360° video data in high quality assuming the viewer will direct the FoV to these areas. In this paper, we present a 360° video streaming system that transitions between sensor- and content-based predictive mechanisms. We evaluate the effects of this transition-based approach on the Quality of Experience (QoE) of such a VR streaming system and show that the perceived quality can be increased between 50\% and 80\% compared to systems that only apply either one of the two approaches.
△ Less
Submitted 6 October, 2019;
originally announced October 2019.
-
A Compiler-Compiler for DSL Embedding
Authors:
Amir Shaikhha,
Vojin Jovanovic,
Christoph Koch
Abstract:
In this paper, we present a framework to generate compilers for embedded domain-specific languages (EDSLs). This framework provides facilities to automatically generate the boilerplate code required for building DSL compilers on top of extensible optimizing compilers. We evaluate the practicality of our framework by demonstrating several use-cases successfully built with it.
In this paper, we present a framework to generate compilers for embedded domain-specific languages (EDSLs). This framework provides facilities to automatically generate the boilerplate code required for building DSL compilers on top of extensible optimizing compilers. We evaluate the practicality of our framework by demonstrating several use-cases successfully built with it.
△ Less
Submitted 3 August, 2018;
originally announced August 2018.
-
Compiling Database Application Programs
Authors:
Mohammad Dashti,
Sachin Basil John,
Thierry Coppey,
Amir Shaikhha,
Vojin Jovanovic,
Christoph Koch
Abstract:
There is a trend towards increased specialization of data management software for performance reasons. In this paper, we study the automatic specialization and optimization of database application programs -- sequences of queries and updates, augmented with control flow constructs as they appear in database scripts, UDFs, transactional workloads and triggers in languages such as PL/SQL. We show ho…
▽ More
There is a trend towards increased specialization of data management software for performance reasons. In this paper, we study the automatic specialization and optimization of database application programs -- sequences of queries and updates, augmented with control flow constructs as they appear in database scripts, UDFs, transactional workloads and triggers in languages such as PL/SQL. We show how to build an optimizing compiler for database application programs using generative programming and state-of-the-art compiler technology.
We evaluate a hand-optimized low-level implementation of TPC-C, and identify the key optimization techniques that account for its good performance. Our compiler fully automates these optimizations and, applied to this benchmark, outperforms the manually optimized baseline by a factor of two. By selectively disabling some of the optimizations in the compiler, we derive a clinical and precise way of obtaining insight into their individual performance contributions.
△ Less
Submitted 25 July, 2018;
originally announced July 2018.
-
Detection and Analysis of Content Creator Collaborations in YouTube Videos using Face- and Speaker-Recognition
Authors:
Moritz Lode,
Michael Örtl,
Christian Koch,
Amr Rizk,
Ralf Steinmetz
Abstract:
This work discusses and implements the application of speaker recognition for the detection of collaborations in YouTube videos. CATANA, an existing framework for detection and analysis of YouTube collaborations, is utilizing face recognition for the detection of collaborators, which naturally performs poor on video-content without appearing faces. This work proposes an extension of CATANA using a…
▽ More
This work discusses and implements the application of speaker recognition for the detection of collaborations in YouTube videos. CATANA, an existing framework for detection and analysis of YouTube collaborations, is utilizing face recognition for the detection of collaborators, which naturally performs poor on video-content without appearing faces. This work proposes an extension of CATANA using active speaker detection and speaker recognition to improve the detection accuracy.
△ Less
Submitted 5 July, 2018;
originally announced July 2018.
-
Efficient Differentiable Programming in a Functional Array-Processing Language
Authors:
Amir Shaikhha,
Andrew Fitzgibbon,
Dimitrios Vytiniotis,
Simon Peyton Jones,
Christoph Koch
Abstract:
We present a system for the automatic differentiation of a higher-order functional array-processing language. The core functional language underlying this system simultaneously supports both source-to-source automatic differentiation and global optimizations such as loop transformations. Thanks to this feature, we demonstrate how for some real-world machine learning and computer vision benchmarks,…
▽ More
We present a system for the automatic differentiation of a higher-order functional array-processing language. The core functional language underlying this system simultaneously supports both source-to-source automatic differentiation and global optimizations such as loop transformations. Thanks to this feature, we demonstrate how for some real-world machine learning and computer vision benchmarks, the system outperforms the state-of-the-art automatic differentiation tools.
△ Less
Submitted 6 June, 2018;
originally announced June 2018.
-
Collaborations on YouTube: From Unsupervised Detection to the Impact on Video and Channel Popularity
Authors:
Christian Koch,
Moritz Lode,
Denny Stohr,
Amr Rizk,
Ralf Steinmetz
Abstract:
YouTube is one of the most popular platforms for streaming of user-generated video. Nowadays, professional YouTubers are organized in so called multi-channel networks (MCNs). These networks offer services such as brand deals, equipment, and strategic advice in exchange for a share of the YouTubers' revenue. A major strategy to gain more subscribers and, hence, revenue is collaborating with other Y…
▽ More
YouTube is one of the most popular platforms for streaming of user-generated video. Nowadays, professional YouTubers are organized in so called multi-channel networks (MCNs). These networks offer services such as brand deals, equipment, and strategic advice in exchange for a share of the YouTubers' revenue. A major strategy to gain more subscribers and, hence, revenue is collaborating with other YouTubers. Yet, collaborations on YouTube have not been studied in a detailed quantitative manner. This paper aims to close this gap with the following contributions. First, we collect a YouTube dataset covering video statistics over three months for 7,942 channels. Second, we design a framework for collaboration detection given a previously unknown number of persons featuring in YouTube videos. We denote this framework for the analysis of collaborations in YouTube videos using a Deep Neural Network (DNN) based approach as CATANA. Third, we analyze about 2.4 years of video content and use CATANA to answer research questions providing guidance for YouTubers and MCNs for efficient collaboration strategies. Thereby, we focus on (i) collaboration frequency and partner selectivity, (ii) the influence of MCNs on channel collaborations, (iii) collaborating channel types, and (iv) the impact of collaborations on video and channel popularity. Our results show that collaborations are in many cases significantly beneficial in terms of viewers and newly attracted subscribers for both collaborating channels, showing often more than 100% popularity growth compared with non-collaboration videos.
△ Less
Submitted 1 May, 2018;
originally announced May 2018.
-
Java Extensions for OMNeT++
Authors:
Henning Puttnies,
Peter Danielis,
Christian Koch,
Dirk Timmermann
Abstract:
On the one side, network simulation frameworks are important tools for research and development activities to evaluate novel approaches in a time- and cost-efficient way. On the other side, Java as a highly platform-independent programming language is ideally suited for rapid prototyping in heterogeneous scenarios. Consequently, Java simulation frameworks could be used to firstly perform functiona…
▽ More
On the one side, network simulation frameworks are important tools for research and development activities to evaluate novel approaches in a time- and cost-efficient way. On the other side, Java as a highly platform-independent programming language is ideally suited for rapid prototyping in heterogeneous scenarios. Consequently, Java simulation frameworks could be used to firstly perform functional verification of new approaches (and protocols) in a simulation environment and afterwards, to evaluate these approaches in real testbeds using prototype Java implementations. Finally, the simulation models can be refined using real world measurement data. Unfortunately, there is to the best of our knowledge no satisfying Java framework for network simulation, as the OMNeT++ Java support ended with OMNeT++ version 4.6. Hence, our contributions are as follows: we present Java extensions for OMNeT++ 5.0 that enable the execution of Java simulation models and give a detailed explanation of the working principles of the OMNeT++ Java extensions that are based on Java Native Interface. We conduct several case studies to evaluate the concept of Java extensions for OMNeT++. Most importantly, we show that the combined use of Java simulation models and C++ models (e.g., from the INET framework) is possible.
△ Less
Submitted 8 September, 2017;
originally announced September 2017.
-
Hyperprofile-based Computation Offloading for Mobile Edge Networks
Authors:
Andrew Crutcher,
Caleb Koch,
Kyle Coleman,
Jon Patman,
Flavio Esposito,
Prasad Calyam
Abstract:
In recent studies, researchers have developed various computation offloading frameworks for bringing cloud services closer to the user via edge networks. Specifically, an edge device needs to offload computationally intensive tasks because of energy and processing constraints. These constraints present the challenge of identifying which edge nodes should receive tasks to reduce overall resource co…
▽ More
In recent studies, researchers have developed various computation offloading frameworks for bringing cloud services closer to the user via edge networks. Specifically, an edge device needs to offload computationally intensive tasks because of energy and processing constraints. These constraints present the challenge of identifying which edge nodes should receive tasks to reduce overall resource consumption. We propose a unique solution to this problem which incorporates elements from Knowledge-Defined Networking (KDN) to make intelligent predictions about offloading costs based on historical data. Each server instance can be represented in a multidimensional feature space where each dimension corresponds to a predicted metric. We compute features for a "hyperprofile" and position nodes based on the predicted costs of offloading a particular task. We then perform a k-Nearest Neighbor (kNN) query within the hyperprofile to select nodes for offloading computation. This paper formalizes our hyperprofile-based solution and explores the viability of using machine learning (ML) techniques to predict metrics useful for computation offloading. We also investigate the effects of using different distance metrics for the queries. Our results show various network metrics can be modeled accurately with regression, and there are circumstances where kNN queries using Euclidean distance as opposed to rectilinear distance is more favorable.
△ Less
Submitted 28 July, 2017;
originally announced July 2017.
-
Building Efficient Query Engines in a High-Level Language
Authors:
Amir Shaikhha,
Yannis Klonatos,
Christoph Koch
Abstract:
Abstraction without regret refers to the vision of using high-level programming languages for systems development without experiencing a negative impact on performance. A database system designed according to this vision offers both increased productivity and high performance, instead of sacrificing the former for the latter as is the case with existing, monolithic implementations that are hard to…
▽ More
Abstraction without regret refers to the vision of using high-level programming languages for systems development without experiencing a negative impact on performance. A database system designed according to this vision offers both increased productivity and high performance, instead of sacrificing the former for the latter as is the case with existing, monolithic implementations that are hard to maintain and extend. In this article, we realize this vision in the domain of analytical query processing. We present LegoBase, a query engine written in the high-level language Scala. The key technique to regain efficiency is to apply generative programming: LegoBase performs source-to-source compilation and optimizes the entire query engine by converting the high-level Scala code to specialized, low-level C code. We show how generative programming allows to easily implement a wide spectrum of optimizations, such as introducing data partitioning or switching from a row to a column data layout, which are difficult to achieve with existing low-level query compilers that handle only queries. We demonstrate that sufficiently powerful abstractions are essential for dealing with the complexity of the optimization effort, shielding developers from compiler internals and decoupling individual optimizations from each other. We evaluate our approach with the TPC-H benchmark and show that: (a) With all optimizations enabled, LegoBase significantly outperforms a commercial database and an existing query compiler. (b) Programmers need to provide just a few hundred lines of high-level code for implementing the optimizations, instead of complicated low-level code that is required by existing query compilation approaches. (c) The compilation overhead is low compared to the overall execution time, thus making our approach usable in practice for compiling query engines.
△ Less
Submitted 16 December, 2016;
originally announced December 2016.
-
Push vs. Pull-Based Loop Fusion in Query Engines
Authors:
Amir Shaikhha,
Mohammad Dashti,
Christoph Koch
Abstract:
Database query engines use pull-based or push-based approaches to avoid the materialization of data across query operators. In this paper, we study these two types of query engines in depth and present the limitations and advantages of each engine. Similarly, the programming languages community has developed loop fusion techniques to remove intermediate collections in the context of collection pro…
▽ More
Database query engines use pull-based or push-based approaches to avoid the materialization of data across query operators. In this paper, we study these two types of query engines in depth and present the limitations and advantages of each engine. Similarly, the programming languages community has developed loop fusion techniques to remove intermediate collections in the context of collection programming. We draw parallels between the DB and PL communities by demonstrating the connection between pipelined query engines and loop fusion techniques. Based on this connection, we propose a new type of pull-based engine, inspired by a loop fusion technique, which combines the benefits of both approaches. Then we experimentally evaluate the various engines, in the context of query compilation, for the first time in a fair environment, eliminating the biasing impact of ancillary optimizations that have traditionally only been used with one of the approaches. We show that for realistic analytical workloads, there is no considerable advantage for either form of pipelined query engine, as opposed to what recent research suggests. Also, by using microbenchmarks we show that our proposed engine dominates the existing engines by combining the benefits of both.
△ Less
Submitted 28 October, 2016;
originally announced October 2016.
-
Bootstrap percolation on geometric inhomogeneous random graphs
Authors:
Christoph Koch,
Johannes Lengler
Abstract:
Geometric inhomogeneous random graphs (GIRGs) are a model for scale-free networks with underlying geometry. We study bootstrap percolation on these graphs, which is a process modelling the spread of an infection of vertices starting within a (small) local region. We show that the process exhibits a phase transition in terms of the initial infection rate in this region. We determine the speed of th…
▽ More
Geometric inhomogeneous random graphs (GIRGs) are a model for scale-free networks with underlying geometry. We study bootstrap percolation on these graphs, which is a process modelling the spread of an infection of vertices starting within a (small) local region. We show that the process exhibits a phase transition in terms of the initial infection rate in this region. We determine the speed of the process in the supercritical case, up to lower order terms, and show that its evolution is fundamentally influenced by the underlying geometry. For vertices with given position and expected degree, we determine the infection time up to lower order terms. Finally, we show how this knowledge can be used to contain the infection locally by removing relatively few edges from the graph. This is the first time that the role of geometry on bootstrap percolation is analysed mathematically for geometric scale-free networks.
△ Less
Submitted 1 September, 2020; v1 submitted 18 February, 2016;
originally announced March 2016.
-
Repairing Conflicts among MVCC Transactions
Authors:
Mohammad Dashti,
Sachin Basil John,
Amir Shaikhha,
Christoph Koch
Abstract:
The optimistic variants of MVCC (Multi-Version Concurrency Control) avoid blocking concurrent transactions at the cost of having a validation phase. Upon failure in the validation phase, the transaction is usually aborted and restarted from scratch. The "abort and restart" approach becomes a performance bottleneck for the use cases with high contention objects or long running transactions. In addi…
▽ More
The optimistic variants of MVCC (Multi-Version Concurrency Control) avoid blocking concurrent transactions at the cost of having a validation phase. Upon failure in the validation phase, the transaction is usually aborted and restarted from scratch. The "abort and restart" approach becomes a performance bottleneck for the use cases with high contention objects or long running transactions. In addition, restarting from scratch creates a negative feedback loop in the system, because the system incurs additional overhead that may create even further conflicts.
In this paper, we propose a novel approach for conflict resolution in MVCC for in-memory databases. This low overhead approach summarizes the transaction programs in the form of a dependency graph. The dependency graph also contains the constructs used in the validation phase of the MVCC algorithm. Then, in the case of encountering conflicts among transactions, the conflict locations in the program are quickly detected, and the conflicting transactions are partially re-executed. This approach maximizes the reuse of the computations done in the initial execution round, and increases the transaction processing throughput.
△ Less
Submitted 1 March, 2016;
originally announced March 2016.
-
A Fast Randomized Algorithm for Multi-Objective Query Optimization
Authors:
Immanuel Trummer,
Christoph Koch
Abstract:
Query plans are compared according to multiple cost metrics in multi-objective query optimization. The goal is to find the set of Pareto plans realizing optimal cost tradeoffs for a given query. So far, only algorithms with exponential complexity in the number of query tables have been proposed for multi-objective query optimization. In this work, we present the first algorithm with polynomial com…
▽ More
Query plans are compared according to multiple cost metrics in multi-objective query optimization. The goal is to find the set of Pareto plans realizing optimal cost tradeoffs for a given query. So far, only algorithms with exponential complexity in the number of query tables have been proposed for multi-objective query optimization. In this work, we present the first algorithm with polynomial complexity in the query size.
Our algorithm is randomized and iterative. It improves query plans via a multi-objective version of hill climbing that applies multiple transformations in each climbing step for maximal efficiency. Based on a locally optimal plan, we approximate the Pareto plan set within the restricted space of plans with similar join orders. We maintain a cache of Pareto-optimal plans for each potentially useful intermediate result to share partial plans that were discovered in different iterations. We show that each iteration of our algorithm performs in expected polynomial time based on an analysis of the expected path length between a random plan and local optima reached by hill climbing. We experimentally show that our algorithm can optimize queries with hundreds of tables and outperforms other randomized algorithms such as the NSGA-II genetic algorithm over a wide range of scenarios.
△ Less
Submitted 1 March, 2016;
originally announced March 2016.
-
Solving the Join Ordering Problem via Mixed Integer Linear Programming
Authors:
Immanuel Trummer,
Christoph Koch
Abstract:
We transform join ordering into a mixed integer linear program (MILP). This allows to address query optimization by mature MILP solver implementations that have evolved over decades and steadily improved their performance. They offer features such as anytime optimization and parallel search that are highly relevant for query optimization.
We present a MILP formulation for searching left-deep que…
▽ More
We transform join ordering into a mixed integer linear program (MILP). This allows to address query optimization by mature MILP solver implementations that have evolved over decades and steadily improved their performance. They offer features such as anytime optimization and parallel search that are highly relevant for query optimization.
We present a MILP formulation for searching left-deep query plans. We use sets of binary variables to represent join operands and intermediate results, operator implementation choices or the presence of interesting orders. Linear constraints restrict value assignments to the ones representing valid query plans. We approximate the cost of scan and join operations via linear functions, allowing to increase approximation precision up to arbitrary degrees. Our experimental results are encouraging: we are able to find optimal plans for joins between 60 tables; a query size that is beyond the capabilities of prior exhaustive query optimization methods.
△ Less
Submitted 6 November, 2015;
originally announced November 2015.
-
Probably Approximately Optimal Query Optimization
Authors:
Immanuel Trummer,
Christoph Koch
Abstract:
Evaluating query predicates on data samples is the only way to estimate their selectivity in certain scenarios. Finding a guaranteed optimal query plan is not a reasonable optimization goal in those cases as it might require an infinite number of samples. We therefore introduce probably approximately optimal query optimization (PAO) where the goal is to find a query plan whose cost is near-optimal…
▽ More
Evaluating query predicates on data samples is the only way to estimate their selectivity in certain scenarios. Finding a guaranteed optimal query plan is not a reasonable optimization goal in those cases as it might require an infinite number of samples. We therefore introduce probably approximately optimal query optimization (PAO) where the goal is to find a query plan whose cost is near-optimal with a certain probability. We will justify why PAO is a suitable formalism to model scenarios in which predicate sampling and optimization need to be interleaved.
We present the first algorithm for PAO. Our algorithm is non-intrusive and uses standard query optimizers and sampling components as sub-functions. It is generic and can be applied to a wide range of scenarios. Our algorithm is iterative and calculates in each iteration a query plan together with a region in the selectivity space where the plan has near-optimal cost. It determines the confidence that the true selectivity values fall within the aforementioned region and chooses the next samples to take based on the current state if the confidence does not reach the threshold specified as problem input. We devise different algorithm variants and analyze their complexity. We experimentally compare them in terms of the number of optimizer invocations, samples, and iterations over many different query classes.
△ Less
Submitted 5 November, 2015;
originally announced November 2015.
-
Parallelizing Query Optimization on Shared-Nothing Architectures
Authors:
Immanuel Trummer,
Christoph Koch
Abstract:
Data processing systems offer an ever increasing degree of parallelism on the levels of cores, CPUs, and processing nodes. Query optimization must exploit high degrees of parallelism in order not to gradually become the bottleneck of query evaluation. We show how to parallelize query optimization at a massive scale.
We present algorithms for parallel query optimization in left-deep and bushy pla…
▽ More
Data processing systems offer an ever increasing degree of parallelism on the levels of cores, CPUs, and processing nodes. Query optimization must exploit high degrees of parallelism in order not to gradually become the bottleneck of query evaluation. We show how to parallelize query optimization at a massive scale.
We present algorithms for parallel query optimization in left-deep and bushy plan spaces. At optimization start, we divide the plan space for a given query into partitions of equal size that are explored in parallel by worker nodes. At the end of optimization, each worker returns the optimal plan in its partition to the master which determines the globally optimal plan from the partition-optimal plans. No synchronization or data exchange is required during the actual optimization phase. The amount of data sent over the network, at the start and at the end of optimization, as well as the complexity of serial steps within our algorithms increase only linearly in the number of workers and in the query size. The time and space complexity of optimization within one partition decreases uniformly in the number of workers. We parallelize single- and multi-objective query optimization over a cluster with 100 nodes in our experiments, using more than 250 concurrent worker threads (Spark executors). Despite high network latency and task assignment overheads, parallelization yields speedups of up to one order of magnitude for large queries whose optimization takes minutes on a single node.
△ Less
Submitted 5 November, 2015;
originally announced November 2015.
-
Multiple Query Optimization on the D-Wave 2X Adiabatic Quantum Computer
Authors:
Immanuel Trummer,
Christoph Koch
Abstract:
The D-Wave adiabatic quantum annealer solves hard combinatorial optimization problems leveraging quantum physics. The newest version features over 1000 qubits and was released in August 2015. We were given access to such a machine, currently hosted at NASA Ames Research Center in California, to explore the potential for hard optimization problems that arise in the context of databases.
In this p…
▽ More
The D-Wave adiabatic quantum annealer solves hard combinatorial optimization problems leveraging quantum physics. The newest version features over 1000 qubits and was released in August 2015. We were given access to such a machine, currently hosted at NASA Ames Research Center in California, to explore the potential for hard optimization problems that arise in the context of databases.
In this paper, we tackle the problem of multiple query optimization (MQO). We show how an MQO problem instance can be transformed into a mathematical formula that complies with the restrictive input format accepted by the quantum annealer. This formula is translated into weights on and between qubits such that the configuration minimizing the input formula can be found via a process called adiabatic quantum annealing. We analyze the asymptotic growth rate of the number of required qubits in the MQO problem dimensions as the number of qubits is currently the main factor restricting applicability. We experimentally compare the performance of the quantum annealer against other MQO algorithms executed on a traditional computer. While the problem sizes that can be treated are currently limited, we already find a class of problem instances where the quantum annealer is three orders of magnitude faster than other approaches.
△ Less
Submitted 21 October, 2015;
originally announced October 2015.
-
Incremental View Maintenance For Collection Programming
Authors:
Christoph Koch,
Daniel Lupei,
Val Tannen
Abstract:
In the context of incremental view maintenance (IVM), delta query derivation is an essential technique for speeding up the processing of large, dynamic datasets. The goal is to generate delta queries that, given a small change in the input, can update the materialized view more efficiently than via recomputation. In this work we propose the first solution for the efficient incrementalization of po…
▽ More
In the context of incremental view maintenance (IVM), delta query derivation is an essential technique for speeding up the processing of large, dynamic datasets. The goal is to generate delta queries that, given a small change in the input, can update the materialized view more efficiently than via recomputation. In this work we propose the first solution for the efficient incrementalization of positive nested relational calculus (NRC+) on bags (with integer multiplicities). More precisely, we model the cost of NRC+ operators and classify queries as efficiently incrementalizable if their delta has a strictly lower cost than full re-evaluation. Then, we identify IncNRC+; a large fragment of NRC+ that is efficiently incrementalizable and we provide a semantics-preserving translation that takes any NRC+ query to a collection of IncNRC+ queries. Furthermore, we prove that incremental maintenance for NRC+ is within the complexity class NC0 and we showcase how recursive IVM, a technique that has provided significant speedups over traditional IVM in the case of flat queries [25], can also be applied to IncNRC+.
△ Less
Submitted 11 April, 2016; v1 submitted 14 December, 2014;
originally announced December 2014.
-
The Secrets of Salient Object Segmentation
Authors:
Yin Li,
Xiaodi Hou,
Christof Koch,
James M. Rehg,
Alan L. Yuille
Abstract:
In this paper we provide an extensive evaluation of fixation prediction and salient object segmentation algorithms as well as statistics of major datasets. Our analysis identifies serious design flaws of existing salient object benchmarks, called the dataset design bias, by over emphasizing the stereotypical concepts of saliency. The dataset design bias does not only create the discomforting disco…
▽ More
In this paper we provide an extensive evaluation of fixation prediction and salient object segmentation algorithms as well as statistics of major datasets. Our analysis identifies serious design flaws of existing salient object benchmarks, called the dataset design bias, by over emphasizing the stereotypical concepts of saliency. The dataset design bias does not only create the discomforting disconnection between fixations and salient object segmentation, but also misleads the algorithm designing. Based on our analysis, we propose a new high quality dataset that offers both fixation and salient object segmentation ground-truth. With fixations and salient object being presented simultaneously, we are able to bridge the gap between fixations and salient objects, and propose a novel method for salient object segmentation. Finally, we report significant benchmark progress on three existing datasets of segmenting salient objects
△ Less
Submitted 12 June, 2014; v1 submitted 11 June, 2014;
originally announced June 2014.
-
Approximation Schemes for Many-Objective Query Optimization
Authors:
Immanuel Trummer,
Christoph Koch
Abstract:
The goal of multi-objective query optimization (MOQO) is to find query plans that realize a good compromise between conflicting objectives such as minimizing execution time and minimizing monetary fees in a Cloud scenario. A previously proposed exhaustive MOQO algorithm needs hours to optimize even simple TPC-H queries. This is why we propose several approximation schemes for MOQO that generate gu…
▽ More
The goal of multi-objective query optimization (MOQO) is to find query plans that realize a good compromise between conflicting objectives such as minimizing execution time and minimizing monetary fees in a Cloud scenario. A previously proposed exhaustive MOQO algorithm needs hours to optimize even simple TPC-H queries. This is why we propose several approximation schemes for MOQO that generate guaranteed near-optimal plans in seconds where exhaustive optimization takes hours.
We integrated all MOQO algorithms into the Postgres optimizer and present experimental results for TPC-H queries; we extended the Postgres cost model and optimize for up to nine conflicting objectives in our experiments. The proposed algorithms are based on a formal analysis of typical cost functions that occur in the context of MOQO. We identify properties that hold for a broad range of objectives and can be exploited for the design of future MOQO algorithms.
△ Less
Submitted 31 March, 2014;
originally announced April 2014.
-
LINVIEW: Incremental View Maintenance for Complex Analytical Queries
Authors:
Milos Nikolic,
Mohammed ElSeidy,
Christoph Koch
Abstract:
Many analytics tasks and machine learning problems can be naturally expressed by iterative linear algebra programs. In this paper, we study the incremental view maintenance problem for such complex analytical queries. We develop a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanc…
▽ More
Many analytics tasks and machine learning problems can be naturally expressed by iterative linear algebra programs. In this paper, we study the incremental view maintenance problem for such complex analytical queries. We develop a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change. As a consequence, our techniques make incremental view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our evaluation demonstrates the efficiency of LINVIEW in generating parallel incremental programs that outperform re-evaluation techniques by more than an order of magnitude.
△ Less
Submitted 9 May, 2014; v1 submitted 27 March, 2014;
originally announced March 2014.
-
The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis
Authors:
Sudip Roy,
Lucja Kot,
Gabriel Bender,
Bailu Ding,
Hossein Hojjat,
Christoph Koch,
Nate Foster,
Johannes Gehrke
Abstract:
Datastores today rely on distribution and replication to achieve improved performance and fault-tolerance. But correctness of many applications depends on strong consistency properties - something that can impose substantial overheads, since it requires coordinating the behavior of multiple nodes. This paper describes a new approach to achieving strong consistency in distributed systems while mini…
▽ More
Datastores today rely on distribution and replication to achieve improved performance and fault-tolerance. But correctness of many applications depends on strong consistency properties - something that can impose substantial overheads, since it requires coordinating the behavior of multiple nodes. This paper describes a new approach to achieving strong consistency in distributed systems while minimizing communication between nodes. The key insight is to allow the state of the system to be inconsistent during execution, as long as this inconsistency is bounded and does not affect transaction correctness. In contrast to previous work, our approach uses program analysis to extract semantic information about permissible levels of inconsistency and is fully automated. We then employ a novel homeostasis protocol to allow sites to operate independently, without communicating, as long as any inconsistency is governed by appropriate treaties between the nodes. We discuss mechanisms for optimizing treaties based on workload characteristics to minimize communication, as well as a prototype implementation and experiments that demonstrate the benefits of our approach on common transactional benchmarks.
△ Less
Submitted 19 January, 2015; v1 submitted 10 March, 2014;
originally announced March 2014.