-
Automated Profile-Guided Replacement of Data Structures to Reduce Memory Allocation
Authors:
Lukas Makor,
Sebastian Kloibhofer,
Peter Hofer,
David Leopoldseder,
Hanspeter Mössenböck
Abstract:
Data structures are a cornerstone of most modern programming languages. Whether they are provided via separate libraries, built into the language specification, or as part of the language's standard library -- data structures such as lists, maps, sets, or arrays provide programmers with a large repertoire of tools to deal with data. Moreover, each kind of data structure typically comes with a vari…
▽ More
Data structures are a cornerstone of most modern programming languages. Whether they are provided via separate libraries, built into the language specification, or as part of the language's standard library -- data structures such as lists, maps, sets, or arrays provide programmers with a large repertoire of tools to deal with data. Moreover, each kind of data structure typically comes with a variety of implementations that focus on scalability, memory efficiency, performance, thread-safety, or similar aspects. Choosing the *right* data structure for a particular use case can be difficult or even impossible if the data structure is part of a framework over which the user has no control. It typically requires in-depth knowledge about the program and, in particular, about the usage of the data structure in question. However, it is usually not feasible for developers to obtain such information about programs in advance. Hence, it makes sense to look for a more automated way for optimizing data structures.
We present an approach to automatically replace data structures in Java applications. We use profiling to determine allocation-site-specific metrics about data structures and their usages, and then automatically replace their allocations with customized versions, focusing on memory efficiency. Our approach is integrated into GraalVM Native Image, an Ahead-of-Time compiler for Java applications. By analyzing the generated data structure profiles, we show how standard benchmarks and microservice-based applications use data structures and demonstrate the impact of customized data structures on the memory usage of applications. We conducted an evaluation on standard and microservice-based benchmarks, in which the memory usage was reduced by up to 13.85 % in benchmarks that make heavy use of data structures. While others are only slightly affected, we could still reduce the average memory usage by 1.63 % in standard benchmarks and by 2.94 % in microservice-based benchmarks. We argue that our work demonstrates that choosing appropriate data structures can reduce the memory usage of applications. While acknowledge that our approach does not provide benefits for all kinds of workloads, our work nevertheless shows how automated profiling and replacement can be used to optimize data structures in general. Hence, we argue that our work could pave the way for future optimizations of data structures.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Control Flow Duplication for Columnar Arrays in a Dynamic Compiler
Authors:
Sebastian Kloibhofer,
Lukas Makor,
David Leopoldseder,
Daniele Bonetta,
Lukas Stadler,
Hanspeter Mössenböck
Abstract:
Columnar databases are an established way to speed up online analytical processing (OLAP) queries. Nowadays, data processing (e.g., storage, visualization, and analytics) is often performed at the programming language level, hence it is desirable to also adopt columnar data structures for common language runtimes. While there are frameworks, libraries, and APIs to enable columnar data stores in pr…
▽ More
Columnar databases are an established way to speed up online analytical processing (OLAP) queries. Nowadays, data processing (e.g., storage, visualization, and analytics) is often performed at the programming language level, hence it is desirable to also adopt columnar data structures for common language runtimes. While there are frameworks, libraries, and APIs to enable columnar data stores in programming languages, their integration into applications typically requires developer interference. In prior work, researchers implemented an approach for *automated* transformation of arrays into columnar arrays in the GraalVM JavaScript runtime. However, this approach suffers from performance issues on smaller workloads as well as on more complex nested data structures. We find that the key to optimizing accesses to columnar arrays is to identify queries and apply specific optimizations to them. In this paper, we describe novel compiler optimizations in the GraalVM Compiler that optimize queries on columnar arrays. At JIT compile time, we identify loops that access potentially columnar arrays and duplicate them in order to specifically optimize accesses to columnar arrays. Additionally, we describe a new approach for creating columnar arrays from arrays consisting of complex objects by performing **multi-level storage transformation**. We demonstrate our approach via an implementation for JavaScript `Date` objects.
[ full abstract at https://doi.org/10.22152/programming-journal.org/2023/7/9 ]
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Compilation Forking: A Fast and Flexible Way of Generating Data for Compiler-Internal Machine Learning Tasks
Authors:
Raphael Mosaner,
David Leopoldseder,
Wolfgang Kisling,
Lukas Stadler,
Hanspeter Mössenböck
Abstract:
Compiler optimization decisions are often based on hand-crafted heuristics centered around a few established benchmark suites. Alternatively, they can be learned from feature and performance data produced during compilation. However, data-driven compiler optimizations based on machine learning models require large sets of quality data for training in order to match or even outperform existing huma…
▽ More
Compiler optimization decisions are often based on hand-crafted heuristics centered around a few established benchmark suites. Alternatively, they can be learned from feature and performance data produced during compilation. However, data-driven compiler optimizations based on machine learning models require large sets of quality data for training in order to match or even outperform existing human-crafted heuristics. In static compilation setups, related work has addressed this problem with iterative compilation. However, a dynamic compiler may produce different data depending on dynamically-chosen compilation strategies, which aggravates the generation of comparable data. We propose compilation forking, a technique for generating consistent feature and performance data from arbitrary, dynamically-compiled programs. Different versions of program parts with the same profiling and compilation history are executed within single program runs to minimize noise stemming from dynamic compilation and the runtime environment. Our approach facilitates large-scale performance evaluations of compiler optimization decisions. Additionally, compilation forking supports creating domain-specific compilation strategies based on machine learning by providing the data for model training. We implemented compilation forking in the GraalVM compiler in a programming-language-agnostic way. To assess the quality of the generated data, we trained several machine learning models to replace compiler heuristics for loop-related optimizations. The trained models perform equally well to the highly-tuned compiler heuristics when comparing the geometric means of benchmark suite performances. Larger impacts on few single benchmarks range from speedups of 20% to slowdowns of 17%. The presented approach can be implemented in any dynamic compiler. We believe that it can help to analyze compilation decisions and leverage the use of machine learning into dynamic compilation.
△ Less
Submitted 28 June, 2022;
originally announced June 2022.
-
Capturing High-level Nondeterminism in Concurrent Programs for Practical Concurrency Model Agnostic Record & Replay
Authors:
Dominik Aumayr,
Stefan Marr,
Sophie Kaleba,
Elisa Gonzalez Boix,
Hanspeter Mössenböck
Abstract:
With concurrency being integral to most software systems, developers combine high-level concurrency models in the same application to tackle each problem with appropriate abstractions. While languages and libraries offer a wide range of concurrency models, debugging support for applications that combine them has not yet gained much attention. Record & replay aids debugging by deterministically rep…
▽ More
With concurrency being integral to most software systems, developers combine high-level concurrency models in the same application to tackle each problem with appropriate abstractions. While languages and libraries offer a wide range of concurrency models, debugging support for applications that combine them has not yet gained much attention. Record & replay aids debugging by deterministically reproducing recorded bugs, but is typically designed for a single concurrency model only.
This paper proposes a practical concurrency-model-agnostic record & replay approach for multi-paradigm concurrent programs, i.e. applications that combine concurrency models. Our approach traces high-level nondeterministic events by using a uniform model-agnostic trace format and infrastructure. This enables orderingbased record & replay support for a wide range of concurrency models, and thereby enables debugging of applications that combine them. In addition, it allows language implementors to add new concurrency models and reuse the model-agnostic record & replay support.
We argue that a concurrency-model-agnostic record & replay is practical and enables advanced debugging support for a wide range of concurrency models. The evaluation shows that our approach is expressive and flexible enough to support record & replay of applications using threads & locks, communicating event loops, communicating sequential processes, software transactional memory and combinations of those concurrency models. For the actor model, we reach recording performance competitive with an optimized special-purpose record & replay solution. The average recording overhead on the Savina actor benchmark suite is 10% (min. 0%, max. 23%). The performance for other concurrency models and combinations thereof is at a similar level.
We believe our concurrency-model-agnostic approach helps developers of applications that mix and match concurrency models. We hope that this substrate inspires new tools and languages making building and maintaining of multi-paradigm concurrent applications simpler and safer.
△ Less
Submitted 26 February, 2021;
originally announced March 2021.
-
Supporting On-Stack Replacement in Unstructured Languages by Loop Reconstruction and Extraction
Authors:
Raphael Mosaner,
David Leopoldseder,
Manuel Rigger,
Roland Schatz,
Hanspeter Mössenböck
Abstract:
On-stack replacement (OSR) is a common technique employed by dynamic compilers to reduce program warm-up time. OSR allows switching from interpreted to compiled code during the execution of this code. The main targets are long running loops, which need to be represented explicitly, with dedicated information about condition and body, to be optimized at run time. Bytecode interpreters, however, rep…
▽ More
On-stack replacement (OSR) is a common technique employed by dynamic compilers to reduce program warm-up time. OSR allows switching from interpreted to compiled code during the execution of this code. The main targets are long running loops, which need to be represented explicitly, with dedicated information about condition and body, to be optimized at run time. Bytecode interpreters, however, represent control flow implicitly via unstructured jumps and thus do not exhibit the required high-level loop representation. To enable OSR also for jump-based - often called unstructured - languages, we propose the partial reconstruction of loops in order to explicitly represent them in a bytecode interpreter. Besides an outline of the general idea, we implemented our approach in Sulong, a bytecode interpreter for LLVM bitcode, which allows the execution of C/C++. We conducted an evaluation with a set of C benchmarks, which showed speed-ups in warm-up of up to 9x for certain benchmarks. This facilitates execution of programs with long-running loops in rarely called functions, which would yield significant slowdown without OSR. While shown with a prototype implementation, the overall idea of our approach is generalizable for all bytecode interpreters.
△ Less
Submitted 19 September, 2019;
originally announced September 2019.
-
Asynchronous Snapshots of Actor Systems for Latency-Sensitive Applications
Authors:
Dominik Aumayr,
Stefan Marr,
Elisa Gonzalez Boix,
Hanspeter Mössenböck
Abstract:
The actor model is popular for many types of server applications. Efficient snapshotting of applications is crucial in the deployment of pre-initialized applications or moving running applications to different machines, e.g for debugging purposes. A key issue is that snapshotting blocks all other operations. In modern latency-sensitive applications, stopping the application to persist its state ne…
▽ More
The actor model is popular for many types of server applications. Efficient snapshotting of applications is crucial in the deployment of pre-initialized applications or moving running applications to different machines, e.g for debugging purposes. A key issue is that snapshotting blocks all other operations. In modern latency-sensitive applications, stopping the application to persist its state needs to be avoided, because users may not tolerate the increased request latency. In order to minimize the impact of snapshotting on request latency, our approach persists the application's state asynchronously by capturing partial heaps, completing snapshots step by step. Additionally, our solution is transparent and supports arbitrary object graphs. We prototyped our snapshotting approach on top of the Truffle/Graal platform and evaluated it with the Savina benchmarks and the Acme Air microservice application. When performing a snapshot every thousand Acme Air requests, the number of slow requests ( 0.007% of all requests) with latency above 100ms increases by 5.43%. Our Savina microbenchmark results detail how different utilization patterns impact snapshotting cost. To the best of our knowledge, this is the first system that enables asynchronous snapshotting of actor applications, i.e. without stop-the-world synchronization, and thereby minimizes the impact on latency. We thus believe it enables new deployment and debugging options for actor systems.
△ Less
Submitted 18 September, 2019; v1 submitted 18 July, 2019;
originally announced July 2019.
-
Understanding GCC Builtins to Develop Better Tools
Authors:
Manuel Rigger,
Stefan Marr,
Bram Adams,
Hanspeter Mössenböck
Abstract:
C programs can use compiler builtins to provide functionality that the C language lacks. On Linux, GCC provides several thousands of builtins that are also supported by other mature compilers, such as Clang and ICC. Maintainers of other tools lack guidance on whether and which builtins should be implemented to support popular projects. To assist tool developers who want to support GCC builtins, we…
▽ More
C programs can use compiler builtins to provide functionality that the C language lacks. On Linux, GCC provides several thousands of builtins that are also supported by other mature compilers, such as Clang and ICC. Maintainers of other tools lack guidance on whether and which builtins should be implemented to support popular projects. To assist tool developers who want to support GCC builtins, we analyzed builtin use in 4,913 C projects from GitHub. We found that 37% of these projects relied on at least one builtin. Supporting an increasing proportion of projects requires support of an exponentially increasing number of builtins; however, implementing only 10 builtins already covers over 30% of the projects. Since we found that many builtins in our corpus remained unused, the effort needed to support 90% of the projects is moderate, requiring about 110 builtins to be implemented. For each project, we analyzed the evolution of builtin use over time and found that the majority of projects mostly added builtins. This suggests that builtins are not a legacy feature and must be supported in future tools. Systematic testing of builtin support in existing tools revealed that many lacked support for builtins either partially or completely; we also discovered incorrect implementations in various tools, including the formally verified CompCert compiler.
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
Debugging Native Extensions of Dynamic Languages
Authors:
Jacob Kreindl,
Manuel Rigger,
Hanspeter Mössenböck
Abstract:
Many dynamic programming languages such as Ruby and Python enable developers to use so called native extensions, code implemented in typically statically compiled languages like C and C++. However, debuggers for these dynamic languages usually lack support for also debugging these native extensions. GraalVM can execute programs implemented in various dynamic programming languages and, by using the…
▽ More
Many dynamic programming languages such as Ruby and Python enable developers to use so called native extensions, code implemented in typically statically compiled languages like C and C++. However, debuggers for these dynamic languages usually lack support for also debugging these native extensions. GraalVM can execute programs implemented in various dynamic programming languages and, by using the LLVM-IR interpreter Sulong, also their native extensions. We added support for source-level debugging to Sulong based on GraalVM's debugging framework by associating run-time debug information from the LLVM-IR level to the original program code. As a result, developers can now use GraalVM to debug source code written in multiple LLVM-based programming languages as well as programs implemented in various dynamic languages that invoke it in a common debugger front-end.
△ Less
Submitted 2 August, 2018;
originally announced August 2018.
-
Context-aware Failure-oblivious Computing as a Means of Preventing Buffer Overflows
Authors:
Manuel Rigger,
Daniel Pekarek,
Hanspeter Mössenböck
Abstract:
In languages like C, buffer overflows are widespread. A common mitigation technique is to use tools that detect them during execution and abort the program to prevent the leakage of data or the diversion of control flow. However, for server applications, it would be desirable to prevent such errors while maintaining availability of the system. To this end, we present an approach to handle buffer o…
▽ More
In languages like C, buffer overflows are widespread. A common mitigation technique is to use tools that detect them during execution and abort the program to prevent the leakage of data or the diversion of control flow. However, for server applications, it would be desirable to prevent such errors while maintaining availability of the system. To this end, we present an approach to handle buffer overflows without aborting the program. This approach involves implementing a continuation logic in library functions based on an introspection function that allows querying the size of a buffer. We demonstrate that introspection can be implemented in popular bug-finding and bug-mitigation tools such as LLVM's AddressSanitizer, SoftBound, and Intel-MPX-based bounds checking. We evaluated our approach in a case study of real-world bugs and show that for tools that explicitly track bounds data, introspection results in a low performance overhead.
△ Less
Submitted 22 November, 2018; v1 submitted 23 June, 2018;
originally announced June 2018.
-
Efficient and Deterministic Record & Replay for Actor Languages
Authors:
Dominik Aumayr,
Stefan Marr,
Clément Béra,
Elisa Gonzalez Boix,
Hanspeter Mössenböck
Abstract:
With the ubiquity of parallel commodity hardware, developers turn to high-level concurrency models such as the actor model to lower the complexity of concurrent software. However, debugging concurrent software is hard, especially for concurrency models with a limited set of supporting tools. Such tools often deal only with the underlying threads and locks, which is at the wrong abstraction level a…
▽ More
With the ubiquity of parallel commodity hardware, developers turn to high-level concurrency models such as the actor model to lower the complexity of concurrent software. However, debugging concurrent software is hard, especially for concurrency models with a limited set of supporting tools. Such tools often deal only with the underlying threads and locks, which is at the wrong abstraction level and may even introduce additional complexity. To improve on this situation, we present a low-overhead record & replay approach for actor languages. It allows one to debug concurrency issues deterministically based on a previously recorded trace. Our evaluation shows that the average run-time overhead for tracing on benchmarks from the Savina suite is 10% (min. 0%, max. 20%). For Acme-Air, a modern web application, we see a maximum increase of 1% in latency for HTTP requests and about 1.4 MB/s of trace data. These results are a first step towards deterministic replay debugging of actor systems in production.
△ Less
Submitted 20 August, 2018; v1 submitted 16 May, 2018;
originally announced May 2018.
-
Introspection for C and its Applications to Library Robustness
Authors:
Manuel Rigger,
Rene Mayrhofer,
Roland Schatz,
Matthias Grimmer,
Hanspeter Mössenböck
Abstract:
Context: In C, low-level errors, such as buffer overflow and use-after-free, are a major problem, as they cause security vulnerabilities and hard-to-find bugs. C lacks automatic checks, and programmers cannot apply defensive programming techniques because objects (e.g., arrays or structs) lack run-time information about bounds, lifetime, and types.
Inquiry: Current approaches to tackling low-level…
▽ More
Context: In C, low-level errors, such as buffer overflow and use-after-free, are a major problem, as they cause security vulnerabilities and hard-to-find bugs. C lacks automatic checks, and programmers cannot apply defensive programming techniques because objects (e.g., arrays or structs) lack run-time information about bounds, lifetime, and types.
Inquiry: Current approaches to tackling low-level errors include dynamic tools, such as bounds or type checkers, that check for certain actions during program execution. If they detect an error, they typically abort execution. Although they track run-time information as part of their runtimes, they do not expose this information to programmers.
Approach: We devised an introspection interface that allows C programmers to access run-time information and to query object bounds, object lifetimes, object types, and information about variadic arguments. This enables library writers to check for invalid input or program states and thus, for example, to implement custom error handling that maintains system availability and does not terminate on benign errors. As we assume that introspection is used together with a dynamic tool that implements automatic checks, errors that are not handled in the application logic continue to cause the dynamic tool to abort execution.
Knowledge: Using the introspection interface, we implemented a more robust, source-compatible version of the C standard library that validates parameters to its functions. The library functions react to otherwise undefined behavior; for example, they can detect lurking flaws, handle unterminated strings, check format string arguments, and set errno when they detect benign usage errors.
Grounding: Existing dynamic tools maintain run-time information that can be used to implement the introspection interface, and we demonstrate its implementation in Safe Sulong, an interpreter and dynamic bug-finding tool for C that runs on a Java Virtual Machine and can thus easily expose relevant run-time information.
Importance: Using introspection in user code is a novel approach to tackling the long-standing problem of low-level errors in C. As new approaches are lowering the performance overhead of run-time information maintenance, the usage of dynamic runtimes for C could become more common, which could ultimately facilitate a more widespread implementation of such an introspection interface.
△ Less
Submitted 4 December, 2017;
originally announced December 2017.
-
A Study of Concurrency Bugs and Advanced Development Support for Actor-based Programs
Authors:
Carmen Torres Lopez,
Stefan Marr,
Hanspeter Mössenböck,
Elisa Gonzalez Boix
Abstract:
The actor model is an attractive foundation for developing concurrent applications because actors are isolated concurrent entities that communicate through asynchronous messages and do not share state. Thereby, they avoid concurrency bugs such as data races, but are not immune to concurrency bugs in general. This study taxonomizes concurrency bugs in actor-based programs reported in literature. Fu…
▽ More
The actor model is an attractive foundation for developing concurrent applications because actors are isolated concurrent entities that communicate through asynchronous messages and do not share state. Thereby, they avoid concurrency bugs such as data races, but are not immune to concurrency bugs in general. This study taxonomizes concurrency bugs in actor-based programs reported in literature. Furthermore, it analyzes the bugs to identify the patterns causing them as well as their observable behavior. Based on this taxonomy, we further analyze the literature and find that current approaches to static analysis and testing focus on communication deadlocks and message protocol violations. However, they do not provide solutions to identify livelocks and behavioral deadlocks. The insights obtained in this study can be used to improve debugging support for actor-based programs with new debugging techniques to identify the root cause of complex concurrency bugs.
△ Less
Submitted 24 April, 2018; v1 submitted 22 June, 2017;
originally announced June 2017.
-
A Concurrency-Agnostic Protocol for Multi-Paradigm Concurrent Debugging Tools
Authors:
Stefan Marr,
Carmen Torres Lopez,
Dominik Aumayr,
Elisa Gonzalez Boix,
Hanspeter Mössenböck
Abstract:
Today's complex software systems combine high-level concurrency models. Each model is used to solve a specific set of problems. Unfortunately, debuggers support only the low-level notions of threads and shared memory, forcing developers to reason about these notions instead of the high-level concurrency models they chose.
This paper proposes a concurrency-agnostic debugger protocol that decouple…
▽ More
Today's complex software systems combine high-level concurrency models. Each model is used to solve a specific set of problems. Unfortunately, debuggers support only the low-level notions of threads and shared memory, forcing developers to reason about these notions instead of the high-level concurrency models they chose.
This paper proposes a concurrency-agnostic debugger protocol that decouples the debugger from the concurrency models employed by the target application. As a result, the underlying language runtime can define custom breakpoints, stepping operations, and execution events for each concurrency model it supports, and a debugger can expose them without having to be specifically adapted.
We evaluated the generality of the protocol by applying it to SOMns, a Newspeak implementation, which supports a diversity of concurrency models including communicating sequential processes, communicating event loops, threads and locks, fork/join parallelism, and software transactional memory. We implemented 21 breakpoints and 20 stepping operations for these concurrency models. For none of these, the debugger needed to be changed. Furthermore, we visualize all concurrent interactions independently of a specific concurrency model. To show that tooling for a specific concurrency model is possible, we visualize actor turns and message sends separately.
△ Less
Submitted 29 October, 2017; v1 submitted 1 June, 2017;
originally announced June 2017.