-
An Efficient Load Balancing Method for Tree Algorithms
Authors:
Osama Talaat Ibrahim,
Ahmed El-Mahdy
Abstract:
Nowadays, multiprocessing is mainstream with exponentially increasing number of processors. Load balancing is, therefore, a critical operation for the efficient execution of parallel algorithms. In this paper we consider the fundamental class of tree-based algorithms that are notoriously irregular, and hard to load-balance with existing static techniques. We propose a hybrid load balancing method…
▽ More
Nowadays, multiprocessing is mainstream with exponentially increasing number of processors. Load balancing is, therefore, a critical operation for the efficient execution of parallel algorithms. In this paper we consider the fundamental class of tree-based algorithms that are notoriously irregular, and hard to load-balance with existing static techniques. We propose a hybrid load balancing method using the utility of statistical random sampling in estimating the tree depth and node count distributions to uniformly partition an input tree. To conduct an initial performance study, we implemented the method on an Intel Xeon Phi accelerator system. We considered the tree traversal operation on both regular and irregular unbalanced trees manifested by Fibonacci and unbalanced (biased) randomly generated trees, respectively. The results show scalable performance for up to the 60 physical processors of the accelerator, as well as an extrapolated 128 processors case.
△ Less
Submitted 29 September, 2017;
originally announced October 2017.
-
If-Conversion Optimization using Neuro Evolution of Augmenting Topologies
Authors:
Reem Elkhouly,
Keiji Kimura,
Ahmed El-Mahdy
Abstract:
Control-flow dependence is an intrinsic limiting factor for pro- gram acceleration. With the availability of instruction-level par- allel architectures, if-conversion optimization has, therefore, be- come pivotal for extracting parallelism from serial programs. While many if-conversion optimization heuristics have been proposed in the literature, most of them consider rigid criteria regardless of…
▽ More
Control-flow dependence is an intrinsic limiting factor for pro- gram acceleration. With the availability of instruction-level par- allel architectures, if-conversion optimization has, therefore, be- come pivotal for extracting parallelism from serial programs. While many if-conversion optimization heuristics have been proposed in the literature, most of them consider rigid criteria regardless of the underlying hardware and input programs. In this paper, we propose a novel if-conversion scheme that preforms an efficient if-conversion transformation using a machine learning technique (NEAT). This method enables if-conversion customization overall branches within a program unlike the literature that considered in- dividual branches. Our technique also provides flexibility required when compiling for heterogeneous systems. The efficacy of our approach is shown by experiments and reported results which il- lustrate that the programs can be accelerated on the same archi- tecture and without modifying the original code. Our technique applies for general purpose programming languages (e.g. C/C++) and is transparent for the programmer. We implemented our tech- nique in LLVM 3.6.1 compilation infrastructure and experimented on the kernels of SPEC-CPU2006 v1.1 benchmarks suite running on a multicore system of Intel(R) Xeon(R) 3.50GHz processors. Our findings show a performance gain up to 8.6% over the stan- dard optimized code (LLVM -O2 with if-conversion included), in- dicating the need for If-conversion compilation optimization that can adapt to the unique characteristics of every individual branch.
△ Less
Submitted 3 March, 2016;
originally announced March 2016.
-
A Linear-Time and Space Algorithm for Optimal Traffic Signal Durations at an Intersection
Authors:
Sameh Samra,
Ahmed El-Mahdy,
Yasutaka Wada
Abstract:
Finding an optimal solution of signal traffic control durations is a computationally intensive task. It is typically O(T3) in time, and O(T2) in space, where T is the length of the control interval in discrete time steps. In this paper, we propose a linear time and space algorithm for the same problem. The algorithm provides for an efficient dynamic programming formulation of the state space, the…
▽ More
Finding an optimal solution of signal traffic control durations is a computationally intensive task. It is typically O(T3) in time, and O(T2) in space, where T is the length of the control interval in discrete time steps. In this paper, we propose a linear time and space algorithm for the same problem. The algorithm provides for an efficient dynamic programming formulation of the state space, the prunes non-optimal states, early on. The paper proves the correctness of the algorithm and provides an initial experimental validation.
△ Less
Submitted 2 November, 2013;
originally announced November 2013.
-
Thread-Based Obfuscation through Control-Flow Mangling
Authors:
Rasha Salah Omar,
Ahmed El-Mahdy,
Erven Rohou
Abstract:
The increasing use of cloud computing and remote execution have made program security especially important. Code obfuscation has been proposed to make the understanding of programs more complicated to attackers. In this paper, we exploit multi-core processing to substantially increase the complexity of programs, making reverse engineering more complicated. We propose a novel method that automatica…
▽ More
The increasing use of cloud computing and remote execution have made program security especially important. Code obfuscation has been proposed to make the understanding of programs more complicated to attackers. In this paper, we exploit multi-core processing to substantially increase the complexity of programs, making reverse engineering more complicated. We propose a novel method that automatically partitions any serial thread into an arbitrary number of parallel threads, at the basic-block level. The method generates new control-flow graphs, preserving the blocks' serial successor relations and guaranteeing that one basic-block is active at a time using guards. The method generates m^n different combinations for m threads and n basic-blocks, significantly complicating the execution state. We provide a correctness proof for the algorithm and implement the algorithm in the LLVM compilation framework.
△ Less
Submitted 31 October, 2013;
originally announced November 2013.