-
A universal bound on the space complexity of Directed Acyclic Graph computations
Authors:
Gianfranco Bilardi,
Lorenzo De Stefani
Abstract:
It is shown that $S(G) = O\left(m/\log_2 m + d\right)$ pebbles are sufficient to pebble any DAG $G=(V,E)$, with $m$ edges and maximum in-degree $d$. It was previously known that $S(G) = O\left(d n/\log n\right)$. The result builds on two novel ideas. The first is the notion of $B-budget\ decomposition$ of a DAG $G$, an efficiently computable partition of $G$ into at most…
▽ More
It is shown that $S(G) = O\left(m/\log_2 m + d\right)$ pebbles are sufficient to pebble any DAG $G=(V,E)$, with $m$ edges and maximum in-degree $d$. It was previously known that $S(G) = O\left(d n/\log n\right)$. The result builds on two novel ideas. The first is the notion of $B-budget\ decomposition$ of a DAG $G$, an efficiently computable partition of $G$ into at most $2^{\lfloor \frac{m}{B} \rfloor}$ sub-DAGs, whose cumulative space requirement is at most $B$. The second is the challenging vertices technique, which constructs a pebbling schedule for $G$ from a pebbling schedule for a simplified DAG $G'$, obtained by removing from $G$ a selected set of vertices $W$ and their incident edges. This technique also yields improved pebbling upper bounds for DAGs with bounded genus and for DAGs with bounded topological depth.
△ Less
Submitted 27 October, 2024;
originally announced October 2024.
-
Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance
Authors:
Gianfranco Bilardi,
Michele Schimd
Abstract:
The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let $e_k(n)$ denote the average edit distance between random, independent strings of $n$ characters from an alphabet of size $k$. For $k \geq 2$, it is an open problem how to efficiently compute the exact value of $α_{k}(n) = e_k(n)/n$ as well as of…
▽ More
The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let $e_k(n)$ denote the average edit distance between random, independent strings of $n$ characters from an alphabet of size $k$. For $k \geq 2$, it is an open problem how to efficiently compute the exact value of $α_{k}(n) = e_k(n)/n$ as well as of $α_{k} = \lim_{n \to \infty} α_{k}(n)$, a limit known to exist.
This paper shows that $α_k(n)-Q(n) \leq α_k \leq α_k(n)$, for a specific $Q(n)=Θ(\sqrt{\log n / n})$, a result which implies that $α_k$ is computable. The exact computation of $α_k(n)$ is explored, leading to an algorithm running in time $T=\mathcal{O}(n^2k\min(3^n,k^n))$, a complexity that makes it of limited practical use.
An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how $α_k(n)$ can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of $n$ say up to a quarter million. Correspondingly, 99.9\% confidence intervals of width approximately $10^{-2}$ are obtained for $α_k$.
Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound $β_k^*$ to $α_k$, such that $ \lim_{k \to \infty} β_k^*=1$. In general, $β_k^* \leq α_k \leq 1-1/k$; for $k$ greater than a few dozens, computing $β_k^*$ is much faster than generating good statistical estimates with confidence intervals of width $1-1/k-β_k^*$.
The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.
△ Less
Submitted 6 April, 2024; v1 submitted 13 November, 2022;
originally announced November 2022.
-
The DAG Visit approach for Pebbling and I/O Lower Bounds
Authors:
Gianfranco Bilardi,
Lorenzo De Stefani
Abstract:
We introduce the notion of an $r$-visit of a Directed Acyclic Graph DAG $G=(V,E)$, a sequence of the vertices of the DAG complying with a given rule $r$. A rule $r$ specifies for each vertex $v\in V$ a family of $r$-enabling sets of (immediate) predecessors: before visiting $v$, at least one of its enabling sets must have been visited. Special cases are the $r^{(top)}$-rule (or, topological rule),…
▽ More
We introduce the notion of an $r$-visit of a Directed Acyclic Graph DAG $G=(V,E)$, a sequence of the vertices of the DAG complying with a given rule $r$. A rule $r$ specifies for each vertex $v\in V$ a family of $r$-enabling sets of (immediate) predecessors: before visiting $v$, at least one of its enabling sets must have been visited. Special cases are the $r^{(top)}$-rule (or, topological rule), for which the only enabling set is the set of all predecessors and the $r^{(sin)}$-rule (or, singleton rule), for which the enabling sets are the singletons containing exactly one predecessor. The $r$-boundary complexity of a DAG $G$, $b_{r}\left(G\right)$, is the minimum integer $b$ such that there is an $r$-visit where, at each stage, for at most $b$ of the vertices yet to be visited an enabling set has already been visited. By a reformulation of known results, it is shown that the boundary complexity of a DAG $G$ is a lower bound to the pebbling number of the reverse DAG, $G^R$. Several known pebbling lower bounds can be cast in terms of the $r^{(sin)}$-boundary complexity.
A visit partition technique for I/O lower bounds, which generalizes the $S$-partition I/O technique introduced by Hong and Kung in their classic paper "I/O complexity: The Red-Blue pebble game". The visit partition approach yields tight I/O bounds for some DAGs for which the $S$-partition technique can only yield an $Ω(1)$ lower bound.
△ Less
Submitted 4 October, 2022;
originally announced October 2022.
-
Encrypted Data Processing
Authors:
Jessica Tseng,
Gianfranco Bilardi,
Kattamuri Ekanadham,
Manoj Kumar,
Jose Moreira,
P. C. Pattnaik
Abstract:
In this paper, we present a comprehensive architecture for confidential computing, which we show to be general purpose and quite efficient. It executes the application as is, without any added burden or discipline requirements from the application developers. Furthermore, it does not require the trust of system software at the computing server and does not impose any added burden on the communicat…
▽ More
In this paper, we present a comprehensive architecture for confidential computing, which we show to be general purpose and quite efficient. It executes the application as is, without any added burden or discipline requirements from the application developers. Furthermore, it does not require the trust of system software at the computing server and does not impose any added burden on the communication subsystem. The proposed Encrypted Data Processing (EDAP) architecture accomplishes confidentiality, authenticity, and freshness of the key-based cryptographic data protection by adopting data encryption with a multi-level key protection scheme. It guarantees that the user data is visible only in non-privileged mode to a designated program trusted by the data owner on a designated hardware, thus protecting the data from an untrusted hardware, hypervisor, OS, or other users' applications. The cryptographic keys and protocols used for achieving these confidential computing requirements are described in a use case example. Encrypting and decrypting data in an EDAP-enabled processor can lead to performance degradation as it adds cycle time to the overall execution. However, our simulation result shows that the slowdown is only 6% on average across a collection of commercial workloads when the data encryption engine is placed between the L1 and L2 cache. We demonstrate that the EDAP architecture is valuable and practicable in the modern cloud environment for confidential computing. EDAP delivers a zero trust model of computing where the user software does not trust system software and vice versa.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
A Lower Bound Technique for Communication in BSP
Authors:
Gianfranco Bilardi,
Michele Scquizzato,
Francesco Silvestri
Abstract:
Communication is a major factor determining the performance of algorithms on current computing systems; it is therefore valuable to provide tight lower bounds on the communication complexity of computations. This paper presents a lower bound technique for the communication complexity in the bulk-synchronous parallel (BSP) model of a given class of DAG computations. The derived bound is expressed i…
▽ More
Communication is a major factor determining the performance of algorithms on current computing systems; it is therefore valuable to provide tight lower bounds on the communication complexity of computations. This paper presents a lower bound technique for the communication complexity in the bulk-synchronous parallel (BSP) model of a given class of DAG computations. The derived bound is expressed in terms of the switching potential of a DAG, that is, the number of permutations that the DAG can realize when viewed as a switching network. The proposed technique yields tight lower bounds for the fast Fourier transform (FFT), and for any sorting and permutation network. A stronger bound is also derived for the periodic balanced sorting network, by applying this technique to suitable subnetworks. Finally, we demonstrate that the switching potential captures communication requirements even in computational models different from BSP, such as the I/O model and the LPRAM.
△ Less
Submitted 25 November, 2017; v1 submitted 7 July, 2017;
originally announced July 2017.
-
The I/O complexity of Strassen's matrix multiplication with recomputation
Authors:
Gianfranco Bilardi,
Lorenzo De Stefani
Abstract:
A tight $Ω((n/\sqrt{M})^{\log_2 7}M)$ lower bound is derived on the \io complexity of Strassen's algorithm to multiply two $n \times n$ matrices, in a two-level storage hierarchy with $M$ words of fast memory. A proof technique is introduced, which exploits the Grigoriev's flow of the matrix multiplication function as well as some combinatorial properties of the Strassen computational directed acy…
▽ More
A tight $Ω((n/\sqrt{M})^{\log_2 7}M)$ lower bound is derived on the \io complexity of Strassen's algorithm to multiply two $n \times n$ matrices, in a two-level storage hierarchy with $M$ words of fast memory. A proof technique is introduced, which exploits the Grigoriev's flow of the matrix multiplication function as well as some combinatorial properties of the Strassen computational directed acyclic graph (CDAG). Applications to parallel computation are also developed. The result generalizes a similar bound previously obtained under the constraint of no-recomputation, that is, that intermediate results cannot be computed more than once. For this restricted case, another lower bound technique is presented, which leads to a simpler analysis of the \io complexity of Strassen's algorithm and can be readily extended to other "Strassen-like" algorithms.
△ Less
Submitted 7 May, 2016;
originally announced May 2016.
-
Network-Oblivious Algorithms
Authors:
Gianfranco Bilardi,
Andrea Pietracaprina,
Geppino Pucci,
Michele Scquizzato,
Francesco Silvestri
Abstract:
A framework is proposed for the design and analysis of \emph{network-oblivious algorithms}, namely, algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network-oblivious algorithm be specified on a parallel model of computation where the only parameter is the p…
▽ More
A framework is proposed for the design and analysis of \emph{network-oblivious algorithms}, namely, algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network-oblivious algorithm be specified on a parallel model of computation where the only parameter is the problem's input size, and then evaluated on a model with two parameters, capturing parallelism granularity and communication latency. It is shown that, for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in the Decomposable BSP model, which is known to effectively describe a wide and significant class of parallel platforms. The proposed framework can be regarded as an attempt to port the notion of obliviousness, well established in the context of cache hierarchies, to the realm of parallel computation. Its effectiveness is illustrated by providing optimal network-oblivious algorithms for a number of key problems. Some limitations of the oblivious approach are also discussed.
△ Less
Submitted 12 April, 2014;
originally announced April 2014.
-
Optimal Eviction Policies for Stochastic Address Traces
Authors:
Gianfranco Bilardi,
Francesco Versaci
Abstract:
The eviction problem for memory hierarchies is studied for the Hidden Markov Reference Model (HMRM) of the memory trace, showing how miss minimization can be naturally formulated in the optimal control setting. In addition to the traditional version assuming a buffer of fixed capacity, a relaxed version is also considered, in which buffer occupancy can vary and its average is constrained. Resortin…
▽ More
The eviction problem for memory hierarchies is studied for the Hidden Markov Reference Model (HMRM) of the memory trace, showing how miss minimization can be naturally formulated in the optimal control setting. In addition to the traditional version assuming a buffer of fixed capacity, a relaxed version is also considered, in which buffer occupancy can vary and its average is constrained. Resorting to multiobjective optimization, viewing occupancy as a cost rather than as a constraint, the optimal eviction policy is obtained by composing solutions for the individual addressable items.
This approach is then specialized to the Least Recently Used Stack Model (LRUSM), a type of HMRM often considered for traces, which includes V-1 parameters, where V is the size of the virtual space. A gain optimal policy for any target average occupancy is obtained which (i) is computable in time O(V) from the model parameters, (ii) is optimal also for the fixed capacity case, and (iii) is characterized in terms of priorities, with the name of Least Profit Rate (LPR) policy. An O(log C) upper bound (being C the buffer capacity) is derived for the ratio between the expected miss rate of LPR and that of OPT, the optimal off-line policy; the upper bound is tightened to O(1), under reasonable constraints on the LRUSM parameters. Using the stack-distance framework, an algorithm is developed to compute the number of misses incurred by LPR on a given input trace, simultaneously for all buffer capacities, in time O(log V) per access.
Finally, some results are provided for miss minimization over a finite horizon and over an infinite horizon under bias optimality, a criterion more stringent than gain optimality.
△ Less
Submitted 25 October, 2013; v1 submitted 29 September, 2011;
originally announced September 2011.
-
QCD on the Cell Broadband Engine
Authors:
F. Belletti,
G. Bilardi,
M. Drochner,
N. Eicker,
Z. Fodor,
D. Hierl,
H. Kaldass,
T. Lippert,
T. Maurer,
N. Meyer,
A. Nobile,
D. Pleiter,
A. Schaefer,
F. Schifano,
H. Simma,
S. Solbrig,
T. Streuer,
R. Tripiccione,
T. Wettig
Abstract:
We evaluate IBM's Enhanced Cell Broadband Engine (BE) as a possible building block of a new generation of lattice QCD machines. The Enhanced Cell BE will provide full support of double-precision floating-point arithmetics, including IEEE-compliant rounding. We have developed a performance model and applied it to relevant lattice QCD kernels. The performance estimates are supported by micro- and…
▽ More
We evaluate IBM's Enhanced Cell Broadband Engine (BE) as a possible building block of a new generation of lattice QCD machines. The Enhanced Cell BE will provide full support of double-precision floating-point arithmetics, including IEEE-compliant rounding. We have developed a performance model and applied it to relevant lattice QCD kernels. The performance estimates are supported by micro- and application-benchmarks that have been obtained on currently available Cell BE-based computers, such as IBM QS20 blades and PlayStation 3. The results are encouraging and show that this processor is an interesting option for lattice QCD applications. For a massively parallel machine on the basis of the Cell BE, an application-optimized network needs to be developed.
△ Less
Submitted 12 October, 2007;
originally announced October 2007.