Portable High-Performance Programs

by Matteo Frigo.

Ph. D. Thesis, MIT Department of Electrical Engineering and Computer Science.
June 1999.
[Full text]


This dissertation discusses how to write computer programs that attain both high performance and portability, despite the fact that current computer systems have different degrees of parallelism, deep memory hierarchies, and diverse processor architectures.

To cope with parallelism portably in high-performance programs, we present the Cilk multithreaded programming system. In the Cilk-5 system, parallel programs scale up to run efficiently on multiple processors, but unlike existing parallel-programming environments, such as MPI and HPF, Cilk programs ``scale down'' to run on one processor as efficiently as a comparable C program. The typical cost of spawning a parallel thread in Cilk-5 is only between 2 and 6 times the cost of a C function call. This efficient implementation was guided by the work-first principle , which dictates that scheduling overheads should be borne by the critical path of the computation and not by the work. We show how the work-first principle inspired Cilk's novel ``two-clone'' compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.

To cope portably with the memory hierarchy, we present asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size $Z$ and cache-line length $L$ where $Z=\Omega(L2)$ the number of cache misses for an $m\times n$ matrix transpose is $\Theta(1+mn/L)$. The number of cache misses for either an $n$-point FFT or the sorting of $n$ numbers is $\Theta(1+(n/L)(1+\log_Z n))$. We also give a $\Theta(mnp)$-work algorithm to multiply an $m\times n$ matrix by an $n\times p$ matrix that incurs $\Theta(1 + (mn+np+mp)/L + mnp/L\sqrt{Z})$ cache faults.

To attain portability in the face of both parallelism and the memory hierarchy at the same time, we examine the location consistency memory model and the Backer coherence algorithm for maintaining it. We prove good asymptotic bounds on the execution time of Cilk programs that use location-consistent shared memory.

To cope with the diversity of processor architectures, we develop the FFTW self-optimizing program, a portable C library that computes Fourier transforms. FFTW is unique in that it can automatically tune itself to the underlying hardware in order to achieve high performance. Through extensive benchmarking, FFTW has been shown to be typically faster than all other publicly available FFT software, including codes such as Sun's Performance Library and IBM's ESSL that are tuned to a specific machine. Most of the performance-critical code of FFTW was generated automatically by a special-purpose compiler written in Objective Caml, which uses symbolic evaluation and other compiler techniques to produce ``codelets''---optimized sequences of C code that can be assembled into ``plans'' to compute a Fourier transform. At runtime, FFTW measures the execution time of many plans and uses dynamic programming to select the fastest. Finally, the plan drives a special interpreter that computes the actual transforms.

This page maintained by: Matteo Frigo. Last updated: Sun Jun 21 09:29:38 2009