June 1999.

[Full text]

This dissertation discusses how to write computer programs that attain both high performance and portability, despite the fact that current computer systems have different degrees of parallelism, deep memory hierarchies, and diverse processor architectures.

To cope with parallelism portably in
high-performance programs, we present the** Cilk **multithreaded programming system. In the Cilk-5 system, parallel
programs scale up to run efficiently on multiple processors, but
unlike existing parallel-programming environments, such as MPI and
HPF, Cilk programs ``scale down'' to run on one processor as
efficiently as a comparable C program. The typical cost of spawning a
parallel thread in Cilk-5 is only between 2 and 6 times the cost of a
C function call. This efficient implementation was guided by the** work-first principle **, which dictates that
scheduling overheads should be borne by the critical path of the
computation and not by the work. We show how the work-first principle
inspired Cilk's novel ``two-clone'' compilation strategy and its
Dijkstra-like mutual-exclusion protocol for implementing the ready
deque in the work-stealing scheduler.

To cope portably with the memory hierarchy, we
present asymptotically optimal algorithms for rectangular matrix
transpose, FFT, and sorting on computers with multiple levels of
caching. Unlike previous optimal algorithms, these algorithms are** cache oblivious**: no variables dependent on
hardware parameters, such as cache size and cache-line length, need to
be tuned to achieve optimality. Nevertheless, these algorithms use an
optimal amount of work and move data optimally among multiple levels
of cache. For a cache with size $Z$ and cache-line length $L$ where
$Z=\Omega(L^{2})$ the number of cache misses for an
$m\times n$ matrix transpose is $\Theta(1+mn/L)$. The number of cache
misses for either an $n$-point FFT or the sorting of $n$ numbers is
$\Theta(1+(n/L)(1+\log_Z n))$. We also give a $\Theta(mnp)$-work
algorithm to multiply an $m\times n$ matrix by an $n\times p$ matrix
that incurs $\Theta(1 + (mn+np+mp)/L + mnp/L\sqrt{Z})$ cache
faults.

To attain portability in the face of both
parallelism and the memory hierarchy at the same time, we examine the** location consistency **memory model and the** Backer **coherence algorithm for maintaining it. We
prove good asymptotic bounds on the execution time of Cilk programs
that use location-consistent shared memory.

To cope with the diversity of processor architectures, we develop the FFTW self-optimizing program, a portable C library that computes Fourier transforms. FFTW is unique in that it can automatically tune itself to the underlying hardware in order to achieve high performance. Through extensive benchmarking, FFTW has been shown to be typically faster than all other publicly available FFT software, including codes such as Sun's Performance Library and IBM's ESSL that are tuned to a specific machine. Most of the performance-critical code of FFTW was generated automatically by a special-purpose compiler written in Objective Caml, which uses symbolic evaluation and other compiler techniques to produce ``codelets''---optimized sequences of C code that can be assembled into ``plans'' to compute a Fourier transform. At runtime, FFTW measures the execution time of many plans and uses dynamic programming to select the fastest. Finally, the plan drives a special interpreter that computes the actual transforms.

This page maintained by: Matteo Frigo. Last updated: Sun Jun 21 09:29:38 2009