Go back to the FFTW home page.

We are currently experimenting with several parallel versions of FFTW. These have not yet been optimized nearly as fully as FFTW itself, nor have they been benchmarked against other parallel FFT software (mainly because parallel FFT codes are few and far in between). They are, however, a start in the right direction.

**Cilk**: Cilk is a superset of C developed at MIT. It allows easy parallelization of C code and automates load balancing between processors. It is currently only available on shared memory SMP architectures. The Cilk FFTW routines parallelize both one and multi-dimensional transforms, and are included in FFTW.**Threads**: Threads are a low-level mechanism for managing concurrent computations that share a single address space. They are available in some form on most SMP systems. FFTW currently supports Solaris, POSIX, and BeOS threads, and should be easy to port to other systems. (Some experimental code is in place for MacOS MP and Win32 threads.) The threads FFTW routines parallelize both one and multi-dimensional transforms, and are included in FFTW.**MPI**: MPI is a standard message-passing library available on many platforms. Currently, we have in-place, multi-dimensional parallel transforms for FFTW that use MPI. These routines differ from the threads and Cilk versions of FFTW in that they work on distributed memory machines in addition to shared-memory architectures. The MPI FFTW routines are included in FFTW.

Benchmark results are available for the following machines:

- 8 Processor Sun HPC 5000 (167 MHz Sun UltraSPARC-I). (Shared memory SMP.)
- Cray T3D (150 MHz DEC Alphas) at the Pittsburg Supercomputing Center. (Distributed memory.)

`spawn`

keywords in
front of subroutine calls (thanks to FFTW's explicitly recursive
structure). Currently, we keep the data in the same order as we do for
the uniprocessor code; in the future, we may reorder the data at each
phase of the computation to achieve better segregation of the data
between processors.
The MPI code works differently, since it is designed for distributed memory systems--here, each processor has its own memory or address space in which it holds a portion of the data. The data on a processor is inaccessible to other processors except by explicit message passing. In the MPI version of FFTW, we assume that a multi-dimensional array is distributed across the rows (the first dimension) of the data. To perform the FFT of this data, each processor first transforms all the dimensions of the data that are completely local to it (e.g. the rows). Then, the processors have to perform a transpose of the data in order to get the remaining dimension (the columns) local to a processor. This dimension is then Fourier transformed, and the data is (optionally) transposed back to its original order.

The hardest part of the multi-dimensional, distributed FFT is the
transpose. A transpose is one of those operations that sounds simple,
but is very tricky to implement in practice. The real subtlety comes
in when you want to do a transpose in place and with minimal
communications. Essentially, the way we do this is to do an in-place
transpose of the data local to each processor (itself a non-trivial
problem), exchange blocks of data with every other processor, and then
complete the transpose by a further (precomputed) permutation on the
data. All of this involves an appalling amount of bookkeeping,
especially when the array dimensions aren't divisible by the number of
processors. The communications phase is an example of a *complete
exchange*: an operation in which every processor sends a different
message to every other processor. (This is probably the worst thing
that you can do to a memory architecture.)

The whole process would be simpler, and possibly faster, if the transpose weren't required to be in-place. We felt, however, that if a problem is large enough to require a distributed memory multiprocessor, it is probably large enough that the capability to perform a transform in-place is essential.

Go back to the FFTW home page.